Act2Cut Continuous Next-Shot Video Narrative Match on Action-Cut

Cailin Zhuang1,2,3*† Yaoqi Hu2,4* Zheng Dong2,4
Shiwen Zhang1 Haibin Huang1‡ Chi Zhang1‡ Xuelong Li1‡
1Institute of Artificial Intelligence, China Telecom (TeleAI)2AIGC Research3ShanghaiTech University4Chongqing University of Technology
* Equal contribution. † Work done during internship at TeleAI. ‡ Corresponding author.
Demo Video Placeholder
Match on Action

Continuous Narrative Generation

ACT Ⅰ

Abstract

Achieving cinematic coherence in video generation requires more than just identity or scene consistency—it demands Action Continuity. In filmmaking, the Match on Action (Action-Cut) technique is essential for narrative flow. Current multi-shot models frequently suffer from action reset, instantaneous flickering, or disruptions in physical logic during abrupt shot transitions.

We introduce Act2Cut, the first specialized framework for continuous action-driven multi-shot video generation. Act2Cut introduces Transitional Boundary Residual (TBR) and GLoS-RoPE to enhance adjacent shot spatio-temporal coherence, alongside Shot Causal Mask (SCM) and Hierarchical Context Mask (HCM) to achieve global temporal causality and local element isolation. It unifies Text-to-Video, Image-to-Video, and Video-to-Video generation into a single native pass.

ACT Ⅱ

Method & Results

Demo video, architecture overview, and qualitative results.

Demo Video — Replace with actual video
Architecture Diagram
Fig. 3 — Replace with actual figure
TBR

Transitional Boundary Residual

Enhances spatial and kinetic consistency at the edit point between adjacent shots. Ensures the terminal frame of shot n and the initial frame of shot n+1 maintain rigorous 3D spatial alignment and smooth momentum continuity, preventing visual "popping" at hard cuts.

GLoS-RoPE

Global-Local Shot RoPE

Decouples temporal position encoding across shot boundaries. Global tokens carry sequence-level context while local tokens remain precisely positioned within their own shot — enabling cross-shot coherence without RoPE bleed between unrelated frames.

SCM

Shot Causal Mask

Applies a unidirectional causal constraint in self-attention so that shot xi can attend to the history of xi−1, but not vice versa. Prevents "semantic leakage" where future shot compositions contaminate earlier frames.

HCM

Hierarchical Context Mask

Partitions cross-attention into global, relational, and local windows. Global tokens act as universal anchors; per-shot cinematic language remains independently modulated, ensuring precise control over "what happens" vs. "how it is framed" for each shot.

Dataset · ActCutVid-200k

Data Collection & Processing

Sourced from films, TV series, and AI-generated content (Sora2, Seedance2.0). We pair TransNetV2 with our novel Similarity Clustering Boundary Optimizer (SCBO) via CLIP/DINO to precisely calibrate shot boundaries.

  • 0–4 Likert Scale: assessed via Gemini-3-Pro, distinguishing Strong Action-Cut (4) from Weak Action-Cut (1–3).
  • Hierarchical Annotations: decoupled cinematographic shot-language across Global · Local · Action · Transition layers.
Hierarchical Prompting S1 → S2
S1 · MCU
Cut
S2 · OTS
  • Global Env · Scene · Style

    Soft, warm sunlight filters through large windows of a quiet apartment.

  • Local Static / Dynamic Cinematography

    Shot 1: Medium Close-Up, focus on hand. Shot 2: Over-the-Shoulder, focus on face.

  • Action Subject behaviour · Motion

    [Character 1] is tying up her hair, then turns slightly toward [Character 2].

  • Transition Edit-point relation

    Hard Cut · Cut-Out · Shot Reverse Shot.

Quantitative Snapshot

Performance Comparison

ActCutBench Quantitative Results

Dataset · ActCutVid-200K
Method FVD (↓) CLIP Score (↑) Action Continuity (↑) Visual Coherence (↑)
Act2Cut (Ours) 142.3 0.31 0.88 0.85
HoloCine-14B 156.8 0.29 0.76 0.84
MSM-14B 151.2 0.28 0.72 0.81
MSM-1.3B 168.1 0.27 0.65 0.75

Numbers are placeholder values for layout demonstration.

Full benchmarks
ACT Ⅵ

BibTeX

@article{zhuang2026act2cut,
  title={Act2Cut: Continuous Next-Shot Video Narrative Match on Action-Cut},
  author={Zhuang, Cailin and Hu, Yaoqi and Dong, Zheng and Zhang, Shiwen and Huang, Haibin and Zhang, Chi and Li, Xuelong},
  journal={ACM Transactions on Graphics (TOG)},
  volume={1},
  number={1},
  year={2026},
  publisher={ACM New York, NY, USA}
}