Action: [Character 1] is tying up her hair. Shot 1: Medium Close-Up, focus on hand. Shot 2: Over-the-Shoulder, focus on face.
Act2Cut Continuous Next-Shot Video Narrative Match on Action-Cut
Continuous Narrative Generation
Abstract
Achieving cinematic coherence in video generation requires more than just identity or scene consistency—it demands Action Continuity. In filmmaking, the Match on Action (Action-Cut) technique is essential for narrative flow. Current multi-shot models frequently suffer from action reset, instantaneous flickering, or disruptions in physical logic during abrupt shot transitions.
We introduce Act2Cut, the first specialized framework for continuous action-driven multi-shot video generation. Act2Cut introduces Transitional Boundary Residual (TBR) and GLoS-RoPE to enhance adjacent shot spatio-temporal coherence, alongside Shot Causal Mask (SCM) and Hierarchical Context Mask (HCM) to achieve global temporal causality and local element isolation. It unifies Text-to-Video, Image-to-Video, and Video-to-Video generation into a single native pass.
Method & Results
Demo video, architecture overview, and qualitative results.
Transitional Boundary Residual
Enhances spatial and kinetic consistency at the edit point between adjacent shots. Ensures the terminal frame of shot n and the initial frame of shot n+1 maintain rigorous 3D spatial alignment and smooth momentum continuity, preventing visual "popping" at hard cuts.
Global-Local Shot RoPE
Decouples temporal position encoding across shot boundaries. Global tokens carry sequence-level context while local tokens remain precisely positioned within their own shot — enabling cross-shot coherence without RoPE bleed between unrelated frames.
Shot Causal Mask
Applies a unidirectional causal constraint in self-attention so that shot xi can attend to the history of xi−1, but not vice versa. Prevents "semantic leakage" where future shot compositions contaminate earlier frames.
Hierarchical Context Mask
Partitions cross-attention into global, relational, and local windows. Global tokens act as universal anchors; per-shot cinematic language remains independently modulated, ensuring precise control over "what happens" vs. "how it is framed" for each shot.
Generated Sequences
Native multi-shot generation across three input modalities.
Action: [Character 1] walks toward [Character 2] across the room. Shot 1: Wide. Shot 2: Mid-Shot, dolly-in.
Action: [Character 1] performs a dance move with arm and leg coordination. Shot 1: Wide. Shot 2: Low-Angle Mid-Shot.
Action: [Character 1] sits down and opens a book on the table. Shot 1: Mid-Shot. Shot 2: Cut-In on hands.
Action: [Character 1] crouches down to pick up an object from the floor. Shot 1: Mid-Shot. Shot 2: ECU on object.
Action: [Character 1] turns and looks out of the window thoughtfully. Shot 1: Mid-Shot. Shot 2: Over-the-Shoulder.
Data Collection & Processing
Sourced from films, TV series, and AI-generated content (Sora2, Seedance2.0). We pair TransNetV2 with our novel Similarity Clustering Boundary Optimizer (SCBO) via CLIP/DINO to precisely calibrate shot boundaries.
- 0–4 Likert Scale: assessed via Gemini-3-Pro, distinguishing Strong Action-Cut (4) from Weak Action-Cut (1–3).
- Hierarchical Annotations: decoupled cinematographic shot-language across Global · Local · Action · Transition layers.
- Global Env · Scene · Style
Soft, warm sunlight filters through large windows of a quiet apartment.
- Local Static / Dynamic Cinematography
Shot 1: Medium Close-Up, focus on hand. Shot 2: Over-the-Shoulder, focus on face.
- Action Subject behaviour · Motion
[Character 1] is tying up her hair, then turns slightly toward [Character 2].
- Transition Edit-point relation
Hard Cut · Cut-Out · Shot Reverse Shot.
Performance Comparison
ActCutBench Quantitative Results
Dataset · ActCutVid-200K| Method | FVD (↓) | CLIP Score (↑) | Action Continuity (↑) | Visual Coherence (↑) |
|---|---|---|---|---|
| Act2Cut (Ours) | 142.3 | 0.31 | 0.88 | 0.85 |
| HoloCine-14B | 156.8 | 0.29 | 0.76 | 0.84 |
| MSM-14B | 151.2 | 0.28 | 0.72 | 0.81 |
| MSM-1.3B | 168.1 | 0.27 | 0.65 | 0.75 |
Numbers are placeholder values for layout demonstration.
Full benchmarksBibTeX
@article{zhuang2026act2cut,
title={Act2Cut: Continuous Next-Shot Video Narrative Match on Action-Cut},
author={Zhuang, Cailin and Hu, Yaoqi and Dong, Zheng and Zhang, Shiwen and Huang, Haibin and Zhang, Chi and Li, Xuelong},
journal={ACM Transactions on Graphics (TOG)},
volume={1},
number={1},
year={2026},
publisher={ACM New York, NY, USA}
}