Act2Cut Continuous Next-Shot Video Narrative Match on Action-Cut

Cailin Zhuang^1,2,3*† Yaoqi Hu^2,4* Zheng Dong^2,4

Shiwen Zhang¹ Haibin Huang^1‡ Chi Zhang^1‡ Xuelong Li^1‡

¹Institute of Artificial Intelligence, China Telecom (TeleAI)²AIGC Research³ShanghaiTech University⁴Chongqing University of Technology

* Equal contribution. † Work done during internship at TeleAI. ‡ Corresponding author.

Paper Code Video

Match on Action

Continuous Narrative Generation

Demonstrating seamless shot transitions ensuring physical logic and action continuity across cinematic cuts.

00:00:24:00

ACT Ⅰ

Abstract

Achieving cinematic coherence in video generation requires more than just identity or scene consistency—it demands Action Continuity. In filmmaking, the Match on Action (Action-Cut) technique is essential for narrative flow. Current multi-shot models frequently suffer from action reset, instantaneous flickering, or disruptions in physical logic during abrupt shot transitions.

We introduce Act2Cut, the first specialized framework for continuous action-driven multi-shot video generation. Act2Cut introduces Transitional Boundary Residual (TBR) and GLoS-RoPE to enhance adjacent shot spatio-temporal coherence, alongside Shot Causal Mask (SCM) and Hierarchical Context Mask (HCM) to achieve global temporal causality and local element isolation. It unifies Text-to-Video, Image-to-Video, and Video-to-Video generation into a single native pass.

ACT Ⅱ

Method & Results

Demo video, architecture overview, and qualitative results.

Demo Video — Replace with actual video

Architecture Diagram

Fig. 3 — Replace with actual figure

TBR

Transitional Boundary Residual

Enhances spatial and kinetic consistency at the edit point between adjacent shots. Ensures the terminal frame of shot n and the initial frame of shot n+1 maintain rigorous 3D spatial alignment and smooth momentum continuity, preventing visual "popping" at hard cuts.

GLoS-RoPE

Global-Local Shot RoPE

Decouples temporal position encoding across shot boundaries. Global tokens carry sequence-level context while local tokens remain precisely positioned within their own shot — enabling cross-shot coherence without RoPE bleed between unrelated frames.

SCM

Shot Causal Mask

Applies a unidirectional causal constraint in self-attention so that shot x_i can attend to the history of x_i−1, but not vice versa. Prevents "semantic leakage" where future shot compositions contaminate earlier frames.

HCM

Hierarchical Context Mask

Partitions cross-attention into global, relational, and local windows. Global tokens act as universal anchors; per-shot cinematic language remains independently modulated, ensuring precise control over "what happens" vs. "how it is framed" for each shot.

Qualitative Results

Generated Sequences

Native multi-shot generation across three input modalities.

T01

Generated

Multi-Shot Sequence

Prompt · Tying hair

Action: [Character 1] is tying up her hair. Shot 1: Medium Close-Up, focus on hand. Shot 2: Over-the-Shoulder, focus on face.

T02

Generated

Multi-Shot Sequence

Prompt · Walk toward

Action: [Character 1] walks toward [Character 2] across the room. Shot 1: Wide. Shot 2: Mid-Shot, dolly-in.

T03

Generated

Multi-Shot Sequence

Prompt · Dance move

Action: [Character 1] performs a dance move with arm and leg coordination. Shot 1: Wide. Shot 2: Low-Angle Mid-Shot.

T04

Generated

Multi-Shot Sequence

Prompt · Sits, opens

Action: [Character 1] sits down and opens a book on the table. Shot 1: Mid-Shot. Shot 2: Cut-In on hands.

T05

Generated

Multi-Shot Sequence

Prompt · Crouches

Action: [Character 1] crouches down to pick up an object from the floor. Shot 1: Mid-Shot. Shot 2: ECU on object.

T06

Generated

Multi-Shot Sequence

Prompt · To window

Action: [Character 1] turns and looks out of the window thoughtfully. Shot 1: Mid-Shot. Shot 2: Over-the-Shoulder.

Dataset · ActCutVid-200k

Data Collection & Processing

Sourced from films, TV series, and AI-generated content (Sora2, Seedance2.0). We pair TransNetV2 with our novel Similarity Clustering Boundary Optimizer (SCBO) via CLIP/DINO to precisely calibrate shot boundaries.

0–4 Likert Scale: assessed via Gemini-3-Pro, distinguishing Strong Action-Cut (4) from Weak Action-Cut (1–3).
Hierarchical Annotations: decoupled cinematographic shot-language across Global · Local · Action · Transition layers.

Explore the dataset

Hierarchical Prompting S1 → S2

S1 · MCU

Cut

S2 · OTS

Global Env · Scene · Style

Soft, warm sunlight filters through large windows of a quiet apartment.
Local Static / Dynamic Cinematography

Shot 1: Medium Close-Up, focus on hand. Shot 2: Over-the-Shoulder, focus on face.
Action Subject behaviour · Motion

[Character 1] is tying up her hair, then turns slightly toward [Character 2].
Transition Edit-point relation

Hard Cut · Cut-Out · Shot Reverse Shot.

Quantitative Snapshot

Performance Comparison

ActCutBench Quantitative Results

Dataset · ActCutVid-200K

Method	FVD (↓)	CLIP Score (↑)	Action Continuity (↑)	Visual Coherence (↑)
Act2Cut (Ours)	142.3	0.31	0.88	0.85
HoloCine-14B	156.8	0.29	0.76	0.84
MSM-14B	151.2	0.28	0.72	0.81
MSM-1.3B	168.1	0.27	0.65	0.75

Numbers are placeholder values for layout demonstration.

Full benchmarks

ACT Ⅵ

BibTeX

@article{zhuang2026act2cut,
  title={Act2Cut: Continuous Next-Shot Video Narrative Match on Action-Cut},
  author={Zhuang, Cailin and Hu, Yaoqi and Dong, Zheng and Zhang, Shiwen and Huang, Haibin and Zhang, Chi and Li, Xuelong},
  journal={ACM Transactions on Graphics (TOG)},
  volume={1},
  number={1},
  year={2026},
  publisher={ACM New York, NY, USA}
}