ACT Ⅴ

Qualitative Comparison

Text → Video, side-by-side against open-source baselines.

Model
T01
Tying hair
T02
Walk toward
T03
Dance move
HoloCine-14B
MSM-14B
★ Act2Cut

Action: [Character 1] is tying up her hair. Shot 1: MCU on hand. Shot 2: OTS on face.

Action: [Character 1] walks toward [Character 2]. Shot 1: Wide. Shot 2: Mid-Shot, dolly-in.

Action: [Character 1] performs a dance move. Shot 1: Wide. Shot 2: Low-Angle Mid-Shot.

ACT Ⅳ

Benchmark Results

ActCutBench, VistoryBench, and VBench evaluation. = Act2Cut (Ours).

ActCutBench — Global Content Sub-Bench

Evaluates narrative consistency (NC), visual coherence (VC), action continuity (AC), subject correctness (SC), and aesthetics (Aes) across generated multi-shot sequences. All metrics ↑ higher is better.

Method Base Model NC ↑ VC ↑ AC ↑ SC ↑ Aes ↑
EnvSceStyCSCA/OM Env/SceStyCSCA/OM CAOM CCCID
Open-Source Methods — ActCutBench
HoloCine-14B Wan2.2-T2V-14B
MSM-1.3B Wan2.1-TI2V-1.3B
MSM-14B Wan2.1-T2V-14B
★ Act2Cut (Ours) Wan2.2-TI2V-5B
ActCutBench-Lite — Ablation
★ Act2Cut (full) Wan2.2-TI2V-5B
w/o GLoS-RoPE Wan2.2-TI2V-5B
w/o TBR Wan2.2-TI2V-5B
w/o SCM Wan2.2-TI2V-5B
w/o HCM Wan2.2-TI2V-5B
ActCutBench — Cinematography Sub-Bench

Evaluates adherence to static cinematography (SC), dynamic cinematography (DC), and relation cinematography (RC) specifications per generated shot.

Method Base Model Static Cinematography (SC) ↑ Dynamic (DC) ↑ Relation (RC) ↑
SSLTFFFVAEAAADDoFCMCZCRSTCont.Narr.
Open-Source Methods — ActCutBench
HoloCine-14B Wan2.2-T2V-14B
MSM-1.3B Wan2.1-TI2V-1.3B
MSM-14B Wan2.1-T2V-14B
★ Act2Cut (Ours) Wan2.2-TI2V-5B
VistoryBench — Story Visualization

Comprehensive benchmark for story visualization across character consistency, scene coherence, and narrative fidelity metrics.

Method Base Model Char. Consist. ↑ Scene Coher. ↑ Style Consist. ↑ Narrative ↑ Overall ↑
HoloCine-14B Wan2.2-T2V-14B
MSM-14B Wan2.1-T2V-14B
★ Act2Cut (Ours) Wan2.2-TI2V-5B
VBench — Video Quality

Holistic evaluation covering video quality, semantic alignment, temporal consistency, and perceptual fidelity for generated video sequences.

Method Base Model Quality ↑ Semantic ↑ Temporal ↑ Aesthetic ↑ Total ↑
HoloCine-14B Wan2.2-T2V-14B
MSM-14B Wan2.1-T2V-14B
★ Act2Cut (Ours) Wan2.2-TI2V-5B
Metric Visualization Chart Replace with radar / bar chart when data is available
ACT Ⅵ

BibTeX

@article{zhuang2026act2cut,
  title={Act2Cut: Continuous Next-Shot Video Narrative Match on Action-Cut},
  author={Zhuang, Cailin and Hu, Yaoqi and Dong, Zheng and Zhang, Shiwen and Huang, Haibin and Zhang, Chi and Li, Xuelong},
  journal={ACM Transactions on Graphics (TOG)},
  volume={1},
  number={1},
  year={2026},
  publisher={ACM New York, NY, USA}
}