Qualitative Comparison
Text → Video, side-by-side against open-source baselines.
Action: [Character 1] is tying up her hair. Shot 1: MCU on hand. Shot 2: OTS on face.
Action: [Character 1] walks toward [Character 2]. Shot 1: Wide. Shot 2: Mid-Shot, dolly-in.
Action: [Character 1] performs a dance move. Shot 1: Wide. Shot 2: Low-Angle Mid-Shot.
Benchmark Results
ActCutBench, VistoryBench, and VBench evaluation. ★ = Act2Cut (Ours).
Evaluates narrative consistency (NC), visual coherence (VC), action continuity (AC), subject correctness (SC), and aesthetics (Aes) across generated multi-shot sequences. All metrics ↑ higher is better.
| Method | Base Model | NC ↑ | VC ↑ | AC ↑ | SC ↑ | Aes ↑ | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Env | Sce | Sty | CS | CA/OM | Env/Sce | Sty | CS | CA/OM | CA | OM | CC | CID | |||
| Open-Source Methods — ActCutBench | |||||||||||||||
| HoloCine-14B | Wan2.2-T2V-14B | — | — | — | — | — | — | — | — | — | — | — | — | — | |
| MSM-1.3B | Wan2.1-TI2V-1.3B | — | — | — | — | — | — | — | — | — | — | — | — | — | |
| MSM-14B | Wan2.1-T2V-14B | — | — | — | — | — | — | — | — | — | — | — | — | — | |
| ★ Act2Cut (Ours) | Wan2.2-TI2V-5B | — | — | — | — | — | — | — | — | — | — | — | — | — | |
| ActCutBench-Lite — Ablation | |||||||||||||||
| ★ Act2Cut (full) | Wan2.2-TI2V-5B | — | — | — | — | — | — | — | — | — | — | — | — | — | |
| w/o GLoS-RoPE | Wan2.2-TI2V-5B | — | — | — | — | — | — | — | — | — | — | — | — | — | |
| w/o TBR | Wan2.2-TI2V-5B | — | — | — | — | — | — | — | — | — | — | — | — | — | |
| w/o SCM | Wan2.2-TI2V-5B | — | — | — | — | — | — | — | — | — | — | — | — | — | |
| w/o HCM | Wan2.2-TI2V-5B | — | — | — | — | — | — | — | — | — | — | — | — | — | |
Evaluates adherence to static cinematography (SC), dynamic cinematography (DC), and relation cinematography (RC) specifications per generated shot.
| Method | Base Model | Static Cinematography (SC) ↑ | Dynamic (DC) ↑ | Relation (RC) ↑ | |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| SS | LT | FF | FV | AE | AA | AD | DoF | CM | CZ | CR | ST | Cont. | Narr. | ||
| Open-Source Methods — ActCutBench | |||||||||||||||
| HoloCine-14B | Wan2.2-T2V-14B | — | — | — | — | — | — | — | — | — | — | — | — | — | — |
| MSM-1.3B | Wan2.1-TI2V-1.3B | — | — | — | — | — | — | — | — | — | — | — | — | — | — |
| MSM-14B | Wan2.1-T2V-14B | — | — | — | — | — | — | — | — | — | — | — | — | — | — |
| ★ Act2Cut (Ours) | Wan2.2-TI2V-5B | — | — | — | — | — | — | — | — | — | — | — | — | — | — |
Comprehensive benchmark for story visualization across character consistency, scene coherence, and narrative fidelity metrics.
| Method | Base Model | Char. Consist. ↑ | Scene Coher. ↑ | Style Consist. ↑ | Narrative ↑ | Overall ↑ |
|---|---|---|---|---|---|---|
| HoloCine-14B | Wan2.2-T2V-14B | — | — | — | — | — |
| MSM-14B | Wan2.1-T2V-14B | — | — | — | — | — |
| ★ Act2Cut (Ours) | Wan2.2-TI2V-5B | — | — | — | — | — |
Holistic evaluation covering video quality, semantic alignment, temporal consistency, and perceptual fidelity for generated video sequences.
| Method | Base Model | Quality ↑ | Semantic ↑ | Temporal ↑ | Aesthetic ↑ | Total ↑ |
|---|---|---|---|---|---|---|
| HoloCine-14B | Wan2.2-T2V-14B | — | — | — | — | — |
| MSM-14B | Wan2.1-T2V-14B | — | — | — | — | — |
| ★ Act2Cut (Ours) | Wan2.2-TI2V-5B | — | — | — | — | — |
BibTeX
@article{zhuang2026act2cut,
title={Act2Cut: Continuous Next-Shot Video Narrative Match on Action-Cut},
author={Zhuang, Cailin and Hu, Yaoqi and Dong, Zheng and Zhang, Shiwen and Huang, Haibin and Zhang, Chi and Li, Xuelong},
journal={ACM Transactions on Graphics (TOG)},
volume={1},
number={1},
year={2026},
publisher={ACM New York, NY, USA}
}