Think, then Score:
Decoupled Reasoning and Scoring
for Video Reward Modeling

A training-efficient and generalizable video reward model that separates CoT reasoning from final reward prediction.

Yuan Wang^1,2*§ Ouxiang Li^1§ Yulong Xu^2‡ Borui Liao² Jiajun Liang^2† Jinghan Li¹ Meng Wang² Xintao Wang² Pengfei Wan² Kuien Liu³ Xiang Wang^1†

¹University of Science and Technology of China · ²Kling Team, Kuaishou Technology · ³Institute of Software, Chinese Academy of Sciences
§ Equal Contribution † Corresponding Authors ‡ Project Lead

📄 Paper 💻 Code

Abstract

The Dilemma of Video Reward Modeling

How can we harness the interpretability and generalization introduced by CoT reasoning during reward modeling while shielding the training process from the optimization instability of a coupled sampling chain?

Recent advances in generative video models are increasingly driven by post-training and test-time scaling, both of which critically depend on the quality of video reward models (RMs). An ideal reward model should predict accurate rewards that align with human preferences across diverse scenarios. However, existing paradigms face a fundamental dilemma:

Discriminative RMs regress rewards directly from MLLM features without explicit reasoning, making them prone to shortcut learning and heavily reliant on massive data scaling. Generative RMs with CoT reasoning exhibit superior interpretability and generalization, but suffer from optimization bottlenecks due to the coupling of reasoning and scoring within a single autoregressive chain.

To harness the generalization benefits of CoT reasoning while mitigating training instability, we introduce DeScore — a decoupled "think-then-score" paradigm: an MLLM first generates an explicit CoT, followed by a dedicated discriminative scoring module (learnable query token + regression head) that predicts the final scalar reward, optimized via a two-stage framework.

Overview · Figure 1

Three Video Reward Modeling Paradigms

Existing models follow two paradigms each with critical drawbacks. DeScore decouples CoT reasoning from scoring to get the best of both worlds.

Figure 1: Overview and Motivation of DeScore. (a) Video Reward Modeling Paradigms. Existing video reward models generally follow two paradigms: Discriminative RMs directly regress rewards without explicit thinking (e.g., CoT), and Generative RMs couple thinking and scoring within a single autoregressive sampling chain. DeScore improves both paradigms based on two observations: First, (b) Preference Accuracy shows that incorporating CoT enables Generative RMs to outperform Discriminative RMs, highlighting the necessity of explicit thinking for generalization. Second, (c) Training Stability reveals that coupling thinking and scoring requires the final score to be optimized through GRPO loss, leading to pronounced training fluctuations. In contrast, discriminative training with BT loss exhibits smooth convergence. Motivated by these findings, DeScore introduces a decoupled "think-then-score" paradigm that effectively leverages the generalization benefits of CoT reasoning while preserving the training stability inherent to discriminative optimization.

Method · Figure 2

DeScore Training Framework

An MLLM generates chain-of-thought reasoning, then a learnable query token's hidden state is projected to a scalar reward via a regression head — trained across two stages.

Figure 2: Our DeScore framework. (a) During inference, DeScore first uses an MLLM to generate CoT from the multi-modal input, then appends a learnable query token whose last hidden state is projected by a regression head into the final video reward. Training follows a two-stage paradigm: (b) In the discriminative cold-start stage, DeScore is trained with BT loss on pre-collected CoT data, where random CoT masking encourages the scoring module to use both multi-modal inputs and reasoning tokens. (c) In the dual-objective RL stage, the GRPO loss optimizes CoT rollouts guided by rule-based rollout rewards, while the BT loss simultaneously calibrates the video reward, decoupling reasoning refinement from reward scoring.

Stage 1

Discriminative Cold Start

Fine-tunes the backbone and scoring module on pre-collected CoT data using Bradley-Terry (BT) loss. A random CoT masking strategy randomly drops the CoT during training, forcing the module to jointly leverage raw multimodal inputs and reasoning tokens — preventing over-reliance on either source and improving robustness.

Stage 2

Dual-Objective Reinforcement Learning

Two independent objectives run simultaneously. GRPO loss refines CoT quality guided by rule-based rollout rewards. Auxiliary BT loss continuously calibrates the scalar reward head, ensuring CoT improvements translate directly to reward accuracy and preventing "reward drift."

            Composite Rollout Reward R(o) — three rule-based signals
          

📐

R_fmt — Format

Binary 0/1: output must strictly follow <think></think> with JSON sub-dimension scores.

🎯

R_qual — Quality

N_correct / N_total: accuracy of sub-dimension scores (subject, dynamics, camera, environment, style) vs. ground truth.

📏

R_len — Length

0 if <500 tokens; scales linearly 500–2000; capped at 1.0 for ≥2000 tokens. Encourages detailed reasoning.

Experiments

State-of-the-Art on All Benchmarks

DeScore is evaluated on one in-domain dataset (1,469 pairs) and two OOD benchmarks: GenAI-Bench (1.9k pairs from early T2V models) and VideoGen-Bench (26.5k pairs from current SOTA models). Metrics are preference accuracy with and without ties.

Model	In-domain Acc w/ Tie	GenAI Acc w/ Tie	GenAI Acc w/o Tie	VideoGen Acc w/ Tie	VideoGen Acc w/o Tie
Discriminative Video Reward Models
VideoScore	0.552	0.490	0.720	0.372	0.503
VideoAlign	0.642	0.494	0.728	0.538	0.722
Generative Video Reward Models
VisionReward	0.571	0.525	0.724	0.465	0.611
UnifiedReward	0.492	0.458	0.686	0.303	0.564
UnifiedReward-Thinking	0.578	0.548	0.709	0.428	0.582
VideoScore2	0.617	0.391	0.616	0.301	0.497
Our Method
DeScore (Ours)	0.734	0.504	0.765	0.568	0.768

Bold = best result per column. DeScore achieves top performance across most benchmarks while using 76% less training data than comparable models.

Figure: Performance vs. Training Data Size. DeScore (red star) consistently outperforms existing models by a large margin while requiring only a fraction of the training data, highlighting its extreme training efficiency and robust semantic understanding.

Figure: Qualitative Comparison of Different Video Reward Models. We compare VideoAlign, UnifiedReward-Thinking, and our DeScore with high- and low-quality reasoning within these responses. DeScore consistently yields accurate rewards and robust reasoning across varied prompts, demonstrating its superior interpretability and generalization. DeScore decouples reasoning from scoring yields two clear advantages: (1) error tolerance, where DeScore remains accurate even with imperfect CoT (e.g. top-left case), and (2) fine-grained discrimination, where it can still produce differentiated scores when generative models output identical reward tokens (e.g. bottom-left case). These results show that DeScore effectively combines reasoning interpretability with robust reward prediction.

Improving Video Generation Quality (VBench)

DeScore is integrated into Longcat-GRPO and Flow-DPO post-training frameworks on Wan-2.1-1.3B, consistently improving all VBench quality dimensions.

Model	Subject Consistency ↑	Background Consistency ↑	Aesthetic Quality ↑	Image Quality ↑	Dynamic Degree ↑
Wan-2.1-1.3B (baseline)	0.951	0.961	0.547	0.669	0.527
+ Longcat-GRPO w/ DeScore	0.969	0.973	0.645	0.706	0.541
+ Flow-DPO w/ DeScore	0.969	0.972	0.615	0.700	0.542

Citation

BibTeX

If you find DeScore useful in your research, please consider citing:

@misc{wang2026thinkscoredecoupledreasoning,
  title={Think, then Score: Decoupled Reasoning and Scoring for Video Reward Modeling},
  author={Yuan Wang and Ouxiang Li and Yulong Xu and Borui Liao and Jiajun Liang
          and Jinghan Li and Meng Wang and Xintao Wang and Pengfei Wan
          and Kuien Liu and Xiang Wang},
  year={2026},
  eprint={2605.05922},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2605.05922}
}