LongVQUBench: Benchmarking Long-Term Video Quality Understanding of Vision-Language Models

Nema, Arpita; Zhu, Hanwei; Zhang, Xi; Lin, Weisi

ECCV 2026

LongVQUBench
Benchmarking Long-Term Video Quality Understanding of Vision-Language Models

Arpita Nema Hanwei Zhu Xi Zhang Weisi Lin

Nanyang Technological University, Singapore

Paper 🤗 Dataset GitHub Leaderboard Supplementary arXiv (coming soon)

1,200 Videos

1,500 QA Pairs

~12 min Avg. Duration

14 LVLMs Evaluated

LongVQUBench features perceptual quality reasoning questions that require integrating visual evidence across extended video durations moving beyond what a single frame can reveal.

GQU · Global Quality Understanding

Which distortion type has the strongest influence on the viewer's global perception of video quality?

A. Spatial sharpen
B. Temporal jitter
C. Frame rate inconsistency
D. Color banding

Abstract

The evaluation of long-term video quality understanding remains an open challenge for large vision–language models (LVLMs). Existing video quality benchmarks predominantly focus on short clips and isolated distortions, overlooking the temporal continuity, cumulative degradation, and reasoning complexity inherent in long-duration content. To address these limitations, we present LongVQUBench, a comprehensive benchmark for long-term video quality understanding. LongVQUBench contains over 1,200 diverse videos spanning movies, documentaries, surveillance footage, egocentric recordings, and animated content, accompanied by 1,500 multiple-choice and open-ended questions for validation and testing.

To assess perceptual reasoning across different temporal scopes, we introduce three progressively complex evaluation levels: (i) local event quality understanding (LQU) for analyzing localized distortions; (ii) cross-event quality reasoning (CQR) for integrating multiple degraded events; and (iii) global quality understanding (GQU) for holistic perceptual evaluation over extended durations. Furthermore, a needle distortion question-answering (NDQA) paradigm is embedded across all three levels, where subtle spatial or temporal artifacts are sparsely inserted to probe fine-grained detection and reasoning capabilities. Extensive experiments on 14 state-of-the-art LVLMs reveal significant performance degradation with increasing video length and reasoning depth, highlighting their limited capacity for long-range temporal integration and perceptual attribution.

Long-term Video Quality Understanding Video Quality Benchmark Large Vision–Language Models

Benchmark Design

Hierarchical Evaluation Framework

Three progressively complex levels probe a model's capacity from local distortion detection to global perceptual quality.

LQU · Level 1

Local Event Quality Understanding

Detect, localize, classify, and assess the severity of a single, temporally bounded distortion event such as localized blur, flicker, or compression noise.

CQR · Level 2

Cross-Event Quality Reasoning

Compare, associate, and integrate multiple distortion events distributed across extended temporal spans. Evaluates cumulative effects and temporal relations.

GQU · Level 3

Global Quality Understanding

Synthesize a holistic perceptual judgment over the entire video — tracking quality trends, dominant degradations, and evaluating overall perceptual stability.

Representative questions across the three evaluation levels of LongVQUBench.

Dataset

Dataset Statistics

Videos sourced from LongVideoBench, MLVU, and LongVideo-Reason-Eval, documentaries, surveillance, vlogs, cooking, animated infographics, and news.

Video content category distribution.

Video duration range distribution.

Citation

BibTeX

@inproceedings{nema2026longvqubench,
  title     = {LongVQUBench: Benchmarking Long-Term Video Quality Understanding of Vision-Language Models},
  author    = {Nema, Arpita and Zhu, Hanwei and Zhang, Xi and Lin, Weisi},
  booktitle = {European Conference on Computer Vision (ECCV)},
  year      = {2026},
  url       = {https://longvqubench.github.io}
}

LongVQUBench Benchmarking Long-Term Video Quality Understanding of Vision-Language Models