The emergence of advanced multimodal large language models (MLLMs) has significantly enhanced AI assistants' ability to process complex information across modalities. Recently, egocentric videos, by directly capturing user focus, actions, and context in an unified coordinate, offer an exciting opportunity to enable proactive and personalized AI user experiences with MLLMs. However, existing benchmarks overlook the crucial role of gaze as an indicator of user intent.
To address this gap, we introduce EgoGazeVQA, an egocentric gaze-guided video question answering benchmark that leverages gaze information to improve the understanding of longer daily-life videos. EgoGazeVQA consists of 1,757 gaze-based QA pairs from 913 videos, generated by MLLMs and refined by human annotators. Our experiments reveal that existing MLLMs struggle to accurately interpret user intentions. In contrast, our gaze-guided intent prompting methods significantly enhance performance by integrating spatial, temporal, and intent-related cues.
We propose EgoGazeVQA, the first MLLM benchmark that incorporates essential gaze signals for understanding user intent in egocentric settings. We present examples of Spatial, Temporal, and Causal Intent QA, demonstrating how gaze information improves MLLMs' performance. Radar charts compare model performance across different scenarios and activities, showing consistent gains with our gaze-guided prompting strategy.
Construction pipeline of EgoGazeVQA. We craft the benchmark in three steps: Stage 1: Egocentric video clips are processed to extract frame captions and gaze coordinates to capture user focus. Stage 2: A MLLM model generates spatial/temporal-aware and intention-related Q&A pairs using a customized prompt. Stage 3: Human annotators manually review the generated Q&A pairs for quality dimensions to ensure high-quality data.
Gaze-guided prompting strategies in EgoGazeVQA. We experiment with three gaze-guided prompting strategies: (Left) Gaze as Textual Prompt - gaze coordinates are presented as text inputs to guide model responses; (Center) Gaze as Visual Prompt - highlights gaze points directly on video frames; (Right) Gaze Salience Maps as Prompt - utilizes heatmaps of gaze trajectories to provide contextual cues for understanding spatial and temporal intent.
| Method | Strategy | Spatial | Temporal | Causal | Average |
|---|---|---|---|---|---|
| Human | - | 80.7 | 75.6 | 95.2 | 83.8 |
| Qwen2.5-VL-72B | w/o gaze | 57.1 | 45.2 | 79.3 | 60.5 |
| Qwen2.5-VL-72B | Textual | 60.0 | 50.7 | 84.2 | 65.0 |
| Qwen2.5-VL-72B | Visual | 59.7 | 48.1 | 84.1 | 63.9 |
| Qwen2.5-VL-72B | Salience Map | 64.3 | 50.3 | 84.3 | 66.3 |
Key Findings: The best-performing method (Qwen2.5-VL-72B + Salience Map) achieves 66.3% average accuracy, with +5.8% improvement over the baseline. However, there remains a significant gap compared to human performance (83.8%), highlighting the challenge and room for improvement in gaze-guided egocentric video understanding.
Visual examples from our EgoGazeVQA benchmark and gaze-guided prompting. The successful examples show how gaze signals significantly enhance MLLMs' performance in tasks demanding precise spatial reasoning and intent interpretation, such as distinguishing closely situated objects in the kitchen and accurately inferring user focus in complex scenes. Challenging cases reveal limitations when there is drastic body motion or gaze saccades that can mislead the model.
@misc{peng2025eyemllmbenchmarkingegocentric,
title={In the Eye of MLLM: Benchmarking Egocentric Video Intent Understanding with Gaze-Guided Prompting},
author={Taiying Peng and Jiacheng Hua and Miao Liu and Feng Lu},
year={2025},
eprint={2509.07447},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2509.07447}
}