EgoGazeVQA

Multi-Modal Conversational Balancer for Egocentric Video Understanding

Taiyi Pan¹, Author 2², Author 3¹, Author 4³

¹University Name 1 ²University Name 2 ³Institution Name 3

NeurIPS 2025

Abstract

Understanding egocentric videos poses unique challenges due to the first-person perspective and dynamic nature of human activities. We present EgoGazeVQA, a novel multi-modal framework that leverages gaze information as an additional modality to enhance video question answering in egocentric scenarios. Our approach introduces a conversational balancer mechanism that effectively integrates visual, textual, and gaze signals to provide more accurate and contextually relevant answers.

Through extensive experiments on multiple egocentric video datasets, we demonstrate that incorporating gaze patterns significantly improves the model's understanding of user intentions and scene dynamics. Our method achieves state-of-the-art performance on EgoExo4D and EGTEA benchmarks, showing improvements of X% and Y% respectively over previous baselines.

Method

Figure 1: Overview of the EgoGazeVQA architecture. Our model consists of three main components: (a) Visual Encoder, (b) Gaze Attention Module, and (c) Conversational Balancer.

Key Contributions

Gaze-Aware Attention: Novel attention mechanism that uses gaze patterns to guide visual feature extraction
Multi-Modal Fusion: Effective integration of visual, textual, and gaze modalities through learnable balancing
Temporal Reasoning: Enhanced temporal understanding through gaze trajectory analysis
Large-Scale Dataset: Introduction of EgoGazeVQA dataset with 50K+ annotated QA pairs

Experimental Results

Quantitative Results

Method	EgoExo4D	EGTEA	Epic-Kitchens
Baseline	45.2	38.7	41.3
Method A	48.5	42.1	44.6
Method B	51.3	44.8	47.2
EgoGazeVQA (Ours)	58.7	52.3	54.1

Ablation Study

Qualitative Results

Scene understanding with gaze guidance

Temporal reasoning visualization

Multi-modal attention maps

Comparison with baselines

EgoGazeVQA Dataset

10K+

Videos

50K+

QA Pairs

100K+

Gaze Points

500+

Hours

Dataset Features

High-quality egocentric videos with synchronized gaze tracking
Diverse scenarios including cooking, assembly, and navigation
Multi-level question types: spatial, temporal, and causal reasoning
Professional annotations with quality control

Download

The dataset will be publicly available upon paper acceptance.

Coming Soon

Interactive Demo

Try it yourself

Select a video:

Ask a question:

Model Output:

Results will appear here...

Citation

@inproceedings{pan2025egogaze,
    title={EgoGazeVQA: Multi-Modal Conversational Balancer for Egocentric Video Understanding},
    author={Pan, Taiyi and Author2 and Author3 and Author4},
    booktitle={Advances in Neural Information Processing Systems},
    year={2025}
}