EgoGazeVQA

Multi-Modal Conversational Balancer for Egocentric Video Understanding

1University Name 1    2University Name 2    3Institution Name 3
NeurIPS 2025

Abstract

Understanding egocentric videos poses unique challenges due to the first-person perspective and dynamic nature of human activities. We present EgoGazeVQA, a novel multi-modal framework that leverages gaze information as an additional modality to enhance video question answering in egocentric scenarios. Our approach introduces a conversational balancer mechanism that effectively integrates visual, textual, and gaze signals to provide more accurate and contextually relevant answers.

Through extensive experiments on multiple egocentric video datasets, we demonstrate that incorporating gaze patterns significantly improves the model's understanding of user intentions and scene dynamics. Our method achieves state-of-the-art performance on EgoExo4D and EGTEA benchmarks, showing improvements of X% and Y% respectively over previous baselines.

Method

Model Architecture

Figure 1: Overview of the EgoGazeVQA architecture. Our model consists of three main components: (a) Visual Encoder, (b) Gaze Attention Module, and (c) Conversational Balancer.

Key Contributions

  • Gaze-Aware Attention: Novel attention mechanism that uses gaze patterns to guide visual feature extraction
  • Multi-Modal Fusion: Effective integration of visual, textual, and gaze modalities through learnable balancing
  • Temporal Reasoning: Enhanced temporal understanding through gaze trajectory analysis
  • Large-Scale Dataset: Introduction of EgoGazeVQA dataset with 50K+ annotated QA pairs

Experimental Results

Quantitative Results

Method EgoExo4D EGTEA Epic-Kitchens
Baseline 45.2 38.7 41.3
Method A 48.5 42.1 44.6
Method B 51.3 44.8 47.2
EgoGazeVQA (Ours) 58.7 52.3 54.1

Ablation Study

Qualitative Results

Result 1

Scene understanding with gaze guidance

Result 2

Temporal reasoning visualization

Result 3

Multi-modal attention maps

Result 4

Comparison with baselines

EgoGazeVQA Dataset

10K+

Videos

50K+

QA Pairs

100K+

Gaze Points

500+

Hours

Dataset Features

  • High-quality egocentric videos with synchronized gaze tracking
  • Diverse scenarios including cooking, assembly, and navigation
  • Multi-level question types: spatial, temporal, and causal reasoning
  • Professional annotations with quality control

Download

The dataset will be publicly available upon paper acceptance.

Coming Soon

Interactive Demo

Try it yourself

Model Output:

Results will appear here...

Citation

@inproceedings{pan2025egogaze,
    title={EgoGazeVQA: Multi-Modal Conversational Balancer for Egocentric Video Understanding},
    author={Pan, Taiyi and Author2 and Author3 and Author4},
    booktitle={Advances in Neural Information Processing Systems},
    year={2025}
}