Understanding egocentric videos poses unique challenges due to the first-person perspective and dynamic nature of human activities. We present EgoGazeVQA, a novel multi-modal framework that leverages gaze information as an additional modality to enhance video question answering in egocentric scenarios. Our approach introduces a conversational balancer mechanism that effectively integrates visual, textual, and gaze signals to provide more accurate and contextually relevant answers.
Through extensive experiments on multiple egocentric video datasets, we demonstrate that incorporating gaze patterns significantly improves the model's understanding of user intentions and scene dynamics. Our method achieves state-of-the-art performance on EgoExo4D and EGTEA benchmarks, showing improvements of X% and Y% respectively over previous baselines.
Figure 1: Overview of the EgoGazeVQA architecture. Our model consists of three main components: (a) Visual Encoder, (b) Gaze Attention Module, and (c) Conversational Balancer.
Method | EgoExo4D | EGTEA | Epic-Kitchens |
---|---|---|---|
Baseline | 45.2 | 38.7 | 41.3 |
Method A | 48.5 | 42.1 | 44.6 |
Method B | 51.3 | 44.8 | 47.2 |
EgoGazeVQA (Ours) | 58.7 | 52.3 | 54.1 |
Scene understanding with gaze guidance
Temporal reasoning visualization
Multi-modal attention maps
Comparison with baselines
Videos
QA Pairs
Gaze Points
Hours
Results will appear here...
@inproceedings{pan2025egogaze,
title={EgoGazeVQA: Multi-Modal Conversational Balancer for Egocentric Video Understanding},
author={Pan, Taiyi and Author2 and Author3 and Author4},
booktitle={Advances in Neural Information Processing Systems},
year={2025}
}