Enhancing Neural Image Captioning with Eye-Tracking


In this work, we take gaze data captured with eye-tracking to be a proxy for human attention during language production. In particular, we hypothesize that training an image captioning system with real gaze data sequentially aligned with individuals’ utterances could lead to more human-like captions. We aim to model the production process of a single speaker incrementally, in contrast to the more common approach where gaze data is aggregated over a multitude of individuals to represent generic saliency.

Abstract to be presented at the Symposium for Integrating Generic and Contextual Knowledge