Georgia Tech Egocentric Activity Datasets

Our repository of egocentric activity datasets!
This page captures our effort on GTEA dataset series.
Our latest and largest version is EGTEA Gaze+ dataset.
We are working on further developing EGTEA Gaze+. Stay tuned!

Recommended!

GTEA

This dataset contains 7 types of daily activities, each performed by 4 different subjects. The camera is mounted on a cap worn by the subject.


We highly recommed replacing this dataset using EGTEA Gaze+!

Please consider citing the following papers when using this dataset:

Alireza Fathi, Xiaofeng Ren, James M. Rehg,
Learning to Recognize Objects in Egocentric Activities, CVPR, 2011

Yin Li, Zhefan Ye, James M. Rehg.
Delving into Egocentric Actions, CVPR 2015

GTEA Gaze

This dataset is collected using Tobii eye-tracking glasses. It consists of 17 sequences, performed by 14 different subjects.

To record the sequences, we stuffed a table with various kinds of food, dishes and snacks. We asked each subject to wear the Tobii glasses and calibrated the gaze. Then we asked the subject to take a sit and make whatever food they feel like having. The beginning and ending time of the actions are annotated. Each action consists of a verb and a set of nouns. For example pouring milk into cup. In our experiments we extract images from video at 15 frames per second. Action annotations are based on frame numbers. The following sequences are used for training: 1, 6, 7, 8, 10, 12, 13, 14, 16, 17, 18, 21, 22 and the following sequences are used for testing: 2, 3, 5, 20.

Please consider citing the following paper when using this dataset:

Alireza Fathi, Yin Li, James M. Rehg,
Learning to Recognize Daily Actions using Gaze, ECCV, 2012

GTEA Gaze+

We collected this dataset using SMI eye-tracking glasses. We are more than half-way through the annotation, and here we have made the collected and annotated data available. The current version contains 37 videos with gaze tracking and action annotations. Audio files are also available upon request.

We collected this dataset at Georgia Tech's AwareHome. This dataset consists of seven meal-preparation activities, performed by 26 subjects. Subjects perform the activities based on the given cooking recipes (get the recipes here).
Activities are: American Breakfast, Pizza, Snack, Greek Salad, Pasta Salad, Turkey Sandwich and Cheese Burger. SMI glasses record a HD video of subjects activities at 24 frames per second. They also record subject's gaze at 30 fps.
For each activity, we used ELAN to annotate its actions. An activity is a meal-preparation task such as making pizza, and an action is a short temporal segment such as putting sauce on the pizza crust, dicing the green peppers, washing the mushrooms, etc.


We highly recommed replacing this dataset using EGTEA Gaze+!

American Breakfast

Video

P1 P2 P3 P4 P5 P6

Pizza (Special)

Video

P1 P2 P3 P4 P5 P6

Afternoon Snack

Video

P1 P2 P3 P4 P5 P6

Greek Salad

Video

P1 P2 P3 P4 P6

Pasta Salad

Video

P1 P2 P3 P4

Turkey Sandwich

Video

P1 P2 P3 P4 P6

Cheese Burger

Video

P1 P2 P3 P4 P6

Gaze & Action Labels

We have mistakenly put raw labels in Jan. 2016. Please re-download the cleaned action labels if you got the incorrect version.

Gaze Labels Hand Masks


Please consider citing the following papers when using this dataset:

Alireza Fathi, Yin Li, James M. Rehg,
Learning to Recognize Daily Actions using Gaze, ECCV, 2012
Yin Li, Zhefan Ye, James M. Rehg.
Delving into Egocentric Actions, CVPR 2015

Extended GTEA Gaze+

EGTEA Gaze+ is our largest and most comprehensive dataset for FPV actions and gaze. It subsumes GTEA Gaze+ and comes with HD videos (1280x960), audios, gaze tracking data, frame-level action annotations, and pixel-level hand masks at sampled frames.

Specifically, EGTEA Gaze+ contains 28 hours (de-identified) of cooking activities from 86 unique sessions of 32 subjects. These videos comes with audios and gaze tracking (30Hz). We have further provided human annotations of actions (human-object interactions) and hand masks.


The action annotations include 10325 instances of fine-grained actions, such as "Cut bell pepper" or "Pour condiment (from) condiment container into salad".


The hand annotations consist of 15,176 hand masks from 13,847 frames from the videos.

Please consider citing the following papers when using this dataset:

Yin Li, Miao Liu, James M. Rehg,
In the eye of beholder: Joint learning of gaze and actions in first person video, ECCV, 2018

Special thanks to BasicFinder for providing the hand annotations

Contact

For general questions or bug reports please contact

Miao Liu (mliu328@gatech.edu).