Yes, each step-by-step instruction has a corresponding subgoal in the training and validation trajectories. If you use this alignment during training, please see the submission guidelines for leaderboard submissions.
You should be able to achieve >99% success rate on training and validation tasks with the ground-truth actions and masks from the dataset. Occasionally, some non-determistic behavior in THOR can lead to failures, but they are extremely rare.
Mask prediction is an important part of the ALFRED challenge. Unlike non-interactive environments (e.g vision-language navigation), here it's necessary for the agent to specify what exactly it wants to interact with.
Why do feat_conv.pt
in Full Dataset have 10 more frames than the number of images?
The last 10 frames are copies of the features from the last image frame.
You can use augment_trajectories.py to replay all the trajectories and augment the visual observations. At each step, use the THOR API to look around and take 6-12 shots of the surrounding. Then stitch together these shots to create a panoramic image for each frame. You might have to set 'forceAction': True
for smooth moveahead/rotate/look. Note that getting panoramic images during test time would incur the additional cost of looking around with the agent.