A dataset of uncurated 360 video content with spatial audio.

Spatial Media

360 video+FOA

+88000 Clips

(10s each)

246 Hours


We collected a dataset of 360° video with first order ambisonics from YouTube, containing clips from a diverse set of topics such as musical performances, vlogs, sports, and others. The dataset was cleaned by removing videos that 1) did not contain valid ambisonics, 2) only contain still images, or 3) contain a significant amount of post-production sounds such as voice-overs and background music. To evaluate a model's ability to localize objects in a 360 scene, we also provide semantic segmentation predictions provided by a state-of-the-art ResNet101 Panoptic FPN model trained on the MS-COCO dataset. For more information about the dataset, please check our paper.

Learning Representations from Audio-Visual Spatial Alignment
Pedro Morgado*, Yi Li*, Nuno Vasconcelos
Advances in Neural Information Processing Systems (NeurIPS), 2020.
[ PDF | Suppl | ArXiv | Code | BibTeX | Video ]


We provide Youtube URLs and segment timestamps. If you experience issues downloading and processing the dataset, please email the authors for assistance.

Training set (IDs)

Test set (IDs)


Semantic Segmentation


This work was partially funded by NSF award IIS-1924937 and NVIDIA GPU donations. We also acknowledge and thank the use of the Nautilus platform for some of the experiments in paper.