We collected a dataset of 360° video with first order ambisonics from YouTube, containing clips from a diverse set of topics such as musical performances, vlogs, sports, and others. The dataset was cleaned by removing videos that 1) did not contain valid ambisonics, 2) only contain still images, or 3) contain a significant amount of post-production sounds such as voice-overs and background music. To evaluate a model's ability to localize objects in a 360 scene, we also provide semantic segmentation predictions provided by a state-of-the-art ResNet101 Panoptic FPN model trained on the MS-COCO dataset. For more information about the dataset, please check our paper.
Learning Representations from Audio-Visual Spatial Alignment
Pedro Morgado*, Yi Li*, Nuno Vasconcelos
Advances in Neural Information Processing Systems (NeurIPS), 2020.
[ PDF | Suppl | ArXiv | Code | BibTeX | Video ]
This work was partially funded by NSF award IIS-1924937 and NVIDIA GPU donations. We also acknowledge and thank the use of the Nautilus platform for some of the experiments in paper.