Canonical Correlation Inference for Mapping Abstract Scenes to Text
This project focuses on the use of canonical correlation analysis to map images and text to a shared space, and then use this shared space to map
unseen images to corresponding captions. The dataset we use is
abstract scene dataset, developed at Microsoft,
for which we show a couple of pictures below.
@techreport{jiang-16,
title={Canonical Correlation Inference for Mapping Abstract Scenes to Text},
author={Helen Jiang and Nikos Papasarantopoulos and Shay B. Cohen},
eprint={arXiv},
year={2016}
}
Click
here to download the ranked captions. The format of the file is as follows.
There are 300 lines, a line per ranked image. Each line has fields separated by ^. The fields are as follows:
- The image name in the abstract scene dataset (in the RenderedScenes/ directory)
- Gold-standard caption 1
- Gold-standard caption 2
- Gold-standard caption 3
- Gold-standard caption 4
- Gold-standard caption 5
- Gold-standard caption 6
- Gold-standard caption 7
- Gold-standard caption 8
- Gold-standard caption that was rated
- The caption from Ortiz et al. (statistical machine translation system) that was rated
- The CCA caption that was rated
- Average rating (by 2-3 subjects) for the gold caption (a number between 1 - least relevant and 5 - most relevant)
- Average rating for the SMT caption
- Average rating for the CCA caption
Not all images have 8 gold-standard captions, so some can be empty.
Click
here to download the splits that we used for the training/development/tuning/test sets. These are the
same splits as were used by Ortiz et al. Each file in this gzipped tarball contains a list of pointers to the scenes that were used for
the relevant set. The human-ranked images were taken from the test set (first 300 images).