-
Deep Visual-Semantic Alignments for Generating Image Descriptions下载
资源介绍
We present a model that generates natural language descriptions
of images and their regions. Our approach leverages
datasets of images and their sentence descriptions to
learn about the inter-modal correspondences between language
and visual data. Our alignment model is based on a
novel combination of Convolutional Neural Networks over
image regions, bidirectional Recurrent Neural Networks
over sentences, and a structured objective that aligns the
two modalities through a multimodal embedding. We then
describe a Multimodal Recurrent Neural Network architecture
that uses the inferred alignments to learn to generate
novel descriptions of image regions. We demonstrate that
our alignment model produces state of the art results in retrieval
experiments on Flickr8K, Flickr30K and MSCOCO
datasets. We then show that the generated descriptions significantly
outperform retrieval baselines on both full images
and on a new dataset of region-level annotations.