Loading...
Please wait, while we are loading the content...
Implications of Multimodal Deep Learning for Textual and Visual Data
| Content Provider | Semantic Scholar |
|---|---|
| Author | Pham, Hieu |
| Copyright Year | 2015 |
| Abstract | While deep learning has been successful in a wide range of tasks in different fields, most of the current neural network based systems are still learning in merely one source of knowledge such as text, image or wave. In this work, we propose a novel method that acquires knowledge from both textual and visual data. Our analyses also show that it was possible to learn a function that map an arbitrary image or sequence of words into vectors in a joint high dimensional semantic space. These vectors are then evaluated on two vision-cross-language tasks; image-caption relevance ranking and image retrieval from text query. Different from prior work in the same direction, we only have a decent model on the text data and a weak model on image data. Furthermore, both the quantity and quality of our training text corpus are far better than those of our image dataset. From the poor performance of our vision model, we achieve 15× improvement after incorporating the knowledge from |
| File Format | PDF HTM / HTML |
| Alternate Webpage(s) | http://cs231n.stanford.edu/reports/2015/pdfs/hyhieu_final.pdf |
| Language | English |
| Access Restriction | Open |
| Content Type | Text |
| Resource Type | Article |