LM101-045: How to Build a Deep Learning Machine for Answering Questions about Images
This is the fourth of a short subsequence of podcasts which provides a summary of events associated with Dr. Golden’s recent visit to the 2015 Neural Information Processing Systems Conference. This is one of the top conferences in the field of Machine Learning. This episode describes a deep learning machine that can answer simple questions about images.
Hello everyone! Welcome to the forty-fifth podcast in the podcast series Learning Machines 101. In this series of podcasts my goal is to discuss important concepts of artificial intelligence and machine learning in hopefully an entertaining and educational manner. This is the fourth of a short subsequence of podcasts in the Learning Machines 101 series which is designed to give a brief overview of my personal experience and perspectives on my visit to the 2015 Neural Information Processing Systems Conference in Montreal. Episode 41 provides a brief overview of the conference which is considered one of the top conferences in the field of machine learning.
In this episode we discuss just one out of the 102 different posters which was presented on the first night of the 2015 Neural Information Processing Systems Conference. The poster is titled: “Are you talking to a machine: Data sets and methods for multilingual image questions”
and it is Poster 9 in the Proceedings of the 2015 Neural Information Processing Systems Conference. This presentation describes a system which can answer simple questions about images. For example, the system might be presented with an image or photograph of a red bus at bus station stop and the examiner might ask the system “What is the color of the bus?”. A correct response would be “The bus is red.” Another example, might be that the system is presented an image which contains a bag of green applies and several bananas and the examiner might ask the system “What is the yellow item?”. A correct response would be “bananas”. A third example is an image consisting of two plates. Each plate contains a sandwich and some cooked broccoli and the question might be: “Please look carefully and tell me what is the name of the vegetable on the plate?” A correct response would be “broccoli”.
The mQA system consists of four components. The system is trained with triplets of information consisting of an image, a question represented as a word sequence, and an answer represented as a word sequence. The system is then tested on images and questions which it has not previously seen.
The first component uses a recurrent network of the type described in Episode 36 of Learning Machines 101 titled “How to Predict the Future from the Distant Past using Recurrent Neural Networks”. A question to the system is represented as a sequence of words. The network learns to predicts the second word in the sequence from a compressed representation of the first word. As a consequence of this learning process, the compressed representation of the first word is updated to become a compressed representation of the first word followed by the second word in the sequence. Then, the network learns to predict the third word in the sequence from the compressed representation of the first two words and as a consequence of this learning process, the compressed representation now contains important information about the first three words in the sequence. The compressed representation of the sequence of words used to predict the final word in the sentence is used as a representation of the sentence. The resulting compressed representation has three very important benefits. First, the number of elements in the compressed representation is small relative to the total number of possible word sequences. And second, the compressed representation has the property that word sequences with similar content and similar temporal sequencing will have similar compressed representations. And third, the basis for this induced similarity is a property of the entire training data set of word sequences. Fine distinctions in induced similarity will be induced only if such discriminatory mechanisms are required to learn the set of training sequences. If the set of training data can be learned without making certain fine-grained distinctions, then those fine-grained distinctions will not be learned by the learning machine.
The second component of the mQA model is a convolutional neural network as described in Episode 29 (Convolutional Neural Networks and Rectilinear Units). We have discussed convolutional neural networks in previous episodes of Learning Machines 101 but we will review the basic ideas behind such networks again.The convolutional neural network works by processing an image of pixels and identifying statistical regularities at the level of small groups of pixels in the image. Another level of the network then examines these constructed statistical regularities and identifies more abstract statistical regularities such as corners and edges. The next level of the network then examines these more abstract statistical regularities such as corners and edges and identifies parts of images such as a patch of carpet texture, a piece of furniture leg texture, or a patch of dog fur. The next level of the network examines these statistical regularities to form even higher level features and so on…until the highest levels of the network might be able to identify objects such as: chairs, faces, pillows, and people.
Such networks work as noted in Episodes 23, 29, and 30 of Learning Machines 101 are highly structured networks which have many parameters but also have a very specific network architecture. In particular, a feature detector which looks at a small region of a transformed image basically scans the image to detect the presence or absence of that feature in different parts of the transformed image…this scanning process is then used to generate a feature map whose elements indicate where the feature was detected. Multiple feature maps are learned at each level of processing. Next, a max-pooling layer is typically used which looks at small regions of the feature map to determine whether or not there is sufficient evidence to decide whether or not the feature is present at that point. Typically, the final output layer of the network is a softmax network which assigns a probability to each possible output label. In particular, the “evidence” for each category label is computed and then exponentiated. The probability of a particular category label is then defined as the exponential function evaluated at the category label evidence divided by the exponentiated evidence supporting all category labels. The softmax layer is closely related to the multinomial logistic regression model.
In the mQA models, the authors of this article did not actually train a convolutional neural network but simply used an existing convolutional neural network called GoogleNet which had already been trained on images on the ImageNet classification task, the output softmax units of the network were disconnected, and the outputs of the top-most layers were connected into the system. Thus, the convolutional neural network was essentially used an image preprocessing mechanism designed to generated a compressed semantic representation of the image in terms of a collection of feature maps. This type of strategy has increasingly become more common in the deployment of deep learning networks. Training a deep learning network takes a tremendous amount of time and effort but as a result of the training one obtains a feature coding module which can potentially be deployed in other applications.
The third component of the mQA model which generates the answer as a sequence of words is similar to the first component which processes the verbal question as a sequence of words. The next word in the answer is predicted from the previous word in the answer and as a result a compressed representation is formed. This final compressed semantic representation of the answer is the output of this third component of the module. Note that this component of the mQA model is used in the training phase of the model because during the training process the model is provided with the answers to the questions. During the testing phase, this third component is not used. Also note that the parameters of the third component of the model associated with knowledge of individual words are shared with the first component of the model which processes word sequences.
The fourth component of the model is similar to a feedforward network with softmax output units as described in Episode 23 How to Build a Deep Learning Machine. This is called the “fusing network” which combines the compressed representations of the image, question word sequence, and answer word sequence to generate a multimodal compressed representation. Then, after this process an additional recurrent neural network is used to predict the next word in the generated answer given the multimodal compressed representation and the previous word in the generated answer.
The data set consists of 158,392 images with 316,193 Chinese question-answer pairs and their English translations. Each image has at least two question-answer pairs as annotations. The original annotations for the images were generated using Baidu’s online crowdsourcing server. Crowdsourcing is a method where the general public has the opportunity to view the images and generate question-answer pairs about the images. The average length of each question was about 7 words and the average length of each answer was about 4 words. About 1,000 triplets consisting of an image, a question, and an answer were extracted from the original data set for the purpose of testing the system. That is, the system was not taught to answer these 1,000 image questions.
The evaluation process was based upon a Visual Turing Test. Turing tests are discussed in Episode 5 and Episode 6 of this podcast series. The Visual Turing Test was setup in the following manner. The 1000 image questions from the test data set were presented to the mQA learning machine to generate 1000 answers. Thus, for each image question, one has answer from the learning machine and additionally an answer from the test data set which was generated by human crowdsourcing. It is possible that the mQA learning machine was not really processing the image but was randomly guessing an answer based upon the question. So, for example, if the question was “What is the color of the bus?”, the system might be able to figure out without looking at the image that it needs to generate a color and perhaps the system only has been trained on four different colors. Thus, the system could guess the correct answer with a probability of 25% without processing the image information. To explore whether the system was generating its answers using only linguistic cues, the mQA system was also tested on the 1000 image questions in a “blind-mode” where it basically wore a blindfold on its visual system. This mode of operation was called “blind-mQA”.
Each of 12 human judges were then asked to decide whether or not the answer for a particular image question was generated by a human or the mQA system. Note that the 12 human judges rated the answers to 3000 image questions since there were 1000 images and each image had an answer generated by human crowdsourcing, an answer generated by the mQA system, and an answer generated by the blind-mQA system.
Here are the results from the Visual Turing Test. The human judges correctly identified human answers when a human answer was provided about 95% of the time. The human judges incorrectly classified a response as human when, in fact, it was generated by mQA about 65% of the time. For blind-mQA, however, the human judges were only fooled 34% of the time.
It is also interesting to discuss the qualitative nature of the failure cases of the mQA model. In one image a boy is in what looks like a bay or lagoon and is reaching for a flying yellow Frisbee. The question is “What is the handsome boy doing?”. The computer generated response is: “surfing”. Another image shows a bowl containing apples and oranges and some bananas next to the bowl. The mQA response to the question “Which fruit is there in the plate?” is “bananas and oranges”. A third image shows several buses on the street and the question “What is the type of vehicle?” yielded “train” by the mQA rather than “bus”.
Although these results are extremely impressive, it’s still not clear how far this type of methodology can be extended. The questions and answers are relatively short and focus on very explicit information. In addition, for each image, the entire database has only about two question-answer pairs per image. Even though the system is only trained upon one of the two question-answer pairs and tested on the other pair, this type of setup is quite different from the real world. It is possible that certain statistical regularities peculiar to each image act to constrain the set of possible answers and these of possible answers is relatively small. It would be interesting to see if systems could be developed in the future which can answer many different questions about the same image.
The answers to these questions, however, can not be obtained by proving mathematical theorems but require computer experiments on different databases possessing different statistical structures. This is a fundamental difference between the design of a traditional software engineering algorithm and these types of algorithms. If you hire a software engineer to program up an algorithm that can search for files in a database contain certain key words, then in principle the engineer can develop this algorithm and implement it in software and it will work correctly. There is not an “experimental component” to the algorithm design. Furthermore, if you took that algorithm and moved it to an entirely new database whose files were in the same format as the original database, the performance characteristics of the algorithm would remain the same.
In contrast to the traditional software engineering algorithm, statistical machine learning algorithms such as the mQA algorithm will exhibit different performance characteristics in different statistical environments. This is because statistical learning algorithms are learning machines which pick up complex statistical regularities from the environment. Therefore, the performance of the mQA algorithm is not only functionally dependent upon the implemented software algorithm it is also functionally dependent upon the statistical characteristics of the data which it processes. This means that if one applies the same mQA algorithm (like other statistical learning machines) to two different datasets, the algorithm might exhibit spectacular performance on one dataset but horrible performance on the other data set. Poor performance does not necessarily mean that the algorithm is incorrectly implemented. Thus, for practical applications, considerable experimental work is required to properly characterize the performance of the mQA in the class of statistical environments for which it will be applied. In addition, theoretical work is required to understand what types of statistical environments such systems will exhibit good generalization performance and what types of statistical environments will exhibit poor generalization performance.
Well…this was a description of only one poster out of the 102 posters presented on the first night of the Neural Information Processing Systems conference in Montreal! I have about 4 or 5 more posters I want to discuss out of these 102 posters presented on the first night of the conference. After I complete those discussions, we will move to the second day of the conference!!!
I have provided in the show notes at: www.learningmachines101.com hyperlinks to all of the papers published at the Neural Information Processing Systems conference since 1987, the workshop and conference schedule for the Neural Information Processing Systems conference, and links to related episodes of Learning Machines 101!
If you are a member of the Learning Machines 101 community, please update your user profile. If you look carefully you can provide specific information about your interests on the user profile when you register for learning machines 101 or when you receive the bi-monthly Learning Machines 101 email update!
You can update your user profile when you receive the email newsletter by simply clicking on the: “Let us know what you want to hear” link!
Or if you are not a member of the Learning Machines 101 community, when you join the community by visiting our website at: www.learningmachines101.com you will have the opportunity to update your user profile at that time.
Also check out the Statistical Machine Learning Forum on LinkedIn and Twitter at “lm101talk”.
Also check us out at PINTEREST as well!
From time to time, I will review the profiles of members of the Learning Machines 101 community and do my best to talk about topics of interest to the members of this group! So please make sure to visit the website: www.learningmachines101.com and update your profile!
So thanks for your participation in today’s show! I greatly appreciate your support and interest in the show!!
“Are you talking to a machine: Data sets and methods for multilingual image questions”
(Poster 9 from the Proceedings of the 2015 Neural Information Processing Systems Conference)
Related Episodes of Learning Machines 101:
Episode 5 How to decide if a Machine is Artificially Intelligent
Episode 6 How to interpret Turing Test Results
Episode 22 (Learning in Monte Carlo Markov Chain Machines)
Episode 23 (deep learning and feedforward networks),
Episode 29 (Convolutional Neural Networks and Rectilinear Units)