Image Captioning is the process of generating textual descriptions of an image. It uses both Natural Language Processing and Computer Vision to generate captions.
- Introduction
- Motive
- Prerequisites
- Data Collection
- Interpreting Data
- Data Cleaning
- Loading the Training Set
- Data Preprocessing — Images
- Data Preprocessing — Captions
- Data Preparation using Generator Function
- Word Embeddings
- Model Architecture
- Inference
- Evaluation
- Conclusion and Future Work
- References
Have you ever wondered how easy it is for humans to look at a picture and describe it in an appropriate language? Even a 5-year-old can do this with ease. However, the same task is challenging for computers.
Can you write a computer program that takes an image as input and produces a relevant caption as output for the given image? With the advent of Deep Learning, this problem can be solved with the right dataset.
Andrej Karapathy, who is now the Director of AI at Tesla, extensively researched this problem in his PhD thesis at Stanford.
The aim is to explain, in simple terms, how Deep Learning can be used to solve the problem of generating a caption for a given image, hence the name Image Captioning.
Caption Bot: A state-of-the-art system created by Microsoft.
Image captioning has several practical applications:
- Image tagging for e-commerce, Photo-Sharing Services, and Online Catalogs.
- Automatic image annotations for blind individuals.
- CCTV Cameras: With the prevalence of CCTV cameras, they can be used not only for monitoring but also for generating relevant captions. This can help in raising alarms in real-time for malicious activities in places such as malls or roads.
- Automatic Captioning: Imagine an Image Search as good as Google Search, where every image is first converted into a caption, and then a search can be performed based on the caption.
Before diving into image captioning, it's essential to understand the following Deep Learning concepts:
- Multi-layered Perceptrons
- Convolution Neural Networks (CNN)
- Recurrent Neural Networks (RNN)
- Transfer Learning
- Gradient Descent
- Backpropagation
- Overfitting
- Probability
- Text Processing
- Python (syntax and data structures)
- Keras library
The image captioning process involves an Encoder, which is a Convolutional Neural Network (CNN) that extracts features from input images. The last hidden state of the CNN is connected to the Decoder, which is a Recurrent Neural Network (RNN) responsible for language modeling up to the word level.
Understanding Encoder and Decoder
There are numerous open-source datasets available for image captioning, such as:
- Flickr_8k_dataset (containing 8k images)
- Flickr_30k_dataset (containing 30k images)
- MS_COCO_dataset (containing 180k images)
For this project, we will use the Flickr_8k_dataset.
The aim is to explain, in simple terms, how Deep Learning can be used to solve the problem of generating a caption for a given image, hence the name Image Captioning.
The dataset consists of approximately 8,000 images, with each image having five captions. Each image-caption pair is relevant. The dataset is divided into the following parts:
- Test Set: 1,000 images
- Training Set: 6,000 images
- Dev Set: 1,000 images
The dataset is in the form of [image → captions], where the images are accompanied by corresponding output captions. These captions are crucial for training the image captioning model.
The dataset structure consists of images and corresponding captions. The captions.txt
file contains the names of each image along with its five captions (0-4).
Data cleaning involves tasks like removing duplicates, handling missing values, and ensuring consistency in the data. It is essential to prepare the data for further processing and training.
To train an image captioning model, we need to load the training set, which consists of 6,000 images along with their relevant captions.
Preprocessing images includes tasks like resizing, normalizing, and extracting features using a pre-trained CNN. These features are crucial for the image captioning model.
Caption preprocessing involves tasks like tokenization, padding, and preparing captions for training.
Data preparation is a crucial step in training an image captioning model. A generator function is used to create batches of data for training and validation.
Word embeddings are used to represent words in a format that can be understood by the model. These embeddings are essential for the language model in the image captioning process.
The model architecture for image captioning includes both the encoder and decoder. It combines Computer Vision and Natural Language Processing to generate captions for images.
Inference involves using the trained model to generate captions for new images. It showcases the practical application of image captioning.
Evaluating the performance of the image captioning model is essential to ensure that the generated captions are relevant and accurate.
The conclusion summarizes the key findings of the project, and future work discusses potential improvements and extensions of the image captioning model.
References provide citations and sources for the research and resources used in the project.