Image Captionbot for Assistive Technology
Abstract
Generating small descriptions from the image is a very difficult task because of the complexity of image features and the vastness of the language contexts. An image may contain a wide variety of information and thus extracting the context of the information contained in the image and generation of the sentence using that context is a very complex task. However, the task can help blind people to understand the surrounding without others assistance. Deep learning techniques have emerged as a new trend in programming and can be utilized to develop this kind of system. In the project, we will be using VGG16, one of the best CNN architectures for image classification and for extracting features from images. An embedding layer and LSTM will be used for text description. And these two networks will be combined to form an image caption generation network. Then we will train the model using data prepared from the flickr8k dataset. The trained model will be used to generate caption from new images and the generated caption will be converted to audio for helping the blind.