This project was implemented as part of an internship at Odnoklassniki.
Try it out on Colab:
The system makes captions to images for blind and visually impaired people. The architecture consists of two models:
- YOLOv3 - state-of-the-art, real-time object detection system.
- Sber ru-GPT models - autoregressive transformer language models.
For the first model, weights were taken from the YOLOv3 neural network and trained in 80 classes from a dataset MS COCO.
For the second model, the weights of two models were taken: ruGPT3Small and ruGPT3Medium. Then they were fine-tuned on a dataset of russion language, containing labels and capltion from them.
Thus, 3 models were developed: ruGPT3Small trained on 2 and 10 epochs, ruGPT3Medium trained on 5 epochs. Generally speaking, following conclusions can be made:ruGPT3Small(2 epochs) model worked best on tests, but ruGPT3Medium makes more eloquent captions.
To install the dependencies, run
pip install -r requirements.txt
Also you should download weights for YOLOv3, GPT2:
- Grab the pretrained weights of yolo3 from https://pjreddie.com/media/files/yolov3.weights
- Weight of GPT model from https://drive.google.com/drive/folders/1WFpM3jFpGHSq3GESIKMnTzyRZvdRf9mN?usp=sharing
And for the GPU to work, make sure you've got the drivers installed beforehand (CUDA).
It has been tested to work with Python 3.7.11
Select model, image and run:
python caption.py -m choosen_models -i your_image.jpg
Time estimated on CPU Intel Core i5.
Name | Download | Time @ 1 image. |
---|---|---|
Small (2 epochs) | model | 7.2 s |
Small (10 epochs) | model | 6.6 s |
Meduim (5 epochs) | model | 10.9 s |
Name | vase.jpg | man.jpg | sofa.jpg | cats.jpg |
---|---|---|---|---|
Small (2 epochs) | Ваза и чашка на столе | Человек с мобильным телефоном и галстуком | Человек сидит на диване с мобильным телефоном | Два кота смотрят на человека на завтраке за столом со стулом и чашей |
Small (10 epochs) | Маленькая вазочка со стеклянной кружкой на столе | Человек, стоящий перед мобильным телефоном в галстуке | Люди на диване с мобильными телефонами | Два больших коричневых кота смотрят на человека на столе со стулом в чаше |
Meduim (5 epochs) | Белая керамическая ваза с розами и белая чашка на деревянном столе | Мужчина в костюме с мобильным телефоном и галстуком на шее | Мужчина сидит на диване и разговаривает по мобильному телефону | Две серые кошки сидят напротив человека, обедающего на кухонном столе возле стульев и чаши |