Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: polish benchmark doc #839

Merged
merged 7 commits into from
Oct 10, 2022
Merged
Show file tree
Hide file tree
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -177,8 +177,8 @@ It means the client and the server are now connected. Well done!

user-guides/client
user-guides/server
user-guides/retriever
user-guides/benchmark
user-guides/retriever
user-guides/faq
```

Expand Down
47 changes: 25 additions & 22 deletions docs/user-guides/benchmark.rst
Original file line number Diff line number Diff line change
@@ -1,27 +1,27 @@
CLIP Benchmark
==============
Benchmark
=========

In order to evaluate the performance of different CLIP models, we conducted a benchmark on a series of tasks using different datasets.
We fork the `CLIP benchmark repo <https://github.com/LAION-AI/CLIP_benchmark>`_, slightly modify the codebase and apply it to all Vision Transformers (ViT) and ResNet (RN) CLIP models.
You can find the benchmark results in the following tables.
The best results are highlighted in bold.
They can be used as a guide to choose the best model for your application.
In order to understand the zero-shot performance of CLIP and its limitations, we conducted a benchmark
across a variety of computer vision datasets (the dataset details are in the appendix). Here, thanks for the
open-source `CLIP Benchmark toolkit <https://github.com/LAION-AI/CLIP_benchmark>`_, we can easily reproduce the results.

Basic statistics
----------------
We hope that this benchmark can help you to better understand the performance of CLIP models and choose the best model for your application.

In the table below, we include the disk usage (in delta) and the peak RAM and VRAM usage (in delta) when running on a single Nvidia TITAN RTX GPU (24GB VRAM) with a Intel® Core™ i7-10700K Processor (128GB RAM) for a series of text and image encoding tasks with ``batch_size=8`` using PyTorch runtime.
We use ``clip_client==3.7.0``, ``clip_server==3.7.0``, ``jina==3.10.1`` and ``docarry==0.17.0``.
We also include the QPS (Queries Per Second) for the text and image encoding tasks using ``clip_client`` with PyTorch runtime.

Size and Efficiency
-------------------------

We first present the model's size and efficiency in terms of query time and memory usage (including the peak RAM and VRAM usage).
All of the results are obtained on a single Nvidia TITAN RTX GPU (24GB VRAM) with default server settings.

+----------------------------------------+------------------+----------------------+-----------------------+-----------+------------+
| Model | Disk Usage (MB) | Peak RAM Usage (GB) | Peak VRAM Usage (GB) | Text QPS | Image QPS |
+========================================+==================+======================+=======================+===========+============+
| RN50::openai | **244** | 2.99 | **1.36** | 1019 | 269 |
| RN50::openai | 244 | 2.99 | 1.36 | 1019 | 269 |
+----------------------------------------+------------------+----------------------+-----------------------+-----------+------------+
| RN50::yfcc15m | 389 | 2.86 | **1.36** | 1083 | 262 |
| RN50::yfcc15m | 389 | 2.86 | 1.36 | 1083 | 262 |
+----------------------------------------+------------------+----------------------+-----------------------+-----------+------------+
| RN50::cc12m | 389 | **2.84** | **1.36** | 1064 | 264 |
| RN50::cc12m | 389 | 2.84 | 1.36 | 1064 | 264 |
+----------------------------------------+------------------+----------------------+-----------------------+-----------+------------+
| RN101::openai | 278 | 3.05 | 1.40 | 1047 | 222 |
+----------------------------------------+------------------+----------------------+-----------------------+-----------+------------+
Expand All @@ -35,7 +35,7 @@ We also include the QPS (Queries Per Second) for the text and image encoding tas
+----------------------------------------+------------------+----------------------+-----------------------+-----------+------------+
| ViT-B-32::openai | 338 | 3.20 | 1.40 | 1064 | 286 |
+----------------------------------------+------------------+----------------------+-----------------------+-----------+------------+
| ViT-B-32::laion2b_e16 | 577 | 2.93 | 1.40 | **1120** | **292** |
| ViT-B-32::laion2b_e16 | 577 | 2.93 | 1.40 | 1120 | 292 |
+----------------------------------------+------------------+----------------------+-----------------------+-----------+------------+
| ViT-B-32::laion400m_e31 | 577 | 2.93 | 1.40 | 1080 | 287 |
+----------------------------------------+------------------+----------------------+-----------------------+-----------+------------+
Expand Down Expand Up @@ -78,10 +78,13 @@ We also include the QPS (Queries Per Second) for the text and image encoding tas



CLIP benchmark
--------------
Zero-shot Performance
----------------------------

In this section, we will report the zero-shot performance of the models on classification and retrieval tasks across different datasets.


Zero-shot retrieval
Zero-shot Retrieval
+++++++++++++++++++

In zero-shot retrieval benchmark, each model is evaluated on the following datasets: `COCO Caption <https://github.com/tylin/coco-caption>`_, `Flickr8k <http://hockenmaier.cs.illinois.edu/8k-pictures.html>`_ and `Flickr30k <https://shannon.cs.illinois.edu/DenotationGraph/>`_.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
In zero-shot retrieval benchmark, each model is evaluated on the following datasets: `COCO Caption <https://github.com/tylin/coco-caption>`_, `Flickr8k <http://hockenmaier.cs.illinois.edu/8k-pictures.html>`_ and `Flickr30k <https://shannon.cs.illinois.edu/DenotationGraph/>`_.
In the zero-shot retrieval benchmark, each model is evaluated on the following datasets: `COCO Caption <https://github.com/tylin/coco-caption>`_, `Flickr8k <http://hockenmaier.cs.illinois.edu/8k-pictures.html>`_ and `Flickr30k <https://shannon.cs.illinois.edu/DenotationGraph/>`_.
The best results are highlighted in bold (higher is better).

Expand Down Expand Up @@ -151,7 +154,7 @@ From the table, we observe that the ViT models outperform the RN models in gener
More specifically, the ``ViT-H-14::laion2b_s32b_b79k`` model and ``ViT-g-14::laion2b_s12b_b42k`` model achieve the best and second-best results on all zero-shot retrieval tasks.
For ViT models, the results of the same base model are better on those pre-trained with larger datasets (e.g., ``ViT-B-32::openai`` vs ``ViT-B-32::laion400m_e31`` vs ``ViT-B-32::laion2b-s34b-b79k``).

Zero-shot classification
Zero-shot Classification
++++++++++++++++++++++++

In zero-shot classification benchmark, each model is evaluated on the following datasets: `ImageNetV2 <https://github.com/modestyachts/ImageNetV2>`_, `VOC2007 <http://host.robots.ox.ac.uk/pascal/VOC/voc2007/>`_ and 19 `VTAB datasets <https://github.com/google-research/task_adaptation>`_.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
In zero-shot classification benchmark, each model is evaluated on the following datasets: `ImageNetV2 <https://github.com/modestyachts/ImageNetV2>`_, `VOC2007 <http://host.robots.ox.ac.uk/pascal/VOC/voc2007/>`_ and 19 `VTAB datasets <https://github.com/google-research/task_adaptation>`_.
In the zero-shot classification benchmark, each model is evaluated on the following datasets: `ImageNetV2 <https://github.com/modestyachts/ImageNetV2>`_, `VOC2007 <http://host.robots.ox.ac.uk/pascal/VOC/voc2007/>`_ and 19 `VTAB datasets <https://github.com/google-research/task_adaptation>`_.
The best results are highlighted in bold (higher is better).

Expand Down Expand Up @@ -219,8 +222,8 @@ Similar to retrieval results, the ``ViT-H-14::laion2b_s32b_b79k`` model and ``Vi
All models tend to perform well on ImageNetV2, VOC2007, VTAB natural and VTAB specialized (except for Retinopathy) datasets, whereas they perform poorly on VTAB structured datasets.
We do not observe any significant difference between the ViT models of the same base model.

Select the best model
+++++++++++++++++++++
Select the right model
-----------------------

In general, you can select the best model for your application from different perspectives: disk usage, peak RAM and VRAM usages, QPS, and most importantly, the performance.

Expand Down