-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
chore: polish benchmark doc #839
Merged
Merged
Changes from 5 commits
Commits
Show all changes
7 commits
Select commit
Hold shift + click to select a range
749f5f9
chore: update benchmark intro
numb3r3 8590c86
chore: minor revision
numb3r3 f24c996
chore: minor revision
numb3r3 eff9c92
chore: minor revision
numb3r3 01a1468
chore: minor revision
numb3r3 78036ae
chore: minor revision
numb3r3 4bc1f4d
chore: minor revision
numb3r3 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change | ||||||
---|---|---|---|---|---|---|---|---|
@@ -1,27 +1,27 @@ | ||||||||
CLIP Benchmark | ||||||||
============== | ||||||||
Benchmark | ||||||||
========= | ||||||||
|
||||||||
In order to evaluate the performance of different CLIP models, we conducted a benchmark on a series of tasks using different datasets. | ||||||||
We fork the `CLIP benchmark repo <https://github.com/LAION-AI/CLIP_benchmark>`_, slightly modify the codebase and apply it to all Vision Transformers (ViT) and ResNet (RN) CLIP models. | ||||||||
You can find the benchmark results in the following tables. | ||||||||
The best results are highlighted in bold. | ||||||||
They can be used as a guide to choose the best model for your application. | ||||||||
In order to understand the zero-shot performance of CLIP and its limitations, we conducted a benchmark | ||||||||
across a variety of computer vision datasets (the dataset details are in the appendix). Here, thanks for the | ||||||||
open-source `CLIP Benchmark toolkit <https://github.com/LAION-AI/CLIP_benchmark>`_, we can easily reproduce the results. | ||||||||
|
||||||||
Basic statistics | ||||||||
---------------- | ||||||||
We hope that this benchmark can help you to better understand the performance of CLIP models and choose the best model for your application. | ||||||||
|
||||||||
In the table below, we include the disk usage (in delta) and the peak RAM and VRAM usage (in delta) when running on a single Nvidia TITAN RTX GPU (24GB VRAM) with a Intel® Core™ i7-10700K Processor (128GB RAM) for a series of text and image encoding tasks with ``batch_size=8`` using PyTorch runtime. | ||||||||
We use ``clip_client==3.7.0``, ``clip_server==3.7.0``, ``jina==3.10.1`` and ``docarry==0.17.0``. | ||||||||
We also include the QPS (Queries Per Second) for the text and image encoding tasks using ``clip_client`` with PyTorch runtime. | ||||||||
|
||||||||
Size and Efficiency | ||||||||
------------------------- | ||||||||
|
||||||||
We first present the model's size and efficiency in terms of query time and memory usage (including the peak RAM and VRAM usage). | ||||||||
All of the results are obtained on a single Nvidia TITAN RTX GPU (24GB VRAM) with default server settings. | ||||||||
|
||||||||
+----------------------------------------+------------------+----------------------+-----------------------+-----------+------------+ | ||||||||
| Model | Disk Usage (MB) | Peak RAM Usage (GB) | Peak VRAM Usage (GB) | Text QPS | Image QPS | | ||||||||
+========================================+==================+======================+=======================+===========+============+ | ||||||||
| RN50::openai | **244** | 2.99 | **1.36** | 1019 | 269 | | ||||||||
| RN50::openai | 244 | 2.99 | 1.36 | 1019 | 269 | | ||||||||
+----------------------------------------+------------------+----------------------+-----------------------+-----------+------------+ | ||||||||
| RN50::yfcc15m | 389 | 2.86 | **1.36** | 1083 | 262 | | ||||||||
| RN50::yfcc15m | 389 | 2.86 | 1.36 | 1083 | 262 | | ||||||||
+----------------------------------------+------------------+----------------------+-----------------------+-----------+------------+ | ||||||||
| RN50::cc12m | 389 | **2.84** | **1.36** | 1064 | 264 | | ||||||||
| RN50::cc12m | 389 | 2.84 | 1.36 | 1064 | 264 | | ||||||||
+----------------------------------------+------------------+----------------------+-----------------------+-----------+------------+ | ||||||||
| RN101::openai | 278 | 3.05 | 1.40 | 1047 | 222 | | ||||||||
+----------------------------------------+------------------+----------------------+-----------------------+-----------+------------+ | ||||||||
|
@@ -35,7 +35,7 @@ We also include the QPS (Queries Per Second) for the text and image encoding tas | |||||||
+----------------------------------------+------------------+----------------------+-----------------------+-----------+------------+ | ||||||||
| ViT-B-32::openai | 338 | 3.20 | 1.40 | 1064 | 286 | | ||||||||
+----------------------------------------+------------------+----------------------+-----------------------+-----------+------------+ | ||||||||
| ViT-B-32::laion2b_e16 | 577 | 2.93 | 1.40 | **1120** | **292** | | ||||||||
| ViT-B-32::laion2b_e16 | 577 | 2.93 | 1.40 | 1120 | 292 | | ||||||||
+----------------------------------------+------------------+----------------------+-----------------------+-----------+------------+ | ||||||||
| ViT-B-32::laion400m_e31 | 577 | 2.93 | 1.40 | 1080 | 287 | | ||||||||
+----------------------------------------+------------------+----------------------+-----------------------+-----------+------------+ | ||||||||
|
@@ -78,10 +78,13 @@ We also include the QPS (Queries Per Second) for the text and image encoding tas | |||||||
|
||||||||
|
||||||||
|
||||||||
CLIP benchmark | ||||||||
-------------- | ||||||||
Zero-shot Performance | ||||||||
---------------------------- | ||||||||
|
||||||||
In this section, we will report the zero-shot performance of the models on classification and retrieval tasks across different datasets. | ||||||||
|
||||||||
|
||||||||
Zero-shot retrieval | ||||||||
Zero-shot Retrieval | ||||||||
+++++++++++++++++++ | ||||||||
|
||||||||
In zero-shot retrieval benchmark, each model is evaluated on the following datasets: `COCO Caption <https://github.com/tylin/coco-caption>`_, `Flickr8k <http://hockenmaier.cs.illinois.edu/8k-pictures.html>`_ and `Flickr30k <https://shannon.cs.illinois.edu/DenotationGraph/>`_. | ||||||||
|
@@ -151,7 +154,7 @@ From the table, we observe that the ViT models outperform the RN models in gener | |||||||
More specifically, the ``ViT-H-14::laion2b_s32b_b79k`` model and ``ViT-g-14::laion2b_s12b_b42k`` model achieve the best and second-best results on all zero-shot retrieval tasks. | ||||||||
For ViT models, the results of the same base model are better on those pre-trained with larger datasets (e.g., ``ViT-B-32::openai`` vs ``ViT-B-32::laion400m_e31`` vs ``ViT-B-32::laion2b-s34b-b79k``). | ||||||||
|
||||||||
Zero-shot classification | ||||||||
Zero-shot Classification | ||||||||
++++++++++++++++++++++++ | ||||||||
|
||||||||
In zero-shot classification benchmark, each model is evaluated on the following datasets: `ImageNetV2 <https://github.com/modestyachts/ImageNetV2>`_, `VOC2007 <http://host.robots.ox.ac.uk/pascal/VOC/voc2007/>`_ and 19 `VTAB datasets <https://github.com/google-research/task_adaptation>`_. | ||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||||
|
@@ -219,8 +222,8 @@ Similar to retrieval results, the ``ViT-H-14::laion2b_s32b_b79k`` model and ``Vi | |||||||
All models tend to perform well on ImageNetV2, VOC2007, VTAB natural and VTAB specialized (except for Retinopathy) datasets, whereas they perform poorly on VTAB structured datasets. | ||||||||
We do not observe any significant difference between the ViT models of the same base model. | ||||||||
|
||||||||
Select the best model | ||||||||
+++++++++++++++++++++ | ||||||||
Select the right model | ||||||||
ZiniuYu marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||
----------------------- | ||||||||
|
||||||||
In general, you can select the best model for your application from different perspectives: disk usage, peak RAM and VRAM usages, QPS, and most importantly, the performance. | ||||||||
|
||||||||
|
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.