jina-ai · numb3r3 · Oct 10, 2022 · Oct 10, 2022 · Oct 10, 2022 · Oct 10, 2022
diff --git a/docs/index.md b/docs/index.md
@@ -177,8 +177,8 @@ It means the client and the server are now connected. Well done!
 
 user-guides/client
 user-guides/server
-user-guides/retriever
 user-guides/benchmark
+user-guides/retriever
 user-guides/faq
 ```
 

diff --git a/docs/user-guides/benchmark.rst b/docs/user-guides/benchmark.rst
@@ -1,27 +1,27 @@
-CLIP Benchmark
-==============
+Benchmark
+=========
 
-In order to evaluate the performance of different CLIP models, we conducted a benchmark on a series of tasks using different datasets. 
-We fork the `CLIP benchmark repo <https://github.com/LAION-AI/CLIP_benchmark>`_, slightly modify the codebase and apply it to all Vision Transformers (ViT) and ResNet (RN) CLIP models. 
-You can find the benchmark results in the following tables. 
-The best results are highlighted in bold. 
-They can be used as a guide to choose the best model for your application.
+In order to understand the zero-shot performance of CLIP and its limitations, we conducted a benchmark
+across a variety of computer vision datasets (the dataset details are in the appendix). Here, thanks for the
+open-source `CLIP Benchmark toolkit <https://github.com/LAION-AI/CLIP_benchmark>`_, we can easily reproduce the results.
 
-Basic statistics
-----------------
+We hope that this benchmark can help you to better understand the performance of CLIP models and choose the best model for your application.
 
-In the table below, we include the disk usage (in delta) and the peak RAM and VRAM usage (in delta) when running on a single Nvidia TITAN RTX GPU (24GB VRAM) with a Intel® Core™ i7-10700K Processor (128GB RAM) for a series of text and image encoding tasks with ``batch_size=8`` using PyTorch runtime.
-We use ``clip_client==3.7.0``, ``clip_server==3.7.0``, ``jina==3.10.1`` and ``docarry==0.17.0``.
-We also include the QPS (Queries Per Second) for the text and image encoding tasks using ``clip_client`` with PyTorch runtime.
+
+Size and Efficiency
+-------------------------
+
+We first present the model's size and efficiency in terms of query time and memory usage (including the peak RAM and VRAM usage).
+All of the results are obtained on a single Nvidia TITAN RTX GPU (24GB VRAM) with default server settings.
 
 +----------------------------------------+------------------+----------------------+-----------------------+-----------+------------+
 | Model                                  | Disk Usage (MB)  | Peak RAM Usage (GB)  | Peak VRAM Usage (GB)  | Text QPS  | Image QPS  |
 +========================================+==================+======================+=======================+===========+============+
-| RN50::openai                           | **244**          | 2.99                 | **1.36**              | 1019      | 269        |
+| RN50::openai                           | 244              | 2.99                 | 1.36                  | 1019      | 269        |
 +----------------------------------------+------------------+----------------------+-----------------------+-----------+------------+
-| RN50::yfcc15m                          | 389              | 2.86                 | **1.36**              | 1083      | 262        |
+| RN50::yfcc15m                          | 389              | 2.86                 | 1.36                  | 1083      | 262        |
 +----------------------------------------+------------------+----------------------+-----------------------+-----------+------------+
-| RN50::cc12m                            | 389              | **2.84**             | **1.36**              | 1064      | 264        |
+| RN50::cc12m                            | 389              | 2.84                 | 1.36                  | 1064      | 264        |
 +----------------------------------------+------------------+----------------------+-----------------------+-----------+------------+
 | RN101::openai                          | 278              | 3.05                 | 1.40                  | 1047      | 222        |
 +----------------------------------------+------------------+----------------------+-----------------------+-----------+------------+
@@ -35,7 +35,7 @@ We also include the QPS (Queries Per Second) for the text and image encoding tas
 +----------------------------------------+------------------+----------------------+-----------------------+-----------+------------+
 | ViT-B-32::openai                       | 338              | 3.20                 | 1.40                  | 1064      | 286        |
 +----------------------------------------+------------------+----------------------+-----------------------+-----------+------------+
-| ViT-B-32::laion2b_e16                  | 577              | 2.93                 | 1.40                  | **1120**  | **292**    |
+| ViT-B-32::laion2b_e16                  | 577              | 2.93                 | 1.40                  | 1120      | 292        |
 +----------------------------------------+------------------+----------------------+-----------------------+-----------+------------+
 | ViT-B-32::laion400m_e31                | 577              | 2.93                 | 1.40                  | 1080      | 287        |
 +----------------------------------------+------------------+----------------------+-----------------------+-----------+------------+
@@ -78,10 +78,13 @@ We also include the QPS (Queries Per Second) for the text and image encoding tas
 
 
 
-CLIP benchmark
---------------
+Zero-shot Performance
+----------------------------
+
+In this section, we will report the zero-shot performance of the models on classification and retrieval tasks across different datasets.
+
 
-Zero-shot retrieval
+Zero-shot Retrieval
 +++++++++++++++++++
 
 In zero-shot retrieval benchmark, each model is evaluated on the following datasets: `COCO Caption <https://github.com/tylin/coco-caption>`_, `Flickr8k <http://hockenmaier.cs.illinois.edu/8k-pictures.html>`_ and `Flickr30k <https://shannon.cs.illinois.edu/DenotationGraph/>`_.
-In zero-shot retrieval benchmark, each model is evaluated on the following datasets: `COCO Caption <https://github.com/tylin/coco-caption>`_, `Flickr8k <http://hockenmaier.cs.illinois.edu/8k-pictures.html>`_ and `Flickr30k <https://shannon.cs.illinois.edu/DenotationGraph/>`_.
+In the zero-shot retrieval benchmark, each model is evaluated on the following datasets: `COCO Caption <https://github.com/tylin/coco-caption>`_, `Flickr8k <http://hockenmaier.cs.illinois.edu/8k-pictures.html>`_ and `Flickr30k <https://shannon.cs.illinois.edu/DenotationGraph/>`_.
+The best results are highlighted in bold (higher is better).
-In zero-shot retrieval benchmark, each model is evaluated on the following datasets: `COCO Caption <https://github.com/tylin/coco-caption>`_, `Flickr8k <http://hockenmaier.cs.illinois.edu/8k-pictures.html>`_ and `Flickr30k <https://shannon.cs.illinois.edu/DenotationGraph/>`_.
+In the zero-shot retrieval benchmark, each model is evaluated on the following datasets: `COCO Caption <https://github.com/tylin/coco-caption>`_, `Flickr8k <http://hockenmaier.cs.illinois.edu/8k-pictures.html>`_ and `Flickr30k <https://shannon.cs.illinois.edu/DenotationGraph/>`_.
+The best results are highlighted in bold (higher is better).
@@ -151,7 +154,7 @@ From the table, we observe that the ViT models outperform the RN models in gener
 More specifically, the ``ViT-H-14::laion2b_s32b_b79k`` model and ``ViT-g-14::laion2b_s12b_b42k`` model achieve the best and second-best results on all zero-shot retrieval tasks.
 For ViT models, the results of the same base model are better on those pre-trained with larger datasets (e.g., ``ViT-B-32::openai`` vs ``ViT-B-32::laion400m_e31`` vs ``ViT-B-32::laion2b-s34b-b79k``).
 
-Zero-shot classification
+Zero-shot Classification
 ++++++++++++++++++++++++
 
 In zero-shot classification benchmark, each model is evaluated on the following datasets: `ImageNetV2 <https://github.com/modestyachts/ImageNetV2>`_, `VOC2007 <http://host.robots.ox.ac.uk/pascal/VOC/voc2007/>`_ and 19 `VTAB datasets <https://github.com/google-research/task_adaptation>`_.
-In zero-shot classification benchmark, each model is evaluated on the following datasets: `ImageNetV2 <https://github.com/modestyachts/ImageNetV2>`_, `VOC2007 <http://host.robots.ox.ac.uk/pascal/VOC/voc2007/>`_ and 19 `VTAB datasets <https://github.com/google-research/task_adaptation>`_.
+In the zero-shot classification benchmark, each model is evaluated on the following datasets: `ImageNetV2 <https://github.com/modestyachts/ImageNetV2>`_, `VOC2007 <http://host.robots.ox.ac.uk/pascal/VOC/voc2007/>`_ and 19 `VTAB datasets <https://github.com/google-research/task_adaptation>`_.
+The best results are highlighted in bold (higher is better).
-In zero-shot classification benchmark, each model is evaluated on the following datasets: `ImageNetV2 <https://github.com/modestyachts/ImageNetV2>`_, `VOC2007 <http://host.robots.ox.ac.uk/pascal/VOC/voc2007/>`_ and 19 `VTAB datasets <https://github.com/google-research/task_adaptation>`_.
+In the zero-shot classification benchmark, each model is evaluated on the following datasets: `ImageNetV2 <https://github.com/modestyachts/ImageNetV2>`_, `VOC2007 <http://host.robots.ox.ac.uk/pascal/VOC/voc2007/>`_ and 19 `VTAB datasets <https://github.com/google-research/task_adaptation>`_.
+The best results are highlighted in bold (higher is better).
@@ -219,8 +222,8 @@ Similar to retrieval results, the ``ViT-H-14::laion2b_s32b_b79k`` model and ``Vi
 All models tend to perform well on ImageNetV2, VOC2007, VTAB natural and VTAB specialized (except for Retinopathy) datasets, whereas they perform poorly on VTAB structured datasets.
 We do not observe any significant difference between the ViT models of the same base model. 
 
-Select the best model
-+++++++++++++++++++++
+Select the right model
+-----------------------
 
 In general, you can select the best model for your application from different perspectives: disk usage, peak RAM and VRAM usages, QPS, and most importantly, the performance.