TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/ELEVATER: A Benchmark and Toolkit for Evaluating Language-...

ELEVATER: A Benchmark and Toolkit for Evaluating Language-Augmented Visual Models

Chunyuan Li, Haotian Liu, Liunian Harold Li, Pengchuan Zhang, Jyoti Aneja, Jianwei Yang, Ping Jin, Houdong Hu, Zicheng Liu, Yong Jae Lee, Jianfeng Gao

2022-04-19FairnessFew-Shot Object DetectionImage ClassificationZero-Shot Image ClassificationFew-Shot Image ClassificationZero-Shot Object Detectionobject-detectionObject Detection
PaperPDFCodeCodeCode(official)CodeCodeCode(official)CodeCodeCode

Abstract

Learning visual representations from natural language supervision has recently shown great promise in a number of pioneering works. In general, these language-augmented visual models demonstrate strong transferability to a variety of datasets and tasks. However, it remains challenging to evaluate the transferablity of these models due to the lack of easy-to-use evaluation toolkits and public benchmarks. To tackle this, we build ELEVATER (Evaluation of Language-augmented Visual Task-level Transfer), the first benchmark and toolkit for evaluating(pre-trained) language-augmented visual models. ELEVATER is composed of three components. (i) Datasets. As downstream evaluation suites, it consists of 20 image classification datasets and 35 object detection datasets, each of which is augmented with external knowledge. (ii) Toolkit. An automatic hyper-parameter tuning toolkit is developed to facilitate model evaluation on downstream tasks. (iii) Metrics. A variety of evaluation metrics are used to measure sample-efficiency (zero-shot and few-shot) and parameter-efficiency (linear probing and full model fine-tuning). ELEVATER is a platform for Computer Vision in the Wild (CVinW), and is publicly released at at https://computer-vision-in-the-wild.github.io/ELEVATER/

Results

TaskDatasetMetricValueModel
Object DetectionODinW Full-shot 35 TasksAP62.6GLIP-T
Object DetectionELEVATERAP62.6GLIP-T
Object DetectionODinWAverage Score11.4GLIP (Tiny A)
3DODinW Full-shot 35 TasksAP62.6GLIP-T
3DELEVATERAP62.6GLIP-T
3DODinWAverage Score11.4GLIP (Tiny A)
2D ClassificationODinW Full-shot 35 TasksAP62.6GLIP-T
2D ClassificationELEVATERAP62.6GLIP-T
2D ClassificationODinWAverage Score11.4GLIP (Tiny A)
2D Object DetectionODinW Full-shot 35 TasksAP62.6GLIP-T
2D Object DetectionELEVATERAP62.6GLIP-T
2D Object DetectionODinWAverage Score11.4GLIP (Tiny A)
16kODinW Full-shot 35 TasksAP62.6GLIP-T
16kELEVATERAP62.6GLIP-T
16kODinWAverage Score11.4GLIP (Tiny A)
Zero-Shot Image ClassificationODinWAverage Score11.4GLIP (Tiny A)
Zero-Shot Image ClassificationICinWAverage Score56.64CLIP (ViT B-32)

Related Papers

A Reproducibility Study of Product-side Fairness in Bundle Recommendation2025-07-18Automatic Classification and Segmentation of Tunnel Cracks Based on Deep Learning and Visual Explanations2025-07-18FedGA: A Fair Federated Learning Framework Based on the Gini Coefficient2025-07-17Adversarial attacks to image classification systems using evolutionary algorithms2025-07-17Efficient Adaptation of Pre-trained Vision Transformer underpinned by Approximately Orthogonal Fine-Tuning Strategy2025-07-17Federated Learning for Commercial Image Sources2025-07-17MUPAX: Multidimensional Problem Agnostic eXplainable AI2025-07-17A Real-Time System for Egocentric Hand-Object Interaction Detection in Industrial Domains2025-07-17