TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/GRiT: A Generative Region-to-text Transformer for Object U...

GRiT: A Generative Region-to-text Transformer for Object Understanding

Jialian Wu, JianFeng Wang, Zhengyuan Yang, Zhe Gan, Zicheng Liu, Junsong Yuan, Lijuan Wang

2022-12-01Descriptiveobject-detectionDense CaptioningObject Detection
PaperPDFCode(official)

Abstract

This paper presents a Generative RegIon-to-Text transformer, GRiT, for object understanding. The spirit of GRiT is to formulate object understanding as <region, text> pairs, where region locates objects and text describes objects. For example, the text in object detection denotes class names while that in dense captioning refers to descriptive sentences. Specifically, GRiT consists of a visual encoder to extract image features, a foreground object extractor to localize objects, and a text decoder to generate open-set object descriptions. With the same model architecture, GRiT can understand objects via not only simple nouns, but also rich descriptive sentences including object attributes or actions. Experimentally, we apply GRiT to object detection and dense captioning tasks. GRiT achieves 60.4 AP on COCO 2017 test-dev for object detection and 15.5 mAP on Visual Genome for dense captioning. Code is available at https://github.com/JialianW/GRiT

Results

TaskDatasetMetricValueModel
Object DetectionCOCO test-devbox mAP60.4GRiT (ViT-H, single-scale testing)
Object DetectionCOCO-OAverage mAP42.9GRiT (ViT-H)
Object DetectionCOCO-OEffective Robustness15.72GRiT (ViT-H)
3DCOCO test-devbox mAP60.4GRiT (ViT-H, single-scale testing)
3DCOCO-OAverage mAP42.9GRiT (ViT-H)
3DCOCO-OEffective Robustness15.72GRiT (ViT-H)
2D ClassificationCOCO test-devbox mAP60.4GRiT (ViT-H, single-scale testing)
2D ClassificationCOCO-OAverage mAP42.9GRiT (ViT-H)
2D ClassificationCOCO-OEffective Robustness15.72GRiT (ViT-H)
2D Object DetectionCOCO test-devbox mAP60.4GRiT (ViT-H, single-scale testing)
2D Object DetectionCOCO-OAverage mAP42.9GRiT (ViT-H)
2D Object DetectionCOCO-OEffective Robustness15.72GRiT (ViT-H)
Dense CaptioningVisual GenomemAP15.5GRiT (ViT-B)
16kCOCO test-devbox mAP60.4GRiT (ViT-H, single-scale testing)
16kCOCO-OAverage mAP42.9GRiT (ViT-H)
16kCOCO-OEffective Robustness15.72GRiT (ViT-H)

Related Papers

DiffRhythm+: Controllable and Flexible Full-Length Song Generation with Preference Optimization2025-07-17A Real-Time System for Egocentric Hand-Object Interaction Detection in Industrial Domains2025-07-17RS-TinyNet: Stage-wise Feature Fusion Network for Detecting Tiny Objects in Remote Sensing Images2025-07-17Decoupled PROB: Decoupled Query Initialization Tasks and Objectness-Class Learning for Open World Object Detection2025-07-17Dual LiDAR-Based Traffic Movement Count Estimation at a Signalized Intersection: Deployment, Data Collection, and Preliminary Analysis2025-07-17Assay2Mol: large language model-based drug design using BioAssay context2025-07-16Describe Anything Model for Visual Question Answering on Text-rich Images2025-07-16Vision-based Perception for Autonomous Vehicles in Obstacle Avoidance Scenarios2025-07-16