TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Pix2seq: A Language Modeling Framework for Object Detection

Pix2seq: A Language Modeling Framework for Object Detection

Ting Chen, Saurabh Saxena, Lala Li, David J. Fleet, Geoffrey Hinton

2021-09-22ICLR 2022 4object-detectionObject DetectionLanguage Modelling
PaperPDFCode(official)CodeCodeCodeCodeCode

Abstract

We present Pix2Seq, a simple and generic framework for object detection. Unlike existing approaches that explicitly integrate prior knowledge about the task, we cast object detection as a language modeling task conditioned on the observed pixel inputs. Object descriptions (e.g., bounding boxes and class labels) are expressed as sequences of discrete tokens, and we train a neural network to perceive the image and generate the desired sequence. Our approach is based mainly on the intuition that if a neural network knows about where and what the objects are, we just need to teach it how to read them out. Beyond the use of task-specific data augmentations, our approach makes minimal assumptions about the task, yet it achieves competitive results on the challenging COCO dataset, compared to highly specialized and well optimized detection algorithms.

Results

TaskDatasetMetricValueModel
Object DetectionCOCO minivalbox AP50Pix2seq (ViT-L)
Object DetectionCOCO minivalbox AP47.3Pix2seq (R50-C4)
Object DetectionCOCO minivalbox AP47.1Pix2seq (ViT-B)
Object DetectionCOCO minivalAP5063.2Pix2seq (R101-DC5)
Object DetectionCOCO minivalAP7548.6Pix2seq (R101-DC5)
Object DetectionCOCO minivalAPL60.4Pix2seq (R101-DC5)
Object DetectionCOCO minivalAPM48.9Pix2seq (R101-DC5)
Object DetectionCOCO minivalAPS28.2Pix2seq (R101-DC5)
Object DetectionCOCO minivalbox AP45Pix2seq (R101-DC5)
Object DetectionCOCO minivalAP5061Pix2seq (R50-DC5 )
Object DetectionCOCO minivalAP7546.1Pix2seq (R50-DC5 )
Object DetectionCOCO minivalAPL58.6Pix2seq (R50-DC5 )
Object DetectionCOCO minivalAPM47Pix2seq (R50-DC5 )
Object DetectionCOCO minivalAPS26.6Pix2seq (R50-DC5 )
Object DetectionCOCO minivalbox AP43.2Pix2seq (R50-DC5 )
Object DetectionCOCO minivalbox AP42.6Pix2seq (R50)
3DCOCO minivalbox AP50Pix2seq (ViT-L)
3DCOCO minivalbox AP47.3Pix2seq (R50-C4)
3DCOCO minivalbox AP47.1Pix2seq (ViT-B)
3DCOCO minivalAP5063.2Pix2seq (R101-DC5)
3DCOCO minivalAP7548.6Pix2seq (R101-DC5)
3DCOCO minivalAPL60.4Pix2seq (R101-DC5)
3DCOCO minivalAPM48.9Pix2seq (R101-DC5)
3DCOCO minivalAPS28.2Pix2seq (R101-DC5)
3DCOCO minivalbox AP45Pix2seq (R101-DC5)
3DCOCO minivalAP5061Pix2seq (R50-DC5 )
3DCOCO minivalAP7546.1Pix2seq (R50-DC5 )
3DCOCO minivalAPL58.6Pix2seq (R50-DC5 )
3DCOCO minivalAPM47Pix2seq (R50-DC5 )
3DCOCO minivalAPS26.6Pix2seq (R50-DC5 )
3DCOCO minivalbox AP43.2Pix2seq (R50-DC5 )
3DCOCO minivalbox AP42.6Pix2seq (R50)
2D ClassificationCOCO minivalbox AP50Pix2seq (ViT-L)
2D ClassificationCOCO minivalbox AP47.3Pix2seq (R50-C4)
2D ClassificationCOCO minivalbox AP47.1Pix2seq (ViT-B)
2D ClassificationCOCO minivalAP5063.2Pix2seq (R101-DC5)
2D ClassificationCOCO minivalAP7548.6Pix2seq (R101-DC5)
2D ClassificationCOCO minivalAPL60.4Pix2seq (R101-DC5)
2D ClassificationCOCO minivalAPM48.9Pix2seq (R101-DC5)
2D ClassificationCOCO minivalAPS28.2Pix2seq (R101-DC5)
2D ClassificationCOCO minivalbox AP45Pix2seq (R101-DC5)
2D ClassificationCOCO minivalAP5061Pix2seq (R50-DC5 )
2D ClassificationCOCO minivalAP7546.1Pix2seq (R50-DC5 )
2D ClassificationCOCO minivalAPL58.6Pix2seq (R50-DC5 )
2D ClassificationCOCO minivalAPM47Pix2seq (R50-DC5 )
2D ClassificationCOCO minivalAPS26.6Pix2seq (R50-DC5 )
2D ClassificationCOCO minivalbox AP43.2Pix2seq (R50-DC5 )
2D ClassificationCOCO minivalbox AP42.6Pix2seq (R50)
2D Object DetectionCOCO minivalbox AP50Pix2seq (ViT-L)
2D Object DetectionCOCO minivalbox AP47.3Pix2seq (R50-C4)
2D Object DetectionCOCO minivalbox AP47.1Pix2seq (ViT-B)
2D Object DetectionCOCO minivalAP5063.2Pix2seq (R101-DC5)
2D Object DetectionCOCO minivalAP7548.6Pix2seq (R101-DC5)
2D Object DetectionCOCO minivalAPL60.4Pix2seq (R101-DC5)
2D Object DetectionCOCO minivalAPM48.9Pix2seq (R101-DC5)
2D Object DetectionCOCO minivalAPS28.2Pix2seq (R101-DC5)
2D Object DetectionCOCO minivalbox AP45Pix2seq (R101-DC5)
2D Object DetectionCOCO minivalAP5061Pix2seq (R50-DC5 )
2D Object DetectionCOCO minivalAP7546.1Pix2seq (R50-DC5 )
2D Object DetectionCOCO minivalAPL58.6Pix2seq (R50-DC5 )
2D Object DetectionCOCO minivalAPM47Pix2seq (R50-DC5 )
2D Object DetectionCOCO minivalAPS26.6Pix2seq (R50-DC5 )
2D Object DetectionCOCO minivalbox AP43.2Pix2seq (R50-DC5 )
2D Object DetectionCOCO minivalbox AP42.6Pix2seq (R50)
16kCOCO minivalbox AP50Pix2seq (ViT-L)
16kCOCO minivalbox AP47.3Pix2seq (R50-C4)
16kCOCO minivalbox AP47.1Pix2seq (ViT-B)
16kCOCO minivalAP5063.2Pix2seq (R101-DC5)
16kCOCO minivalAP7548.6Pix2seq (R101-DC5)
16kCOCO minivalAPL60.4Pix2seq (R101-DC5)
16kCOCO minivalAPM48.9Pix2seq (R101-DC5)
16kCOCO minivalAPS28.2Pix2seq (R101-DC5)
16kCOCO minivalbox AP45Pix2seq (R101-DC5)
16kCOCO minivalAP5061Pix2seq (R50-DC5 )
16kCOCO minivalAP7546.1Pix2seq (R50-DC5 )
16kCOCO minivalAPL58.6Pix2seq (R50-DC5 )
16kCOCO minivalAPM47Pix2seq (R50-DC5 )
16kCOCO minivalAPS26.6Pix2seq (R50-DC5 )
16kCOCO minivalbox AP43.2Pix2seq (R50-DC5 )
16kCOCO minivalbox AP42.6Pix2seq (R50)

Related Papers

Visual-Language Model Knowledge Distillation Method for Image Quality Assessment2025-07-21A Real-Time System for Egocentric Hand-Object Interaction Detection in Industrial Domains2025-07-17RS-TinyNet: Stage-wise Feature Fusion Network for Detecting Tiny Objects in Remote Sensing Images2025-07-17Decoupled PROB: Decoupled Query Initialization Tasks and Objectness-Class Learning for Open World Object Detection2025-07-17Dual LiDAR-Based Traffic Movement Count Estimation at a Signalized Intersection: Deployment, Data Collection, and Preliminary Analysis2025-07-17Making Language Model a Hierarchical Classifier and Generator2025-07-17VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning2025-07-17The Generative Energy Arena (GEA): Incorporating Energy Awareness in Large Language Model (LLM) Human Evaluations2025-07-17