Ting Chen, Saurabh Saxena, Lala Li, David J. Fleet, Geoffrey Hinton
We present Pix2Seq, a simple and generic framework for object detection. Unlike existing approaches that explicitly integrate prior knowledge about the task, we cast object detection as a language modeling task conditioned on the observed pixel inputs. Object descriptions (e.g., bounding boxes and class labels) are expressed as sequences of discrete tokens, and we train a neural network to perceive the image and generate the desired sequence. Our approach is based mainly on the intuition that if a neural network knows about where and what the objects are, we just need to teach it how to read them out. Beyond the use of task-specific data augmentations, our approach makes minimal assumptions about the task, yet it achieves competitive results on the challenging COCO dataset, compared to highly specialized and well optimized detection algorithms.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Object Detection | COCO minival | box AP | 50 | Pix2seq (ViT-L) |
| Object Detection | COCO minival | box AP | 47.3 | Pix2seq (R50-C4) |
| Object Detection | COCO minival | box AP | 47.1 | Pix2seq (ViT-B) |
| Object Detection | COCO minival | AP50 | 63.2 | Pix2seq (R101-DC5) |
| Object Detection | COCO minival | AP75 | 48.6 | Pix2seq (R101-DC5) |
| Object Detection | COCO minival | APL | 60.4 | Pix2seq (R101-DC5) |
| Object Detection | COCO minival | APM | 48.9 | Pix2seq (R101-DC5) |
| Object Detection | COCO minival | APS | 28.2 | Pix2seq (R101-DC5) |
| Object Detection | COCO minival | box AP | 45 | Pix2seq (R101-DC5) |
| Object Detection | COCO minival | AP50 | 61 | Pix2seq (R50-DC5 ) |
| Object Detection | COCO minival | AP75 | 46.1 | Pix2seq (R50-DC5 ) |
| Object Detection | COCO minival | APL | 58.6 | Pix2seq (R50-DC5 ) |
| Object Detection | COCO minival | APM | 47 | Pix2seq (R50-DC5 ) |
| Object Detection | COCO minival | APS | 26.6 | Pix2seq (R50-DC5 ) |
| Object Detection | COCO minival | box AP | 43.2 | Pix2seq (R50-DC5 ) |
| Object Detection | COCO minival | box AP | 42.6 | Pix2seq (R50) |
| 3D | COCO minival | box AP | 50 | Pix2seq (ViT-L) |
| 3D | COCO minival | box AP | 47.3 | Pix2seq (R50-C4) |
| 3D | COCO minival | box AP | 47.1 | Pix2seq (ViT-B) |
| 3D | COCO minival | AP50 | 63.2 | Pix2seq (R101-DC5) |
| 3D | COCO minival | AP75 | 48.6 | Pix2seq (R101-DC5) |
| 3D | COCO minival | APL | 60.4 | Pix2seq (R101-DC5) |
| 3D | COCO minival | APM | 48.9 | Pix2seq (R101-DC5) |
| 3D | COCO minival | APS | 28.2 | Pix2seq (R101-DC5) |
| 3D | COCO minival | box AP | 45 | Pix2seq (R101-DC5) |
| 3D | COCO minival | AP50 | 61 | Pix2seq (R50-DC5 ) |
| 3D | COCO minival | AP75 | 46.1 | Pix2seq (R50-DC5 ) |
| 3D | COCO minival | APL | 58.6 | Pix2seq (R50-DC5 ) |
| 3D | COCO minival | APM | 47 | Pix2seq (R50-DC5 ) |
| 3D | COCO minival | APS | 26.6 | Pix2seq (R50-DC5 ) |
| 3D | COCO minival | box AP | 43.2 | Pix2seq (R50-DC5 ) |
| 3D | COCO minival | box AP | 42.6 | Pix2seq (R50) |
| 2D Classification | COCO minival | box AP | 50 | Pix2seq (ViT-L) |
| 2D Classification | COCO minival | box AP | 47.3 | Pix2seq (R50-C4) |
| 2D Classification | COCO minival | box AP | 47.1 | Pix2seq (ViT-B) |
| 2D Classification | COCO minival | AP50 | 63.2 | Pix2seq (R101-DC5) |
| 2D Classification | COCO minival | AP75 | 48.6 | Pix2seq (R101-DC5) |
| 2D Classification | COCO minival | APL | 60.4 | Pix2seq (R101-DC5) |
| 2D Classification | COCO minival | APM | 48.9 | Pix2seq (R101-DC5) |
| 2D Classification | COCO minival | APS | 28.2 | Pix2seq (R101-DC5) |
| 2D Classification | COCO minival | box AP | 45 | Pix2seq (R101-DC5) |
| 2D Classification | COCO minival | AP50 | 61 | Pix2seq (R50-DC5 ) |
| 2D Classification | COCO minival | AP75 | 46.1 | Pix2seq (R50-DC5 ) |
| 2D Classification | COCO minival | APL | 58.6 | Pix2seq (R50-DC5 ) |
| 2D Classification | COCO minival | APM | 47 | Pix2seq (R50-DC5 ) |
| 2D Classification | COCO minival | APS | 26.6 | Pix2seq (R50-DC5 ) |
| 2D Classification | COCO minival | box AP | 43.2 | Pix2seq (R50-DC5 ) |
| 2D Classification | COCO minival | box AP | 42.6 | Pix2seq (R50) |
| 2D Object Detection | COCO minival | box AP | 50 | Pix2seq (ViT-L) |
| 2D Object Detection | COCO minival | box AP | 47.3 | Pix2seq (R50-C4) |
| 2D Object Detection | COCO minival | box AP | 47.1 | Pix2seq (ViT-B) |
| 2D Object Detection | COCO minival | AP50 | 63.2 | Pix2seq (R101-DC5) |
| 2D Object Detection | COCO minival | AP75 | 48.6 | Pix2seq (R101-DC5) |
| 2D Object Detection | COCO minival | APL | 60.4 | Pix2seq (R101-DC5) |
| 2D Object Detection | COCO minival | APM | 48.9 | Pix2seq (R101-DC5) |
| 2D Object Detection | COCO minival | APS | 28.2 | Pix2seq (R101-DC5) |
| 2D Object Detection | COCO minival | box AP | 45 | Pix2seq (R101-DC5) |
| 2D Object Detection | COCO minival | AP50 | 61 | Pix2seq (R50-DC5 ) |
| 2D Object Detection | COCO minival | AP75 | 46.1 | Pix2seq (R50-DC5 ) |
| 2D Object Detection | COCO minival | APL | 58.6 | Pix2seq (R50-DC5 ) |
| 2D Object Detection | COCO minival | APM | 47 | Pix2seq (R50-DC5 ) |
| 2D Object Detection | COCO minival | APS | 26.6 | Pix2seq (R50-DC5 ) |
| 2D Object Detection | COCO minival | box AP | 43.2 | Pix2seq (R50-DC5 ) |
| 2D Object Detection | COCO minival | box AP | 42.6 | Pix2seq (R50) |
| 16k | COCO minival | box AP | 50 | Pix2seq (ViT-L) |
| 16k | COCO minival | box AP | 47.3 | Pix2seq (R50-C4) |
| 16k | COCO minival | box AP | 47.1 | Pix2seq (ViT-B) |
| 16k | COCO minival | AP50 | 63.2 | Pix2seq (R101-DC5) |
| 16k | COCO minival | AP75 | 48.6 | Pix2seq (R101-DC5) |
| 16k | COCO minival | APL | 60.4 | Pix2seq (R101-DC5) |
| 16k | COCO minival | APM | 48.9 | Pix2seq (R101-DC5) |
| 16k | COCO minival | APS | 28.2 | Pix2seq (R101-DC5) |
| 16k | COCO minival | box AP | 45 | Pix2seq (R101-DC5) |
| 16k | COCO minival | AP50 | 61 | Pix2seq (R50-DC5 ) |
| 16k | COCO minival | AP75 | 46.1 | Pix2seq (R50-DC5 ) |
| 16k | COCO minival | APL | 58.6 | Pix2seq (R50-DC5 ) |
| 16k | COCO minival | APM | 47 | Pix2seq (R50-DC5 ) |
| 16k | COCO minival | APS | 26.6 | Pix2seq (R50-DC5 ) |
| 16k | COCO minival | box AP | 43.2 | Pix2seq (R50-DC5 ) |
| 16k | COCO minival | box AP | 42.6 | Pix2seq (R50) |