TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/CroCo v2: Improved Cross-view Completion Pre-training for ...

CroCo v2: Improved Cross-view Completion Pre-training for Stereo Matching and Optical Flow

Philippe Weinzaepfel, Thomas Lucas, Vincent Leroy, Yohann Cabon, Vaibhav Arora, Romain Brégier, Gabriela Csurka, Leonid Antsfeld, Boris Chidlovskii, Jérôme Revaud

2022-11-18ICCV 2023 1Stereo MatchingOptical Flow EstimationSelf-Supervised Learning
PaperPDFCode(official)

Abstract

Despite impressive performance for high-level downstream tasks, self-supervised pre-training methods have not yet fully delivered on dense geometric vision tasks such as stereo matching or optical flow. The application of self-supervised concepts, such as instance discrimination or masked image modeling, to geometric tasks is an active area of research. In this work, we build on the recent cross-view completion framework, a variation of masked image modeling that leverages a second view from the same scene which makes it well suited for binocular downstream tasks. The applicability of this concept has so far been limited in at least two ways: (a) by the difficulty of collecting real-world image pairs -- in practice only synthetic data have been used -- and (b) by the lack of generalization of vanilla transformers to dense downstream tasks for which relative position is more meaningful than absolute position. We explore three avenues of improvement. First, we introduce a method to collect suitable real-world image pairs at large scale. Second, we experiment with relative positional embeddings and show that they enable vision transformers to perform substantially better. Third, we scale up vision transformer based cross-completion architectures, which is made possible by the use of large amounts of data. With these improvements, we show for the first time that state-of-the-art results on stereo matching and optical flow can be reached without using any classical task-specific techniques like correlation volume, iterative estimation, image warping or multi-scale reasoning, thus paving the way towards universal vision models.

Results

TaskDatasetMetricValueModel
Optical Flow EstimationSintel-cleanAverage End-Point Error1.092CroCo-Flow
Optical Flow EstimationSintel-finalAverage End-Point Error2.436CroCo-Flow
Optical Flow EstimationKITTI 2015Fl-all3.64CroCo-Flow
Optical Flow EstimationKITTI 2015Fl-fg5.94CroCo-Flow
Optical Flow EstimationKITTI 2012Average End-Point Error0.8CroCo-Flow
Optical Flow EstimationKITTI 2012Noc0.5CroCo-Flow
Optical Flow EstimationKITTI 2012Out-Noc1.57CroCo-Flow

Related Papers

$S^2M^2$: Scalable Stereo Matching Model for Reliable Depth Estimation2025-07-17Channel-wise Motion Features for Efficient Motion Segmentation2025-07-17A Semi-Supervised Learning Method for the Identification of Bad Exposures in Large Imaging Surveys2025-07-17Self-supervised Learning on Camera Trap Footage Yields a Strong Universal Face Embedder2025-07-14An Efficient Approach for Muscle Segmentation and 3D Reconstruction Using Keypoint Tracking in MRI Scan2025-07-11Learning to Track Any Points from Human Motion2025-07-08Speech Quality Assessment Model Based on Mixture of Experts: System-Level Performance Enhancement and Utterance-Level Challenge Analysis2025-07-08Learning Robust Stereo Matching in the Wild with Selective Mixture-of-Experts2025-07-07