TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Self-supervised Character-to-Character Distillation for Te...

Self-supervised Character-to-Character Distillation for Text Recognition

Tongkun Guan, Wei Shen, Xue Yang, Qi Feng, Zekun Jiang, Xiaokang Yang

2022-11-01ICCV 2023 1Super-ResolutionScene Text RecognitionRepresentation LearningSelf-Supervised LearningSSIMData Augmentationself-supervised scene text recognitionText SegmentationSelf-Learning
PaperPDFCode(official)

Abstract

When handling complicated text images (e.g., irregular structures, low resolution, heavy occlusion, and uneven illumination), existing supervised text recognition methods are data-hungry. Although these methods employ large-scale synthetic text images to reduce the dependence on annotated real images, the domain gap still limits the recognition performance. Therefore, exploring the robust text feature representations on unlabeled real images by self-supervised learning is a good solution. However, existing self-supervised text recognition methods conduct sequence-to-sequence representation learning by roughly splitting the visual features along the horizontal axis, which limits the flexibility of the augmentations, as large geometric-based augmentations may lead to sequence-to-sequence feature inconsistency. Motivated by this, we propose a novel self-supervised Character-to-Character Distillation method, CCD, which enables versatile augmentations to facilitate general text representation learning. Specifically, we delineate the character structures of unlabeled real images by designing a self-supervised character segmentation module. Following this, CCD easily enriches the diversity of local characters while keeping their pairwise alignment under flexible augmentations, using the transformation matrix between two augmented views from images. Experiments demonstrate that CCD achieves state-of-the-art results, with average performance gains of 1.38% in text recognition, 1.7% in text segmentation, 0.24 dB (PSNR) and 0.0321 (SSIM) in text super-resolution. Code is available at https://github.com/TongkunGuan/CCD.

Results

TaskDatasetMetricValueModel
Scene ParsingSVTAccuracy97.8CCD-ViT-Base(ARD_2.8M)
Scene ParsingSVTAccuracy96.4CCD-ViT-Small(ARD_2.8M)
Scene ParsingSVTAccuracy96CCD-ViT-Tiny(ARD_2.8M)
Scene ParsingSVTPAccuracy96.1CCD-ViT-Base
Scene ParsingSVTPAccuracy92.7CCD-ViT-Small
Scene ParsingSVTPAccuracy91.6CCD-ViT-Tiny
Scene ParsingCUTE80Accuracy98.3CCD-ViT-Small(ARD_2.8M)
Scene ParsingCUTE80Accuracy98.3CCD-ViT-Base(ARD_2.8M)
Scene ParsingCUTE80Accuracy95.8CCD-ViT-Tiny(ARD_2.8M)
Scene ParsingWOST1:1 Accuracy86CCD-ViT-Base
Scene ParsingHOST1:1 Accuracy77.3CCD-ViT-Base
Scene ParsingIIIT5kAccuracy98CCD-ViT-Small(ARD_2.8M)
Scene ParsingIIIT5kAccuracy98CCD-ViT-Base(ARD_2.8M)
Scene ParsingIIIT5kAccuracy97.1CCD-ViT-Tiny(ARD_2.8M)
Scene ParsingICDAR2013Accuracy98.3CCD-ViT-Base(ARD_2.8M)
Scene ParsingICDAR2013Accuracy98.3CCD-ViT-Small(ARD_2.8M)
Scene ParsingICDAR2013Accuracy97.5CCD-ViT-Tiny(ARD_2.8M)
2D Semantic SegmentationSVTAccuracy97.8CCD-ViT-Base(ARD_2.8M)
2D Semantic SegmentationSVTAccuracy96.4CCD-ViT-Small(ARD_2.8M)
2D Semantic SegmentationSVTAccuracy96CCD-ViT-Tiny(ARD_2.8M)
2D Semantic SegmentationSVTPAccuracy96.1CCD-ViT-Base
2D Semantic SegmentationSVTPAccuracy92.7CCD-ViT-Small
2D Semantic SegmentationSVTPAccuracy91.6CCD-ViT-Tiny
2D Semantic SegmentationCUTE80Accuracy98.3CCD-ViT-Small(ARD_2.8M)
2D Semantic SegmentationCUTE80Accuracy98.3CCD-ViT-Base(ARD_2.8M)
2D Semantic SegmentationCUTE80Accuracy95.8CCD-ViT-Tiny(ARD_2.8M)
2D Semantic SegmentationWOST1:1 Accuracy86CCD-ViT-Base
2D Semantic SegmentationHOST1:1 Accuracy77.3CCD-ViT-Base
2D Semantic SegmentationIIIT5kAccuracy98CCD-ViT-Small(ARD_2.8M)
2D Semantic SegmentationIIIT5kAccuracy98CCD-ViT-Base(ARD_2.8M)
2D Semantic SegmentationIIIT5kAccuracy97.1CCD-ViT-Tiny(ARD_2.8M)
2D Semantic SegmentationICDAR2013Accuracy98.3CCD-ViT-Base(ARD_2.8M)
2D Semantic SegmentationICDAR2013Accuracy98.3CCD-ViT-Small(ARD_2.8M)
2D Semantic SegmentationICDAR2013Accuracy97.5CCD-ViT-Tiny(ARD_2.8M)
self-supervised scene text recognitionTextZoomAverage PSNR (dB)21.84CCD-ViT-Small
self-supervised scene text recognitionTextZoomSSIM0.7843CCD-ViT-Small
self-supervised scene text recognitionTextSegIoU (%)84.8CCD-ViT-Small
self-supervised scene text recognitionScene Text Recognition BenchmarksAverage Accuracy84.9CCD-ViT-Small
Scene Text RecognitionSVTAccuracy97.8CCD-ViT-Base(ARD_2.8M)
Scene Text RecognitionSVTAccuracy96.4CCD-ViT-Small(ARD_2.8M)
Scene Text RecognitionSVTAccuracy96CCD-ViT-Tiny(ARD_2.8M)
Scene Text RecognitionSVTPAccuracy96.1CCD-ViT-Base
Scene Text RecognitionSVTPAccuracy92.7CCD-ViT-Small
Scene Text RecognitionSVTPAccuracy91.6CCD-ViT-Tiny
Scene Text RecognitionCUTE80Accuracy98.3CCD-ViT-Small(ARD_2.8M)
Scene Text RecognitionCUTE80Accuracy98.3CCD-ViT-Base(ARD_2.8M)
Scene Text RecognitionCUTE80Accuracy95.8CCD-ViT-Tiny(ARD_2.8M)
Scene Text RecognitionWOST1:1 Accuracy86CCD-ViT-Base
Scene Text RecognitionHOST1:1 Accuracy77.3CCD-ViT-Base
Scene Text RecognitionIIIT5kAccuracy98CCD-ViT-Small(ARD_2.8M)
Scene Text RecognitionIIIT5kAccuracy98CCD-ViT-Base(ARD_2.8M)
Scene Text RecognitionIIIT5kAccuracy97.1CCD-ViT-Tiny(ARD_2.8M)
Scene Text RecognitionICDAR2013Accuracy98.3CCD-ViT-Base(ARD_2.8M)
Scene Text RecognitionICDAR2013Accuracy98.3CCD-ViT-Small(ARD_2.8M)
Scene Text RecognitionICDAR2013Accuracy97.5CCD-ViT-Tiny(ARD_2.8M)

Related Papers

Touch in the Wild: Learning Fine-Grained Manipulation with a Portable Visuo-Tactile Gripper2025-07-20SpectraLift: Physics-Guided Spectral-Inversion Network for Self-Supervised Hyperspectral Image Super-Resolution2025-07-17Spectral Bellman Method: Unifying Representation and Exploration in RL2025-07-17Boosting Team Modeling through Tempo-Relational Representation Learning2025-07-17A Semi-Supervised Learning Method for the Identification of Bad Exposures in Large Imaging Surveys2025-07-17fastWDM3D: Fast and Accurate 3D Healthy Tissue Inpainting2025-07-17Overview of the TalentCLEF 2025: Skill and Job Title Intelligence for Human Capital Management2025-07-17Pixel Perfect MegaMed: A Megapixel-Scale Vision-Language Foundation Model for Generating High Resolution Medical Images2025-07-17