ConCSE: Unified Contrastive Learning and Augmentation for Code-Switched Embeddings

Jangyeong Jeon, Sangyeon Cho, Minuk Ma, Junyoung Kim

2024-08-28Natural Language Inference Sentence Embedding Semantic Textual Similarity Contrastive Learning STS Sentence-Embedding

Paper PDF Code(official)

Abstract

This paper examines the Code-Switching (CS) phenomenon where two languages intertwine within a single utterance. There exists a noticeable need for research on the CS between English and Korean. We highlight that the current Equivalence Constraint (EC) theory for CS in other languages may only partially capture English-Korean CS complexities due to the intrinsic grammatical differences between the languages. We introduce a novel Koglish dataset tailored for English-Korean CS scenarios to mitigate such challenges. First, we constructed the Koglish-GLUE dataset to demonstrate the importance and need for CS datasets in various tasks. We found the differential outcomes of various foundation multilingual language models when trained on a monolingual versus a CS dataset. Motivated by this, we hypothesized that SimCSE, which has shown strengths in monolingual sentence embedding, would have limitations in CS scenarios. We construct a novel Koglish-NLI (Natural Language Inference) dataset using a CS augmentation-based approach to verify this. From this CS-augmented dataset Koglish-NLI, we propose a unified contrastive learning and augmentation method for code-switched embeddings, ConCSE, highlighting the semantics of CS sentences. Experimental results validate the proposed ConCSE with an average performance enhancement of 1.77\% on the Koglish-STS(Semantic Textual Similarity) tasks.

Related Papers

SemCSE: Semantic Contrastive Sentence Embeddings Using LLM-Generated Summaries For Scientific Abstracts2025-07-17 HapticCap: A Multimodal Dataset and Task for Understanding User Experience of Vibration Haptic Signals2025-07-17 Overview of the TalentCLEF 2025: Skill and Job Title Intelligence for Human Capital Management2025-07-17 SGCL: Unifying Self-Supervised and Supervised Learning for Graph Recommendation2025-07-17 Similarity-Guided Diffusion for Contrastive Sequential Recommendation2025-07-16 LRCTI: A Large Language Model-Based Framework for Multi-Step Evidence Retrieval and Reasoning in Cyber Threat Intelligence Credibility Verification2025-07-15 LLM-Driven Dual-Level Multi-Interest Modeling for Recommendation2025-07-15 Latent Space Consistency for Sparse-View CT Reconstruction2025-07-15