TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/CBLUE: A Chinese Biomedical Language Understanding Evaluat...

CBLUE: A Chinese Biomedical Language Understanding Evaluation Benchmark

Ningyu Zhang, Mosha Chen, Zhen Bi, Xiaozhuan Liang, Lei LI, Xin Shang, Kangping Yin, Chuanqi Tan, Jian Xu, Fei Huang, Luo Si, Yuan Ni, Guotong Xie, Zhifang Sui, Baobao Chang, Hui Zong, Zheng Yuan, Linfeng Li, Jun Yan, Hongying Zan, Kunli Zhang, Buzhou Tang, Qingcai Chen

2021-06-15ACL 2022 5Medical Concept NormalizationNatural Language InferenceNamed Entity RecognitionSentence-Pair ClassificationSemantic SimilarityMedical Relation ExtractionNamed Entity Recognition (NER)Intent ClassificationSentence Classification
PaperPDFCode(official)Code

Abstract

Artificial Intelligence (AI), along with the recent progress in biomedical language understanding, is gradually changing medical practice. With the development of biomedical language understanding benchmarks, AI applications are widely used in the medical field. However, most benchmarks are limited to English, which makes it challenging to replicate many of the successes in English for other languages. To facilitate research in this direction, we collect real-world biomedical data and present the first Chinese Biomedical Language Understanding Evaluation (CBLUE) benchmark: a collection of natural language understanding tasks including named entity recognition, information extraction, clinical diagnosis normalization, single-sentence/sentence-pair classification, and an associated online platform for model evaluation, comparison, and analysis. To establish evaluation on these tasks, we report empirical results with the current 11 pre-trained Chinese models, and experimental results show that state-of-the-art neural models perform by far worse than the human ceiling. Our benchmark is released at \url{https://tianchi.aliyun.com/dataset/dataDetail?dataId=95414&lang=en-us}.

Results

TaskDatasetMetricValueModel
Natural Language InferenceKUAKE-QQRAccuracy84.7BERT-base
Natural Language InferenceKUAKE-QTRAccuracy62.9MacBERT-large
Language ModellingCHIP-STSMacro F185.6MacBERT-large
Medical Relation ExtractionCMeIEMicro F155.9RoBERTa-wwm-ext-large
Intent ClassificationKUAKE-QICAccuracy85.5RoBERTa-wwm-ext-base
Named Entity Recognition (NER)CMeEEMicro F162.4MacBERT-large
Text ClassificationCHIP-CTCMacro F170.9RoBERTa-large
Sentence Pair ModelingCHIP-STSMacro F185.6MacBERT-large
Sentence ClassificationCHIP-CTCMacro F170.9RoBERTa-large
ClassificationCHIP-CTCMacro F170.9RoBERTa-large
Semantic SimilarityCHIP-STSMacro F185.6MacBERT-large

Related Papers

SemCSE: Semantic Contrastive Sentence Embeddings Using LLM-Generated Summaries For Scientific Abstracts2025-07-17LRCTI: A Large Language Model-Based Framework for Multi-Step Evidence Retrieval and Reasoning in Cyber Threat Intelligence Credibility Verification2025-07-15DS@GT at CheckThat! 2025: Evaluating Context and Tokenization Strategies for Numerical Fact Verification2025-07-08Flippi: End To End GenAI Assistant for E-Commerce2025-07-08SARA: Selective and Adaptive Retrieval-augmented Generation with Context Compression2025-07-08FA: Forced Prompt Learning of Vision-Language Models for Out-of-Distribution Detection2025-07-06LineRetriever: Planning-Aware Observation Reduction for Web Agents2025-06-30Selecting and Merging: Towards Adaptable and Scalable Named Entity Recognition with Large Language Models2025-06-28