TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/GLM-130B: An Open Bilingual Pre-trained Model

GLM-130B: An Open Bilingual Pre-trained Model

Aohan Zeng, Xiao Liu, Zhengxiao Du, Zihan Wang, Hanyu Lai, Ming Ding, Zhuoyi Yang, Yifan Xu, Wendi Zheng, Xiao Xia, Weng Lam Tam, Zixuan Ma, Yufei Xue, Jidong Zhai, WenGuang Chen, Peng Zhang, Yuxiao Dong, Jie Tang

2022-10-05Multi-task Language UnderstandingLong-Context UnderstandingQuantizationLanguage Modelling
PaperPDFCode(official)CodeCodeCodeCodeCodeCodeCodeCode

Abstract

We introduce GLM-130B, a bilingual (English and Chinese) pre-trained language model with 130 billion parameters. It is an attempt to open-source a 100B-scale model at least as good as GPT-3 (davinci) and unveil how models of such a scale can be successfully pre-trained. Over the course of this effort, we face numerous unexpected technical and engineering challenges, particularly on loss spikes and divergence. In this paper, we introduce the training process of GLM-130B including its design choices, training strategies for both efficiency and stability, and engineering efforts. The resultant GLM-130B model offers significant outperformance over GPT-3 175B (davinci) on a wide range of popular English benchmarks while the performance advantage is not observed in OPT-175B and BLOOM-176B. It also consistently and significantly outperforms ERNIE TITAN 3.0 260B -- the largest Chinese language model -- across related benchmarks. Finally, we leverage a unique scaling property of GLM-130B to reach INT4 quantization without post training, with almost no performance loss, making it the first among 100B-scale models and more importantly, allowing its effective inference on 4$\times$RTX 3090 (24G) or 8$\times$RTX 2080 Ti (11G) GPUs, the most affordable GPUs required for using 100B-scale models. The GLM-130B model weights are publicly accessible and its code, training logs, related toolkit, and lessons learned are open-sourced at \url{https://github.com/THUDM/GLM-130B/}.

Results

TaskDatasetMetricValueModel
Transfer LearningMMLAverage (%)44.8GLM-130B
Language ModellingBIG-bench-liteAccuracy15.11GLM-130B (3-shot)
Language ModellingBIG-bench-liteAccuracy14.91GLM-130B (1-shot)
Language ModellingBIG-bench-liteAccuracy13.31GLM-130B (0-shot)
Language ModellingCLUE (WSC1.1)Accuracy83.9GLM-130B
Language ModellingCLUE (WSC1.1)Accuracy81.1ERNIE 3.0 Titan-260B
Language ModellingCLUE (DRCD)Accuracy77.1GLM-130B
Language ModellingCLUE (DRCD)Accuracy29.5ERNIE 3.0 Titan-260B
Language ModellingCLUE (CMRC2018)Accuracy55.7GLM-130B
Language ModellingCLUE (CMRC2018)Accuracy16.6ERNIE 3.0 Titan-260B
Language ModellingFewCLUE (BUSTM)Accuracy77.5GLM-130B
Language ModellingFewCLUE (BUSTM)Accuracy64.4ERNIE 3.0 Titan-260B
Language ModellingFewCLUE (EPRSTMT)Accuracy92.5GLM-130B
Language ModellingFewCLUE (EPRSTMT)Accuracy88.8ERNIE 3.0 Titan-260B
Language ModellingFewCLUE (CHID-FC)Accuracy90.1GLM-130B
Language ModellingFewCLUE (CHID-FC)Accuracy87.1ERNIE 3.0 Titan-260B
Language ModellingLAMBADAAccuracy80.2GLM-130B (bidirectional attention)
Language ModellingCLUE (OCNLI_50K)Accuracy74.7GLM-130B
Language ModellingCLUE (OCNLI_50K)Accuracy44.6ERNIE 3.0 Titan-260B
Language ModellingFewCLUE (OCNLI-FC)Accuracy73.8GLM-130B
Language ModellingFewCLUE (OCNLI-FC)Accuracy53.8ERNIE 3.0 Titan-260B
Language ModellingCLUE (AFQMC)Accuracy71.2GLM-130B
Language ModellingCLUE (AFQMC)Accuracy69ERNIE 3.0 Titan-260B
Language ModellingCLUE (C3)Accuracy77.5GLM-130B
Language ModellingCLUE (C3)Accuracy54.9ERNIE 3.0 Titan-260B
Language ModellingThe PileBits per byte0.634GLM-130B
Language ModellingThe PileBits per byte0.65Jurassic-1
Language ModellingThe PileBits per byte0.742GPT-3
Language ModellingFewCLUE (CLUEWSC-FC)Accuracy77.4GLM-130B
Language ModellingFewCLUE (CLUEWSC-FC)Accuracy53.5ERNIE 3.0 Titan-260B
Language ModellingCLUE (CMNLI)Accuracy77GLM-130B
Language ModellingCLUE (CMNLI)Accuracy51.7ERNIE 3.0 Titan-260B
Multi-Task LearningMMLAverage (%)44.8GLM-130B
Long-Context UnderstandingAda-LEval (BestAnswer)12k0.9ChatGLM3-6b-32k
Long-Context UnderstandingAda-LEval (BestAnswer)16k0.5ChatGLM3-6b-32k
Long-Context UnderstandingAda-LEval (BestAnswer)1k39.8ChatGLM3-6b-32k
Long-Context UnderstandingAda-LEval (BestAnswer)2k18.8ChatGLM3-6b-32k
Long-Context UnderstandingAda-LEval (BestAnswer)4k9ChatGLM3-6b-32k
Long-Context UnderstandingAda-LEval (BestAnswer)6k5ChatGLM3-6b-32k
Long-Context UnderstandingAda-LEval (BestAnswer)8k3.4ChatGLM3-6b-32k
Long-Context UnderstandingAda-LEval (BestAnswer)16k0.3ChatGLM2-6b-32k
Long-Context UnderstandingAda-LEval (BestAnswer)1k31.2ChatGLM2-6b-32k
Long-Context UnderstandingAda-LEval (BestAnswer)2k10.9ChatGLM2-6b-32k
Long-Context UnderstandingAda-LEval (BestAnswer)4k4.5ChatGLM2-6b-32k
Long-Context UnderstandingAda-LEval (BestAnswer)6k1.6ChatGLM2-6b-32k
Long-Context UnderstandingAda-LEval (BestAnswer)8k1.6ChatGLM2-6b-32k
Long-Context UnderstandingAda-LEval (TSort)16k0.7ChatGLM3-6b-32k
Long-Context UnderstandingAda-LEval (TSort)2k2.3ChatGLM3-6b-32k
Long-Context UnderstandingAda-LEval (TSort)4k2.4ChatGLM3-6b-32k
Long-Context UnderstandingAda-LEval (TSort)8k2ChatGLM3-6b-32k
Long-Context UnderstandingAda-LEval (TSort)16k0.9ChatGLM2-6b-32k
Long-Context UnderstandingAda-LEval (TSort)2k0.9ChatGLM2-6b-32k
Long-Context UnderstandingAda-LEval (TSort)4k0.2ChatGLM2-6b-32k
Long-Context UnderstandingAda-LEval (TSort)8k0.7ChatGLM2-6b-32k

Related Papers

Efficient Deployment of Spiking Neural Networks on SpiNNaker2 for DVS Gesture Recognition Using Neuromorphic Intermediate Representation2025-09-04Visual-Language Model Knowledge Distillation Method for Image Quality Assessment2025-07-21An End-to-End DNN Inference Framework for the SpiNNaker2 Neuromorphic MPSoC2025-07-18Task-Specific Audio Coding for Machines: Machine-Learned Latent Features Are Codes for That Machine2025-07-17Angle Estimation of a Single Source with Massive Uniform Circular Arrays2025-07-17Making Language Model a Hierarchical Classifier and Generator2025-07-17VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning2025-07-17The Generative Energy Arena (GEA): Incorporating Energy Awareness in Large Language Model (LLM) Human Evaluations2025-07-17