Not all layers are equally as important: Every Layer Counts BERT

Lucas Georges Gabriel Charpentier, David Samuel

2023-11-03Natural Language Inference Linguistic Acceptability All

Abstract

This paper introduces a novel modification of the transformer architecture, tailored for the data-efficient pretraining of language models. This aspect is evaluated by participating in the BabyLM challenge, where our solution won both the strict and strict-small tracks. Our approach allows each transformer layer to select which outputs of previous layers to process. The empirical results verify the potential of this simple modification and show that not all layers are equally as important.

Results

Task	Dataset	Metric	Value	Model
Natural Language Inference	RTE	Accuracy	63	ELC-BERT-base 98M (zero init)
Natural Language Inference	RTE	Accuracy	55.4	ELC-BERT-small 24M
Natural Language Inference	RTE	Accuracy	54.7	LTG-BERT-base 98M
Natural Language Inference	RTE	Accuracy	53.7	LTG-BERT-small 24M
Natural Language Inference	MultiNLI	Matched	84.4	ELC-BERT-base 98M (zero init)
Natural Language Inference	MultiNLI	Mismatched	84.5	ELC-BERT-base 98M (zero init)
Natural Language Inference	MultiNLI	Matched	83	LTG-BERT-base 98M
Natural Language Inference	MultiNLI	Mismatched	83.4	LTG-BERT-base 98M
Natural Language Inference	MultiNLI	Matched	79.2	ELC-BERT-small 24M
Natural Language Inference	MultiNLI	Mismatched	79.9	ELC-BERT-small 24M
Natural Language Inference	MultiNLI	Matched	78	LTG-BERT-small 24M
Natural Language Inference	MultiNLI	Mismatched	78.8	LTG-BERT-small 24M
Linguistic Acceptability	CoLA	Accuracy	82.7	LTG-BERT-base 98M
Linguistic Acceptability	CoLA	Accuracy	82.6	ELC-BERT-base 98M
Linguistic Acceptability	CoLA	Accuracy	77.6	LTG-BERT-small 24M
Linguistic Acceptability	CoLA	Accuracy	76.1	ELC-BERT-small 24M

Not all layers are equally as important: Every Layer Counts BERT

Abstract

Results

Related Papers

Not all layers are equally as important: Every Layer Counts BERT

Abstract

Results

Related Papers