TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Methods

158 machine learning methods and techniques

AllAudioComputer VisionGeneralGraphsNatural Language ProcessingReinforcement LearningSequential

BPE

Byte Pair Encoding

Byte Pair Encoding, or BPE, is a subword segmentation algorithm that encodes rare and unknown words as sequences of subword units. The intuition is that various word classes are translatable via smaller units than words, for instance names (via character copying or transliteration), compounds (via compositional translation), and cognates and loanwords (via phonological and morphological transformations). Lei Mao has a detailed blog post that explains how this works.

Natural Language ProcessingIntroduced 200018980 papers

Focus

Natural Language ProcessingIntroduced 200015341 papers

Transformer

A Transformer is a model architecture that eschews recurrence and instead relies entirely on an attention mechanism to draw global dependencies between input and output. Before Transformers, the dominant sequence transduction models were based on complex recurrent or convolutional neural networks that include an encoder and a decoder. The Transformer also employs an encoder and decoder, but removing recurrence in favor of attention mechanisms allows for significantly more parallelization than methods like RNNs and CNNs.

Natural Language ProcessingIntroduced 200014004 papers

Diffusion

Diffusion models generate samples by gradually removing noise from a signal, and their training objective can be expressed as a reweighted variational lower-bound (https://arxiv.org/abs/2006.11239).

Natural Language ProcessingIntroduced 200013850 papers

BERT

BERT, or Bidirectional Encoder Representations from Transformers, improves upon standard Transformers by removing the unidirectionality constraint by using a masked language model (MLM) pre-training objective. The masked language model randomly masks some of the tokens from the input, and the objective is to predict the original vocabulary id of the masked word based only on its context. Unlike left-to-right language model pre-training, the MLM objective enables the representation to fuse the left and the right context, which allows us to pre-train a deep bidirectional Transformer. In addition to the masked language model, BERT uses a next sentence prediction task that jointly pre-trains text-pair representations. There are two steps in BERT: pre-training and fine-tuning. During pre-training, the model is trained on unlabeled data over different pre-training tasks. For fine-tuning, the BERT model is first initialized with the pre-trained parameters, and all of the parameters are fine-tuned using labeled data from the downstream tasks. Each downstream task has separate fine-tuned models, even though they are initialized with the same pre-trained parameters.

Natural Language ProcessingIntroduced 20006938 papers

GPT-4

GPT-4 is a transformer based model pre-trained to predict the next token in a document.

Natural Language ProcessingIntroduced 20002871 papers

{Dispute@FaQ-s}How to file a dispute with Expedia?

How to file a dispute with Expedia? To file a complaint against Expedia, first try contacting their customer service directly. You can reach them by phone at +(1)(888)(829)(0881) or +(1)(888)(829)(0881), via their online chat support, or through their online help center. If the issue persists, you can escalate it by contacting their corporate office via email or filing a complaint with your payment provider or a consumer protection agency.

Natural Language ProcessingIntroduced 20001905 papers

RAG

Retriever-Augmented Generation, or RAG, is a type of language generation model that combines pre-trained parametric and non-parametric memory for language generation. Specifically, the parametric memory is a pre-trained seq2seq model and the non-parametric memory is a dense vector index of Wikipedia, accessed with a pre-trained neural retriever. For query , Maximum Inner Product Search (MIPS) is used to find the top-K documents . For final prediction , we treat as a latent variable and marginalize over seq2seq predictions given different documents.

Natural Language ProcessingIntroduced 20001286 papers

GPT

GPT is a Transformer-based architecture and training procedure for natural language processing tasks. Training follows a two-stage procedure. First, a language modeling objective is used on the unlabeled data to learn the initial parameters of a neural network model. Subsequently, these parameters are adapted to a target task using the corresponding supervised objective.

Natural Language ProcessingIntroduced 20001212 papers

LLaMA

LLaMA is a collection of foundation language models ranging from 7B to 65B parameters. It is based on the transformer architecture with various improvements that were subsequently proposed. The main difference with the original architecture are listed below. - RMSNorm normalizing function is used to improve the training stability, by normalizing the input of each transformer sub-layer, instead of normalizing the output. - The ReLU non-linearity is replaced by the SwiGLU activation function to improve performance. - Absolute positional embeddings are removed and instead rotary positional embeddings (RoPE) are added at each layer of the network.

Natural Language ProcessingIntroduced 20001062 papers

GPT-2

GPT-2 is a Transformer architecture that was notable for its size (1.5 billion parameters) on its release. The model is pretrained on a WebText dataset - text from 45 million website links. It largely follows the previous GPT architecture with some modifications: - Layer normalization is moved to the input of each sub-block, similar to a pre-activation residual network and an additional layer normalization was added after the final self-attention block. - A modified initialization which accounts for the accumulation on the residual path with model depth is used. Weights of residual layers are scaled at initialization by a factor of where is the number of residual layers. - The vocabulary is expanded to 50,257. The context size is expanded from 512 to 1024 tokens and a larger batch size of 512 is used.

Natural Language ProcessingIntroduced 2000768 papers

T5

T5, or Text-to-Text Transfer Transformer, is a Transformer based architecture that uses a text-to-text approach. Every task – including translation, question answering, and classification – is cast as feeding the model text as input and training it to generate some target text. This allows for the use of the same model, loss function, hyperparameters, etc. across our diverse set of tasks. The changes compared to BERT include: - adding a causal decoder to the bidirectional architecture. - replacing the fill-in-the-blank cloze task with a mix of alternative pre-training tasks.

Natural Language ProcessingIntroduced 2000708 papers

Seq2Seq

Sequence to Sequence

Seq2Seq, or Sequence To Sequence, is a model used in sequence prediction tasks, such as language modelling and machine translation. The idea is to use one LSTM, the encoder, to read the input sequence one timestep at a time, to obtain a large fixed dimensional vector representation (a context vector), and then to use another LSTM, the decoder, to extract the output sequence from that vector. The second LSTM is essentially a recurrent neural network language model except that it is conditioned on the input sequence. (Note that this page refers to the original seq2seq not general sequence-to-sequence models)

Natural Language ProcessingIntroduced 2000700 papers

GloVe

GloVe Embeddings

GloVe Embeddings are a type of word embedding that encode the co-occurrence probability ratio between two words as vector differences. GloVe uses a weighted least squares objective that minimizes the difference between the dot product of the vectors of two words and the logarithm of their number of co-occurrences: where and are the word vector and bias respectively of word , and are the context word vector and bias respectively of word , is the number of times word occurs in the context of word , and is a weighting function that assigns lower weights to rare and frequent co-occurrences.

Natural Language ProcessingIntroduced 2000357 papers

OPT

OPT is a suite of decoder-only pre-trained transformers ranging from 125M to 175B parameters. The model uses an AdamW optimizer and weight decay of 0.1. It follows a linear learning rate schedule, warming up from 0 to the maximum learning rate over the first 2000 steps in OPT-175B, or over 375M tokens in the smaller models, and decaying down to 10% of the maximum LR over 300B tokens. The batch sizes range from 0.5M to 4M depending on the model size and is kept constant throughout the course of training.

Natural Language ProcessingIntroduced 2000285 papers

CAM

Class-activation map

Class activation maps could be used to interpret the prediction decision made by the convolutional neural network (CNN). Image source: Learning Deep Features for Discriminative Localization

Natural Language ProcessingIntroduced 2000267 papers

fastText

fastText embeddings exploit subword information to construct word embeddings. Representations are learnt of character -grams, and words represented as the sum of the -gram vectors. This extends the word2vec type models with subword information. This helps the embeddings understand suffixes and prefixes. Once a word is represented using character -grams, a skipgram model is trained to learn the embeddings.

Natural Language ProcessingIntroduced 2000240 papers

ELMo

Embeddings from Language Models, or ELMo, is a type of deep contextualized word representation that models both (1) complex characteristics of word use (e.g., syntax and semantics), and (2) how these uses vary across linguistic contexts (i.e., to model polysemy). Word vectors are learned functions of the internal states of a deep bidirectional language model (biLM), which is pre-trained on a large text corpus. A biLM combines both a forward and backward LM. ELMo jointly maximizes the log likelihood of the forward and backward directions. To add ELMo to a supervised model, we freeze the weights of the biLM and then concatenate the ELMo vector with and pass the ELMO enhanced representation into the task RNN. Here is a context-independent token representation for each token position. Image Source: here

Natural Language ProcessingIntroduced 2000234 papers

mBERT

mBERT

Natural Language ProcessingIntroduced 2000198 papers

XLM-R

XLM-R

Natural Language ProcessingIntroduced 2000176 papers

Electric

Electric is an energy-based cloze model for representation learning over text. Like BERT, it is a conditional generative model of tokens given their contexts. However, Electric does not use masking or output a full distribution over tokens that could occur in a context. Instead, it assigns a scalar energy score to each input token indicating how likely it is given its context. Specifically, like BERT, Electric also models , but does not use masking or a softmax layer. Electric first maps the unmasked input into contextualized vector representations using a transformer network. The model assigns a given position an energy score using a learned weight vector . The energy function defines a distribution over the possible tokens at position as where denotes replacing the token at position with and is the vocabulary, in practice usually word pieces. Unlike with BERT, which produces the probabilities for all possible tokens using a softmax layer, a candidate is passed in as input to the transformer. As a result, computing is prohibitively expensive because the partition function requires running the transformer times; unlike most EBMs, the intractability of is more due to the expensive scoring function rather than having a large sample space.

Natural Language ProcessingIntroduced 2000167 papers

BLOOM

BLOOM is a decoder-only Transformer language model that was trained on the ROOTS corpus, a dataset comprising hundreds of sources in 46 natural and 13 programming languages (59 in total).

Natural Language ProcessingIntroduced 2000116 papers

Flan-T5

Flan-T5 is the instruction fine-tuned version of T5 or Text-to-Text Transfer Transformer Language Model.

Natural Language ProcessingIntroduced 2000107 papers

MTS

Matching The Statements

Natural Language ProcessingIntroduced 2000107 papers

Performer

Performer is a Transformer architecture which can estimate regular (softmax) full-rank-attention Transformers with provable accuracy, but using only linear (as opposed to quadratic) space and time complexity, without relying on any priors such as sparsity or low-rankness. Performers are linear architectures fully compatible with regular Transformers and with strong theoretical guarantees: unbiased or nearly-unbiased estimation of the attention matrix, uniform convergence and low estimation variance. To approximate softmax attention-kernels, Performers use a Fast Attention Via positive Orthogonal Random features approach (FAVOR+), leveraging new methods for approximating softmax and Gaussian kernels.

Natural Language ProcessingIntroduced 2000103 papers

Patching

Activation Patching

Activation patching studies the model's computation by altering its latent representations, the token embeddings in transformer-based language models, during the inference process

Natural Language ProcessingIntroduced 2000102 papers

DeBERTa

DeBERTa is a Transformer-based neural language model that aims to improve the BERT and RoBERTa models with two techniques: a disentangled attention mechanism and an enhanced mask decoder. The disentangled attention mechanism is where each word is represented unchanged using two vectors that encode its content and position, respectively, and the attention weights among words are computed using disentangle matrices on their contents and relative positions. The enhanced mask decoder is used to replace the output softmax layer to predict the masked tokens for model pre-training. In addition, a new virtual adversarial training method is used for fine-tuning to improve model’s generalization on downstream tasks.

Natural Language ProcessingIntroduced 200090 papers

How do I get a human at Expedia immediately? (2025-2026)

How do I get a human at Expedia immediately? (2025 Complete Guide) Most travelers run into a point where self-service isn’t enough, and speaking to a real person becomes the only way forward +1-888-829-0881 Or +1-805-330-4056 in urgent situations. Whether you're dealing with last-minute changes, missed confirmations, or technical errors, direct help always works faster +1-888-829-0881 Or +1-805-330-4056 than virtual support. Once you’ve passed through chatbots and FAQ pages without results, the only productive step left is real communication +1-888-829-0881 Or +1-805-330-4056 with someone who can actually access your booking. Time-sensitive issues like flight cancellations or hotel no-shows require immediate assistance +1-888-829-0881 Or +1-805-330-4056, not generic articles. Many travelers don’t realize that representatives can do far more than you see in the app, from rebooking to refund processing +1-888-829-0881 Or +1-805-330-4056 in real time. If you’re traveling within 24 hours, getting through to someone becomes even more critical +1-888-829-0881 Or +1-805-330-4056 to avoid missing the trip. You might run into situations where your itinerary disappears from the app, even though the reservation was paid +1-888-829-0881 Or +1-805-330-4056 and confirmed earlier. This can lead to panic moments at check-in, which only manual support can fix +1-888-829-0881 Or +1-805-330-4056 with system-level updates. In cases where the airline has made a change but Expedia hasn’t updated it yet, the coordination becomes confusing +1-888-829-0881 Or +1-805-330-4056, especially if you're caught between two sides. A support rep can bridge that gap instantly by verifying the issue +1-888-829-0881 Or +1-805-330-4056 and offering workarounds. Some users also deal with loyalty point problems or gift cards not applying correctly during payment +1-888-829-0881 Or +1-805-330-4056, and these errors aren't easy to fix without backend access. Real agents can issue new codes or adjust balances immediately +1-888-829-0881 Or +1-805-330-4056 without delay. There are moments when refunds take too long or show as processed but never hit the bank, which requires someone to investigate +1-888-829-0881 Or +1-805-330-4056 beyond automated responses. Manual refund confirmation ensures the issue is actually closed +1-888-829-0881 Or +1-805-330-4056 on both ends. Even minor booking errors like the wrong date, time, or passenger name can cause major problems unless corrected early +1-888-829-0881 Or +1-805-330-4056 through a live conversation. It’s better to fix these things immediately than to argue at check-in desks later +1-888-829-0881 Or +1-805-330-4056 under pressure. Travelers with complex bookings—like multi-city trips or multiple passengers—need flexibility that's just not possible through standard tools +1-888-829-0881 Or +1-805-330-4056 on the website. A support team member can piece everything together smoothly +1-888-829-0881 Or +1-805-330-4056 without losing money. If your booking says "pending" or doesn't reflect on the airline's system, it could be a syncing error that only internal teams can resolve +1-888-829-0881 Or +1-805-330-4056 using the right codes. Trying to fly with an invalid ticket is risky, so verifying early is smart +1-888-829-0881 Or +1-805-330-4056. International travelers often face foreign card errors or unfamiliar hotel policies that create confusion +1-888-829-0881 Or +1-805-330-4056 during travel. Speaking to someone saves you from having to guess what's going wrong +1-888-829-0881 Or +1-805-330-4056 when timing is everything. Missing confirmation emails can be a result of typos or spam filters, but only a rep can resend or regenerate them for you +1-888-829-0881 Or +1-805-330-4056 when you're in a hurry. Waiting for chat support wastes precious time before check-in +1-888-829-0881 Or +1-805-330-4056 or departure. Many support actions—like refund escalations, name corrections, or airline requests—are only available to human agents +1-888-829-0881 Or +1-805-330-4056, not bots. That’s why the most efficient travelers skip ahead to live help when needed +1-888-829-0881 Or +1-805-330-4056. Final Thoughts Speaking to someone at Expedia is the only route when things go off track, and the sooner you do it, the faster things get resolved +1-888-829-0881 Or +1-805-330-4056 without added stress. Don’t wait for delays to stack up—just take control through direct communication +1-888-829-0881 Or +1-805-330-4056 and travel confidently.

Natural Language ProcessingIntroduced 200088 papers

Longformer

Longformer is a modified Transformer architecture. Traditional Transformer-based models are unable to process long sequences due to their self-attention operation, which scales quadratically with the sequence length. To address this, Longformer uses an attention pattern that scales linearly with sequence length, making it easy to process documents of thousands of tokens or longer. The attention mechanism is a drop-in replacement for the standard self-attention and combines a local windowed attention with a task motivated global attention. The attention patterns utilised include: sliding window attention, dilated sliding window attention and global + sliding window. These can be viewed in the components section of this page.

Natural Language ProcessingIntroduced 200087 papers

How do I make a claim with Expedia?*Make FastClaimService

How do I make a claim with Expedia? To make a claim with Expedia, contact their support team at +1(888) (829) (0881) OR +1(805) (330) (4056), or use the Help Center to submit details of your issue. While resolving your claim, ask about available discounts—Expedia may offer travel credits, promo codes, or exclusive deals to help compensate for your inconvenience. How do I make a claim with Expedia? To make a claim with Expedia, contact their support team at +1(888) (829) (0881) OR +1(805) (330) (4056), or use the Help Center to submit details of your issue. While resolving your claim, ask about available discounts—Expedia may offer travel credits, promo codes, or exclusive deals to help compensate for your inconvenience. How do I make a claim with Expedia? To make a claim with Expedia, contact their support team at +1(888) (829) (0881) OR +1(805) (330) (4056), or use the Help Center to submit details of your issue. While resolving your claim, ask about available discounts—Expedia may offer travel credits, promo codes, or exclusive deals to help compensate for your inconvenience. How do I make a claim with Expedia? To make a claim with Expedia, contact their support team at +1(888) (829) (0881) OR +1(805) (330) (4056), or use the Help Center to submit details of your issue. While resolving your claim, ask about available discounts—Expedia may offer travel credits, promo codes, or exclusive deals to help compensate for your inconvenience. How do I make a claim with Expedia? To make a claim with Expedia, contact their support team at +1(888) (829) (0881) OR +1(805) (330) (4056), or use the Help Center to submit details of your issue. While resolving your claim, ask about available discounts—Expedia may offer travel credits, promo codes, or exclusive deals to help compensate for your inconvenience.

Natural Language ProcessingIntroduced 200087 papers

How do I complain to Expedia?*ComplainByAgent

How do I complain to Expedia? To make a claim on Expedia, call +1(888) (829) (0881) OR +1(805) (330) (4056), or use their Help Center to submit your issue with full booking details. During the process, ask about special offers—Expedia may provide discount codes, travel credits, or promotional deals as part of their resolution and customer satisfaction efforts. How do I complain to Expedia? To make a claim on Expedia, call +1(888) (829) (0881) OR +1(805) (330) (4056), or use their Help Center to submit your issue with full booking details. During the process, ask about special offers—Expedia may provide discount codes, travel credits, or promotional deals as part of their resolution and customer satisfaction efforts. How do I complain to Expedia? To make a claim on Expedia, call +1(888) (829) (0881) OR +1(805) (330) (4056), or use their Help Center to submit your issue with full booking details. During the process, ask about special offers—Expedia may provide discount codes, travel credits, or promotional deals as part of their resolution and customer satisfaction efforts. How do I complain to Expedia? To make a claim on Expedia, call +1(888) (829) (0881) OR +1(805) (330) (4056), or use their Help Center to submit your issue with full booking details. During the process, ask about special offers—Expedia may provide discount codes, travel credits, or promotional deals as part of their resolution and customer satisfaction efforts.

Natural Language ProcessingIntroduced 200086 papers

Adaptive Input Representations

Adaptive Input Embeddings extend the adaptive softmax to input word representations. The factorization assigns more capacity to frequent words and reduces the capacity for less frequent words with the benefit of reducing overfitting to rare words.

Natural Language ProcessingIntroduced 200066 papers

SBERT

Sentence-BERT

Natural Language ProcessingIntroduced 200064 papers

Transformer-XL

Transformer-XL (meaning extra long) is a Transformer architecture that introduces the notion of recurrence to the deep self-attention network. Instead of computing the hidden states from scratch for each new segment, Transformer-XL reuses the hidden states obtained in previous segments. The reused hidden states serve as memory for the current segment, which builds up a recurrent connection between the segments. As a result, modeling very long-term dependency becomes possible because information can be propagated through the recurrent connections. As an additional contribution, the Transformer-XL uses a new relative positional encoding formulation that generalizes to attention lengths longer than the one observed during training.

Natural Language ProcessingIntroduced 200064 papers

Pythia

Pythia is a suite of decoder-only autoregressive language models all trained on public data seen in the exact same order and ranging in size from 70M to 12B parameters. The model architecture and hyperparameters largely follow GPT-3, with a few notable deviations based on recent advances in best practices for large scale language modeling.

Natural Language ProcessingIntroduced 200060 papers

XLM

XLM is a Transformer based architecture that is pre-trained using one of three language modelling objectives: 1. Causal Language Modeling - models the probability of a word given the previous words in a sentence. 2. Masked Language Modeling - the masked language modeling objective of BERT. 3. Translation Language Modeling - a (new) translation language modeling objective for improving cross-lingual pre-training. The authors find that both the CLM and MLM approaches provide strong cross-lingual features that can be used for pretraining models.

Natural Language ProcessingIntroduced 200057 papers

ERNIE

ERNIE is a transformer-based model consisting of two stacked modules: 1) textual encoder and 2) knowledgeable encoder, which is responsible to integrate extra token-oriented knowledge information into textual information. This layer consists of stacked aggregators, designed for encoding both tokens and entities as well as fusing their heterogeneous features. To integrate this layer of enhancing representations via knowledge, a special pre-training task is adopted for ERNIE - it involves randomly masking token-entity alignments and training the model to predict all corresponding entities based on aligned tokens (aka denoising entity auto-encoder).

Natural Language ProcessingIntroduced 200054 papers

PEGASUS

PEGASUS proposes a transformer-based model for abstractive summarization. It uses a special self-supervised pre-training objective called gap-sentences generation (GSG) that's designed to perform well on summarization-related downstream tasks. As reported in the paper, "both GSG and MLM are applied simultaneously to this example as pre-training objectives. Originally there are three sentences. One sentence is masked with [MASK1] and used as target generation text (GSG). The other two sentences remain in the input, but some tokens are randomly masked by [MASK2]."

Natural Language ProcessingIntroduced 200053 papers

SimCSE

SimCSE is a contrastive learning framework for generating sentence embeddings. It utilizes an unsupervised approach, which takes an input sentence and predicts itself in contrastive objective, with only standard dropout used as noise. The authors find that dropout acts as minimal “data augmentation” of hidden representations, while removing it leads to a representation collapse. Afterwards a supervised approach is used, which incorporates annotated pairs from natural language inference datasets into the contrastive framework, by using “entailment” pairs as positives and “contradiction

Natural Language ProcessingIntroduced 200052 papers

GLM

GLM is a bilingual (English and Chinese) pre-trained transformer-based language model that follow the traditional architecture of decoder-only autoregressive language modeling. It leverages autoregressive blank infilling as its training objective.

Natural Language ProcessingIntroduced 200046 papers

SRS

Sticker Response Selector

Sticker Response Selector, or SRS, is a model for multi-turn dialog that automatically selects a sticker response. SRS first employs a convolutional based sticker image encoder and a self-attention based multi-turn dialog encoder to obtain the representation of stickers and utterances. Next, deep interaction network is proposed to conduct deep matching between the sticker with each utterance in the dialog history. SRS then learns the short-term and long-term dependency between all interaction results by a fusion network to output the the final matching score.

Natural Language ProcessingIntroduced 200045 papers

ULMFiT

Universal Language Model Fine-tuning

Universal Language Model Fine-tuning, or ULMFiT, is an architecture and transfer learning method that can be applied to NLP tasks. It involves a 3-layer AWD-LSTM architecture for its representations. The training consists of three steps: 1) general language model pre-training on a Wikipedia-based text, 2) fine-tuning the language model on a target task, and 3) fine-tuning the classifier on the target task. As different layers capture different types of information, they are fine-tuned to different extents using discriminative fine-tuning. Training is performed using Slanted triangular learning rates (STLR), a learning rate scheduling strategy that first linearly increases the learning rate and then linearly decays it. Fine-tuning the target classifier is achieved in ULMFiT using gradual unfreezing. Rather than fine-tuning all layers at once, which risks catastrophic forgetting, ULMFiT gradually unfreezes the model starting from the last layer (i.e., closest to the output) as this contains the least general knowledge. First the last layer is unfrozen and all unfrozen layers are fine-tuned for one epoch. Then the next group of frozen layers is unfrozen and fine-tuned and repeat, until all layers are fine-tuned until convergence at the last iteration.

Natural Language ProcessingIntroduced 200040 papers

PCT

Perceptual control theoretic architecture

Natural Language ProcessingIntroduced 200038 papers

GPT-Neo

An implementation of model & data parallel GPT3-like models using the mesh-tensorflow library. Source: EleutherAI/GPT-Neo

Natural Language ProcessingIntroduced 200038 papers

ETC

Extended Transformer Construction

Extended Transformer Construction, or ETC, is an extension of the Transformer architecture with a new attention mechanism that extends the original in two main ways: (1) it allows scaling up the input length from 512 to several thousands; and (2) it can ingesting structured inputs instead of just linear sequences. The key ideas that enable ETC to achieve these are a new global-local attention mechanism, coupled with relative position encodings. ETC also allows lifting weights from existing BERT models, saving computational resources while training.

Natural Language ProcessingIntroduced 200036 papers

DART

Difficulty-Aware Rejection Tuning

🎯 DART-Math Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving 📝 Paper@arXiv | 🤗 Datasets&Models@HF | 🐱 Code@GitHub 🐦 Thread@X(Twitter) | 🐶 中文博客@知乎 | 📊 Leaderboard@PapersWithCode | 📑 BibTeX Datasets: DART-Math DART-Math datasets are the state-of-the-art and data-efficient open-source instruction tuning datasets for mathematical reasoning. DART-Math-Hard contains \585k mathematical QA pair samples constructed by applying DARS-Prop2Diff to the query set from MATH and GSK8K training sets, achieves SOTA on many challenging mathematical reasoning benchmarks. It introduces a deliberate bias towards hard queries, opposite to vanilla rejection sampling. Performance produced by DART-Math-Hard is usually but not necessarily slightly better (\1% absolutely) than DART-Math-Uniform, which contains \591k samples constructed by applying DARS-Uniform. Comparison between Mathematical Instruction Tuning Datasets Most of previous datasets are constructed with ChatGPT, and many of them are not open-source, especially for ones of the best performance. | Math SFT Dataset | of Samples | MATH | GSM8K | College | Synthesis Agent(s) | Open-Source | | :--------------------------------------------------------------------------------- | -----------: | -----------------------------------------------------------------: | ---------------------------------------------: | -----------------------------------------------------------------------------------------------------------: | :---------------------- | :-------------------------------------------------------------------------: | | WizardMath | 96k | 32.3 | 80.4 | 23.1 | GPT-4 | ✗ | | MetaMathQA | 395k | 29.8 | 76.5 | 19.3 | GPT-3.5 | ✓ | | MMIQC | 2294k | 37.4 | 75.4 | 28.5 | GPT-4+GPT-3.5+Human | ✓ | | Orca-Math | 200k | -- | -- | -- | GPT-4 | ✓ | | Xwin-Math-V1.1 | 1440k | 45.5 | 84.9 | 27.6 | GPT-4 | ✗ | | KPMath-Plus | 1576k | 46.8 | 82.1 | -– | GPT-4 | ✗ | | MathScaleQA | 2021k | 35.2 | 74.8 | 21.8 | GPT-3.5+Human | ✗ | | DART-Math-Uniform | 591k | 43.5 | 82.6 | 26.9 | DeepSeekMath-7B-RL | ✓ | | DART-Math-Hard | 585k | 45.5 | 81.1 | 29.4 | DeepSeekMath-7B-RL | ✓ | <supMATH and GSM8K are in-domain, while College(Math) is out-of-domain. Performance here are of models fine-tuned from Mistral-7B, except for Xwin-Math-V1.1 based on Llama2-7B. Bold/Italic means the best/second best score here.</sup Dataset Construction: DARS - Difficulty-Aware Rejection Sampling Previous works usually synthesize data from proprietary models to augment existing datasets, followed by instruction tuning to achieve top-tier results. However, our analysis of these datasets reveals severe biases towards easy queries, with frequent failures to generate any correct response for the most challenging queries. Motivated by the observation above, we propose to Difficulty-Aware Rejection Sampling (DARS), to collect more responses for more difficult queries. Specifically, we introduce two strategies to increase the number of correct responses for difficult queries: 1) Uniform, which involves sampling responses for each query until each query accumulates correct responses, where is a preset hyperparameter determined by the desired size of the synthetic dataset; 2) Prop2Diff, where we continue sampling responses until the number of correct responses for each query is proportional to its difficulty score. The most challenging queries will receive responses and kp is a hyperparameter. This method introduces a deliberate bias in the opposite direction to vanilla rejection sampling, towards more difficult queries, inspired by previous works that demonstrate difficult samples can be more effective to enhance model capabilities (Sorscher et al., 2022; Liu et al., 2024b). See Figure 1 (Right) for examples of DART-Math-Uniform by DARS-Uniform and DART-Math-Hard by DARS-Prop2Diff. Citation If you find our data, model or code useful for your work, please kindly cite our paper: latex @article{tong2024dartmath, title={DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving}, author={Yuxuan Tong and Xiwen Zhang and Rui Wang and Ruidong Wu and Junxian He}, year={2024}, eprint={2407.13690}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2407.13690}, }

Natural Language ProcessingIntroduced 200032 papers

CodeT5

CodeT5 is a Transformer-based model for code understanding and generation based on the T5 architecture. It utilizes an identifier-aware pre-training objective that considers the crucial token type information (identifiers) from code. Specifically, the denoising Seq2Seq objective of T5 is extended with two identifier tagging and prediction tasks to enable the model to better leverage the token type information from programming languages, which are the identifiers assigned by developers. To improve the natural language-programming language alignment, a bimodal dual learning objective is used for a bidirectional conversion between natural language and programming language.

Natural Language ProcessingIntroduced 200032 papers

USE

Multilingual Universal Sentence Encoder

Natural Language ProcessingIntroduced 200026 papers

BLOOMZ

BLOOMZ is a Multitask prompted finetuning (MTF) variant of BLOOM.

Natural Language ProcessingIntroduced 200026 papers

CodeGen

CodeGen is an autoregressive transformers with next-token prediction language modeling as the learning objective trained on a natural language corpus and programming language data curated from GitHub.

Natural Language ProcessingIntroduced 200025 papers
Page 1 of 4Next