Methods

158 machine learning methods and techniques

All Audio Computer Vision General Graphs Natural Language Processing Reinforcement Learning Sequential

Discriminative Adversarial Search

Discriminative Adversarial Search, or DAS, is a sequence decoding approach which aims to alleviate the effects of exposure bias and to optimize on the data distribution itself rather than for external metrics. Inspired by generative adversarial networks (GANs), wherein a discriminator is used to improve the generator, DAS differs from GANs in that the generator parameters are not updated at training time and the discriminator is only used to drive sequence generation at inference time.

Natural Language ProcessingIntroduced 20002 papers

DualCL

Dual Contrastive Learning

Contrastive learning has achieved remarkable success in representation learning via self-supervision in unsupervised settings. However, effectively adapting contrastive learning to supervised learning tasks remains as a challenge in practice. In this work, we introduce a dual contrastive learning (DualCL) framework that simultaneously learns the features of input samples and the parameters of classifiers in the same space. Specifically, DualCL regards the parameters of the classifiers as augmented samples associating to different labels and then exploits the contrastive learning between the input samples and the augmented samples. Empirical studies on five benchmark text classification datasets and their low-resource version demonstrate the improvement in classification accuracy and confirm the capability of learning discriminative representations of DualCL.

Natural Language ProcessingIntroduced 20002 papers

Sandwich Transformer

A Sandwich Transformer is a variant of a Transformer that reorders sublayers in the architecture to achieve better performance. The reordering is based on the authors' analysis that models with more self-attention toward the bottom and more feedforward sublayers toward the top tend to perform better in general.

Natural Language ProcessingIntroduced 20002 papers

Funnel Transformer

Funnel Transformer is a type of Transformer that gradually compresses the sequence of hidden states to a shorter one and hence reduces the computation cost. By re-investing the saved FLOPs from length reduction in constructing a deeper or wider model, the model capacity is further improved. In addition, to perform token-level predictions as required by common pretraining objectives, Funnel-transformer is able to recover a deep representation for each token from the reduced hidden sequence via a decoder. The proposed model keeps the same overall skeleton of interleaved S-Attn and P-FFN sub-modules wrapped by residual connection and layer normalization. But differently, to achieve representation compression and computation reduction, THE model employs an encoder that gradually reduces the sequence length of the hidden states as the layer gets deeper. In addition, for tasks involving per-token predictions like pretraining, a simple decoder is used to reconstruct a full sequence of token-level representations from the compressed encoder output. Compression is achieved via a pooling operation,

Natural Language ProcessingIntroduced 20002 papers

MacBERT

MacBERT is a Transformer-based model for Chinese NLP that alters RoBERTa in several ways, including a modified masking strategy. Instead of masking with [MASK] token, which never appears in the fine-tuning stage, MacBERT masks the word with its similar word. Specifically MacBERT shares the same pre-training tasks as BERT with several modifications. For the MLM task, the following modifications are performed: - Whole word masking is used as well as Ngram masking strategies for selecting candidate tokens for masking, with a percentage of 40%, 30%, 20%, 10% for word-level unigram to 4-gram. - Instead of masking with [MASK] token, which never appears in the fine-tuning stage, similar words are used for the masking purpose. A similar word is obtained by using Synonyms toolkit which is based on word2vec similarity calculations. If an N-gram is selected to mask, we will find similar words individually. In rare cases, when there is no similar word, we will degrade to use random word replacement. - A percentage of 15% input words is used for masking, where 80% will replace with similar words, 10% replace with a random word, and keep with original words for the rest of 10%.

Natural Language ProcessingIntroduced 20002 papers

Packed Levitated Markers

Packed Levitated Markers, or PL-Marker, is a span representation approach for named entity recognition that considers the dependencies between spans (pairs) by strategically packing the markers in the encoder. A pair of Levitated Markers, emphasizing a span, consists of a start marker and an end marker which share the same position embeddings with span’s start and end tokens respectively. In addition, both levitated markers adopt a restricted attention, that is, they are visible to each other, but not to the text token and other pairs of markers. sBased on the above features, the levitated marker would not affect the attended context of the original text tokens, which allows us to flexibly pack a series of related spans with their levitated markers in the encoding phase and thus model their dependencies.

Natural Language ProcessingIntroduced 20002 papers

AutoTinyBERT

AutoTinyBERT is a an efficient BERT variant found through neural architecture search. Specifically, one-shot learning is used to obtain a big Super Pretrained Language Model (SuperPLM), where the objectives of pre-training or task-agnostic BERT distillation are used. Then, given a specific latency constraint, an evolutionary algorithm is run on the SuperPLM to search optimal architectures. Finally, we extract the corresponding sub-models based on the optimal architectures and further train these models.

Natural Language ProcessingIntroduced 20002 papers

SIRM

Skim and Intensive Reading Model

Skim and Intensive Reading Model, or SIRM, is a deep neural network for figuring out implied textual meaning. It consists of two main components, namely the skim reading component and intensive reading component. N-gram features are quickly extracted from the skim reading component, which is a combination of several convolutional neural networks, as skim (entire) information. An intensive reading component enables a hierarchical investigation for both local (sentence) and global (paragraph) representation, which encapsulates the current embedding and the contextual information with a dense connection.

Natural Language ProcessingIntroduced 20002 papers

TernaryBERT

TernaryBERT is a Transformer-based model which ternarizes the weights of a pretrained BERT model to , with different granularities for word embedding and weights in the Transformer layer. Instead of directly using knowledge distillation to compress a model, it is used to improve the performance of ternarized student model with the same size as the teacher model. In this way, we transfer the knowledge from the highly-accurate teacher model to the ternarized student model with smaller capacity.

Natural Language ProcessingIntroduced 20002 papers

RAHP

Review-guided Answer Helpfulness Prediction

Review-guided Answer Helpfulness Prediction (RAHP) is a textual inference model for identifying helpful answers in e-commerce. It not only considers the interactions between QA pairs, but also investigates the opinion coherence between the answer and crowds' opinions reflected in the reviews, which is another important factor to identify helpful answers.

Natural Language ProcessingIntroduced 20002 papers

DynaBERT

DynaBERT is a BERT-variant which can flexibly adjust the size and latency by selecting adaptive width and depth. The training process of DynaBERT includes first training a width-adaptive BERT and then allowing both adaptive width and depth, by distilling knowledge from the full-sized model to small sub-networks. Network rewiring is also used to keep the more important attention heads and neurons shared by more sub-networks. A two-stage procedure is used to train DynaBERT. First, using knowledge distillation (dashed lines) to transfer the knowledge from a fixed teacher model to student sub-networks with adaptive width in DynaBERTW. Then, using knowledge distillation (dashed lines) to transfer the knowledge from a trained DynaBERTW to student sub-networks with adaptive width and depth in DynaBERT.

Natural Language ProcessingIntroduced 20002 papers

Feedback Transformer

A Feedback Transformer is a type of sequential transformer that exposes all previous representations to all future representations, meaning the lowest representation of the current timestep is formed from the highest-level abstract representation of the past. This feedback nature allows this architecture to perform recursive computation, building stronger representations iteratively upon previous states. To achieve this, the self-attention mechanism of the standard Transformer is modified so it attends to higher level representations rather than lower ones.

Natural Language ProcessingIntroduced 20002 papers

Step-DPO

Step-wise Direct Preference Optimization

Please enter a description about the method here

Natural Language ProcessingIntroduced 20002 papers

Lbl2Vec

Natural Language ProcessingIntroduced 20002 papers

FastSGT

Fast Schema Guided Tracker, or FastSGT, is a fast and robust BERT-based model for state tracking in goal-oriented dialogue systems. The model employs carry-over mechanisms for transferring the values between slots, enabling switching between services and accepting the values offered by the system during dialogue. It also uses multi-head attention projections in some of the decoders to have a better modelling of the encoder outputs. The model architecture is illustrated in the Figure. It consists of four main modules: 1-Utterance Encoder, 2-Schema Encoder, 3-State Decoder, and 4-State Tracker. The first three modules constitute the NLU component and are based on neural networks, whereas the state tracker is a rule-based module. BERT was used for both encoders in the model. The Utterance Encoder is a BERT model which encodes the user and system utterances at each turn. The Schema Encoder is also a BERT model which encodes the schema descriptions of intents, slots, and values into schema embeddings. These schema embeddings help the decoders to transfer or share knowledge between different services by having some language understanding of each slot, intent, or value. The schema and utterance embeddings are passed to the State Decoder - a multi-task module. This module consists of five sub-modules producing the information necessary to track the state of the dialogue. Finally, the State Tracker module takes the previous state along with the current outputs of the State Decoder and predicts the current state of the dialogue by aggregating and summarizing the information across turns.

Natural Language ProcessingIntroduced 20001 papers

MixLoRA

MixLoRA is a type of PEFT method which construct a resource-efficient sparse MoE model based on LoRA.

Natural Language ProcessingIntroduced 20001 papers

BinaryBERT

BinaryBERT is a BERT-variant that applies quantization in the form of weight binarization. Specifically, ternary weight splitting is proposed which initializes BinaryBERT by equivalently splitting from a half-sized ternary network. To obtain BinaryBERT, we first train a half-sized ternary BERT model, and then apply a ternary weight splitting operator to obtain the latent full-precision and quantized weights as the initialization of the full-sized BinaryBERT. We then fine-tune BinaryBERT for further refinement.

Natural Language ProcessingIntroduced 20001 papers

TransferQA

TransferQA is a transferable generative QA model, built upon T5 that combines extractive QA and multi-choice QA via a text-to-text transformer framework, and tracks both categorical slots and non-categorical slots in DST. In addition, it introduces two effective ways to construct unanswerable questions, namely, negative question sampling and context truncation, which enable the model to handle “none” value slots in the zero-shot DST setting.

Natural Language ProcessingIntroduced 20001 papers

lda2vec

lda2vec builds representations over both words and documents by mixing word2vec’s skipgram architecture with Dirichlet-optimized sparse topic mixtures. The Skipgram Negative-Sampling (SGNS) objective of word2vec is modified to utilize document-wide feature vectors while simultaneously learning continuous document weights loading onto topic vectors. The total loss term is the sum of the Skipgram Negative Sampling Loss (SGNS) with the addition of a Dirichlet-likelihood term over document weights, . The loss is conducted using a context vector, , pivot word vector , target word vector , and negatively-sampled word vector :

Natural Language ProcessingIntroduced 20001 papers

NormFormer

NormFormer is a type of Pre-LN transformer that adds three normalization operations to each layer: a Layer Norm after self attention, head-wise scaling of self-attention outputs, and a Layer Norm after the first fully connected layer. The modifications introduce a small number of additional learnable parameters, which provide a cost-effective way for each layer to change the magnitude of its features, and therefore the magnitude of the gradients to subsequent components.

Natural Language ProcessingIntroduced 20001 papers

Categorical Modularity

A novel low-resource intrinsic metric to evaluate word embedding quality based on graph modularity.

Natural Language ProcessingIntroduced 20001 papers

SMITH

Siamese Multi-depth Transformer-based Hierarchical Encoder

SMITH, or Siamese Multi-depth Transformer-based Hierarchical Encoder, is a Transformer-based model for document representation learning and matching. It contains several design choices to adapt self-attention models for long text inputs. For the model pre-training, a masked sentence block language modeling task is used in addition to the original masked word language model task used in BERT, to capture sentence block relations within a document. Given a sequence of sentence block representation, the document level Transformers learn the contextual representation for each sentence block and the final document representation.

Natural Language ProcessingIntroduced 20001 papers

LayoutReader

LayoutReader is a sequence-to-sequence model for reading order detection that uses both textual and layout information, where the layout-aware language model LayoutLM is leveraged as an encoder. The generation step in the encoder-decoder structure tis modified to generate the reading order sequence. In the encoding stage, LayoutReader packs the pair of source and target segments into a contiguous input sequence of LayoutLM and carefully designs the self-attention mask to control the visibility between tokens. As shown in the Figure, LayoutReader allows the tokens in the source segment to attend to each other while preventing the tokens in the target segment from attending to the rightward context. If 1 means allowing and 0 means preventing, the detail of the mask is as follows: where are the indices in the packed input sequence, so they may be from source or target segments; src means both tokens are from source segment. In the decoding stage, since the source and target are reordered sequences, the prediction candidates can be constrained to the source segment. Therefore, we ask the model to predict the indices in the source sequence. The probability is calculated as follows: where is an index in the source segment; and are the -th and -th input embeddings of the source segment; is the hidden states at the -th time step; is the bias at the -th time step.

Natural Language ProcessingIntroduced 20001 papers

DDParser

Baidu Dependency Parser

DDParser, or Baidu Dependency Parser, is a Chinese dependency parser trained on a large-scale manually labeled dataset called Baidu Chinese Treebank (DuCTB). For inputs, for the th word, its input vector is the concatenation of the word embedding and character-level representation: Where is the output vectors after feeding the character sequence into a BiLSTM layer. The experimental results on DuCTB dataset show that replacing POS tag embeddings with leads to the improvement. For the BiLSTM encoder, three BiLSTM layers are employed over the input vectors for context encoding. Denote the output vector of the top-layer BiLSTM for The dependency parser of Dozat and Manning is used. Dimension-reducing MLPs are applied to each recurrent output vector before applying the biaffine transformation. Applying smaller MLPs to the recurrent output states before the biaffine classifier has the advantage of stripping away information not relevant to the current decision. Then biaffine attention is used both in the dependency arc classifier and relation classifier. The computations of all symbols in the Figure are shown below: For the decoder, the first-order Eisner algorithm is used to ensure that the output is a projection tree. Based on the dependency tree built by biaffine parser, we get a word sequence through the in-order traversal of the tree. The output is a projection tree only if the word sequence is in order.

Natural Language ProcessingIntroduced 20001 papers

Sinkhorn Transformer

The Sinkhorn Transformer is a type of transformer that uses Sparse Sinkhorn Attention as a building block. This component is a plug-in replacement for dense fully-connected attention (as well as local attention, and sparse attention alternatives), and allows for reduced memory complexity as well as sparse attention.

Natural Language ProcessingIntroduced 20001 papers

Vulnerability-constrained Decoding

Vulnerability-constrained Decoding, is a sequence decoding approach that aims to avoid generating vulnerabilities in generated code.

Natural Language ProcessingIntroduced 20001 papers

HEGCN

Hierarchical Entity Graph Convolutional Network

HEGCN, or Hierarchical Entity Graph Convolutional Network is a model for multi-hop relation extraction across documents. Documents in a document chain are encoded using a bi-directional long short-term memory (BiLSTM) layer. On top of the BiLSTM layer, two graph convolutional networks (GCN) are used, one after another in a hierarchy. In the first level of the GCN hierarchy, a separate entity mention graph is constructed on each document of the chain using all the entities mentioned in that document. Each mention of an entity in a document is considered as a separate node in the graph. A graph convolutional network (GCN) is used to represent the entity mention graph of each document to capture the relations among the entity mentions in the document. A unified entity-level graph is then constructed across all the documents in the chain. Each node of this entity-level graph represents a unique entity in the document chain. Each common entity between two documents in the chain is represented by a single node in the graph. A GCN is used to represent this entity-level graph to capture the relations among the entities across the documents. The representations of the nodes of the subject entity and object entity are concatenated and passed to a feed-forward layer with softmax for relation classification.

Natural Language ProcessingIntroduced 20001 papers

ClipBERT

ClipBERT is a framework for end-to-end-learning for video-and-language tasks, by employing sparse sampling, where only a single or a few sparsely sampled short clips from a video are used at each training step. Two aspects distinguish ClipBERT from previous work. First, in contrast to densely extracting video features (adopted by most existing methods), CLIPBERT sparsely samples only one single or a few short clips from the full-length videos at each training step. The hypothesis is that visual features from sparse clips already capture key visual and semantic information in the video, as consecutive clips usually contain similar semantics from a continuous scene. Thus, a handful of clips are sufficient for training, instead of using the full video. Then, predictions from multiple densely-sampled clips are aggregated to obtain the final video-level prediction during inference, which is less computational demanding. The second differentiating aspect concerns the initialization of model weights (i.e., transfer through pre-training). The authors use 2D architectures (e.g., ResNet-50) instead of 3D features as the visual backbone for video encoding, allowing them to harness the power of image-text pretraining for video-text understanding along with the advantages of low memory cost and runtime efficiency.

Natural Language ProcessingIntroduced 20001 papers

BP-Transformer

The BP-Transformer (BPT) is a type of Transformer that is motivated by the need to find a better balance between capability and computational complexity for self-attention. The architecture partitions the input sequence into different multi-scale spans via binary partitioning (BP). It incorporates an inductive bias of attending the context information from fine-grain to coarse-grain as the relative distance increases. The farther the context information is, the coarser its representation is. BPT can be regard as graph neural network, whose nodes are the multi-scale spans. A token node can attend the smaller-scale span for the closer context and the larger-scale span for the longer distance context. The representations of nodes are updated with Graph Self-Attention.

Natural Language ProcessingIntroduced 20001 papers

CubeRE

Our model known as CubeRE first encodes each input sentence using a language model encoder to obtain the contextualized sequence representation. We then capture the interaction between each possible head and tail entity as a pair representation for predicting the entity-relation label scores. To reduce the computational cost, each sentence is pruned to retain only words that have higher entity scores. Finally, we capture the interaction between each possible relation triplet and qualifier to predict the qualifier label scores and decode the outputs.

Natural Language ProcessingIntroduced 20001 papers

PAR Transformer

PAR Transformer is a Transformer model that uses 63% fewer self-attention blocks, replacing them with feed-forward blocks, while retaining test accuracies. It is based on the Transformer-XL architecture and uses neural architecture search to find an an efficient pattern of blocks in the transformer architecture.

Natural Language ProcessingIntroduced 20001 papers

Seq2Edits

Seq2Edits is an open-vocabulary approach to sequence editing for natural language processing (NLP) tasks with a high degree of overlap between input and output texts. In this approach, each sequence-to-sequence transduction is represented as a sequence of edit operations, where each operation either replaces an entire source span with target tokens or keeps it unchanged. For text normalization, sentence fusion, sentence splitting & rephrasing, text simplification, and grammatical error correction, the approach improves explainability by associating each edit operation with a human-readable tag. Rather than generating the target sentence as a series of tokens, the model predicts a sequence of edit operations that, when applied to the source sentence, yields the target sentence. Each edit operates on a span in the source sentence and either copies, deletes, or replaces it with one or more target tokens. Edits are generated auto-regressively from left to right using a modified Transformer architecture to facilitate learning of long-range dependencies.

Natural Language ProcessingIntroduced 20001 papers

PermuteFormer

PermuteFormer is a Performer-based model with relative position encoding that scales linearly on long sequences. PermuteFormer applies position-dependent transformation on queries and keys to encode positional information into the attention module. This transformation is carefully crafted so that the final output of self-attention is not affected by absolute positions of tokens. Each token’s query / key feature is illustrated as a row of blocks in the figure, and its elements are marked with different colors. The position-aware permutation permutes elements of each token’s query / key feature along the head size dimension in each attention head. Depending on the token’s position, the permutation applied to query / key feature is different.

Natural Language ProcessingIntroduced 20001 papers

PanGu-$α$

PanGu- is an autoregressive language model (ALM) with up to 200 billion parameters pretrained on a large corpus of text, mostly in Chinese language. The architecture of PanGu- is based on Transformer, which has been extensively used as the backbone of a variety of pretrained language models such as BERT and GPT. Different from them, there's an additional query layer developed on top of Transformer layers which aims to explicitly induce the expected output.

Natural Language ProcessingIntroduced 20001 papers

SC-GPT

SC-GPT is a multi-layer Transformer neural language model, trained in three steps: (i) Pre-trained on plain text, similar to GPT-2; (ii) Continuously pretrained on large amounts of dialog-act labeled utterances corpora to acquire the ability of controllable generation; (iii) Fine-tuned for a target domain using very limited amounts of domain labels. Unlike GPT-2, SC-GPT generates semantically controlled responses that are conditioned on the given semantic form, similar to SC-LSTM but requiring much less domain labels to generalize to new domains. It is pre-trained on a large set of annotated NLG corpus to acquire the controllable generation ability, and fine-tuned with only a few domain-specific labels to adapt to new domains.

Natural Language ProcessingIntroduced 20001 papers

TopK Copy

TopK Copy is a cross-attention guided copy mechanism for entity extraction where only the Top- important attention heads are used for computing copy distributions. The motivation is that that attention heads may not equally important, and that some heads can be pruned out with a marginal decrease in overall performance. Attention probabilities produced by insignificant attention heads may be noisy. Thus, computing copy distributions without these heads could improve the model’s ability to infer the importance of each token in the input document.

Natural Language ProcessingIntroduced 20001 papers

Unigram Segmentation

Unigram Segmentation is a subword segmentation algorithm based on a unigram language model. It provides multiple segmentations with probabilities. The language model allows for emulating the noise generated during the segmentation of actual data. The unigram language model makes an assumption that each subword occurs independently, and consequently, the probability of a subword sequence is formulated as the product of the subword occurrence probabilities : where is a pre-determined vocabulary. The most probable segmentation for the input sentence is then given by: where is a set of segmentation candidates built from the input sentence . is obtained with the Viterbi algorithm.

Natural Language ProcessingIntroduced 20001 papers

ooJpiued

Please enter a description about the method here

Natural Language ProcessingIntroduced 20001 papers

DAHSF

Digestion Algorithm in Hierarchical Symbolic Forests

A brand new Foundation Model framework for deep learning in the future.

Natural Language ProcessingIntroduced 20001 papers

Chinese Pre-trained Unbalanced Transformer

CPT, or Chinese Pre-trained Unbalanced Transformer, is a pre-trained unbalanced Transformer for Chinese natural language understanding (NLU) and natural language generation (NLG) tasks. CPT consists of three parts: a shared encoder, an understanding decoder, and a generation decoder. Two specific decoders with a shared encoder are pre-trained with masked language modeling (MLM) and denoising auto-encoding (DAE) tasks, respectively. With the partially shared architecture and multi-task pre-training, CPT can (1) learn specific knowledge of both NLU or NLG tasks with two decoders and (2) be fine-tuned flexibly that fully exploits the potential of the model. Two specific decoders with a shared encoder are pre-trained with masked language modeling (MLM) and denoising auto-encoding (DAE) tasks, respectively. With the partially shared architecture and multi-task pre-training, CPT can (1) learn specific knowledge of both NLU or NLG tasks with two decoders and (2) be fine-tuned flexibly that fully exploits the potential of the model.

Natural Language ProcessingIntroduced 20001 papers

Augmented SBERT

Augmented SBERT is a data augmentation strategy for pairwise sentence scoring that uses a BERT cross-encoder to improve the performance for the SBERT bi-encoders. Given a pre-trained, well-performing crossencoder, we sample sentence pairs according to a certain sampling strategy and label these using the cross-encoder. We call these weakly labeled examples the silver dataset and they will be merged with the gold training dataset. We then train the bi-encoder on this extended training dataset.

Natural Language ProcessingIntroduced 20001 papers

PolyNorm

Polynomial Composition Activations

Transformers have found extensive applications across various domains due to the powerful fitting capabilities. This success can be partially attributed to their inherent nonlinearity. Thus, in addition to the ReLU function employed in the original transformer architecture, researchers have explored alternative modules such as GeLU and SwishGLU to enhance nonlinearity and thereby augment representational capacity. In this paper, we propose a novel category of polynomial composition activations (PolyCom), designed to optimize the dynamics of transformers. Theoretically, we provide a comprehensive mathematical analysis of PolyCom, highlighting its enhanced expressivity and efficacy relative to other activation functions. Notably, we demonstrate that networks incorporating PolyCom achieve the \emph{optimal approximation rate}, indicating that PolyCom networks require minimal parameters to approximate general smooth functions in Sobolev spaces. We conduct empirical experiments on the pre-training configurations of large language models (LLMs), including both dense and sparse architectures. By substituting conventional activation functions with PolyCom, we enable LLMs to capture higher-order interactions within the data, thus improving performance metrics in terms of accuracy and convergence rates. Extensive experimental results demonstrate the effectiveness of our method, showing substantial improvements over other activation functions. Code is available at \url{https://github.com/BryceZhuo/PolyCom}.

Natural Language ProcessingIntroduced 20001 papers

ReasonBERT

ReasonBERT is a pre-training method that augments language models with the ability to reason over long-range relations and multiple, possibly hybrid, contexts. It utilizes distant supervision to automatically connect multiple pieces of text and tables to create pre-training examples that require long-range reasoning. Different types of reasoning are simulated, including intersecting multiple pieces of evidence, bridging from one piece of evidence to another, and detecting unanswerable cases. Specifically, given a query sentence containing an entity pair, if we mask one of the entities, another sentence or table that contains the same pair of entities can likely be used as evidence to recover the masked entity. Moreover, to encourage deeper reasoning, multiple pieces of evidence are collected that are jointly used to recover the masked entities in the query sentence, allowing for the scattering of the masked entities among different pieces of evidence to mimic different types of reasoning. The Figure illustrates several examples using such distant supervision. In Ex. 1, a model needs to check multiple constraints (i.e., intersection reasoning type) and find “the beach soccer competition that is established in 1998.” In Ex. 2, a model needs to find “the type of the band that released Awaken the Guardian,” by first inferring the name of the band “Fates Warning” (i.e., bridging reasoning type). The masked entities in a query sentence are replaced with the [QUESTION] tokens. The new pre-training objective, span reasoning, then extracts the masked entities from the provided evidence. Existing LMs like BERT and RoBERTa are augmented by continuing to train them with the new objective, which leads to ReasonBERT. Then query sentence and textual evidence are encoded via the LM. When tabular evidence is present, the structure-aware transformer TAPAS is used as the encoder to capture the table structure.

Natural Language ProcessingIntroduced 20001 papers

SynthesizRR

Synthesize by Retrieval and Refinement

It is often desirable to distill the capabilities of large language models (LLMs) into smaller student models due to compute and memory constraints. One way to do this for classification tasks is via dataset synthesis. Prior approaches to synthesis use few-shot prompting, which relies on the LLM's parametric knowledge to generate usable examples. However, this leads to issues of repetition, bias towards popular entities, and stylistic differences from human text. We propose Synthesize by Retrieval and Refinement (SynthesizRR), which uses retrieval augmentation to introduce variety into the dataset synthesis process: as retrieved passages vary, the LLM is seeded with different content to generate its examples. We find that SynthesizRR greatly improves lexical and semantic diversity, similarity to human-written text, and distillation performance,

Natural Language ProcessingIntroduced 20001 papers

Deep LSTM Reader

The Deep LSTM Reader is a neural network for reading comprehension. We feed documents one word at a time into a Deep LSTM encoder, after a delimiter we then also feed the query into the encoder. The model therefore processes each document query pair as a single long sequence. Given the embedded document and query the network predicts which token in the document answers the query. The model consists of a Deep LSTM cell with skip connections from each input to every hidden layer, and from every hidden layer to the output : where || indicates vector concatenation, is the hidden state for layer at time , and , , are the input, forget, and output gates respectively. Thus our Deep LSTM Reader is defined by with input the concatenation of and separated by the delimiter |||.

Natural Language ProcessingIntroduced 20001 papers

Factorized Dense Synthesized Attention

Factorized Dense Synthesized Attention is a synthesized attention mechanism, similar to dense synthesized attention, but we factorize the outputs to reduce parameters and prevent overfitting. It was proposed as part of the Synthesizer architecture. The factorized variant of the dense synthesizer can be expressed as follows: where projects input into dimensions, projects to dimensions, and . The output of the factorized module is now written as: where , where , are tiling functions and . The tiling function simply duplicates the vector times, i.e., . In this case, is a projection of and is a projection of . To avoid having similar values within the same block, we compose the outputs of and .

Natural Language ProcessingIntroduced 20001 papers

K3M

K3M is a multi-modal pretraining method for e-commerce product data that introduces knowledge modality to correct the noise and supplement the missing of image and text modalities. The modal-encoding layer extracts the features of each modality. The modal-interaction layer is capable of effectively modeling the interaction of multiple modalities, where an initial-interactive feature fusion model is designed to maintain the independence of image modality and text modality, and a structure aggregation module is designed to fuse the information of image, text, and knowledge modalities. K3M is pre-trained with three pretraining tasks, including masked object modeling (MOM), masked language modeling (MLM), and link prediction modeling (LPM).

Natural Language ProcessingIntroduced 20001 papers

MT-PET

MT-PET is a multi-task version of Pattern Exploiting Training (PET) for exaggeration detection, which leverages knowledge from complementary cloze-style QA tasks to improve few-shot learning. It defines pairs of complementary pattern-verbalizer pairs for a main task and auxiliary task. These PVPs are then used to train PET on data from both tasks. PET uses the masked language modeling objective of pretrained language models to transform a task into one or more cloze-style question answering tasks. In the original PET implementation, PVPs are defined for a single target task. MT-PET extends this by allowing for auxiliary PVPs from related tasks, adding complementary cloze-style QA tasks during training. The motivation for the multi-task approach is two-fold: 1) complementary cloze-style tasks can potentially help the model to learn different aspects of the main task, i.e. the similar tasks of exaggeration detection and claim strength prediction; 2) data on related tasks can be utilized during training, which is important in situations where data for the main task is limited.

Natural Language ProcessingIntroduced 20001 papers

DeLighT

DeLiGHT is a transformer architecture that delivers parameter efficiency improvements by (1) within each Transformer block using DExTra, a deep and light-weight transformation, allowing for the use of single-headed attention and bottleneck FFN layers and (2) across blocks using block-wise scaling, that allows for shallower and narrower DeLighT blocks near the input and wider and deeper DeLighT blocks near the output.

Natural Language ProcessingIntroduced 20001 papers

SqueezeBERT

SqueezeBERT is an efficient architectural variant of BERT for natural language processing that uses grouped convolutions. It is much like BERT-base, but with positional feedforward connection layers implemented as convolutions, and grouped convolution for many of the layers.

Natural Language ProcessingIntroduced 20001 papers

PreviousPage 3 of 4Next