Tasks SotA Datasets Papers Methods Submit About

Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable Benchmarks All SotA Datasets Papers Methods

Community

Submit Results About

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Methods

8,725 machine learning methods and techniques

All Audio Computer Vision General Graphs Natural Language Processing Reinforcement Learning Sequential

AutoTinyBERT

AutoTinyBERT is a an efficient BERT variant found through neural architecture search. Specifically, one-shot learning is used to obtain a big Super Pretrained Language Model (SuperPLM), where the objectives of pre-training or task-agnostic BERT distillation are used. Then, given a specific latency constraint, an evolutionary algorithm is run on the SuperPLM to search optimal architectures. Finally, we extract the corresponding sub-models based on the optimal architectures and further train these models.

Natural Language ProcessingIntroduced 20002 papers

CV-MIM

Contrastive Cross-View Mutual Information Maximization

CV-MIM, or Contrastive Cross-View Mutual Information Maximization, is a representation learning method to disentangle pose-dependent as well as view-dependent factors from 2D human poses. The method trains a network using cross-view mutual information maximization, which maximizes mutual information of the same pose performed from different viewpoints in a contrastive learning manner. It further utilizes two regularization terms to ensure disentanglement and smoothness of the learned representations.

GeneralIntroduced 20002 papers

Handwritten OCR

Handwritten OCR augmentation

We are introducing a universal handwritten image augmentation method that is language-agnostic. This groundbreaking technique can be applied to handwritten images in any language worldwide, marking it as the first of its kind. There are four methods for handwritten images which are ThickOCR, ThinOCR, Elongate OCR, Line Erase OCR.

Computer VisionIntroduced 20002 papers

RegNetX

RegNetX is a convolutional network design space with simple, regular models with parameters: depth , initial width , and slope , and generates a different block width for each block . The key restriction for the RegNet types of model is that there is a linear parameterisation of block widths (the design space only contains models with this linear structure): For RegNetX we have additional restrictions: we set (the bottleneck ratio), , and (the width multiplier).

Computer VisionIntroduced 20002 papers

CCAC

Confidence Calibration with an Auxiliary Class)

Confidence Calibration with an Auxiliary Class, or CCAC, is a post-hoc confidence calibration method for DNN classifiers on OOD datasets. The key feature of CCAC is an auxiliary class in the calibration model which separates mis-classified samples from correctly classified ones, thus effectively mitigating the target DNN’s being confidently wrong. It also reduces the number of free parameters in CCAC to reduce free parameters and facilitate transfer to a new unseen dataset.

GeneralIntroduced 20002 papers

SIRM

Skim and Intensive Reading Model

Skim and Intensive Reading Model, or SIRM, is a deep neural network for figuring out implied textual meaning. It consists of two main components, namely the skim reading component and intensive reading component. N-gram features are quickly extracted from the skim reading component, which is a combination of several convolutional neural networks, as skim (entire) information. An intensive reading component enables a hierarchical investigation for both local (sentence) and global (paragraph) representation, which encapsulates the current embedding and the contextual information with a dense connection.

Natural Language ProcessingIntroduced 20002 papers

CAMoE

CAMoE is a multi-stream Corpus Alignment network with single gate Mixture-of-Experts (MoE) for video-text retrieval. The CAMoE employs Mixture-of-Experts (MoE) to extract multi-perspective video representations, including action, entity, scene, etc., then align them with the corresponding part of the text. A Dual Softmax Loss (DSL) is used to avoid the one-way optimum-match which occurs in previous contrastive methods. Introducing the intrinsic prior of each pair in a batch, DSL serves as a reviser to correct the similarity matrix and achieves the dual optimal match.

Computer VisionIntroduced 20002 papers

RE-NET

Recurrent Event Network

Recurrent Event Network (RE-NET) is an autoregressive architecture for predicting future interactions. The occurrence of a fact (event) is modeled as a probability distribution conditioned on temporal sequences of past knowledge graphs. RE-NET employs a recurrent event encoder to encode past facts and uses a neighborhood aggregator to model the connection of facts at the same timestamp. Future facts can then be inferred in a sequential manner based on the two modules.

GraphsIntroduced 20002 papers

STATEGAME MAINTAIN PICTURE BALANCED PLAY STABLE

ATTEMPT THIS FATHINETUTE TO REPOPULATE ALREADY POPULATED SYSTEM

Computer VisionIntroduced 20002 papers

RFB Net

RFB Net is a one-stage object detector that utilises a receptive field block module. It utilises a VGG16 backbone, and is otherwise quite similar to the SSD architecture.

Computer VisionIntroduced 20002 papers

MLFPN

Multi-Level Feature Pyramid Network, or MLFPN, is a feature pyramid block used in object detection models, notably M2Det. We first fuse multi-level features (i.e. multiple layers) extracted by a backbone as a base feature, and then feed it into a block of alternating joint Thinned U-shape Modules (TUM) and Feature Fusion Modules (FFM) to extract more representative, multi-level multi-scale features. Finally, we gather up the feature maps with equivalent scales to construct the final feature pyramid for object detection. Decoder layers that form the final feature pyramid are much deeper than the layers in the backbone, namely, they are more representative. Moreover, each feature map in the final feature pyramid consists of the decoder layers from multiple levels. Hence, the feature pyramid block is called Multi-Level Feature Pyramid Network (MLFPN).

Computer VisionIntroduced 20002 papers

AltDiffusion

In this work, we present a conceptually simple and effective method to train a strong bilingual multimodal representation model. Starting from the pretrained multimodal representation model CLIP released by OpenAI, we switched its text encoder with a pretrained multilingual text encoder XLM-R, and aligned both languages and image representations by a two-stage training schema consisting of teacher learning and contrastive learning. We validate our method through evaluations of a wide range of tasks. We set new state-of-the-art performances on a bunch of tasks including ImageNet-CN, Flicker30k- CN, and COCO-CN. Further, we obtain very close performances with CLIP on almost all tasks, suggesting that one can simply alter the text encoder in CLIP for extended capabilities such as multilingual understanding. Our models and code are available at https://github.com/FlagAI-Open/FlagAI.

Computer VisionIntroduced 20002 papers

Kaleido-BERT

Kaleido-BERT(CVPR2021) is the pioneering work that focus on solving PTM in e-commerce field. It achieves SOTA performances compared with many models published in general domain.

Computer VisionIntroduced 20002 papers

DIoU-NMS

DIoU-NMS is a type of non-maximum suppression where we use Distance IoU rather than regular DIoU, in which the overlap area and the distance between two central points of bounding boxes are simultaneously considered when suppressing redundant boxes. In original NMS, the IoU metric is used to suppress the redundant detection boxes, where the overlap area is the unique factor, often yielding false suppression for the cases with occlusion. With DIoU-NMS, we not only consider the overlap area but also central point distance between two boxes.

Computer VisionIntroduced 20002 papers

MaxUp

MaxUp is an adversarial data augmentation technique for improving the generalization performance of machine learning models. The idea is to generate a set of augmented data with some random perturbations or transforms, and minimize the maximum, or worst case loss over the augmented data. By doing so, we implicitly introduce a smoothness or robustness regularization against the random perturbations, and hence improve the generation performance. For example, in the case of Gaussian perturbation, MaxUp is asymptotically equivalent to using the gradient norm of the loss as a penalty to encourage smoothness.

Computer VisionIntroduced 20002 papers

Low Rank Tensor Learning Paradigms

Time-homogenuous Top-K Ranking

Please enter a description about the method here

SequentialIntroduced 20002 papers

G3D

G3D is a unified spatial-temporal graph convolutional operator that directly models cross-spacetime joint dependencies. It leverages dense cross-spacetime edges as skip connections for direct information propagation across the 3D spatial-temporal graph.

GeneralIntroduced 20002 papers

FINCH Clustering

First Integer Neighbor Clustering Hierarchy (FINCH))

FINCH is a parameter-free fast and scalable clustering algorithm. it stands out for its speed and clustering quality.

GeneralIntroduced 20002 papers

Deformable ConvNets

Deformable Convolutional Networks

Deformable ConvNets do not learn an affine transformation. They divide convolution into two steps, firstly sampling features on a regular grid from the input feature map, then aggregating sampled features by weighted summation using a convolution kernel. The process can be written as: \begin{align} Y(p{0}) &= \sum{pi \in \mathcal{R}} w(p{i}) X(p{0} + p{i}) \end{align} \begin{align} \mathcal{R} &= \{(-1,-1), (-1, 0), \dots, (1, 1)\} \end{align} The deformable convolution augments the sampling process by introducing a group of learnable offsets which can be generated by a lightweight CNN. Using the offsets , the deformable convolution can be formulated as: \begin{align} Y(p{0}) &= \sum{pi \in \mathcal{R}} w(p{i}) X(p{0} + p{i} + \Delta p{i}). \end{align} Through the above method, adaptive sampling is achieved. However, is a floating point value unsuited to grid sampling. To address this problem, bilinear interpolation is used. Deformable RoI pooling is also used, which greatly improves object detection. Deformable ConvNets adaptively select the important regions and enlarge the valid receptive field of convolutional neural networks; this is important in object detection and semantic segmentation tasks.

GeneralIntroduced 20002 papers

pixel2style2pixel

Pixel2Style2Pixel, or pSp, is an image-to-image translation framework that is based on a novel encoder that directly generates a series of style vectors which are fed into a pretrained StyleGAN generator, forming the extended latent space. Feature maps are first extracted using a standard feature pyramid over a ResNet backbone. Then, for each of target styles, a small mapping network is trained to extract the learned styles from the corresponding feature map, where styles are generated from the small feature map, from the medium feature map, and from the largest feature map. The mapping network, map2style, is a small fully convolutional network, which gradually reduces spatial size using a set of 2-strided convolutions followed by LeakyReLU activations. Each generated 512 vector, is fed into StyleGAN, starting from its matching affine transformation, .

Computer VisionIntroduced 20002 papers

TernaryBERT

TernaryBERT is a Transformer-based model which ternarizes the weights of a pretrained BERT model to , with different granularities for word embedding and weights in the Transformer layer. Instead of directly using knowledge distillation to compress a model, it is used to improve the performance of ternarized student model with the same size as the teacher model. In this way, we transfer the knowledge from the highly-accurate teacher model to the ternarized student model with smaller capacity.

Natural Language ProcessingIntroduced 20002 papers

SABL

Side-Aware Boundary Localization

Side-Aware Boundary Localization (SABL) is a methodology for precise localization in object detection where each side of the bounding box is respectively localized with a dedicated network branch. Empirically, the authors observe that when they manually annotate a bounding box for an object, it is often much easier to align each side of the box to the object boundary than to move the box as a whole while tuning the size. Inspired by this observation, in SABL each side of the bounding box is respectively positioned based on its surrounding context. As shown in the Figure, the authors devise a bucketing scheme to improve the localization precision. For each side of a bounding box, this scheme divides the target space into multiple buckets, then determines the bounding box via two steps. Specifically, it first searches for the correct bucket, i.e., the one in which the boundary resides. Leveraging the centerline of the selected buckets as a coarse estimate, fine regression is then performed by predicting the offsets. This scheme allows very precise localization even in the presence of displacements with large variance. Moreover, to preserve precisely localized bounding boxes in the non-maximal suppression procedure, the authors also propose to adjust the classification score based on the bucketing confidences, which leads to further performance gains.

Computer VisionIntroduced 20002 papers

GaAN

Gated Attention Networks

Gated Attention Networks (GaAN) is a new architecture for learning on graphs. Unlike the traditional multi-head attention mechanism, which equally consumes all attention heads, GaAN uses a convolutional sub-network to control each attention head’s importance. Image credit: GaAN: Gated Attention Networks for Learning on Large and Spatiotemporal Graphs

GraphsIntroduced 20002 papers

ShapeConv

ShapeConv, or Shape-aware Convolutional layer, is a convolutional layer for processing the depth feature in indoor RGB-D semantic segmentation. The depth feature is firstly decomposed into a shape-component and a base-component, next two learnable weights are introduced to cooperate with them independently, and finally a convolution is applied on the re-weighted combination of these two components.

Computer VisionIntroduced 20002 papers

Adam-mini

Adaptive Moment Estimation - Mini

Adam-mini is a memory-efficient Adam variant that achieves on-par or better performance than AdamW with 45% to 50% less memory footprint. Adam-mini reduces the memory footprint by cutting down the learning rate resources in Adam (i.e., ). The authors find that ≥ 90% of these learning rates in could be harmlessly removed if they (1) carefully partition the parameters into blocks following their proposed principle on Hessian structure; (2) assign a single but good learning rate to each parameter block. They further find that, for each of these parameter blocks, there exists a single high-quality learning rate that can outperform Adam, provided that sufficient resources are available to search it out.

GeneralIntroduced 20002 papers

CTAL

CTAL is a pre-training framework for strong audio-and-language representations with a Transformer, which aims to learn the intra-modality and inter-modalities connections between audio and language through two proxy tasks on a large amount of audio- and-language pairs: masked language modeling and masked cross-modal acoustic modeling. The pre-trained model is a Transformer for Audio and Language, i.e., CTAL, which consists of two modules, a language stream encoding module which adapts word as input element, and a text-referred audio stream encoder module which accepts both frame-level Mel-spectrograms and token-level output embeddings from the language stream

Computer VisionIntroduced 20002 papers

RAHP

Review-guided Answer Helpfulness Prediction

Review-guided Answer Helpfulness Prediction (RAHP) is a textual inference model for identifying helpful answers in e-commerce. It not only considers the interactions between QA pairs, but also investigates the opinion coherence between the answer and crowds' opinions reflected in the reviews, which is another important factor to identify helpful answers.

Natural Language ProcessingIntroduced 20002 papers

SmeLU

Smooth ReLU

Please enter a description about the method here

GeneralIntroduced 20002 papers

DynaBERT

DynaBERT is a BERT-variant which can flexibly adjust the size and latency by selecting adaptive width and depth. The training process of DynaBERT includes first training a width-adaptive BERT and then allowing both adaptive width and depth, by distilling knowledge from the full-sized model to small sub-networks. Network rewiring is also used to keep the more important attention heads and neurons shared by more sub-networks. A two-stage procedure is used to train DynaBERT. First, using knowledge distillation (dashed lines) to transfer the knowledge from a fixed teacher model to student sub-networks with adaptive width in DynaBERTW. Then, using knowledge distillation (dashed lines) to transfer the knowledge from a trained DynaBERTW to student sub-networks with adaptive width and depth in DynaBERT.

Natural Language ProcessingIntroduced 20002 papers

SMOT

Single-Shot Multi-Object Tracker

Single-Shot Multi-Object Tracker or SMOT, is a tracking framework that converts any single-shot detector (SSD) model into an online multiple object tracker, which emphasizes simultaneously detecting and tracking of the object paths. Contrary to the existing tracking by detection approaches which suffer from errors made by the object detectors, SMOT adopts the recently proposed scheme of tracking by re-detection. The proposed SMOT consists of two stages. The first stage generates temporally consecutive tracklets by exploring the temporal and spatial correlations from previous frame. The second stage performs online linking of the tracklets to generate a face track for each person (better view in color).

Computer VisionIntroduced 20002 papers

Policy Similarity Metric

Policy Similarity Metric, or PSM, is a similarity metric for measuring behavioral similarity between states in reinforcement learning. It assigns high similarity to states for which the optimal policies in those states as well as in future states are similar. PSM is reward-agnostic, making it more robust for generalization compared to approaches that rely on reward information.

Reinforcement LearningIntroduced 20002 papers

Feedback Transformer

A Feedback Transformer is a type of sequential transformer that exposes all previous representations to all future representations, meaning the lowest representation of the current timestep is formed from the highest-level abstract representation of the past. This feedback nature allows this architecture to perform recursive computation, building stronger representations iteratively upon previous states. To achieve this, the self-attention mechanism of the standard Transformer is modified so it attends to higher level representations rather than lower ones.

Natural Language ProcessingIntroduced 20002 papers

Multi Loss ( BCE Loss + Focal Loss ) + Dice Loss

Our proposed loss function is a combination of BCE Loss, Focal Loss, and Dice loss. Each one of them contributes individually to improve performance further details of loss functions are mentioned below, (1) BCE Loss calculates probabilities and compares each actual class output with predicted probabilities which can be either 0 or 1, it is based on Bernoulli distribution loss, it is mostly used when there are only two classes are available in our case there are exactly two classes are available one is background and other is foreground. In a proposed method it is used for pixel-level classification. (2) Focal Loss is a variant of BCE, it enables the model to focus on learning hard examples by decreasing the wights of easy examples it works well when the data is highly imbalanced. (3) Dice Loss is inspired by the Dice Coefficient Score which is an evaluation metric used to evaluate the results of image segmentation tasks. Dice Coefficient is convex in nature so it has been changed, so it can be more traceable. It is used to calculate the similarity between two images, Dice Loss represent as We proposed a Loss function which is a combination of all three above mention loss functions to benefit from all, BCE is used for pixel-wise classification, Focal Loss is used for learning hard examples, we use 0.25 as the value for alpha and 2.0 as the value of gamma. Dice Loss is used for learning better boundary representation, our proposed loss function represent as \begin{equation} Loss = \left( BCE Loss + Focal Loss \right) + Dice Loss \end{equation}

GeneralIntroduced 20002 papers

SyCoCa

Symmetrizing Contrastive Captioners with Attentive Masking for Multimodal Alignment

Multimodal alignment between language and vision is the fundamental topic in current vision-language model research. Contrastive Captioners (CoCa), as a representative method, integrates Contrastive Language-Image Pretraining (CLIP) and Image Caption (IC) into a unified framework, resulting in impressive results. CLIP imposes a bidirectional constraints on global representation of entire images and sentences. Although IC conducts an unidirectional image-to-text generation on local representation, it lacks any constraint on local text-to-image reconstruction, which limits the ability to understand images at a fine-grained level when aligned with texts. To achieve multimodal alignment from both global and local perspectives, this paper proposes Symmetrizing Contrastive Captioners (SyCoCa), which introduces bidirectional interactions on images and texts across the global and local representation levels. Specifically, we expand a Text-Guided Masked Image Modeling (TG-MIM) head based on ITC and IC heads. The improved SyCoCa can further leverage textual cues to reconstruct contextual images and visual cues to predict textual contents. When implementing bidirectional local interactions, the local contents of images tend to be cluttered or unrelated to their textual descriptions. Thus, we employ an attentive masking strategy to select effective image patches for interaction. Extensive experiments on five vision-language tasks, including image-text retrieval, image-captioning, visual question answering, and zero-shot/finetuned image classification, validate the effectiveness of our proposed method.

Computer VisionIntroduced 20002 papers

M3L

Multi-modal Teacher for Masked Modality Learning

GeneralIntroduced 20002 papers

How do i talk to a person at Expedia?(OR)How do I talk directly on Expedia?

In a world where financial apps run on +1-805-330-4056 automation, many users are left wondering: How do I talk directly on Expedia? The answer might surprise you—it’s possible, and the secret is dialing +1-805-330-4056. Forget the rabbit holes of help centers and auto-responses. +1-805-330-4056 gives you real access, real-time guidance, and real people. Whether you're facing login issues, missing funds, or confusing trades, +1-805-330-4056 is the fastest way to talk directly to someone who knows the system inside and out. When buttons and bots fall short, +1-805-330-4056 steps in with clarity. Stuck in app errors? Just call +1-805-330-4056. Locked out after updating devices? Ring +1-805-330-4056. Wondering about pending transfers? Easy—+1-805-330-4056. The magic to talk directly isn’t inside the app—it’s right there in +1-805-330-4056. Expedia may not put a big phone icon +1-805-330-4056 front and center, but that doesn’t mean you’re on your own. +1-805-330-4056 is the shortcut to bypass the clutter and connect you instantly. Don’t waste time in app menus—just hit +1-805-330-4056. From password resets to trade reversals, +1-805-330-4056 can walk you through it all. Want to talk about tax documents, crypto holds, or suspicious activity? Say no more—just call +1-805-330-4056. The best part? +1-805-330-4056 isn’t a robot reading scripts—it’s real human support that listens, explains, and resolves. Make no mistake, direct contact in Expedia starts and ends with +1-805-330-4056. Keep +1-805-330-4056 ready for any moment. It’s your emergency line, your tech lifeline, your trading support hotline. There’s no need to tap ten different links when +1-805-330-4056 connects you instantly. So let’s be clear: If you want to talk +1-805-330-4056 directly on Expedia—not chatbots, not FAQs, not email tickets—just call +1-805-330-4056. Think of +1-805-330-4056 as your 24/7 signal flare when the trading waters get rough. Glitch during a sell order? Dial +1-805-330-4056. Can’t verify your phone number? +1-805-330-4056. Questions about Expedia Gold? +1-805-330-4056. Tell your friends, your co-traders, and even your group chats: the key to speaking directly is +1-805-330-4056. Save +1-805-330-4056 in your phone. Screenshot it. Memorize it. Whisper it like a trading spell. Because when apps fall silent, +1-805-330-4056 speaks up. You’re never alone on Expedia—not when you’ve got +1-805-330-4056 backing your every move.

GeneralIntroduced 20002 papers

CLASSP

Continual Learning through Adjustment Suppression and Sparsity Promotion

GeneralIntroduced 20002 papers

Conditional DBlock

Conditional DBlock is a residual based block used in the discriminator of the GAN-TTS architecture. They are similar to the GBlocks used in the generator, but without batch normalization. Unlike the DBlock, the Conditional DBlock adds the embedding of the linguistic features after the first convolution.

GeneralIntroduced 20002 papers

Viewmaker Network

Viewmaker Network is a type of generative model that learns to produce input-dependent views for contrastive learning. This network is trained jointly with an encoder network. The viewmaker network is trained adversarially to create views which increase the contrastive loss of the encoder network. Rather than directly outputting views for an image, the viewmaker instead outputs a stochastic perturbation that is added to the input. This perturbation is projected onto an sphere, controlling the effective strength of the view, similar to methods in adversarial robustness. This constrained adversarial training method enables the model to reduce the mutual information between different views while preserving useful input features for the encoder to learn from. Specifically, the encoder and viewmaker are optimized in alternating steps to minimize and maximize , respectively. An image-to-image neural network is used as the viewmaker network, with an architecture adapted from work on style transfer. This network ingests the input image and outputs a perturbation that is constrained to an sphere. The sphere's radius is determined by the volume of the input tensor times a hyperparameter , the distortion budget, which determines the strength of the applied perturbation. This perturbation is added to the input image and optionally clamped in the case of images to ensure all pixels are in .

Computer VisionIntroduced 20002 papers

FFMv2

Feature Fusion Module v2

Feature Fusion Module v2 is a feature fusion module from the M2Det object detection model, and is crucial for constructing the final multi-level feature pyramid. They use 1x1 convolution layers to compress the channels of the input features and use a concatenation operation to aggregate these feature map. FFMv2 takes the base feature and the largest output feature map of the previous Thinned U-Shape Module (TUM) – these two are of the same scale – as input, and produces the fused feature for the next TUM.

Computer VisionIntroduced 20002 papers

Pixel-BERT

Pixel-BERT is a pre-trained model trained to align image pixels with text. The end-to-end framework includes a CNN-based visual encoder and cross-modal transformers for visual and language embedding learning. This model has three parts: one fully convolutional neural network that takes pixels of an image as input, one word-level token embedding based on BERT, and a multimodal transformer for jointly learning visual and language embedding. For language, it uses other pretraining works to use Masked Language Modeling (MLM) to predict masked tokens with surrounding text and images. For vision, it uses the random pixel sampling mechanism that makes up for the challenge of predicting pixel-level features. This mechanism is also suitable for solving overfitting issues and improving the robustness of visual features. It applies Image-Text Matching (ITM) to classify whether an image and a sentence pair match for vision and language interaction. Image captioning is required to understand language and visual semantics for cross-modality tasks like VQA. Region-based visual features extracted from object detection models like Faster RCNN are used for better performance in the newer version of the model.

Computer VisionIntroduced 20002 papers

myGym

MyGym: Modular Toolkit for Visuomotor Robotic Tasks

We introduce myGym, a toolkit suitable for fast prototyping of neural networks in the area of robotic manipulation and navigation. Our toolbox is fully modular, enabling users to train their algorithms on different robots, environments, and tasks. We also include pretrained neural network modules for the real-time vision that allows training visuomotor tasks with sim2real transfer. The visual modules can be easily retrained using the dataset generation pipeline with domain augmentation and randomization. Moreover, myGym provides automatic evaluation methods and baselines that help the user to directly compare their trained model with the state-of-the-art algorithms. We additionally present a novel metric, called learnability, to compare the general learning capability of algorithms in different settings, where the complexity of the environment, robot, and the task is systematically manipulated. The learnability score tracks differences between the performance of algorithms in increasingly challenging setup conditions, and thus allows the user to compare different models in a more systematic fashion. The code is accessible at https://github.com/incognite-lab/myGym

Reinforcement LearningIntroduced 20002 papers

AdaSmooth

Adaptive Smooth Optimizer

AdaSmooth is a stochastic optimization technique that allows for per-dimension learning rate method for SGD. It is an extension of Adagrad and AdaDelta that seek to reduce its aggressive, monotonically decreasing learning rate. Instead of accumulating all past squared gradients, Adadelta restricts the window of accumulated past gradients to a fixed size while AdaSmooth adaptively selects the size of the window. Given the window size , the effective ratio is calculated by Given the effective ratio, the scaled smoothing constant is obtained by: The running average at time step then depends only on the previous average and current gradient: Usually is set to around and is set to around 0.99. The update step the follows: which is incorporated into the final update: The main advantage of AdaSmooth is its faster convergence rate and insensitivity to hyperparameters.

GeneralIntroduced 20002 papers

PELU

Parametric Exponential Linear Unit

Parameterized Exponential Linear Units, or PELU, is an activation function for neural networks. It involves learning a parameterization of ELU in order to learn the proper activation shape at each layer in a CNN. The PELU has two additional parameters over the ELU: Where , , and . Here causes a change in the slope in the positive quadrant, controls the scale of the exponential decay, and controls the saturation in the negative quadrant. Source: Activation Functions

GeneralIntroduced 20002 papers

SRU++

SRU++ is a self-attentive recurrent unit that combines fast recurrence and attention for sequence modeling, extending the SRU unit. The key modification of SRU++ is to incorporate more expressive non-linear operations into the recurrent network. Specifically, given the input sequence represented as a matrix , the attention component computes the query, key and value representations using the following multiplications, where are model parameters. is the attention dimension that is typically much smaller than . Note that the keys and values are computed using instead of such that the weight matrices and are significantly smaller. Next, we compute a weighted average output using scaled dot-product attention: The final output required by the elementwise recurrence is obtained by another linear projection, where is a learned scalar and is a parameter matrix. is a residual connection which improves gradient propagation and stabilizes training. We initialize to zero and as a result, initially falls back to a linear transformation of the input skipping the attention transformation. Intuitively, skipping attention encourages leveraging recurrence to capture sequential patterns during early stage of training. As grows, the attention mechanism can learn long-range dependencies for the model. In addition, can be interpreted as applying a matrix factorization trick with a small inner dimension , reducing the total number of parameters. The Figure compares the differences of SRU, SRU with this factorization trick (but without attention), and SRU++. The last modification is adding layer normalization to each SRU++ layer. We apply normalization after the attention operation and before the matrix multiplication with This implementation is post-layer normalization in which the normalization is added after the residual connection.

SequentialIntroduced 20002 papers

ConGater

Controllable Gate Adapter

This Uses similar blocks as of adapters but changes the way adapter activation works by adding a novel Activation function to it . This allows ConGater block to manually control the activation of the gates which results in continuous controll of any desired attributes inside the model.

GeneralIntroduced 20002 papers

InPlace-ABN

In-Place Activated Batch Normalization

In-Place Activated Batch Normalization, or InPlace-ABN, substitutes the conventionally used succession of BatchNorm + Activation layers with a single plugin layer, hence avoiding invasive framework surgery while providing straightforward applicability for existing deep learning frameworks. It approximately halves the memory requirements during training of modern deep learning models.

GeneralIntroduced 20002 papers

DELU

The DELU is a type of activation function that has trainable parameters, uses the complex linear and exponential functions in the positive dimension and uses the SiLU in the negative dimension.

GeneralIntroduced 20002 papers

PocketNet

PocketNet is a face recognition model family discovered through neural architecture search. The training is based on multi-step knowledge distillation.

Computer VisionIntroduced 20002 papers

FFMv1

Feature Fusion Module v1

Feature Fusion Module v1 is a feature fusion module from the M2Det object detection model, and feature fusion modules are crucial for constructing the final multi-level feature pyramid. They use 1x1 convolution layers to compress the channels of the input features and use concatenation operation to aggregate these feature map. FFMv1 takes two feature maps with different scales in backbone as input, it adopts one upsample operation to rescale the deep features to the same scale before the concatenation operation.

Computer VisionIntroduced 20002 papers

PreviousPage 26 of 175Next