Methods

2,776 machine learning methods and techniques

All Audio Computer Vision General Graphs Natural Language Processing Reinforcement Learning Sequential

Fragmentation

Given a pattern that is more complicated than the patterns, we fragment into simpler patterns such that their exact count is known. In the subgraph GNN proposed earlier, look into the subgraph of the host graph. We have seen that this technique is scalable on large graphs. Also, we have seen that subgraph GNN is more expressive and efficient than traditional GNN. So, we tried to explore the expressibility when the pattern is fragmented into smaller subpatterns.

Computer VisionIntroduced 2000123 papers

GoogLeNet

GoogLeNet is a type of convolutional neural network based on the Inception architecture. It utilises Inception modules, which allow the network to choose between multiple convolutional filter sizes in each block. An Inception network stacks these modules on top of each other, with occasional max-pooling layers with stride 2 to halve the resolution of the grid.

Computer VisionIntroduced 2000122 papers

VOS

VOS is a type of video object segmentation model consisting of two network components. The target appearance model consists of a light-weight module, which is learned during the inference stage using fast optimization techniques to predict a coarse but robust target segmentation. The segmentation model is exclusively trained offline, designed to process the coarse scores into high quality segmentation masks.

Computer VisionIntroduced 2000117 papers

PnP

PnP, or Poll and Pool, is sampling module extension for DETR-type architectures that adaptively allocates its computation spatially to be more efficient. Concretely, the PnP module abstracts the image feature map into fine foreground object feature vectors and a small number of coarse background contextual feature vectors. The transformer models information interaction within the fine-coarse feature space and translates the features into the detection result.

Computer VisionIntroduced 2000114 papers

BigGAN

BigGAN is a type of generative adversarial network that was designed for scaling generation to high-resolution, high-fidelity images. It includes a number of incremental changes and innovations. The baseline and incremental changes are: - Using SAGAN as a baseline with spectral norm. for G and D, and using TTUR. - Using a Hinge Loss GAN objective - Using class-conditional batch normalization to provide class information to G (but with linear projection not MLP. - Using a projection discriminator for D to provide class information to D. - Evaluating with EWMA of G's weights, similar to ProGANs. The innovations are: - Increasing batch sizes, which has a big effect on the Inception Score of the model. - Increasing the width in each layer leads to a further Inception Score improvement. - Adding skip connections from the latent variable to further layers helps performance. - A new variant of Orthogonal Regularization.

Computer VisionIntroduced 2000103 papers

Grid Sensitive

Grid Sensitive is a trick for object detection introduced by YOLOv4. When we decode the coordinate of the bounding box center and , in original YOLOv3, we can get them by where is the sigmoid function, and are integers and is a scale factor. Obviously, and cannot be exactly equal to or . This makes it difficult to predict the centres of bounding boxes that just located on the grid boundary. We can address this problem, by changing the equation to This makes it easier for the model to predict bounding box center exactly located on the grid boundary. The FLOPs added by Grid Sensitive are really small, and can be totally ignored.

Computer VisionIntroduced 2000102 papers

cVAE

Conditional Variational Auto Encoder

Computer VisionIntroduced 2000100 papers

YOLOv4

YOLOv4 is a one-stage object detection model that improves on YOLOv3 with several bags of tricks and modules introduced in the literature. The components section below details the tricks and modules used.

Computer VisionIntroduced 2000100 papers

Inception-v3

Inception-v3 is a convolutional neural network architecture from the Inception family that makes several improvements including using Label Smoothing, Factorized 7 x 7 convolutions, and the use of an auxiliary classifer to propagate label information lower down the network (along with the use of batch normalization for layers in the sidehead).

Computer VisionIntroduced 200098 papers

SqueezeNet

SqueezeNet is a convolutional neural network that employs design strategies to reduce the number of parameters, notably with the use of fire modules that "squeeze" parameters using 1x1 convolutions.

Computer VisionIntroduced 200097 papers

Inception-v3 Module

Inception-v3 Module is an image block used in the Inception-v3 architecture. This architecture is used on the coarsest (8 × 8) grids to promote high dimensional representations.

Computer VisionIntroduced 200095 papers

WGAN

Wasserstein GAN

Wasserstein GAN, or WGAN, is a type of generative adversarial network that minimizes an approximation of the Earth-Mover's distance (EM) rather than the Jensen-Shannon divergence as in the original GAN formulation. It leads to more stable training than original GANs with less evidence of mode collapse, as well as meaningful curves that can be used for debugging and searching hyperparameters.

Computer VisionIntroduced 200095 papers

BLIP

BLIP: Bootstrapping Language-Image Pre-training

Vision-Language Pre-training (VLP) has advanced the performance for many vision-language tasks. However, most existing pre-trained models only excel in either understanding-based tasks or generation-based tasks. Furthermore, performance improvement has been largely achieved by scaling up the dataset with noisy image-text pairs collected from the web, which is a suboptimal source of supervision. In this paper, we propose BLIP, a new VLP framework which transfers flexibly to both vision-language understanding and generation tasks. BLIP effectively utilizes the noisy web data by bootstrapping the captions, where a captioner generates synthetic captions and a filter removes the noisy ones. We achieve state-of-the-art results on a wide range of vision-language tasks, such as image-text retrieval (+2.7% in average recall@1), image captioning (+2.8% in CIDEr), and VQA (+1.6% in VQA score). BLIP also demonstrates strong generalization ability when directly transferred to video-language tasks in a zero-shot manner. Code, models, and datasets are released at https://github.com/salesforce/BLIP.

Computer VisionIntroduced 200093 papers

1D CNN

1-Dimensional Convolutional Neural Networks

1D Convolutional Neural Networks are similar to well known and more established 2D Convolutional Neural Networks. 1D Convolutional Neural Networks are used mainly used on text and 1D signals.

Computer VisionIntroduced 200092 papers

VGG-19

Visual Geometry Group 19 Layer CNN

Computer VisionIntroduced 200087 papers

ASPP

Atrous Spatial Pyramid Pooling

Atrous Spatial Pyramid Pooling (ASPP) is a semantic segmentation module for resampling a given feature layer at multiple rates prior to convolution. This amounts to probing the original image with multiple filters that have complementary effective fields of view, thus capturing objects as well as useful image context at multiple scales. Rather than actually resampling features, the mapping is implemented using multiple parallel atrous convolutional layers with different sampling rates.

Computer VisionIntroduced 200083 papers

SegNet

SegNet is a semantic segmentation model. This core trainable segmentation architecture consists of an encoder network, a corresponding decoder network followed by a pixel-wise classification layer. The architecture of the encoder network is topologically identical to the 13 convolutional layers in the VGG16 network. The role of the decoder network is to map the low resolution encoder feature maps to full input resolution feature maps for pixel-wise classification. The novelty of SegNet lies is in the manner in which the decoder upsamples its lower resolution input feature maps. Specifically, the decoder uses pooling indices computed in the max-pooling step of the corresponding encoder to perform non-linear upsampling.

Computer VisionIntroduced 200081 papers

FCOS

FCOS is an anchor-box free, proposal free, single-stage object detection model. By eliminating the predefined set of anchor boxes, FCOS avoids computation related to anchor boxes such as calculating overlapping during training. It also avoids all hyper-parameters related to anchor boxes, which are often very sensitive to the final detection performance.

Computer VisionIntroduced 200080 papers

Invertible 1x1 Convolution

The Invertible 1x1 Convolution is a type of convolution used in flow-based generative models that reverses the ordering of channels. The weight matrix is initialized as a random rotation matrix. The log-determinant of an invertible 1 × 1 convolution of a tensor with weight matrix is straightforward to compute:

Computer VisionIntroduced 200079 papers

How Do I Get a Human at Expedia?+1-805>330>4056.

How Do I Get a Human at Expedia? If you’re having trouble with an Expedia booking—such as a flight change, refund delay, or technical issue—the fastest way to resolve it is to speak directly with a human agent by calling +1-8053304056.. While Expedia offers automated menus and digital support options, reaching a live person ensures your concern is handled promptly and effectively. When you dial +1-8053304056., follow these tips to get past the automated system: ● Press “0” repeatedly or say “Agent” or “Representative” ● Avoid long voice prompts; repeat “Talk to someone” ● Call during off-peak hours (early morning or late evening) These steps can help you bypass long hold times and connect faster with a real Expedia support agent at +1-8053304056.. When to Speak to a Human at Expedia Contacting a live agent at +1-8053304056. is best for the following situations: ● Refunds not received ● Cancellations or itinerary issues ● Name or flight changes ● Missing reservations or booking errors ● Unexpected charges or billing problems While Expedia’s Help Center and app offer helpful tools, calling +1-8053304056. provides more personalized and immediate assistance. What You Should Have Ready Before Calling +1-8053304056. To make your call more efficient, gather the following: ● Your booking confirmation number ● The email address used to make your Expedia reservation ● A clear description of the issue ● Any supporting documents (e.g., screenshots or receipts) This information will help the representative at +1-8053304056. resolve your issue more quickly. Other Ways to Reach a Human at Expedia If you’re unable to connect via phone, here are some alternatives—though phone remains the most direct method via +1-8053304056.: ● Live Chat: Available in the Help Center (expedia.com/help) ● Social Media: Contact Expedia through Twitter (@Expedia) or Facebook ● Callback Requests: Use the Help Center to schedule a callback Even when using online channels, it’s often recommended to follow up with a phone call to +1-8053304056. for thorough resolution. Final Thoughts To get a human at Expedia, call +1-8053304056. and use the voice or keypad shortcuts to reach a live agent. Whether you're dealing with a refund delay, flight issue, or app problem, speaking to someone directly at +1-8053304056. ensures your case is handled with urgency and clarity.

Computer VisionIntroduced 200078 papers

Random Horizontal Flip

RandomHorizontalFlip is a type of image data augmentation which horizontally flips a given image with a given probability. Image Credit: Apache MXNet

Computer VisionIntroduced 200078 papers

Pyramid Pooling Module

A Pyramid Pooling Module is a module for semantic segmentation which acts as an effective global contextual prior. The motivation is that the problem of using a convolutional network like a ResNet is that, while the receptive field is already larger than the input image, the empirical receptive field is much smaller than the theoretical one especially on high-level layers. This makes many networks not sufficiently incorporate the momentous global scenery prior. The PPM is an effective global prior representation that addresses this problem. It contains information with different scales and varying among different sub-regions. Using our 4-level pyramid, the pooling kernels cover the whole, half of, and small portions of the image. They are fused as the global prior. Then we concatenate the prior with the original feature map in the final part.

Computer VisionIntroduced 200075 papers

HRNet

HRNet, or High-Resolution Net, is a general purpose convolutional neural network for tasks like semantic segmentation, object detection and image classification. It is able to maintain high resolution representations through the whole process. We start from a high-resolution convolution stream, gradually add high-to-low resolution convolution streams one by one, and connect the multi-resolution streams in parallel. The resulting network consists of several ( in the paper) stages and the th stage contains streams corresponding to resolutions. The authors conduct repeated multi-resolution fusions by exchanging the information across the parallel streams over and over.

Computer VisionIntroduced 200075 papers

MobileNetV1

MobileNet is a type of convolutional neural network designed for mobile and embedded vision applications. They are based on a streamlined architecture that uses depthwise separable convolutions to build lightweight deep neural networks that can have low latency for mobile and embedded devices.

Computer VisionIntroduced 200074 papers

Deep Belief Network

A Deep Belief Network (DBN) is a multi-layer generative graphical model. DBNs have bi-directional connections (RBM-type connections) on the top layer while the bottom layers only have top-down connections. They are trained using layerwise pre-training. Pre-training occurs by training the network component by component bottom up: treating the first two layers as an RBM and training, then treating the second layer and third layer as another RBM and training for those parameters. Source: Origins of Deep Learning Image Source: Wikipedia

Computer VisionIntroduced 200971 papers

ADA

Adaptive Discriminator Augmentation

Computer VisionIntroduced 200068 papers

Cutout

Cutout is an image augmentation and regularization technique that randomly masks out square regions of input during training. and can be used to improve the robustness and overall performance of convolutional neural networks. The main motivation for cutout comes from the problem of object occlusion, which is commonly encountered in many computer vision tasks, such as object recognition, tracking, or human pose estimation. By generating new images which simulate occluded examples, we not only better prepare the model for encounters with occlusions in the real world, but the model also learns to take more of the image context into consideration when making decisions

Computer VisionIntroduced 200064 papers

Wizard

Wizard: Unsupervised goats tracking algorithm

Computer vision is an interesting tool for animal behavior monitoring, mainly because it limits animal handling and it can be used to record various traits using only one sensor. From previous studies, this technic has shown to be suitable for various species and behavior. However it remains challenging to collect individual information, i.e. not only to detect animals and behavior on the video frames, but also to identify them. Animal identification is a prerequisite to gather individual information in order to characterize individuals and compare them. A common solution to this problem, known as multiple objects tracking, consists in detecting the animals on each video frame, and then associate detections to a unique animal ID. Association of detections between two consecutive frames are generally made to maintain coherence of the detection locations and appearances. To extract appearance information, a common solution is to use a convolutional neural network (CNN), trained on a large dataset before running the tracking algorithm. For farmed animals, designing such network is challenging as far as large training dataset are still lacking. In this article, we proposed an innovative solution, where the CNN used to extract appearance information is parameterized using offline unsupervised training. The algorithm, named Wizard, was evaluated for the purpose of goats monitoring in outdoor conditions. 17 annotated videos were used, for a total of 4H30, with various number of animals on the video (from 3 to 8) and different level of color differences between animals. First, the ability of the algorithm to track the detected animals was evaluated. When animals were detected, the algorithm found the correct animal ID in 94.82% of the frames. When tracking and detection were evaluated together, we found that Wizard found the correct animal ID in 86.18% of the video length. In situations where the animal detection rate could be high, Wizard seems to be a suitable solution for individual behavior analysis experiments based on computer vision.

Computer VisionIntroduced 200064 papers

Hierarchical Feature Fusion

Hierarchical Feature Fusion (HFF) is a feature fusion method employed in ESP and EESP image model blocks for degridding. In the ESP module, concatenating the outputs of dilated convolutions gives the ESP module a large effective receptive field, but it introduces unwanted checkerboard or gridding artifacts. To address the gridding artifact in ESP, the feature maps obtained using kernels of different dilation rates are hierarchically added before concatenating them (HFF). This solution is simple and effective and does not increase the complexity of the ESP module.

Computer VisionIntroduced 200064 papers

UNet++

UNet++ is an architecture for semantic segmentation based on the U-Net. Through the use of densely connected nested decoder sub-networks, it enhances extracted feature processing and was reported by its authors to outperform the U-Net in Electron Microscopy (EM), Cell, Nuclei, Brain Tumor, Liver and Lung Nodule medical image segmentation tasks.

Computer VisionIntroduced 200061 papers

Denoising Score Matching

Training a denoiser on signals gives you a powerful prior over this signal that you can then use to sample examples of this signal.

Computer VisionIntroduced 200057 papers

Sparse Autoencoder

A Sparse Autoencoder is a type of autoencoder that employs sparsity to achieve an information bottleneck. Specifically the loss function is constructed so that activations are penalized within a layer. The sparsity constraint can be imposed with L1 regularization or a KL divergence between expected average neuron activation to an ideal distribution . Image: Jeff Jordan. Read his blog post (click) for a detailed summary of autoencoders.

Computer VisionIntroduced 200057 papers

YOLOv2

YOLOv2, or YOLO9000, is a single-stage real-time object detection model. It improves upon YOLOv1 in several ways, including the use of Darknet-19 as a backbone, batch normalization, use of a high-resolution classifier, and the use of anchor boxes to predict bounding boxes, and more.

Computer VisionIntroduced 200055 papers

PCB

Part-based Convolutional Baseline

Computer VisionIntroduced 200054 papers

ShuffleNet

ShuffleNet is a convolutional neural network designed specially for mobile devices with very limited computing power. The architecture utilizes two new operations, pointwise group convolution and channel shuffle, to reduce computation cost while maintaining accuracy.

Computer VisionIntroduced 200051 papers

OASIS

OASIS is a GAN-based model to translate semantic label maps into realistic-looking images. The model builds on preceding work such as Pix2Pix and SPADE. OASIS introduces the following innovations: 1. The method is not dependent on the perceptual loss, which is commonly used for the semantic image synthesis task. A VGG network trained on ImageNet is routinely employed as the perceptual loss to strongly improve the synthesis quality. The authors show that this perceptual loss also has negative effects: First, it reduces the diversity of the generated images. Second, it negatively influences the color distribution to be more biased towards ImageNet. OASIS eliminates the dependence on the perceptual loss by changing the common discriminator design: The OASIS discriminator segments an image into one of the real classes or an additional fake class. In doing so, it makes more efficient use of the label maps that the discriminator normally receives. This distinguishes the discriminator from the commonly used encoder-shaped discriminators, which concatenate the label maps to the input image and predict a single score per image. With the more fine-grained supervision through the loss of the OASIS discriminator, the perceptual loss is shown to become unnecessary. 2. A user can generate a diverse set of images per label map by simply resampling noise. This is achieved by conditioning the spatially-adaptive denormalization module in each layer of the GAN generator directly on spatially replicated input noise. A side effect of this conditioning is that at inference time an image can be resampled either globally or locally (either the complete image changes or a restricted region in the image).

Computer VisionIntroduced 200050 papers

UNETR

UNet Transformer

UNETR, or UNet Transformer, is a Transformer-based architecture for medical image segmentation that utilizes a pure transformer as the encoder to learn sequence representations of the input volume -- effectively capturing the global multi-scale information. The transformer encoder is directly connected to a decoder via skip connections at different resolutions like a U-Net to compute the final semantic segmentation output.

Computer VisionIntroduced 200049 papers

StyleGAN2

StyleGAN2 is a generative adversarial network that builds on StyleGAN with several improvements. First, adaptive instance normalization is redesigned and replaced with a normalization technique called weight demodulation. Secondly, an improved training scheme upon progressively growing is introduced, which achieves the same goal - training starts by focusing on low-resolution images and then progressively shifts focus to higher and higher resolutions - without changing the network topology during training. Additionally, new types of regularization like lazy regularization and path length regularization are proposed.

Computer VisionIntroduced 200049 papers

OpenPose

Computer VisionIntroduced 200048 papers

BiFPN

A BiFPN, or Weighted Bi-directional Feature Pyramid Network, is a type of feature pyramid network which allows easy and fast multi-scale feature fusion. It incorporates the multi-level feature fusion idea from FPN, PANet and NAS-FPN that enables information to flow in both the top-down and bottom-up directions, while using regular and efficient connections. It also utilizes a fast normalized fusion technique. Traditional approaches usually treat all features input to the FPN equally, even those with different resolutions. However, input features at different resolutions often have unequal contributions to the output features. Thus, the BiFPN adds an additional weight for each input feature allowing the network to learn the importance of each. All regular convolutions are also replaced with less expensive depthwise separable convolutions. Comparing with PANet, PANet added an extra bottom-up path for information flow at the expense of more computational cost. Whereas BiFPN optimizes these cross-scale connections by removing nodes with a single input edge, adding an extra edge from the original input to output node if they are on the same level, and treating each bidirectional path as one feature network layer (repeating it several times for more high-level future fusion).

Computer VisionIntroduced 200048 papers

SENet

A SENet is a convolutional neural network architecture that employs squeeze-and-excitation blocks to enable the network to perform dynamic channel-wise feature recalibration.

Computer VisionIntroduced 200047 papers

Copy-Paste

simple Copy-Paste

Computer VisionIntroduced 200047 papers

SegFormer

SegFormer is a Transformer-based framework for semantic segmentation that unifies Transformers with lightweight multilayer perceptron (MLP) decoders. SegFormer has two appealing features: 1) SegFormer comprises a novel hierarchically structured Transformer encoder which outputs multiscale features. It does not need positional encoding, thereby avoiding the interpolation of positional codes which leads to decreased performance when the testing resolution differs from training. 2) SegFormer avoids complex decoders. The proposed MLP decoder aggregates information from different layers, and thus combining both local attention and global attention to render powerful representations.

Computer VisionIntroduced 200047 papers

PSPNet

PSPNet, or Pyramid Scene Parsing Network, is a semantic segmentation model that utilises a pyramid parsing module that exploits global context information by different-region based context aggregation. The local and global clues together make the final prediction more reliable. We also propose an optimization Given an input image, PSPNet use a pretrained CNN with the dilated network strategy to extract the feature map. The final feature map size is of the input image. On top of the map, we use the pyramid pooling module to gather context information. Using our 4-level pyramid, the pooling kernels cover the whole, half of, and small portions of the image. They are fused as the global prior. Then we concatenate the prior with the original feature map in the final part of. It is followed by a convolution layer to generate the final prediction map.

Computer VisionIntroduced 200047 papers

CenterNet

CenterNet is a one-stage object detector that detects each object as a triplet, rather than a pair, of keypoints. It utilizes two customized modules named cascade corner pooling and center pooling, which play the roles of enriching information collected by both top-left and bottom-right corners and providing more recognizable information at the central regions, respectively. The intuition is that, if a predicted bounding box has a high IoU with the ground-truth box, then the probability that the center keypoint in its central region is predicted as the same class is high, and vice versa. Thus, during inference, after a proposal is generated as a pair of corner keypoints, we determine if the proposal is indeed an object by checking if there is a center keypoint of the same class falling within its central region.

Computer VisionIntroduced 200047 papers

Cascade Corner Pooling

Cascade Corner Pooling is a pooling layer for object detection that builds upon the corner pooling operation. Corners are often outside the objects, which lacks local appearance features. CornerNet uses corner pooling to address this issue, where we find the maximum values on the boundary directions so as to determine corners. However, it makes corners sensitive to the edges. To address this problem, we need to let corners see the visual patterns of objects. Cascade corner pooling first looks along a boundary to find a boundary maximum value, then looks inside along the location of the boundary maximum value to find an internal maximum value, and finally, add the two maximum values together. By doing this, the corners obtain both the the boundary information and the visual patterns of objects.

Computer VisionIntroduced 200046 papers

Attention Pooling

Computer VisionIntroduced 200046 papers

Center Pooling

Center Pooling is a pooling technique for object detection that aims to capture richer and more recognizable visual patterns. The geometric centers of objects do not necessarily convey very recognizable visual patterns (e.g., the human head contains strong visual patterns, but the center keypoint is often in the middle of the human body). The detailed process of center pooling is as follows: the backbone outputs a feature map, and to determine if a pixel in the feature map is a center keypoint, we need to find the maximum value in its both horizontal and vertical directions and add them together. By doing this, center pooling helps the better detection of center keypoints.

Computer VisionIntroduced 200046 papers

PixelCNN

A PixelCNN is a generative model that uses autoregressive connections to model images pixel by pixel, decomposing the joint image distribution as a product of conditionals. PixelCNNs are much faster to train than PixelRNNs because convolutions are inherently easier to parallelize; given the vast number of pixels present in large image datasets this is an important advantage.

Computer VisionIntroduced 200045 papers

LeNet

LeNet is a classic convolutional neural network employing the use of convolutions, pooling and fully connected layers. It was used for the handwritten digit recognition task with the MNIST dataset. The architectural design served as inspiration for future networks such as AlexNet and VGG.. code

Computer VisionIntroduced 199845 papers

PreviousPage 2 of 56Next