Methods

2,776 machine learning methods and techniques

All Audio Computer Vision General Graphs Natural Language Processing Reinforcement Learning Sequential

InternVideo

InternVideo: General Video Foundation Models via Generative and Discriminative Learning

The foundation models have recently shown excellent performance on a variety of downstream tasks in computer vision. However, most existing vision foundation models simply focus on image-level pretraining and adpation, which are limited for dynamic and complex video-level understanding tasks. To fill the gap, we present general video foundation models, InternVideo, by taking advantage of both generative and discriminative self-supervised video learning. Specifically, InternVideo efficiently explores masked video modeling and video-language contrastive learning as the pretraining objectives, and selectively coordinates video representations of these two complementary frameworks in a learnable manner to boost various video applications. Without bells and whistles, InternVideo achieves state-of-the-art performance on 39 video datasets from extensive tasks including video action recognition/detection, video-language alignment, and open-world video applications. Especially, our methods can obtain 91.1% and 77.2% top-1 accuracy on the challenging Kinetics-400 and Something-Something V2 benchmarks, respectively. All of these results effectively show the generality of our InternVideo for video understanding. The code will be released at https://github.com/OpenGVLab/InternVideo.

Computer VisionIntroduced 20005 papers

HyperDenseNet

Recently, dense connections have attracted substantial attention in computer vision because they facilitate gradient flow and implicit deep supervision during training. Particularly, DenseNet that connects each layer to every other layer in a feed-forward fashion and has shown impressive performances in natural image classification tasks. We propose HyperDenseNet, a 3-D fully convolutional neural network that extends the definition of dense connectivity to multi-modal segmentation problems. Each imaging modality has a path, and dense connections occur not only between the pairs of layers within the same path but also between those across different paths. This contrasts with the existing multi-modal CNN approaches, in which modeling several modalities relies entirely on a single joint layer (or level of abstraction) for fusion, typically either at the input or at the output of the network. Therefore, the proposed network has total freedom to learn more complex combinations between the modalities, within and in-between all the levels of abstraction, which increases significantly the learning representation. We report extensive evaluations over two different and highly competitive multi-modal brain tissue segmentation challenges, iSEG 2017 and MRBrainS 2013, with the former focusing on six month infant data and the latter on adult images. HyperDenseNet yielded significant improvements over many state-of-the-art segmentation networks, ranking at the top on both benchmarks. We further provide a comprehensive experimental analysis of features re-use, which confirms the importance of hyper-dense connections in multi-modal representation learning.

Computer VisionIntroduced 20005 papers

2D DWT

2D Discrete Wavelet Transform

Methods

InternVideo

HyperDenseNet

2D DWT

XCiT Layer

Siamese U-Net

Social-STGCNN

CoVR

Big-Little Module

PixelRNN

XCiT

Revision Network

EmbraceNet

Deformable RoI Pooling

DeepLabv2

FoveaBox

PP-OCR

Anti-Alias Downsampling

SimAug

VL-BERT

Precise RoI Pooling

EBC

MODNet

Grid R-CNN

Spatial Attention Module (ThunderNet)

TridentNet Block

WenLan

STDC

Focal Transformers

DE-GAN

ThunderNet

MAVL

SKNet

LRNet

Feature-Centric Voting

PanNet

Position-Sensitive RoIAlign

MUSIQ

UNIMO

DG-Net

BASNet

HTCN

PVTv2

Local Patch Interaction

PP-YOLO

FreeAnchor

Vokenization

Composite Fields

PFGM

Blended Diffusion

TridentNet

Methods

InternVideo

HyperDenseNet

2D DWT

XCiT Layer

Siamese U-Net

Social-STGCNN

CoVR

Big-Little Module

PixelRNN

XCiT

Revision Network

EmbraceNet

Deformable RoI Pooling

DeepLabv2

FoveaBox

PP-OCR

Anti-Alias Downsampling

SimAug

VL-BERT

Precise RoI Pooling

EBC

MODNet

Grid R-CNN

Spatial Attention Module (ThunderNet)

TridentNet Block

WenLan

STDC

Focal Transformers