Jiang-Xin Shi, Tong Wei, Zhi Zhou, Jie-Jing Shao, Xin-Yan Han, Yu-Feng Li
The fine-tuning paradigm in addressing long-tail learning tasks has sparked significant interest since the emergence of foundation models. Nonetheless, how fine-tuning impacts performance in long-tail learning was not explicitly quantified. In this paper, we disclose that heavy fine-tuning may even lead to non-negligible performance deterioration on tail classes, and lightweight fine-tuning is more effective. The reason is attributed to inconsistent class conditions caused by heavy fine-tuning. With the observation above, we develop a low-complexity and accurate long-tail learning algorithms LIFT with the goal of facilitating fast prediction and compact models by adaptive lightweight fine-tuning. Experiments clearly verify that both the training time and the learned parameters are significantly reduced with more accurate predictive performance compared with state-of-the-art approaches. The implementation code is available at https://github.com/shijxcs/LIFT.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Image Classification | Places-LT | Top-1 Accuracy | 53.7 | LIFT (ViT-L/14) |
| Image Classification | Places-LT | Top-1 Accuracy | 52.2 | LIFT (ViT-B/16) |
| Image Classification | CIFAR-100-LT (ρ=50) | Error Rate | 9.8 | LIFT (ViT-B/16, ImageNet-21K pre-training) |
| Image Classification | CIFAR-100-LT (ρ=50) | Error Rate | 16.9 | LIFT (ViT-B/16, CLIP) |
| Image Classification | CIFAR-100-LT (ρ=10) | Error Rate | 8.7 | LIFT (ViT-B/16, ImageNet-21K pre-training) |
| Image Classification | CIFAR-100-LT (ρ=10) | Error Rate | 15.1 | LIFT (ViT-B/16, CLIP) |
| Image Classification | ImageNet-LT | Top-1 Accuracy | 82.9 | LIFT (ViT-L/14) |
| Image Classification | ImageNet-LT | Top-1 Accuracy | 78.3 | LIFT (ViT-B/16) |
| Image Classification | CIFAR-100-LT (ρ=100) | Error Rate | 10.9 | LIFT (ViT-B/16, ImageNet-21K pre-training) |
| Image Classification | CIFAR-100-LT (ρ=100) | Error Rate | 18.3 | LIFT (ViT-B/16, CLIP) |
| Few-Shot Image Classification | Places-LT | Top-1 Accuracy | 53.7 | LIFT (ViT-L/14) |
| Few-Shot Image Classification | Places-LT | Top-1 Accuracy | 52.2 | LIFT (ViT-B/16) |
| Few-Shot Image Classification | CIFAR-100-LT (ρ=50) | Error Rate | 9.8 | LIFT (ViT-B/16, ImageNet-21K pre-training) |
| Few-Shot Image Classification | CIFAR-100-LT (ρ=50) | Error Rate | 16.9 | LIFT (ViT-B/16, CLIP) |
| Few-Shot Image Classification | CIFAR-100-LT (ρ=10) | Error Rate | 8.7 | LIFT (ViT-B/16, ImageNet-21K pre-training) |
| Few-Shot Image Classification | CIFAR-100-LT (ρ=10) | Error Rate | 15.1 | LIFT (ViT-B/16, CLIP) |
| Few-Shot Image Classification | ImageNet-LT | Top-1 Accuracy | 82.9 | LIFT (ViT-L/14) |
| Few-Shot Image Classification | ImageNet-LT | Top-1 Accuracy | 78.3 | LIFT (ViT-B/16) |
| Few-Shot Image Classification | CIFAR-100-LT (ρ=100) | Error Rate | 10.9 | LIFT (ViT-B/16, ImageNet-21K pre-training) |
| Few-Shot Image Classification | CIFAR-100-LT (ρ=100) | Error Rate | 18.3 | LIFT (ViT-B/16, CLIP) |
| Generalized Few-Shot Classification | Places-LT | Top-1 Accuracy | 53.7 | LIFT (ViT-L/14) |
| Generalized Few-Shot Classification | Places-LT | Top-1 Accuracy | 52.2 | LIFT (ViT-B/16) |
| Generalized Few-Shot Classification | CIFAR-100-LT (ρ=50) | Error Rate | 9.8 | LIFT (ViT-B/16, ImageNet-21K pre-training) |
| Generalized Few-Shot Classification | CIFAR-100-LT (ρ=50) | Error Rate | 16.9 | LIFT (ViT-B/16, CLIP) |
| Generalized Few-Shot Classification | CIFAR-100-LT (ρ=10) | Error Rate | 8.7 | LIFT (ViT-B/16, ImageNet-21K pre-training) |
| Generalized Few-Shot Classification | CIFAR-100-LT (ρ=10) | Error Rate | 15.1 | LIFT (ViT-B/16, CLIP) |
| Generalized Few-Shot Classification | ImageNet-LT | Top-1 Accuracy | 82.9 | LIFT (ViT-L/14) |
| Generalized Few-Shot Classification | ImageNet-LT | Top-1 Accuracy | 78.3 | LIFT (ViT-B/16) |
| Generalized Few-Shot Classification | CIFAR-100-LT (ρ=100) | Error Rate | 10.9 | LIFT (ViT-B/16, ImageNet-21K pre-training) |
| Generalized Few-Shot Classification | CIFAR-100-LT (ρ=100) | Error Rate | 18.3 | LIFT (ViT-B/16, CLIP) |
| Long-tail Learning | Places-LT | Top-1 Accuracy | 53.7 | LIFT (ViT-L/14) |
| Long-tail Learning | Places-LT | Top-1 Accuracy | 52.2 | LIFT (ViT-B/16) |
| Long-tail Learning | CIFAR-100-LT (ρ=50) | Error Rate | 9.8 | LIFT (ViT-B/16, ImageNet-21K pre-training) |
| Long-tail Learning | CIFAR-100-LT (ρ=50) | Error Rate | 16.9 | LIFT (ViT-B/16, CLIP) |
| Long-tail Learning | CIFAR-100-LT (ρ=10) | Error Rate | 8.7 | LIFT (ViT-B/16, ImageNet-21K pre-training) |
| Long-tail Learning | CIFAR-100-LT (ρ=10) | Error Rate | 15.1 | LIFT (ViT-B/16, CLIP) |
| Long-tail Learning | ImageNet-LT | Top-1 Accuracy | 82.9 | LIFT (ViT-L/14) |
| Long-tail Learning | ImageNet-LT | Top-1 Accuracy | 78.3 | LIFT (ViT-B/16) |
| Long-tail Learning | CIFAR-100-LT (ρ=100) | Error Rate | 10.9 | LIFT (ViT-B/16, ImageNet-21K pre-training) |
| Long-tail Learning | CIFAR-100-LT (ρ=100) | Error Rate | 18.3 | LIFT (ViT-B/16, CLIP) |
| Generalized Few-Shot Learning | Places-LT | Top-1 Accuracy | 53.7 | LIFT (ViT-L/14) |
| Generalized Few-Shot Learning | Places-LT | Top-1 Accuracy | 52.2 | LIFT (ViT-B/16) |
| Generalized Few-Shot Learning | CIFAR-100-LT (ρ=50) | Error Rate | 9.8 | LIFT (ViT-B/16, ImageNet-21K pre-training) |
| Generalized Few-Shot Learning | CIFAR-100-LT (ρ=50) | Error Rate | 16.9 | LIFT (ViT-B/16, CLIP) |
| Generalized Few-Shot Learning | CIFAR-100-LT (ρ=10) | Error Rate | 8.7 | LIFT (ViT-B/16, ImageNet-21K pre-training) |
| Generalized Few-Shot Learning | CIFAR-100-LT (ρ=10) | Error Rate | 15.1 | LIFT (ViT-B/16, CLIP) |
| Generalized Few-Shot Learning | ImageNet-LT | Top-1 Accuracy | 82.9 | LIFT (ViT-L/14) |
| Generalized Few-Shot Learning | ImageNet-LT | Top-1 Accuracy | 78.3 | LIFT (ViT-B/16) |
| Generalized Few-Shot Learning | CIFAR-100-LT (ρ=100) | Error Rate | 10.9 | LIFT (ViT-B/16, ImageNet-21K pre-training) |
| Generalized Few-Shot Learning | CIFAR-100-LT (ρ=100) | Error Rate | 18.3 | LIFT (ViT-B/16, CLIP) |