Xin Wang, Wenhu Chen, Yuan-Fang Wang, William Yang Wang
Though impressive results have been achieved in visual captioning, the task of generating abstract stories from photo streams is still a little-tapped problem. Different from captions, stories have more expressive language styles and contain many imaginary concepts that do not appear in the images. Thus it poses challenges to behavioral cloning algorithms. Furthermore, due to the limitations of automatic metrics on evaluating story quality, reinforcement learning methods with hand-crafted rewards also face difficulties in gaining an overall performance boost. Therefore, we propose an Adversarial REward Learning (AREL) framework to learn an implicit reward function from human demonstrations, and then optimize policy search with the learned reward function. Though automatic eval- uation indicates slight performance boost over state-of-the-art (SOTA) methods in cloning expert behaviors, human evaluation shows that our approach achieves significant improvement in generating more human-like stories than SOTA systems.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Text Generation | VIST | BLEU-1 | 63.8 | AREL-t-100 |
| Text Generation | VIST | BLEU-2 | 39.1 | AREL-t-100 |
| Text Generation | VIST | BLEU-3 | 23.2 | AREL-t-100 |
| Text Generation | VIST | BLEU-4 | 14.1 | AREL-t-100 |
| Text Generation | VIST | CIDEr | 9.4 | AREL-t-100 |
| Text Generation | VIST | METEOR | 35 | AREL-t-100 |
| Text Generation | VIST | ROUGE-L | 29.5 | AREL-t-100 |
| Text Generation | VIST | BLEU-1 | 62.8 | GAN |
| Text Generation | VIST | BLEU-2 | 38.8 | GAN |
| Text Generation | VIST | BLEU-3 | 23 | GAN |
| Text Generation | VIST | BLEU-4 | 14 | GAN |
| Text Generation | VIST | CIDEr | 9 | GAN |
| Text Generation | VIST | METEOR | 35 | GAN |
| Text Generation | VIST | ROUGE-L | 29.5 | GAN |
| Text Generation | VIST | BLEU-1 | 62.3 | XE-ss |
| Text Generation | VIST | BLEU-2 | 38.2 | XE-ss |
| Text Generation | VIST | BLEU-3 | 22.5 | XE-ss |
| Text Generation | VIST | BLEU-4 | 13.7 | XE-ss |
| Text Generation | VIST | CIDEr | 8.7 | XE-ss |
| Text Generation | VIST | METEOR | 34.8 | XE-ss |
| Text Generation | VIST | ROUGE-L | 29.7 | XE-ss |
| Data-to-Text Generation | VIST | BLEU-1 | 63.8 | AREL-t-100 |
| Data-to-Text Generation | VIST | BLEU-2 | 39.1 | AREL-t-100 |
| Data-to-Text Generation | VIST | BLEU-3 | 23.2 | AREL-t-100 |
| Data-to-Text Generation | VIST | BLEU-4 | 14.1 | AREL-t-100 |
| Data-to-Text Generation | VIST | CIDEr | 9.4 | AREL-t-100 |
| Data-to-Text Generation | VIST | METEOR | 35 | AREL-t-100 |
| Data-to-Text Generation | VIST | ROUGE-L | 29.5 | AREL-t-100 |
| Data-to-Text Generation | VIST | BLEU-1 | 62.8 | GAN |
| Data-to-Text Generation | VIST | BLEU-2 | 38.8 | GAN |
| Data-to-Text Generation | VIST | BLEU-3 | 23 | GAN |
| Data-to-Text Generation | VIST | BLEU-4 | 14 | GAN |
| Data-to-Text Generation | VIST | CIDEr | 9 | GAN |
| Data-to-Text Generation | VIST | METEOR | 35 | GAN |
| Data-to-Text Generation | VIST | ROUGE-L | 29.5 | GAN |
| Data-to-Text Generation | VIST | BLEU-1 | 62.3 | XE-ss |
| Data-to-Text Generation | VIST | BLEU-2 | 38.2 | XE-ss |
| Data-to-Text Generation | VIST | BLEU-3 | 22.5 | XE-ss |
| Data-to-Text Generation | VIST | BLEU-4 | 13.7 | XE-ss |
| Data-to-Text Generation | VIST | CIDEr | 8.7 | XE-ss |
| Data-to-Text Generation | VIST | METEOR | 34.8 | XE-ss |
| Data-to-Text Generation | VIST | ROUGE-L | 29.7 | XE-ss |
| Visual Storytelling | VIST | BLEU-1 | 63.8 | AREL-t-100 |
| Visual Storytelling | VIST | BLEU-2 | 39.1 | AREL-t-100 |
| Visual Storytelling | VIST | BLEU-3 | 23.2 | AREL-t-100 |
| Visual Storytelling | VIST | BLEU-4 | 14.1 | AREL-t-100 |
| Visual Storytelling | VIST | CIDEr | 9.4 | AREL-t-100 |
| Visual Storytelling | VIST | METEOR | 35 | AREL-t-100 |
| Visual Storytelling | VIST | ROUGE-L | 29.5 | AREL-t-100 |
| Visual Storytelling | VIST | BLEU-1 | 62.8 | GAN |
| Visual Storytelling | VIST | BLEU-2 | 38.8 | GAN |
| Visual Storytelling | VIST | BLEU-3 | 23 | GAN |
| Visual Storytelling | VIST | BLEU-4 | 14 | GAN |
| Visual Storytelling | VIST | CIDEr | 9 | GAN |
| Visual Storytelling | VIST | METEOR | 35 | GAN |
| Visual Storytelling | VIST | ROUGE-L | 29.5 | GAN |
| Visual Storytelling | VIST | BLEU-1 | 62.3 | XE-ss |
| Visual Storytelling | VIST | BLEU-2 | 38.2 | XE-ss |
| Visual Storytelling | VIST | BLEU-3 | 22.5 | XE-ss |
| Visual Storytelling | VIST | BLEU-4 | 13.7 | XE-ss |
| Visual Storytelling | VIST | CIDEr | 8.7 | XE-ss |
| Visual Storytelling | VIST | METEOR | 34.8 | XE-ss |
| Visual Storytelling | VIST | ROUGE-L | 29.7 | XE-ss |
| Story Generation | VIST | BLEU-1 | 63.8 | AREL-t-100 |
| Story Generation | VIST | BLEU-2 | 39.1 | AREL-t-100 |
| Story Generation | VIST | BLEU-3 | 23.2 | AREL-t-100 |
| Story Generation | VIST | BLEU-4 | 14.1 | AREL-t-100 |
| Story Generation | VIST | CIDEr | 9.4 | AREL-t-100 |
| Story Generation | VIST | METEOR | 35 | AREL-t-100 |
| Story Generation | VIST | ROUGE-L | 29.5 | AREL-t-100 |
| Story Generation | VIST | BLEU-1 | 62.8 | GAN |
| Story Generation | VIST | BLEU-2 | 38.8 | GAN |
| Story Generation | VIST | BLEU-3 | 23 | GAN |
| Story Generation | VIST | BLEU-4 | 14 | GAN |
| Story Generation | VIST | CIDEr | 9 | GAN |
| Story Generation | VIST | METEOR | 35 | GAN |
| Story Generation | VIST | ROUGE-L | 29.5 | GAN |
| Story Generation | VIST | BLEU-1 | 62.3 | XE-ss |
| Story Generation | VIST | BLEU-2 | 38.2 | XE-ss |
| Story Generation | VIST | BLEU-3 | 22.5 | XE-ss |
| Story Generation | VIST | BLEU-4 | 13.7 | XE-ss |
| Story Generation | VIST | CIDEr | 8.7 | XE-ss |
| Story Generation | VIST | METEOR | 34.8 | XE-ss |
| Story Generation | VIST | ROUGE-L | 29.7 | XE-ss |