Rylan Conway, Lambert Mathias
In a spoken dialogue system, dialogue state tracker (DST) components track the state of the conversation by updating a distribution of values associated with each of the slots being tracked for the current user turn, using the interactions until then. Much of the previous work has relied on modeling the natural order of the conversation, using distance based offsets as an approximation of time. In this work, we hypothesize that leveraging the wall-clock temporal difference between turns is crucial for finer-grained control of dialogue scenarios. We develop a novel approach that applies a {\it time mask}, based on the wall-clock time difference, to the associated slot embeddings and empirically demonstrate that our proposed approach outperforms existing approaches that leverage distance offsets, on both an internal benchmark dataset as well as DSTC2.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Video | SegTrack v2 | AVERAGE MAE | 0.116 | TIMP |
| Video | SegTrack v2 | S-Measure | 0.644 | TIMP |
| Video | SegTrack v2 | max E-measure | 0.768 | TIMP |
| Video | MCL | AVERAGE MAE | 0.113 | TIMP |
| Video | MCL | MAX E-MEASURE | 0.76 | TIMP |
| Video | MCL | S-Measure | 0.642 | TIMP |
| Object Detection | SegTrack v2 | AVERAGE MAE | 0.116 | TIMP |
| Object Detection | SegTrack v2 | S-Measure | 0.644 | TIMP |
| Object Detection | SegTrack v2 | max E-measure | 0.768 | TIMP |
| Object Detection | MCL | AVERAGE MAE | 0.113 | TIMP |
| Object Detection | MCL | MAX E-MEASURE | 0.76 | TIMP |
| Object Detection | MCL | S-Measure | 0.642 | TIMP |
| 3D | SegTrack v2 | AVERAGE MAE | 0.116 | TIMP |
| 3D | SegTrack v2 | S-Measure | 0.644 | TIMP |
| 3D | SegTrack v2 | max E-measure | 0.768 | TIMP |
| 3D | MCL | AVERAGE MAE | 0.113 | TIMP |
| 3D | MCL | MAX E-MEASURE | 0.76 | TIMP |
| 3D | MCL | S-Measure | 0.642 | TIMP |
| Video Object Segmentation | SegTrack v2 | AVERAGE MAE | 0.116 | TIMP |
| Video Object Segmentation | SegTrack v2 | S-Measure | 0.644 | TIMP |
| Video Object Segmentation | SegTrack v2 | max E-measure | 0.768 | TIMP |
| Video Object Segmentation | MCL | AVERAGE MAE | 0.113 | TIMP |
| Video Object Segmentation | MCL | MAX E-MEASURE | 0.76 | TIMP |
| Video Object Segmentation | MCL | S-Measure | 0.642 | TIMP |
| RGB Salient Object Detection | SegTrack v2 | AVERAGE MAE | 0.116 | TIMP |
| RGB Salient Object Detection | SegTrack v2 | S-Measure | 0.644 | TIMP |
| RGB Salient Object Detection | SegTrack v2 | max E-measure | 0.768 | TIMP |
| RGB Salient Object Detection | MCL | AVERAGE MAE | 0.113 | TIMP |
| RGB Salient Object Detection | MCL | MAX E-MEASURE | 0.76 | TIMP |
| RGB Salient Object Detection | MCL | S-Measure | 0.642 | TIMP |
| 2D Classification | SegTrack v2 | AVERAGE MAE | 0.116 | TIMP |
| 2D Classification | SegTrack v2 | S-Measure | 0.644 | TIMP |
| 2D Classification | SegTrack v2 | max E-measure | 0.768 | TIMP |
| 2D Classification | MCL | AVERAGE MAE | 0.113 | TIMP |
| 2D Classification | MCL | MAX E-MEASURE | 0.76 | TIMP |
| 2D Classification | MCL | S-Measure | 0.642 | TIMP |
| 2D Object Detection | SegTrack v2 | AVERAGE MAE | 0.116 | TIMP |
| 2D Object Detection | SegTrack v2 | S-Measure | 0.644 | TIMP |
| 2D Object Detection | SegTrack v2 | max E-measure | 0.768 | TIMP |
| 2D Object Detection | MCL | AVERAGE MAE | 0.113 | TIMP |
| 2D Object Detection | MCL | MAX E-MEASURE | 0.76 | TIMP |
| 2D Object Detection | MCL | S-Measure | 0.642 | TIMP |
| 16k | SegTrack v2 | AVERAGE MAE | 0.116 | TIMP |
| 16k | SegTrack v2 | S-Measure | 0.644 | TIMP |
| 16k | SegTrack v2 | max E-measure | 0.768 | TIMP |
| 16k | MCL | AVERAGE MAE | 0.113 | TIMP |
| 16k | MCL | MAX E-MEASURE | 0.76 | TIMP |
| 16k | MCL | S-Measure | 0.642 | TIMP |