SoftSPICE

Reported on 1 benchmark across 1 task · 1 paper

Note: results are matched by exact model name. Different papers may use the same name for different model variants.

Reasoning1 result

Human Judgment CorrelationonFlickr8k-Expert
Kendall's Tau-c· 2023-05-27
54.2
best: 54.9 (MID)
FACTUAL: A Benchmark for Faithful and Consistent Textual Scene Graph Parsing arXiv:2305.17497