Papers With Code 2 | ML Benchmarks, SotA Results & Code

We contribute a new challenging visual grounding dataset for robotic perception and reasoning in indoor environments, called RoboRefIt. The RoboRefIt collects 10,872 real-world RGB and depth images from cluttered daily life scenes, and generates 50,758 referring expressions in the form of robot language instructions. Moreover, nearly half of the images involve ambiguous object recognition. We hope that the RoboRefIt provides a distinctive training bed of visual grounding tasks for the robot interactive grasp.