RIO: A Benchmark for Reasoning Intention-Oriented Objects in Open Environments

1 Institute of Information Science, Beijing Jiaotong University
2 Beijing Key Laboratory of Advanced Information Science and Network Technology
3 Wuhan University  4 JD Explore Academy  5 Sun Yat-sen University  6 MBZUAI 
7 University of Electronic Science and Technology of China
Image

Abstract

Intention-oriented object detection aims to detect desired objects based on specific intentions or requirements. For instance, when we desire to "lie down and rest", we instinctively seek out a suitable option such as a "bed" or a "sofa" that can fulfill our needs. Previous work in this area is limited either by the number of intention descriptions or by the affordance vocabulary available for intention objects. These limitations make it challenging to handle intentions in open environments effectively. To facilitate this research, we construct a comprehensive dataset called Reasoning Intention-Oriented Objects (RIO).

In particular, RIO is specifically designed to incorporate diverse real-world scenarios and a wide range of object categories. It offers the following key features: 1) intention descriptions in RIO are represented as natural sentences rather than a mere word or verb phrase, making them more practical and meaningful; 2) the intention descriptions are contextually relevant to the scene, enabling a broader range of potential functionalities associated with the objects; 3) the dataset comprises a total of 40,214 images and 130,585 intention-object pairs.

With the proposed RIO, we evaluate the ability of some existing models to reason intention-oriented objects in open environments. We hope RIO can promote deeper research in intention-oriented object detection, bridging the gap between traditional detection methods and real-world user intentions. By offering a rich dataset that emphasizes natural sentences and contextual relevance, RIO challenges models to not only detect, but also understand the broader implications of a scene.

Intention Sentence

Intention descriptions in RIO are represented as natural sentences rather than a word or phrase, making them more practical and meaningful.

Scene-Related

Descriptions are contextually relevant to the scene, enabling a broader range of potential functionalities associated with the objects.

Open Environment

We collect the intention descriptions of associated objects without restriction of syntax in the open environment.