An Extensible Multimodal Multi-task Object Dataset with Materials
Standley, Trevor, Gao, Ruohan, Chen, Dawn, Wu, Jiajun, Savarese, Silvio
–arXiv.org Artificial Intelligence
We present EMMa, an Extensible, Multimodal dataset of Amazon product listings that contains rich Material annotations. It contains more than 2.8 million objects, each with image(s), listing text, mass, price, product ratings, and position in Amazon's product-category taxonomy. Objects are annotated with one or more materials from this taxonomy. With the numerous attributes available for each object, we develop a Smart Labeling framework to quickly add new binary labels to all objects with very little manual labeling effort, making the dataset extensible. Each object attribute in our dataset can be included in either the model inputs or outputs, leading to combinatorial possibilities in task configurations. For example, we can train a model to predict the object category from the listing text, or the mass and price from the product listing image. EMMa offers a new benchmark for multi-task learning in computer vision and NLP, and allows practitioners to efficiently add new tasks and object attributes at scale. Perhaps the biggest problem faced by machine learning practitioners today is that of producing labeled datasets for their specific needs. Manually labeling large amounts of data is time-consuming and costly (Deng et al., 2009; Lin et al., 2014; Kuznetsova et al., 2020). Furthermore, it is often not possible to communicate how numerous ambiguous corner cases should be handled (e.g., is a hole puncher "sharp"?) to the human annotators we typically rely on to produce these labels. Could we solve this problem with the aid of machine learning? We hypothesized that we could accurately add new properties to every instance in a semi-automated fashion if given a rich dataset with substantial information about every instance. Consequently, we developed EMMa, a large, object-centric, multimodal, and multi-task dataset. We show that EMMa can be easily extended to contain any number of new object labels using a Smart Labeling technique we developed for large multi-task and multimodal datasets. Multi-task datasets contain labels for more than one attribute for each instance, whereas multimodal datasets contain data from more than one modality, such as images, text, audio, and tabular data. Derived from Amazon product listings, EMMa contains images, text, and a number of useful attributes, such as materials, mass, price, product category, and product ratings. Each attribute can be used as either a model input or a model output.
arXiv.org Artificial Intelligence
Apr-29-2023
- Country:
- North America > United States (0.93)
- Genre:
- Research Report (0.82)
- Industry:
- Appliances & Durable Goods (0.67)
- Consumer Products & Services > Food, Beverage, Tobacco & Cannabis (0.93)
- Health & Medicine (0.93)
- Leisure & Entertainment (1.00)
- Materials
- Chemicals > Commodity Chemicals (0.67)
- Metals & Mining (0.67)
- Media (1.00)
- Technology: