referential
GUMBridge: a Corpus for Varieties of Bridging Anaphora
Bridging is an anaphoric phenomenon where the referent of an entity in a discourse is dependent on a previous, non-identical entity for interpretation, such as in "There is 'a house'. 'The door' is red," where the door is specifically understood to be the door of the aforementioned house. While there are several existing resources in English for bridging anaphora, most are small, provide limited coverage of the phenomenon, and/or provide limited genre coverage. In this paper, we introduce GUMBridge, a new resource for bridging, which includes 16 diverse genres of English, providing both broad coverage for the phenomenon and granular annotations for the subtype categorization of bridging varieties. We also present an evaluation of annotation quality and report on baseline performance using open and closed source contemporary LLMs on three tasks underlying our data, showing that bridging resolution and subtype classification remain difficult NLP tasks in the age of LLMs.
SORT3D: Spatial Object-centric Reasoning Toolbox for Zero-Shot 3D Grounding Using Large Language Models
Zantout, Nader, Zhang, Haochen, Kachana, Pujith, Qiu, Jinkai, Chen, Guofei, Zhang, Ji, Wang, Wenshan
Interpreting object-referential language and grounding objects in 3D with spatial relations and attributes is essential for robots operating alongside humans. However, this task is often challenging due to the diversity of scenes, large number of fine-grained objects, and complex free-form nature of language references. Furthermore, in the 3D domain, obtaining large amounts of natural language training data is difficult. Thus, it is important for methods to learn from little data and zero-shot generalize to new environments. To address these challenges, we propose SORT3D, an approach that utilizes rich object attributes from 2D data and merges a heuristics-based spatial reasoning toolbox with the ability of large language models (LLMs) to perform sequential reasoning. Importantly, our method does not require text-to-3D data for training and can be applied zero-shot to unseen environments. We show that SORT3D achieves state-of-the-art zero-shot performance on complex view-dependent grounding tasks on two benchmarks. We also implement the pipeline to run real-time on two autonomous vehicles and demonstrate that our approach can be used for object-goal navigation on previously unseen real-world environments. All source code for the system pipeline is publicly released at https://github.com/nzantout/SORT3D.
IRef-VLA: A Benchmark for Interactive Referential Grounding with Imperfect Language in 3D Scenes
Zhang, Haochen, Zantout, Nader, Kachana, Pujith, Zhang, Ji, Wang, Wenshan
With the recent rise of large language models, vision-language models, and other general foundation models, there is growing potential for multimodal, multi-task robotics that can operate in diverse environments given natural language input. One such application is indoor navigation using natural language instructions. However, despite recent progress, this problem remains challenging due to the 3D spatial reasoning and semantic understanding required. Additionally, the language used may be imperfect or misaligned with the scene, further complicating the task. To address this challenge, we curate a benchmark dataset, IRef-VLA, for Interactive Referential Vision and Language-guided Action in 3D Scenes with imperfect references. IRef-VLA is the largest real-world dataset for the referential grounding task, consisting of over 11.5K scanned 3D rooms from existing datasets, 7.6M heuristically generated semantic relations, and 4.7M referential statements. Our dataset also contains semantic object and room annotations, scene graphs, navigable free space annotations, and is augmented with statements where the language has imperfections or ambiguities. We verify the generalizability of our dataset by evaluating with state-of-the-art models to obtain a performance baseline and also develop a graph-search baseline to demonstrate the performance bound and generation of alternatives using scene-graph knowledge. With this benchmark, we aim to provide a resource for 3D scene understanding that aids the development of robust, interactive navigation systems. The dataset and all source code is publicly released at https://github.com/HaochenZ11/IRef-VLA.
Indication Finding: a novel use case for representation learning
Eckhoff, Maren, Selimi, Valmir, Aranovitch, Alexander, Lyons, Ian, Briggs, Emily, Hou, Jennifer, Devereson, Alex, Macak, Matej, Champagne, David, Anagnostopoulos, Chris
Many therapies are effective in treating multiple diseases. We present an approach that leverages methods developed in natural language processing and real-world data to prioritize potential, new indications for a mechanism of action (MoA). We specifically use representation learning to generate embeddings of indications and prioritize them based on their proximity to the indications with the strongest available evidence for the MoA. We demonstrate the successful deployment of our approach for anti-IL-17A using embeddings generated with SPPMI and present an evaluation framework to determine the quality of indication finding results and the derived embeddings.
Fuzzy Temporal Protoforms for the Quantitative Description of Processes in Natural Language
Fontenla-Seco, Yago, Bugarรญn-Diz, Alberto, Lama, Manuel
In this paper, we propose a series of fuzzy temporal protoforms in the framework of the automatic generation of quantitative and qualitative natural language descriptions of processes. The model includes temporal and causal information from processes and attributes, quantifies attributes in time during the process life-span and recalls causal relations and temporal distances between events, among other features. Through integrating process mining techniques and fuzzy sets within the usual Data-to-Text architecture, our framework is able to extract relevant quantitative temporal as well as structural information from a process and describe it in natural language involving uncertain terms. A real use-case in the cardiology domain is presented, showing the potential of our model for providing natural language explanations addressed to domain experts.
Bottom-up top-down detection transformers for open vocabulary object detection
We perform open vocabulary detection of the objects mentioned in the sentence using both bottom-up and top-down feedback. Object detection is the fundamental computer vision task of finding all "objects" that are present in a visual scene. However, this raises the question, what is an object? Typically, this question is side-stepped by defining a vocabulary of categories and then training a model to detect instances of this vocabulary. This means that if "apple" is not in this vocabulary, the model does not consider it as an object.
10 Best Machine Learning Textbooks that All Data Scientists Should Read
Machine learning is an intimidating subject. Knowing where to develop mastery around such a massive subject that encompasses so many fields, research topics, and applications can be the hardest part of the journey. Anyone with a background in programming will attest to the value of a good textbook, especially when it comes to a subject as technical as machine learning. Get a quote for an end-to-end data solution to your specific requirements. Whether you're a complete novice or a distinguished mastermind in this field, we at iMerit have compiled the best field guides, icebreakers, and referential machine learning textbooks that will suit both newcomers and veterans alike who are looking to improve their understanding of machine learning.
Automating the Generation of High School Geometry Proofs using Prolog in an Educational Context
Font, Ludovic, Cyr, Sรฉbastien, Richard, Philippe R., Gagnon, Michel
When working on intelligent tutor systems designed for mathematics education and its specificities, an interesting objective is to provide relevant help to the students by anticipating their next steps. This can only be done by knowing, beforehand, the possible ways to solve a problem. Hence the need for an automated theorem prover that provide proofs as they would be written by a student. To achieve this objective, logic programming is a natural tool due to the similarity of its reasoning with a mathematical proof by inference. In this paper, we present the core ideas we used to implement such a prover, from its encoding in Prolog to the generation of the complete set of proofs. However, when dealing with educational aspects, there are many challenges to overcome. We also present the main issues we encountered, as well as the chosen solutions. The QED-Tutrix software [15, 19] provides an environment where a highschool student can solve geometry proof problems. One of its key features is that it allows the student to provide proof elements in any order, not limiting them to forward-or backward-chaining. For instance, when solving the simple problem "prove that a quadrilateral with three right angles is a rectangle", the student can provide any element of any possible proof, such as a direct consequence of the hypotheses ("if two lines are perpendicular to a third, they are parallel"), a necessary premise for the conclusion ("a rectangle is a quadrilateral that has four right angles"), or anything in between ("the quadrilateral ABCD is a parallelogram"). A second key feature is the tutoring aspect. When the student is stuck is the resolution, the software is able to provide them with relevant messages. In the previous example, if the student entered "the quadrilateral ABCD is a parallelogram" and is stuck afterwards, the software identifies that they are working on a proof using parallelogram properties, and will provide them messages such as "what is the definition of a parallelogram?" or "is there a relation between parallelogram and rectangle?" These features, the flexibility in exploration and the tutoring, are very interesting from a mathematics education perspective, but come with a cost.
That and There: Judging the Intent of Pointing Actions with Robotic Arms
Alikhani, Malihe, Khalid, Baber, Shome, Rahul, Mitash, Chaitanya, Bekris, Kostas, Stone, Matthew
Collaborative robotics requires effective communication between a robot and a human partner. This work proposes a set of interpretive principles for how a robotic arm can use pointing actions to communicate task information to people by extending existing models from the related literature. These principles are evaluated through studies where English-speaking human subjects view animations of simulated robots instructing pick-and-place tasks. The evaluation distinguishes two classes of pointing actions that arise in pick-and- place tasks: referential pointing (identifying objects) and locating pointing (identifying locations). The study indicates that human subjects show greater flexibility in interpreting the intent of referential pointing compared to locating pointing, which needs to be more deliberate. The results also demonstrate the effects of variation in the environment and task context on the interpretation of pointing. Our corpus, experiments and design principles advance models of context, common sense reasoning and communication in embodied communication.
Learning to Mediate Perceptual Differences in Situated Human-Robot Dialogue
Liu, Changsong (Michigan State University) | Chai, Joyce Yue (Michigan State University)
In human-robot dialogue, although a robot and its human partner are co-present in a shared environment, they have significantly mismatched perceptual capabilities (e.g., recognizing objects in the surroundings). When a shared perceptual basis is missing, it becomes difficult for the robot to identify referents in the physical world that are referred to by the human (i.e., a problem of referential grounding). To overcome this problem, we have developed an optimization based approach that allows the robot to detect and adapt to perceptual differences. Through online interaction with the human, the robot can learn a set of weights indicating how reliably/unreliably each dimension (e.g., object type, object color, etc.) of its perception of the environment maps to the human's linguistic descriptors and thus adjust its word models accordingly. Our empirical evaluation has shown that this weight-learning approach can successfully adjust the weights to reflect the robot's perceptual limitations. The learned weights, together with updated word models, can lead to a significant improvement for referential grounding in future dialogues.