retrieve
completion
Algorithm 2 describes the prompt completion algorithm introduced in Section 2.2. It implicitly401 considers a single action, which takes the next sequence element.402 Algorithm 2 - Prompt completion Input: Grounded schema {T,C,Erb}with rebound CSCG emission matrix Erb, delimiter token x, prompt x(prompt) = (x1,...,xm) Output: A completed prompt x(prompt completed) = (x1,...,xm,xm+1,...,xm+p = x) 1: Run max-product for MAP inference and return zMAP = (z1,...,zm) = argmaxz p(z|x(prompt)). Algorithm 3 is a variant of the rebinding Algorithm 1 that does not use EM. Instead, it first searches404 for "surprising observations": a surprise has a low probability of being emitted by its decoded clone.405
derivation of Eqs . 3 and 5
A.1 Derivation of Eq. (3) By expanding Eq. (2) with the definition of εli,t = xli,t µli,t, we have: Et = We note that each xli,t influences Et in two ways: (i) it occurs in Eq. (6) explicitly, but (ii) it also determines the values of µl 1k,t via Eq. Considering also the special cases of l = Land l = 0, we obtain Eq. (3). We note that θl+1i,j affects the value of the function Et of Eq. (6) by influencing µli,t via Eq. Here, we provide further details about training PCNs, useful to reproduce them. Furthermore, we have applied a decay factor of 0.9 to γ, applied each time the energy failed to decrease.
Retrieve, Reason,andRefine: AppendixofGenerating AccurateandFaithfulPatientInstructions
For the constructed knowledge graph, we use randomly initialized embeddingsH(0) = {v1,v2,...,vNKG} RNKG d to represent all node features. Table 2shows that all variants with different number ofretrieved instructionsNP can consistently outperform the baseline model, which proves the effectiveness of our approach in retrieving the working experience to boost the Patient Instruction generation. Asaresult, givenanewmale/female patient at61years old,wewillmatchmale/female patients in the age-group 55 <= Age < 70 in the training data to generate the PIs.
RETRIEVE: Coreset Selection for Efficient and Robust Semi-Supervised Learning
Semi-supervised learning (SSL) algorithms have had great success in recent years in limited labeled data regimes. However, the current state-of-the-art SSL algorithms are computationally expensive and entail significant compute time and energy requirements. This can prove to be a huge limitation for many smaller companies and academic groups. Our main insight is that training on a subset of unlabeled data instead of entire unlabeled data enables the current SSL algorithms to converge faster, significantly reducing computational costs. In this work, we propose RETRIEVE, a coreset selection framework for efficient and robust semi-supervised learning. RETRIEVE selects the coreset by solving a mixed discrete-continuous bi-level optimization problem such that the selected coreset minimizes the labeled set loss. We use a one-step gradient approximation and show that the discrete optimization problem is approximately submodular, enabling simple greedy algorithms to obtain the coreset. We empirically demonstrate on several real-world datasets that existing SSL algorithms like VAT, Mean-Teacher, FixMatch, when used with RETRIEVE, achieve a) faster training times, b) better performance when unlabeled data consists of Out-of-Distribution (OOD) data and imbalance. More specifically, we show that with minimal accuracy degradation, RETRIEVE achieves a speedup of around $3\times$ in the traditional SSL setting and achieves a speedup of $5\times$ compared to state-of-the-art (SOTA) robust SSL algorithms in the case of imbalance and OOD data.
Type-to-Track: Retrieve Any Object via Prompt-based Tracking
One of the recent trends in vision problems is to use natural language captions to describe the objects of interest. This approach can overcome some limitations of traditional methods that rely on bounding boxes or category annotations. This paper introduces a novel paradigm for Multiple Object Tracking called Type-to-Track, which allows users to track objects in videos by typing natural language descriptions. We present a new dataset for that Grounded Multiple Object Tracking task, called GroOT, that contains videos with various types of objects and their corresponding textual captions describing their appearance and action in detail. Additionally, we introduce two new evaluation protocols and formulate evaluation metrics specifically for this task. We develop a new efficient method that models a transformer-based eMbed-ENcoDE-extRact framework (MENDER) using the third-order tensor decomposition. The experiments in five scenarios show that our MENDER approach outperforms another two-stage design in terms of accuracy and efficiency, up to 14.7\% accuracy and $4\times$ speed faster.
Searching in Space and Time: Unified Memory-Action Loops for Open-World Object Retrieval
Chen, Taijing, Kumar, Sateesh, Xu, Junhong, Pavlakos, Georgios, Biswas, Joydeep, Martín-Martín, Roberto
Service robots must retrieve objects in dynamic, open-world settings where requests may reference attributes ("the red mug"), spatial context ("the mug on the table"), or past states ("the mug that was here yesterday"). Existing approaches capture only parts of this problem: scene graphs capture spatial relations but ignore temporal grounding, temporal reasoning methods model dynamics but do not support embodied interaction, and dynamic scene graphs handle both but remain closed-world with fixed vocabularies. We present STAR (SpatioTemporal Active Retrieval), a framework that unifies memory queries and embodied actions within a single decision loop. STAR leverages non-parametric long-term memory and a working memory to support efficient recall, and uses a vision-language model to select either temporal or spatial actions at each step. We introduce STARBench, a benchmark of spatiotemporal object search tasks across simulated and real environments. Experiments in STARBench and on a Tiago robot show that STAR consistently outperforms scene-graph and memory-only baselines, demonstrating the benefits of treating search in time and search in space as a unified problem. For more information: https://amrl.cs.utexas.edu/STAR.