simple 0
ViPlan: A Benchmark for Visual Planning with Symbolic Predicates and Vision-Language Models
Merler, Matteo, Dainese, Nicola, Alakuijala, Minttu, Bonetta, Giovanni, Ferrazzi, Pietro, Tian, Yu, Magnini, Bernardo, Marttinen, Pekka
Integrating Large Language Models with symbolic planners is a promising direction for obtaining verifiable and grounded plans compared to planning in natural language, with recent works extending this idea to visual domains using Vision-Language Models (VLMs). However, rigorous comparison between VLM-grounded symbolic approaches and methods that plan directly with a VLM has been hindered by a lack of common environments, evaluation protocols and model coverage. We introduce ViPlan, the first open-source benchmark for Visual Planning with symbolic predicates and VLMs. ViPlan features a series of increasingly challenging tasks in two domains: a visual variant of the classic Blocksworld planning problem and a simulated household robotics environment. We benchmark nine open-source VLM families across multiple sizes, along with selected closed models, evaluating both VLM-grounded symbolic planning and using the models directly to propose actions. We find symbolic planning to outperform direct VLM planning in Blocksworld, where accurate image grounding is crucial, whereas the opposite is true in the household robotics tasks, where commonsense knowledge and the ability to recover from errors are beneficial. Finally, we show that across most models and methods, there is no significant benefit to using Chain-of-Thought prompting, suggesting that current VLMs still struggle with visual reasoning.
Bayesian vs. PAC-Bayesian Deep Neural Network Ensembles
Hauptvogel, Nick, Igel, Christian
Bayesian neural networks address epistemic uncertainty by learning a posterior distribution over model parameters. Sampling and weighting networks according to this posterior yields an ensemble model referred to as Bayes ensemble. Ensembles of neural networks (deep ensembles) can profit from the cancellation of errors effect: Errors by ensemble members may average out and the deep ensemble achieves better predictive performance than each individual network. We argue that neither the sampling nor the weighting in a Bayes ensemble are particularly well-suited for increasing generalization performance, as they do not support the cancellation of errors effect, which is evident in the limit from the Bernstein-von~Mises theorem for misspecified models. In contrast, a weighted average of models where the weights are optimized by minimizing a PAC-Bayesian generalization bound can improve generalization performance. This requires that the optimization takes correlations between models into account, which can be achieved by minimizing the tandem loss at the cost that hold-out data for estimating error correlations need to be available. The PAC-Bayesian weighting increases the robustness against correlated models and models with lower performance in an ensemble. This allows us to safely add several models from the same learning process to an ensemble, instead of using early-stopping for selecting a single weight configuration. Our study presents empirical results supporting these conceptual considerations on four different classification datasets. We show that state-of-the-art Bayes ensembles from the literature, despite being computationally demanding, do not improve over simple uniformly weighted deep ensembles and cannot match the performance of deep ensembles weighted by optimizing the tandem loss, which additionally come with non-vacuous generalization guarantees.
Value Prediction for Spatiotemporal Gait Data Using Deep Learning
Cavanagh, Ryan, Trajkovic, Jelena, Zhang, Wenlu, Khoo, I-Hung, Krishnan, Vennila
Human gait has been commonly used for the diagnosis and evaluation of medical conditions and for monitoring the progress during treatment and rehabilitation. The use of wearable sensors that capture pressure or motion has yielded techniques that analyze the gait data to aid recovery, identify activity performed, or identify individuals. Deep learning, usually employing classification, has been successfully utilized in a variety of applications such as computer vision, biomedical imaging analysis, and natural language processing. We expand the application of deep learning to value prediction of time-series of spatiotemporal gait data. Moreover, we explore several deep learning architectures (Recurrent Neural Networks (RNN) and RNN combined with Convolutional Neural Networks (CNN)) to make short- and long-distance predictions using two different experimental setups. Our results show that short-distance prediction has an RMSE as low as 0.060675, and long-distance prediction RMSE as low as 0.106365. Additionally, the results show that the proposed deep learning models are capable of predicting the entire trial when trained and validated using the trials from the same participant. The proposed, customized models, used with value prediction open possibilities for additional applications, such as fall prediction, in-home progress monitoring, aiding of exoskeleton movement, and authentication.
Overview of the TREC 2023 Product Product Search Track
Campos, Daniel, Kallumadi, Surya, Rosset, Corby, Zhai, Cheng Xiang, Magnani, Alessandro
At TREC 2023, we hosted the first TREC Product Search Track, looking to create a reusable general benchmark for evaluating the performance of retrieval methods in the product search domain. We focus on providing a benchmark similar in scale and format to NQ Kwiatkowski et al. [2019], or the Deep Learning Track Craswell et al. [2021] but focused on product search. In providing a simple-to-use dataset, we believe broad experimentation using popular retrieval libraries Lin et al. [2021] Gao et al. [2022] can lead to broad improvements in retrieval performance. In this first year of the track, we created a novel collection based on the ESCI Product Re-ranking dataset Reddy et al. [2022], sampled novel queries, created enriched metadata in the form of additional text and images along with seeded evaluation results with a broad range of baseline runs to aid in collection reusability and to allow iteration and experimentation on the use of additional context. Unlike previous product search corpora, the Product Search Track is multi-modal and has a large enough scale to explore the usage of neural retrieval methods.