South America
A Picture is Worth A Thousand Numbers: Enabling LLMs Reason about Time Series via Visualization
Liu, Haoxin, Liu, Chenghao, Prakash, B. Aditya
Large language models (LLMs), with demonstrated reasoning abilities across multiple domains, are largely underexplored for time-series reasoning (TsR), which is ubiquitous in the real world. In this work, we propose TimerBed, the first comprehensive testbed for evaluating LLMs' TsR performance. Specifically, TimerBed includes stratified reasoning patterns with real-world tasks, comprehensive combinations of LLMs and reasoning strategies, and various supervised models as comparison anchors. We perform extensive experiments with TimerBed, test multiple current beliefs, and verify the initial failures of LLMs in TsR, evidenced by the ineffectiveness of zero shot (ZST) and performance degradation of few shot in-context learning (ICL). Further, we identify one possible root cause: the numerical modeling of data. To address this, we propose a prompt-based solution VL-Time, using visualization-modeled data and language-guided reasoning. Experimental results demonstrate that Vl-Time enables multimodal LLMs to be non-trivial ZST and powerful ICL reasoners for time series, achieving about 140% average performance improvement and 99% average token costs reduction.
Autoregressive Models in Vision: A Survey
Xiong, Jing, Liu, Gongye, Huang, Lun, Wu, Chengyue, Wu, Taiqiang, Mu, Yao, Yao, Yuan, Shen, Hui, Wan, Zhongwei, Huang, Jinfa, Tao, Chaofan, Yan, Shen, Yao, Huaxiu, Kong, Lingpeng, Yang, Hongxia, Zhang, Mi, Sapiro, Guillermo, Luo, Jiebo, Luo, Ping, Wong, Ngai
Autoregressive modeling has been a huge success in the field of natural language processing (NLP). Recently, autoregressive models have emerged as a significant area of focus in computer vision, where they excel in producing high-quality visual content. Autoregressive models in NLP typically operate on subword tokens. However, the representation strategy in computer vision can vary in different levels, \textit{i.e.}, pixel-level, token-level, or scale-level, reflecting the diverse and hierarchical nature of visual data compared to the sequential structure of language. This survey comprehensively examines the literature on autoregressive models applied to vision. To improve readability for researchers from diverse research backgrounds, we start with preliminary sequence representation and modeling in vision. Next, we divide the fundamental frameworks of visual autoregressive models into three general sub-categories, including pixel-based, token-based, and scale-based models based on the strategy of representation. We then explore the interconnections between autoregressive models and other generative models. Furthermore, we present a multi-faceted categorization of autoregressive models in computer vision, including image generation, video generation, 3D generation, and multi-modal generation. We also elaborate on their applications in diverse domains, including emerging domains such as embodied AI and 3D medical AI, with about 250 related references. Finally, we highlight the current challenges to autoregressive models in vision with suggestions about potential research directions. We have also set up a Github repository to organize the papers included in this survey at: \url{https://github.com/ChaofanTao/Autoregressive-Models-in-Vision-Survey}.
Analysis and Forecasting of the Dynamics of a Floating Wind Turbine Using Dynamic Mode Decomposition
Palma, Giorgio, Bardazzi, Andrea, Lucarelli, Alessia, Pilloton, Chiara, Serani, Andrea, Lugni, Claudio, Diez, Matteo
This article presents a data-driven equation-free modeling of the dynamics of a hexafloat floating offshore wind turbine based on the Dynamic Mode Decomposition (DMD). The DMD is here used to provide a modal analysis and extract knowledge from the dynamic system. A forecasting algorithm for the motions, accelerations, and forces acting on the floating system, as well as the height of the incoming waves, the wind speed, and the power extracted by the wind turbine, is developed by using a methodological extension called Hankel-DMD, that includes time-delayed copies of the states in an augmented state vector. All the analyses are performed on experimental data collected from an operating prototype. The quality of the forecasts obtained varying two main hyperparameters of the algorithm, namely the number of delayed copies and the length of the observation time, is assessed using three different error metrics, each analyzing complementary aspects of the prediction. A statistical analysis exposed the existence of optimal values for the algorithm hyperparameters. Results show the approach's capability for short-term future estimates of the system's state, which can be used for real-time prediction and control. Furthermore, a novel Stochastic Hankel-DMD formulation is introduced by considering hyperparameters as stochastic variables. The stochastic version of the method not only enriches the prediction with its related uncertainty but is also found to improve the normalized root mean square error up to 10% on a statistical basis compared to the deterministic counterpart.
Word reuse and combination support efficient communication of emerging concepts
Xu, Aotao, Kemp, Charles, Frermann, Lea, Xu, Yang
A key function of the lexicon is to express novel concepts as they emerge over time through a process known as lexicalization. The most common lexicalization strategies are the reuse and combination of existing words, but they have typically been studied separately in the areas of word meaning extension and word formation. Here we offer an information-theoretic account of how both strategies are constrained by a fundamental tradeoff between competing communicative pressures: word reuse tends to preserve the average length of word forms at the cost of less precision, while word combination tends to produce more informative words at the expense of greater word length. We test our proposal against a large dataset of reuse items and compounds that appeared in English, French and Finnish over the past century. We find that these historically emerging items achieve higher levels of communicative efficiency than hypothetical ways of constructing the lexicon, and both literal reuse items and compounds tend to be more efficient than their non-literal counterparts. These results suggest that reuse and combination are both consistent with a unified account of lexicalization grounded in the theory of efficient communication.
Longitudinal Ensemble Integration for sequential classification with multimodal data
Susman, Aviad, Krishnamurthy, Rupak, Li, Yan Chak, Olaimat, Mohammad, Bozdag, Serdar, Varghese, Bino, Sheikh-Bahaei, Nasim, Pandey, Gaurav
A BSTRACT Effectively modeling multimodal longitudinal data is a pressing need in various application areas, especially biomedicine. Despite this, few approaches exist in the literature for this problem, with most not adequately taking into account the multimodality of the data. In this study, we developed multiple configurations of a novel multimodal and longitudinal learning framework, Longitudinal Ensemble Integration (LEI), for sequential classification. We evaluated LEI's performance, and compared it against existing approaches, for the early detection of dementia, which is among the most studied multimodal sequential classification tasks. LEI outperformed these approaches due to its use of intermediate base predictions arising from the individual data modalities, which enabled their better integration over time. LEI's design also enabled the identification of features that were consistently important across time for the effective prediction of dementia-related diagnoses. Overall, our work demonstrates the potential of LEI for sequential classification from longitudinal multimodal data. 1 I NTRODUCTION Data that are both longitudinal/temporal and multimodal are increasingly being used in combination with machine learning for forecasting, especially in medical diagnosis (Brand et al., 2019; Zhang & Shen, 2012; Feis et al., 2019; Li et al., 2023). Recently, a number of promising approaches for sequential classification from such data have been introduced (Eslami et al., 2023; Zhang et al., 2011; Wang et al., 2016; Zhang et al., 2024). For instance, some approaches have used recurrent neural network (RNN)-based models applied to data sequences where the modalities at each time point have been concatenated into a long feature vector, sometimes referred to as early fusion (Nguyen et al., 2020; Olaimat et al., 2023; Maheux et al., 2023).
EROAS: 3D Efficient Reactive Obstacle Avoidance System for Autonomous Underwater Vehicles using 2.5D Forward-Looking Sonar
Mane, Pruthviraj, George, Allen Jacob, Makam, Rajini, Majumder, Rudrashis, Sundaram, Suresh
Advances in Autonomous Underwater Vehicles (AUVs) have evolved vastly in short period of time. While advancements in sonar and camera technology with deep learning aid the obstacle detection and path planning to a great extent, achieving the right balance between computational resources , precision and safety maintained remains a challenge. Finding optimal solutions for real-time navigation in cluttered environments becomes pivotal as systems have to process large amounts of data efficiently. In this work, we propose a novel obstacle avoidance method for navigating 3D underwater environments. This approach utilizes a standard multibeam forward-looking sonar to detect and map obstacle in 3D environment. Instead of using computationally expensive 3D sensors, we pivot the 2D sonar to get 3D heuristic data effectively transforming the sensor into a 2.5D sonar for real-time 3D navigation decisions. This approach enhances obstacle detection and navigation by leveraging the simplicity of 2D sonar with the depth perception typically associated with 3D systems. We have further incorporated Control Barrier Function (CBF) as a filter to ensure safety of the AUV. The effectiveness of this algorithm was tested on a six degrees of freedom (DOF) rover in various simulation scenarios. The results demonstrate that the system successfully avoids obstacles and navigates toward predefined goals, showcasing its capability to manage complex underwater environments with precision. This paper highlights the potential of 2.5D sonar for improving AUV navigation and offers insights into future enhancements and applications of this technology in underwater autonomous systems. \url{https://github.com/AIRLabIISc/EROAS}
Aioli: A Unified Optimization Framework for Language Model Data Mixing
Chen, Mayee F., Hu, Michael Y., Lourie, Nicholas, Cho, Kyunghyun, Rรฉ, Christopher
Language model performance depends on identifying the optimal mixture of data groups to train on (e.g., law, code, math). Prior work has proposed a diverse set of methods to efficiently learn mixture proportions, ranging from fitting regression models over training runs to dynamically updating proportions throughout training. Surprisingly, we find that no existing method consistently outperforms a simple stratified sampling baseline in terms of average test perplexity per group. In this paper, we study the cause of this inconsistency by unifying existing methods into a standard optimization framework. We show that all methods set proportions to minimize total loss, subject to a method-specific mixing law -- an assumption on how loss is a function of mixture proportions. We find that existing parameterizations of mixing laws can express the true loss-proportion relationship empirically, but the methods themselves often set the mixing law parameters inaccurately, resulting in poor and inconsistent performance. Finally, we leverage the insights from our framework to derive a new online method named Aioli, which directly estimates the mixing law parameters throughout training and uses them to dynamically adjust proportions. Empirically, Aioli outperforms stratified sampling on 6 out of 6 datasets by an average of 0.28 test perplexity points, whereas existing methods fail to consistently beat stratified sampling, doing up to 6.9 points worse. Moreover, in a practical setting where proportions are learned on shorter runs due to computational constraints, Aioli can dynamically adjust these proportions over the full training run, consistently improving performance over existing methods by up to 12.01 test perplexity points.
Green My LLM: Studying the key factors affecting the energy consumption of code assistants
Coignion, Tristan, Quinton, Clรฉment, Rouvoy, Romain
In recent years,Large Language Models (LLMs) have significantly improved in generating high-quality code, enabling their integration into developers' Integrated Development Environments (IDEs) as code assistants. These assistants, such as GitHub Copilot, deliver real-time code suggestions and can greatly enhance developers' productivity. However, the environmental impact of these tools, in particular their energy consumption, remains a key concern. This paper investigates the energy consumption of LLM-based code assistants by simulating developer interactions with GitHub Copilot and analyzing various configuration factors. We collected a dataset of development traces from 20 developers and conducted extensive software project development simulations to measure energy usage under different scenarios. Our findings reveal that the energy consumption and performance of code assistants are influenced by various factors, such as the number of concurrent developers, model size, quantization methods, and the use of streaming. Notably, a substantial portion of generation requests made by GitHub Copilot is either canceled or rejected by developers, indicating a potential area for reducing wasted computations. Based on these findings, we share actionable insights into optimizing configurations for different use cases, demonstrating that careful adjustments can lead to significant energy savings.
Generative Adapter: Contextualizing Language Models in Parameters with A Single Forward Pass
Chen, Tong, Fang, Hao, Xia, Patrick, Liu, Xiaodong, Van Durme, Benjamin, Zettlemoyer, Luke, Gao, Jianfeng, Cheng, Hao
Large language models (LMs) are typically adapted to improve performance on new contexts (e.g., text prompts that define new tasks or domains) through finetuning or prompting. However, there is an accuracy compute tradeoff--finetuning incurs significant training cost and prompting increases inference overhead. We introduce GenerativeAdapter, an effective and efficient adaptation method that directly maps new contexts to low-rank LM adapters, thereby significantly reducing inference overhead with no need for finetuning. The adapter generator is trained via self-supervised learning, and can be used to adapt a single frozen LM for any new task simply by mapping the associated task or domain context to a new adapter. We apply GenerativeAdapter to two pretrained LMs (Mistral-7B-Instruct and Llama2-7B-Chat) and evaluate the adapted models in three adaption scenarios: knowledge acquisition from documents, learning from demonstrations, and personalization for users. In StreamingQA, our approach is effective in injecting knowledge into the LM's parameters, achieving a 63.5% improvement in F1 score over the model with supervised fine-tuning (from 19.5 to 31.5) for contexts as long as 32K tokens. In the MetaICL in-context learning evaluation, our method achieves an average accuracy of 44.9 across 26 tasks, outperforming the base model. On MSC, our method proves to be highly competitive in memorizing user information from conversations with a 4x reduction in computation and memory costs compared to prompting with full conversation history. Together, these results suggest that GenerativeAdapter should allow for general adaption to a wide range of different contexts. Adaptation is essential for language models (LMs) to acquire new world knowledge (Jiang et al., 2024; Hu et al., 2023; Mecklenburg et al., 2024), learn new tasks (Min et al., 2022), and personalize to individual users (Salemi et al., 2024).
Exploring the Alignment Landscape: LLMs and Geometric Deep Models in Protein Representation
Shu, Dong, Duan, Bingbing, Guo, Kai, Zhou, Kaixiong, Tang, Jiliang, Du, Mengnan
Latent representation alignment has become a foundational technique for constructing multimodal large language models (MLLM) by mapping embeddings from different modalities into a shared space, often aligned with the embedding space of large language models (LLMs) to enable effective cross-modal understanding. While preliminary protein-focused MLLMs have emerged, they have predominantly relied on heuristic approaches, lacking a fundamental understanding of optimal alignment practices across representations. In this study, we explore the alignment of multimodal representations between LLMs and Geometric Deep Models (GDMs) in the protein domain. We comprehensively evaluate three state-of-the-art LLMs (Gemma2-2B, LLaMa3.1-8B, and LLaMa3.1-70B) with four protein-specialized GDMs (GearNet, GVP, ScanNet, GAT). Our work examines alignment factors from both model and protein perspectives, identifying challenges in current alignment methodologies and proposing strategies to improve the alignment process. Our key findings reveal that GDMs incorporating both graph and 3D structural information align better with LLMs, larger LLMs demonstrate improved alignment capabilities, and protein rarity significantly impacts alignment performance. We also find that increasing GDM embedding dimensions, using two-layer projection heads, and fine-tuning LLMs on protein-specific data substantially enhance alignment quality. These strategies offer potential enhancements to the performance of protein-related multimodal models. Our code and data are available at https://github.com/Tizzzzy/LLM-GDM-alignment.