Problem Solving
A Comprehensive Survey on Root Cause Analysis in (Micro) Services: Methodologies, Challenges, and Trends
Initially, IT operations were predominantly manual, relying heavily on human intervention for system monitoring, troubleshooting, and problem resolution. However, with the escalating scale and complexity of systems, the efficacy and precision of manual operations have been increasingly challenged. Subsequently, DevOps was introduced, building upon manual operations and fostering a synergistic collaboration between development and operations. Through automated deployment and continuous integration, DevOps has the capability to expedite the release of new features and rectify issues with greater speed and reliability. Nonetheless, DevOps still necessitates manual involvement in certain complex decision-making processes and tasks. To further mitigate this challenge and enhance cost-effectiveness and efficiency, AIOps leverages machine learning and data analysis to automatically collect and scrutinize vast amounts of IT operation data, enabling real-time monitoring, anomaly detection, fault localization, and automated processing of IT systems. AIOps not only augments the efficiency and accuracy of IT operations but also equips IT operations with the capacity to adapt more effectively to complex and dynamic IT environments, utilizing artificial intelligence and big data technologies.
DOPRA: Decoding Over-accumulation Penalization and Re-allocation in Specific Weighting Layer
In this work, we introduce DOPRA, a novel approach designed to mitigate hallucinations in multi-modal large language models (MLLMs). Unlike existing solutions that typically involve costly supplementary training data or the integration of external knowledge sources, DOPRA innovatively addresses hallucinations by decoding specific weighted layer penalties and redistribution, offering an economical and effective solution without additional resources. DOPRA is grounded in unique insights into the intrinsic mechanisms controlling hallucinations within MLLMs, especially the models' tendency to over-rely on a subset of summary tokens in the self-attention matrix, neglecting critical image-related information. This phenomenon is particularly pronounced in certain strata. To counteract this over-reliance, DOPRA employs a strategy of weighted overlay penalties and redistribution in specific layers, such as the 12th layer, during the decoding process. Furthermore, DOPRA includes a retrospective allocation process that re-examines the sequence of generated tokens, allowing the algorithm to reallocate token selection to better align with the actual image content, thereby reducing the incidence of hallucinatory descriptions in auto-generated captions. Overall, DOPRA represents a significant step forward in improving the output quality of MLLMs by systematically reducing hallucinations through targeted adjustments during the decoding process.
Improving Context-Aware Preference Modeling for Language Models
Pitis, Silviu, Xiao, Ziang, Roux, Nicolas Le, Sordoni, Alessandro
While finetuning language models from pairwise preferences has proven remarkably effective, the underspecified nature of natural language presents critical challenges. Direct preference feedback is uninterpretable, difficult to provide where multidimensional criteria may apply, and often inconsistent, either because it is based on incomplete instructions or provided by diverse principals. To address these challenges, we consider the two-step preference modeling procedure that first resolves the under-specification by selecting a context, and then evaluates preference with respect to the chosen context. We decompose reward modeling error according to these two steps, which suggests that supervising context in addition to context-specific preference may be a viable approach to aligning models with diverse human preferences. For this to work, the ability of models to evaluate context-specific preference is critical. To this end, we contribute context-conditioned preference datasets and accompanying experiments that investigate the ability of language models to evaluate context-specific preference. We use our datasets to (1) show that existing preference models benefit from, but fail to fully consider, added context, (2) finetune a context-aware reward model with context-specific performance exceeding that of GPT-4 and Llama 3 70B on tested datasets, and (3) investigate the value of context-aware preference modeling.
How to Blend Concepts in Diffusion Models
Longari, Giorgio, Olearo, Lorenzo, Melzi, Simone, Peñaloza, Rafael, Raganato, Alessandro
For the last decade, there has been a push to use multi-dimensional (latent) spaces to represent concepts; and yet how to manipulate these concepts or reason with them remains largely unclear. Some recent methods exploit multiple latent representations and their connection, making this research question even more entangled. Our goal is to understand how operations in the latent space affect the underlying concepts. To that end, we explore the task of concept blending through diffusion models. Diffusion models are based on a connection between a latent representation of textual prompts and a latent space that enables image reconstruction and generation. This task allows us to try different text-based combination strategies, and evaluate easily through a visual analysis. Our conclusion is that concept blending through space manipulation is possible, although the best strategy depends on the context of the blend.
Aligning Cyber Space with Physical World: A Comprehensive Survey on Embodied AI
Liu, Yang, Chen, Weixing, Bai, Yongjie, Luo, Jingzhou, Song, Xinshuai, Jiang, Kaixuan, Li, Zhida, Zhao, Ganlong, Lin, Junyi, Li, Guanbin, Gao, Wen, Lin, Liang
Embodied Artificial Intelligence (Embodied AI) is crucial for achieving Artificial General Intelligence (AGI) and serves as a foundation for various applications that bridge cyberspace and the physical world. Recently, the emergence of Multi-modal Large Models (MLMs) and World Models (WMs) have attracted significant attention due to their remarkable perception, interaction, and reasoning capabilities, making them a promising architecture for the brain of embodied agents. However, there is no comprehensive survey for Embodied AI in the era of MLMs. In this survey, we give a comprehensive exploration of the latest advancements in Embodied AI. Our analysis firstly navigates through the forefront of representative works of embodied robots and simulators, to fully understand the research focuses and their limitations. Then, we analyze four main research targets: 1) embodied perception, 2) embodied interaction, 3) embodied agent, and 4) sim-to-real adaptation, covering the state-of-the-art methods, essential paradigms, and comprehensive datasets. Additionally, we explore the complexities of MLMs in virtual and real embodied agents, highlighting their significance in facilitating interactions in dynamic digital and physical environments. Finally, we summarize the challenges and limitations of embodied AI and discuss their potential future directions. We hope this survey will serve as a foundational reference for the research community and inspire continued innovation. The associated project can be found at https://github.com/HCPLab-SYSU/Embodied_AI_Paper_List.
Learning Goal-Conditioned Representations for Language Reward Models
Nath, Vaskar, Slack, Dylan, Da, Jeff, Ma, Yuntao, Zhang, Hugh, Whitehead, Spencer, Hendryx, Sean
Techniques that learn improved representations via offline data or self-supervised objectives have shown impressive results in traditional reinforcement learning (RL). Nevertheless, it is unclear how improved representation learning can benefit reinforcement learning from human feedback (RLHF) on language models (LMs). In this work, we propose training reward models (RMs) in a contrastive, $\textit{goal-conditioned}$ fashion by increasing the representation similarity of future states along sampled preferred trajectories and decreasing the similarity along randomly sampled dispreferred trajectories. This objective significantly improves RM performance by up to 0.09 AUROC across challenging benchmarks, such as MATH and GSM8k. These findings extend to general alignment as well -- on the Helpful-Harmless dataset, we observe $2.3\%$ increase in accuracy. Beyond improving reward model performance, we show this way of training RM representations enables improved $\textit{steerability}$ because it allows us to evaluate the likelihood of an action achieving a particular goal-state (e.g., whether a solution is correct or helpful). Leveraging this insight, we find that we can filter up to $55\%$ of generated tokens during majority voting by discarding trajectories likely to end up in an "incorrect" state, which leads to significant cost savings. We additionally find that these representations can perform fine-grained control by conditioning on desired future goal-states. For example, we show that steering a Llama 3 model towards helpful generations with our approach improves helpfulness by $9.6\%$ over a supervised-fine-tuning trained baseline. Similarly, steering the model towards complex generations improves complexity by $21.6\%$ over the baseline. Overall, we find that training RMs in this contrastive, goal-conditioned fashion significantly improves performance and enables model steerability.
LIMT: Language-Informed Multi-Task Visual World Models
Aljalbout, Elie, Sotirakis, Nikolaos, van der Smagt, Patrick, Karl, Maximilian, Chen, Nutan
Most recent successes in robot reinforcement learning involve learning a specialized single-task agent. However, robots capable of performing multiple tasks can be much more valuable in real-world applications. Multi-task reinforcement learning can be very challenging due to the increased sample complexity and the potentially conflicting task objectives. Previous work on this topic is dominated by model-free approaches. The latter can be very sample inefficient even when learning specialized single-task agents. In this work, we focus on model-based multi-task reinforcement learning. We propose a method for learning multi-task visual world models, leveraging pre-trained language models to extract semantically meaningful task representations. These representations are used by the world model and policy to reason about task similarity in dynamics and behavior. Our results highlight the benefits of using language-driven task representations for world models and a clear advantage of model-based multi-task learning over the more common model-free paradigm.
Steamroller Problems: An Evaluation of LLM Reasoning Capability with Automated Theorem Prover Strategies
McGinness, Lachlan, Baumgartner, Peter
This study presents the first examination of the ability of Large Language Models (LLMs) to follow reasoning strategies that are used to guide Automated Theorem Provers (ATPs). We evaluate the performance of GPT4, GPT3.5 Turbo and Google's recent Gemini model on problems from a steamroller domain. In addition to determining accuracy we make use of the Natural Language Processing library spaCy to explore new methods of investigating LLM's reasoning capabilities. This led to one alarming result, the low correlation between correct reasoning and correct answers for any of the tested models. We found that the models' performance when using the ATP reasoning strategies was comparable to one-shot chain of thought and observe that attention to uncertainty in the accuracy results is critical when drawing conclusions about model performance. Consistent with previous speculation we confirm that LLMs have a preference for, and are best able to follow, bottom up reasoning processes. However, the reasoning strategies can still be beneficial for deriving small and relevant sets of formulas for external processing by a trusted inference engine.
ScanReason: Empowering 3D Visual Grounding with Reasoning Capabilities
Zhu, Chenming, Wang, Tai, Zhang, Wenwei, Chen, Kai, Liu, Xihui
Although great progress has been made in 3D visual grounding, current models still rely on explicit textual descriptions for grounding and lack the ability to reason human intentions from implicit instructions. We propose a new task called 3D reasoning grounding and introduce a new benchmark ScanReason which provides over 10K question-answer-location pairs from five reasoning types that require the synerization of reasoning and grounding. We further design our approach, ReGround3D, composed of the visual-centric reasoning module empowered by Multi-modal Large Language Model (MLLM) and the 3D grounding module to obtain accurate object locations by looking back to the enhanced geometry and fine-grained details from the 3D scenes. A chain-of-grounding mechanism is proposed to further boost the performance with interleaved reasoning and grounding steps during inference. Extensive experiments on the proposed benchmark validate the effectiveness of our proposed approach.
Knowledge-based Drug Samples' Comparison
Guillemin, Sébastien, Roxin, Ana, Dujourdy, Laurence, Journaux, Ludovic
-- Drug sample comparison is a process used by the French National Police to identify drug distribution networks. The current approach is based on a manual comparison done by forensic experts. In this article, we present our approach to acquire, formalise, and specify expert knowledge to improve the current process. We use an ontology coupled with logical rules to model the underlying knowledge. The different steps of our approach are designed to be reused in other application domains. The results obtained are explainable making them usable by experts in different fields. The fight against drug trafficking has been one of the French government's priorities since the end of 2019 and has led to the creation of the National Stup plan. This plan comprises 55 measures, including the use of new indicators to understand consumer habits and dealers' methods. The work described in this article is part of this plan and aims to support scientific experts in the decision-making process for narcotic profiling. As part of the fight against drug trafficking, several arrests may be made, often accompanied by seizures. Forensic experts perform several analyses on samples from a seizure. They aim to correlate different samples from different seizures to identify trafficking networks best. To do so, experts use sample matching to pair samples according to their characteristics. Paired samples constitute an ensemble called a batch. The sample characteristics used are represented by different data, namely: macroscopic data (e.g., sample dimension, drug logos), qualitative data (e.g., list of active substances), quantitative data (e.g., dosage of substances) or non-confidential seizure data (e.g., date, place of seizure). In France, such data is stored in the national STUPS database.