Malacca
Bi'an: A Bilingual Benchmark and Model for Hallucination Detection in Retrieval-Augmented Generation
Jiang, Zhouyu, Sun, Mengshu, Zhang, Zhiqiang, Liang, Lei
Retrieval-Augmented Generation (RAG) effectively reduces hallucinations in Large Language Models (LLMs) but can still produce inconsistent or unsupported content. Although LLM-as-a-Judge is widely used for RAG hallucination detection due to its implementation simplicity, it faces two main challenges: the absence of comprehensive evaluation benchmarks and the lack of domain-optimized judge models. To bridge these gaps, we introduce \textbf{Bi'an}, a novel framework featuring a bilingual benchmark dataset and lightweight judge models. The dataset supports rigorous evaluation across multiple RAG scenarios, while the judge models are fine-tuned from compact open-source LLMs. Extensive experimental evaluations on Bi'anBench show our 14B model outperforms baseline models with over five times larger parameter scales and rivals state-of-the-art closed-source LLMs. We will release our data and models soon at https://github.com/OpenSPG/KAG.
Task Offloading in Vehicular Edge Computing using Deep Reinforcement Learning: A Survey
Uddin, Ashab, Sakr, Ahmed Hamdi, Zhang, Ning
The increasing demand for Intelligent Transportation Systems (ITS) has introduced significant challenges in managing the complex, computation-intensive tasks generated by modern vehicles while offloading tasks to external computing infrastructures such as edge computing (EC), nearby vehicular , and UAVs has become influential solution to these challenges. However, traditional computational offloading strategies often struggle to adapt to the dynamic and heterogeneous nature of vehicular environments. In this study, we explored the potential of Reinforcement Learning (RL) and Deep Reinforcement Learning (DRL) frameworks to optimize computational offloading through adaptive, real-time decision-making, and we have thoroughly investigated the Markov Decision Process (MDP) approaches on the existing literature. The paper focuses on key aspects such as standardized learning models, optimized reward structures, and collaborative multi-agent systems, aiming to advance the understanding and application of DRL in vehicular networks. Our findings offer insights into enhancing the efficiency, scalability, and robustness of ITS, setting the stage for future innovations in this rapidly evolving field.
The Shipwreck Detective
The wreck was like a bug on the wall, a jumbly shape splayed on the abyssal plain. It was noticed by a team of autonomous-underwater-vehicle operators on board a subsea exploration vessel, working at an undisclosed location in the Atlantic Ocean, about a thousand miles from the nearest shore. The analysts belonged to a small private company that specializes in deep-sea search operations; I have been asked not to name it. They were looking for something else. In the past decade, the company has helped to transform the exploration of the seabed by deploying fleets of A.U.V.s--underwater drones--which cruise in formation, mapping large areas of the ocean floor with high-definition imagery.
Personal Intelligence System UniLM: Hybrid On-Device Small Language Model and Server-Based Large Language Model for Malay Nusantara
Nazri, Azree, Agbolade, Olalekan, Aziz, Faisal
In contexts with limited computational and data resources, high-resource language models often prove inadequate, particularly when addressing the specific needs of Malay languages. This paper introduces a Personal Intelligence System designed to efficiently integrate both on-device and server-based models. The system incorporates SLiM-34M for on-device processing, optimized for low memory and power usage, and MANYAK-1.3B for server-based tasks, allowing for scalable, high-performance language processing. The models achieve significant results across various tasks, such as machine translation, question-answering, and translate IndoMMLU. Particularly noteworthy is SLiM-34M's ability to achieve a high improvement in accuracy compared to other LLMs while using 2 times fewer pre-training tokens. This work challenges the prevailing assumption that large-scale computational resources are necessary to build effective language models, contributing to the development of resource-efficient models for the Malay language with the unique orchestration between SLiM-34M and MANYAK-1.3B.
DAAD: Dynamic Analysis and Adaptive Discriminator for Fake News Detection
Su, Xinqi, Cui, Yawen, Liu, Ajian, Lin, Xun, Wang, Yuhao, Liang, Haochen, Li, Wenhui, Yu, Zitong
In current web environment, fake news spreads rapidly across online social networks, posing serious threats to society. Existing multimodal fake news detection (MFND) methods can be classified into knowledge-based and semantic-based approaches. However, these methods are overly dependent on human expertise and feedback, lacking flexibility. To address this challenge, we propose a Dynamic Analysis and Adaptive Discriminator (DAAD) approach for fake news detection. For knowledge-based methods, we introduce the Monte Carlo Tree Search (MCTS) algorithm to leverage the self-reflective capabilities of large language models (LLMs) for prompt optimization, providing richer, domain-specific details and guidance to the LLMs, while enabling more flexible integration of LLM comment on news content. For semantic-based methods, we define four typical deceit patterns: emotional exaggeration, logical inconsistency, image manipulation, and semantic inconsistency, to reveal the mechanisms behind fake news creation. To detect these patterns, we carefully design four discriminators and expand them in depth and breadth, using the soft-routing mechanism to explore optimal detection models. Experimental results on three real-world datasets demonstrate the superiority of our approach. The code will be available at: https://github.com/SuXinqi/DAAD.
Are LLMs Good Annotators for Discourse-level Event Relation Extraction?
Wei, Kangda, Gautam, Aayush, Huang, Ruihong
Large Language Models (LLMs) have demonstrated proficiency in a wide array of natural language processing tasks. However, its effectiveness over discourse-level event relation extraction (ERE) tasks remains unexplored. In this paper, we assess the effectiveness of LLMs in addressing discourse-level ERE tasks characterized by lengthy documents and intricate relations encompassing coreference, temporal, causal, and subevent types. Evaluation is conducted using an commercial model, GPT-3.5, and an open-source model, LLaMA-2. Our study reveals a notable underperformance of LLMs compared to the baseline established through supervised learning. Although Supervised Fine-Tuning (SFT) can improve LLMs performance, it does not scale well compared to the smaller supervised baseline model. Our quantitative and qualitative analysis shows that LLMs have several weaknesses when applied for extracting event relations, including a tendency to fabricate event mentions, and failures to capture transitivity rules among relations, detect long distance relations, or comprehend contexts with dense event mentions.
Truth is Universal: Robust Detection of Lies in LLMs
Bรผrger, Lennart, Hamprecht, Fred A., Nadler, Boaz
Large Language Models (LLMs) have revolutionised natural language processing, exhibiting impressive human-like capabilities. In particular, LLMs are capable of "lying", knowingly outputting false statements. Hence, it is of interest and importance to develop methods to detect when LLMs lie. Indeed, several authors trained classifiers to detect LLM lies based on their internal model activations. However, other researchers showed that these classifiers may fail to generalise, for example to negated statements. In this work, we aim to develop a robust method to detect when an LLM is lying. To this end, we make the following key contributions: (i) We demonstrate the existence of a two-dimensional subspace, along which the activation vectors of true and false statements can be separated. Notably, this finding is universal and holds for various LLMs, including Gemma-7B, LLaMA2-13B and LLaMA3-8B. Our analysis explains the generalisation failures observed in previous studies and sets the stage for more robust lie detection; (ii) Building upon (i), we construct an accurate LLM lie detector. Empirically, our proposed classifier achieves state-of-the-art performance, distinguishing simple true and false statements with 94% accuracy and detecting more complex real-world lies with 95% accuracy.
PlanRAG: A Plan-then-Retrieval Augmented Generation for Generative Large Language Models as Decision Makers
Lee, Myeonghwa, An, Seonho, Kim, Min-Soo
In this paper, we conduct a study to utilize LLMs as a solution for decision making that requires complex data analysis. We define Decision QA as the task of answering the best decision, $d_{best}$, for a decision-making question $Q$, business rules $R$ and a database $D$. Since there is no benchmark that can examine Decision QA, we propose Decision QA benchmark, DQA. It has two scenarios, Locating and Building, constructed from two video games (Europa Universalis IV and Victoria 3) that have almost the same goal as Decision QA. To address Decision QA effectively, we also propose a new RAG technique called the iterative plan-then-retrieval augmented generation (PlanRAG). Our PlanRAG-based LM generates the plan for decision making as the first step, and the retriever generates the queries for data analysis as the second step. The proposed method outperforms the state-of-the-art iterative RAG method by 15.8% in the Locating scenario and by 7.4% in the Building scenario, respectively. We release our code and benchmark at https://github.com/myeon9h/PlanRAG.