Personal
Don't Let It Hallucinate: Premise Verification via Retrieval-Augmented Logical Reasoning
Qin, Yuehan, Li, Shawn, Nian, Yi, Yu, Xinyan Velocity, Zhao, Yue, Ma, Xuezhe
Large language models (LLMs) have shown substantial capacity for generating fluent, contextually appropriate responses. However, they can produce hallucinated outputs, especially when a user query includes one or more false premises-claims that contradict established facts. Such premises can mislead LLMs into offering fabricated or misleading details. Existing approaches include pretraining, fine-tuning, and inference-time techniques that often rely on access to logits or address hallucinations after they occur. These methods tend to be computationally expensive, require extensive training data, or lack proactive mechanisms to prevent hallucination before generation, limiting their efficiency in real-time applications. We propose a retrieval-based framework that identifies and addresses false premises before generation. Our method first transforms a user's query into a logical representation, then applies retrieval-augmented generation (RAG) to assess the validity of each premise using factual sources. Finally, we incorporate the verification results into the LLM's prompt to maintain factual consistency in the final output. Experiments show that this approach effectively reduces hallucinations, improves factual accuracy, and does not require access to model logits or large-scale fine-tuning.
Dynamic Evaluation Framework for Personalized and Trustworthy Agents: A Multi-Session Approach to Preference Adaptability
Shah, Chirag, Joho, Hideo, Kaur, Kirandeep, Dammu, Preetam Prabhu Srikar
Recent advancements in generative AI have significantly increased interest in personalized agents. With increased personalization, there is also a greater need for being able to trust decision-making and action taking capabilities of these agents. However, the evaluation methods for these agents remain outdated and inadequate, often failing to capture the dynamic and evolving nature of user interactions. In this conceptual article, we argue for a paradigm shift in evaluating personalized and adaptive agents. We propose a comprehensive novel framework that models user personas with unique attributes and preferences. In this framework, agents interact with these simulated users through structured interviews to gather their preferences and offer customized recommendations. These recommendations are then assessed dynamically using simulations driven by Large Language Models (LLMs), enabling an adaptive and iterative evaluation process. Our flexible framework is designed to support a variety of agents and applications, ensuring a comprehensive and versatile evaluation of recommendation strategies that focus on proactive, personalized, and trustworthy aspects.
The Hall of AI Fears and Hopes: Comparing the Views of AI Influencers and those of Members of the U.S. Public Through an Interactive Platform
Moreira, Gustavo, Bogucka, Edyta Paulina, Constantinides, Marios, Quercia, Daniele
AI development is shaped by academics and industry leaders - let us call them ``influencers'' - but it is unclear how their views align with those of the public. To address this gap, we developed an interactive platform that served as a data collection tool for exploring public views on AI, including their fears, hopes, and overall sense of hopefulness. We made the platform available to 330 participants representative of the U.S. population in terms of age, sex, ethnicity, and political leaning, and compared their views with those of 100 AI influencers identified by Time magazine. The public fears AI getting out of control, while influencers emphasize regulation, seemingly to deflect attention from their alleged focus on monetizing AI's potential. Interestingly, the views of AI influencers from underrepresented groups such as women and people of color often differ from the views of underrepresented groups in the public.
Towards Smarter Hiring: Are Zero-Shot and Few-Shot Pre-trained LLMs Ready for HR Spoken Interview Transcript Analysis?
Maity, Subhankar, Deroy, Aniket, Sarkar, Sudeshna
This research paper presents a comprehensive analysis of the performance of prominent pre-trained large language models (LLMs), including GPT-4 Turbo, GPT-3.5 Turbo, text-davinci-003, text-babbage-001, text-curie-001, text-ada-001, llama-2-7b-chat, llama-2-13b-chat, and llama-2-70b-chat, in comparison to expert human evaluators in providing scores, identifying errors, and offering feedback and improvement suggestions to candidates during mock HR (Human Resources) interviews. We introduce a dataset called HURIT (Human Resource Interview Transcripts), which comprises 3,890 HR interview transcripts sourced from real-world HR interview scenarios. Our findings reveal that pre-trained LLMs, particularly GPT-4 Turbo and GPT-3.5 Turbo, exhibit commendable performance and are capable of producing evaluations comparable to those of expert human evaluators. Although these LLMs demonstrate proficiency in providing scores comparable to human experts in terms of human evaluation metrics, they frequently fail to identify errors and offer specific actionable advice for candidate performance improvement in HR interviews. Our research suggests that the current state-of-the-art pre-trained LLMs are not fully conducive for automatic deployment in an HR interview assessment. Instead, our findings advocate for a human-in-the-loop approach, to incorporate manual checks for inconsistencies and provisions for improving feedback quality as a more suitable strategy.
2025 Hugo Award game finalists include Zelda: Echoes of Wisdom and Dragon Age: The Veilguard
The Hugo Awards began honoring video games for the first time back in 2021. This week, the organization revealed the list of six finalists for the 2025 awards ceremony. Let's go over the nominations. Two AAA titles are up for the award. The gameplay involves summoning monsters and items to solve puzzles and do battle.
Humanoid robot stuns with perfect side-flip acrobatics
A robotics company has advanced from a backflipping robot to a side-flipping robot. Robots aren't just efficient machines anymore, they are now agile performers that can flip and jog. Take, for instance, Unitree, a Chinese robotics company that has been making headlines with its incredible G1 humanoid robot. You might have seen it dancing alongside humans or remembered its predecessor, the H1, which stunned us with a backflip using electric motors. But now, the G1 has taken things to a whole new level.
Yuval Noah Harari: 'How Do We Share the Planet With This New Superintelligence?'
Israeli historian and philosopher Yuval Noah Harari's book Sapiens became an international bestseller by presenting a view of history driven by the fictions created by mankind. His later work Homo Deus then depicted the a future for mankind brought about by the emergence of superintelligence. His latest book, Nexus: A Brief History of Information Networks From the Stone Age to AI, is a warning against the unparalleled threat of AI. A rising trend of techno-fascism driven by populism and artificial intelligence has been visible since the US presidential election in November. Nexus, which was published just a few months earlier, is a timely explainer of the potential consequences of AI on democracy and totalitarianism.
AI can be a powerful tool for scientists. But it can also fuel research misconduct
In February this year, Google announced it was launching "a new AI system for scientists". It said this system was a collaborative tool designed to help scientists "in creating novel hypotheses and research plans". It's too early to tell just how useful this particular tool will be to scientists. But what is clear is that artificial intelligence (AI) more generally is already transforming science. Last year for example, computer scientists won the Nobel Prize for Chemistry for developing an AI model to predict the shape of every protein known to mankind.
Bridget Phillipson eyes AI's potential to free up teachers' time
AI tools will soon be in use in classrooms across England, but the education secretary, Bridget Phillipson, has one big question she wants answered: will they save time? Attending a Department for Education-sponsored hackathon in central London last week, Phillipson listened as developers explained how their tools could compile pupil reports, improve writing samples and even assess the quality of soldering done by trainee electrical engineers. After listening to one developer extol their AI writing analysis tool as "superhuman", able to aggregate all the writing a pupil had ever done, Phillipson asked bluntly: "Do you know how much time it will have saved?" That will be our next step, the developer admitted, less confidently. In an interview with the Guardian, Phillipson said her interest in AI was less futuristic and more practical.
EventWeave: A Dynamic Framework for Capturing Core and Supporting Events in Dialogue Systems
Zhao, Zhengyi, Zhang, Shubo, Du, Yiming, Liang, Bin, Wang, Baojun, Li, Zhongyang, Li, Binyang, Wong, Kam-Fai
Existing large language models (LLMs) have shown remarkable progress in dialogue systems. However, many approaches still overlook the fundamental role of events throughout multi-turn interactions, leading to \textbf{incomplete context tracking}. Without tracking these events, dialogue systems often lose coherence and miss subtle shifts in user intent, causing disjointed responses. To bridge this gap, we present \textbf{EventWeave}, an event-centric framework that identifies and updates both core and supporting events as the conversation unfolds. Specifically, we organize these events into a dynamic event graph, which represents the interplay between \textbf{core events} that shape the primary idea and \textbf{supporting events} that provide critical context during the whole dialogue. By leveraging this dynamic graph, EventWeave helps models focus on the most relevant events when generating responses, thus avoiding repeated visits of the entire dialogue history. Experimental results on two benchmark datasets show that EventWeave improves response quality and event relevance without fine-tuning.