Media
India's IT sector faces seismic shift as AI displaces call-center workers
India's IT sector faces seismic shift as AI displaces call-center workers Developers work inside the office of AI startup LimeChat in Bengaluru on Aug. 19. BENGALURU - At a startup office in this Indian city, developers are fine-tuning artificial-intelligence chatbots that talk and message like humans. The company, LimeChat, has an audacious goal: to make customer-service jobs almost obsolete. It says its generative AI agents enable clients to slash by 80% the number of workers needed to handle 10,000 monthly queries. Once you hire a LimeChat agent, you never have to hire again, said Nikhil Gupta, the company's 28-year-old co-founder.
NATO and EU scramble to boost drone defenses to counter Russia
BRUSSELS - NATO and the EU on Wednesday sought ways to boost anti-drone defenses, as Europe scrambles to counter the threat from Russia after a series of air incursions. High-profile incidents in Poland and Estonia have set off a flurry of activity from European officials to plug gaps in the continent's defenses. NATO has launched a new mission and beefed up forces on its eastern border, but it is playing catch-up as it tries to tap Ukraine's experience and get to grips with the drone threat from Moscow. In a time of both misinformation and too much information, quality journalism is more crucial than ever. By subscribing, you can help us get the story right.
Evaluating Style-Personalized Text Generation: Challenges and Directions
Jangra, Anubhav, Sarrafzadeh, Bahareh, Cucerzan, Silviu, de Wynter, Adrian, Jauhar, Sujay Kumar
With the surge of large language models (LLMs) and their ability to produce customized output, style-personalized text generation--"write like me"--has become a rapidly growing area of interest. However, style personalization is highly specific, relative to every user, and depends strongly on the pragmatic context, which makes it uniquely challenging. Although prior research has introduced benchmarks and metrics for this area, they tend to be non-standardized and have known limitations (e.g., poor correlation with human subjects). LLMs have been found to not capture author-specific style well, it follows that the metrics themselves must be scrutinized carefully. In this work we critically examine the effectiveness of the most common metrics used in the field, such as BLEU, embeddings, and LLMs-as-judges. We evaluate these metrics using our proposed style discrimination benchmark, which spans eight diverse writing tasks across three evaluation settings: domain discrimination, authorship attribution, and LLM-generated personalized vs non-personalized discrimination. We find strong evidence that employing ensembles of diverse evaluation metrics consistently outperforms single-evaluator methods, and conclude by providing guidance on how to reliably assess style-personalized text generation.
ReasoningShield: Safety Detection over Reasoning Traces of Large Reasoning Models
Li, Changyi, Wang, Jiayi, Pan, Xudong, Hong, Geng, Yang, Min
Large Reasoning Models (LRMs) leverage transparent reasoning traces, known as Chain-of-Thoughts (CoTs), to break down complex problems into intermediate steps and derive final answers. However, these reasoning traces introduce unique safety challenges: harmful content can be embedded in intermediate steps even when final answers appear benign. Existing moderation tools, designed to handle generated answers, struggle to effectively detect hidden risks within CoTs. To address these challenges, we introduce ReasoningShield, a lightweight yet robust framework for moderating CoTs in LRMs. Our key contributions include: (1) formalizing the task of CoT moderation with a multi-level taxonomy of 10 risk categories across 3 safety levels, (2) creating the first CoT moderation benchmark which contains 9.2K pairs of queries and reasoning traces, including a 7K-sample training set annotated via a human-AI framework and a rigorously curated 2.2K human-annotated test set, and (3) developing a two-stage training strategy that combines stepwise risk analysis and contrastive learning to enhance robustness. Experiments show that ReasoningShield achieves state-of-the-art performance, outperforming task-specific tools like LlamaGuard-4 by 35.6% and general-purpose commercial models like GPT-4o by 15.8% on benchmarks, while also generalizing effectively across diverse reasoning paradigms, tasks, and unseen scenarios. All resources are released at https://github.com/CosmosYi/ReasoningShield.
Fact-R1: Towards Explainable Video Misinformation Detection with Deep Reasoning
Zhang, Fanrui, Li, Dian, Zhang, Qiang, Chen, Jun, Liu, Gang, Lin, Junxiong, Yan, Jiahong, Liu, Jiawei, Zha, Zheng-Jun
The rapid spread of multimodal misinformation on social media has raised growing concerns, while research on video misinformation detection remains limited due to the lack of large-scale, diverse datasets. Existing methods often overfit to rigid templates and lack deep reasoning over deceptive content. To address these challenges, we introduce FakeVV, a large-scale benchmark comprising over 100,000 video-text pairs with fine-grained, interpretable annotations. In addition, we further propose Fact-R1, a novel framework that integrates deep reasoning with collaborative rule-based reinforcement learning. Fact-R1 is trained through a three-stage process: (1) misinformation long-Chain-of-Thought (CoT) instruction tuning, (2) preference alignment via Direct Preference Optimization (DPO), and (3) Group Relative Policy Optimization (GRPO) using a novel verifiable reward function. This enables Fact-R1 to exhibit emergent reasoning behaviors comparable to those observed in advanced text-based reinforcement learning systems, but in the more complex multimodal misinformation setting. Our work establishes a new paradigm for misinformation detection, bridging large-scale video understanding, reasoning-guided alignment, and interpretable verification.
Learning to Think: Information-Theoretic Reinforcement Fine-Tuning for LLMs
Wang, Jingyao, Qiang, Wenwen, Song, Zeen, Zheng, Changwen, Xiong, Hui
Large language models (LLMs) excel at complex tasks thanks to advances in their reasoning abilities. However, existing methods overlook the trade-off between reasoning effectiveness and efficiency, often encouraging unnecessarily long reasoning chains and wasting tokens. To address this, we propose Learning to Think (L2T), an information-theoretic reinforcement fine-tuning framework for LLMs to make the models achieve optimal reasoning with fewer tokens. Specifically, L2T treats each query-response interaction as a hierarchical session of multiple episodes and proposes a universal dense process reward, i.e., quantifies the episode-wise information gain in parameters, requiring no extra annotations or task-specific evaluators. We propose a method to quickly estimate this reward based on PAC-Bayes bounds and the Fisher information matrix. Theoretical analyses show that it significantly reduces computational complexity with high estimation accuracy. By immediately rewarding each episode's contribution and penalizing excessive updates, L2T optimizes the model via reinforcement learning to maximize the use of each episode and achieve effective updates. Empirical results on various reasoning benchmarks and base models demonstrate the advantage of L2T across different tasks, boosting both reasoning effectiveness and efficiency.
TIFIN India at SemEval-2025: Harnessing Translation to Overcome Multilingual IR Challenges in Fact-Checked Claim Retrieval
Devadiga, Prasanna, Suneesh, Arya, Rajpoot, Pawan Kumar, Hazarika, Bharatdeep, Baliga, Aditya U
We address the challenge of retrieving previously fact-checked claims in monolingual and crosslingual settings - a critical task given the global prevalence of disinformation. Our approach follows a two-stage strategy: a reliable baseline retrieval system using a fine-tuned embedding model and an LLM-based reranker. Our key contribution is demonstrating how LLM-based translation can overcome the hurdles of multilingual information retrieval. Additionally, we focus on ensuring that the bulk of the pipeline can be replicated on a consumer GPU. Our final integrated system achieved a success@10 score of 0.938 and 0.81025 on the monolingual and crosslingual test sets, respectively.
BRIEF-Pro: Universal Context Compression with Short-to-Long Synthesis for Fast and Accurate Multi-Hop Reasoning
Gu, Jia-Chen, Zhang, Junyi, Wu, Di, Li, Yuankai, Chang, Kai-Wei, Peng, Nanyun
As retrieval-augmented generation (RAG) tackles complex tasks, increasingly expanded contexts offer richer information, but at the cost of higher latency and increased cognitive load on the model. To mitigate this bottleneck, especially for intricate multi-hop questions, we introduce BRIEF-Pro. It is a universal, lightweight compressor that distills relevant evidence for a given query from retrieved documents into a concise summary for seamless integration into in-context RAG. Using seed data consisting of relatively short contexts (fewer than 1k words), BRIEF-Pro is trained to perform abstractive compression of extended contexts exceeding 10k words across a wide range of scenarios. Furthermore, BRIEF-Pro offers flexible user control over summary length by allowing users to specify the desired number of sentences. Experiments on four open-domain multi-hop question-answering datasets show that BRIEF-Pro generates more concise and relevant summaries, enhancing performance across small, large, and proprietary language models. With the 70B reader model, 32x compression by BRIEF-Pro improves QA performance by 4.67% on average over LongLLMLingua's 9x, while requiring only 23% of its computational overhead.
Assessing Web Search Credibility and Response Groundedness in Chat Assistants
Vykopal, Ivan, Pikuliak, Matรบลก, Ostermann, Simon, ล imko, Mariรกn
Chat assistants increasingly integrate web search functionality, enabling them to retrieve and cite external sources. While this promises more reliable answers, it also raises the risk of amplifying misinformation from low-credibility sources. In this paper, we introduce a novel methodology for evaluating assistants' web search behavior, focusing on source credibility and the groundedness of responses with respect to cited sources. Using 100 claims across five misinformation-prone topics, we assess GPT-4o, GPT-5, Perplexity, and Qwen Chat. Our findings reveal differences between the assistants, with Perplexity achieving the highest source credibility, whereas GPT-4o exhibits elevated citation of non-credibility sources on sensitive topics. This work provides the first systematic comparison of commonly used chat assistants for fact-checking behavior, offering a foundation for evaluating AI systems in high-stakes information environments.
When Embedding Models Meet: Procrustes Bounds and Applications
Maystre, Lucas, Gonzalez, Alvaro Ortega, Park, Charles, Dolga, Rares, Berariu, Tudor, Zhao, Yu, Ciosek, Kamil
Embedding models trained separately on similar data often produce representations that encode stable information but are not directly interchangeable. This lack of interoperability raises challenges in several practical applications, such as model retraining, partial model upgrades, and multimodal search. Driven by these challenges, we study when two sets of embeddings can be aligned by an orthogonal transformation. We show that if pairwise dot products are approximately preserved, then there exists an isometry that closely aligns the two sets, and we provide a tight bound on the alignment error. This insight yields a simple alignment recipe, Procrustes post-processing, that makes two embedding models interoperable while preserving the geometry of each embedding space. Empirically, we demonstrate its effectiveness in three applications: maintaining compatibility across retrainings, combining different models for text retrieval, and improving mixed-modality search, where it achieves state-of-the-art performance.