Goto

Collaborating Authors

 ccp



MUCH: A Multilingual Claim Hallucination Benchmark

Dentan, Jérémie, Canesse, Alexi, Buscaldi, Davide, Shabou, Aymen, Vanier, Sonia

arXiv.org Artificial Intelligence

Claim-level Uncertainty Quantification (UQ) is a promising approach to mitigate the lack of reliability in Large Language Models (LLMs). We introduce MUCH, the first claim-level UQ benchmark designed for fair and reproducible evaluation of future methods under realistic conditions. It includes 4,873 samples across four European languages (English, French, Spanish, and German) and four instruction-tuned open-weight LLMs. Unlike prior claim-level benchmarks, we release 24 generation logits per token, facilitating the development of future white-box methods without re-generating data. Moreover, in contrast to previous benchmarks that rely on manual or LLM-based segmentation, we propose a new deterministic algorithm capable of segmenting claims using as little as 0.2% of the LLM generation time. This makes our segmentation approach suitable for real-time monitoring of LLM outputs, ensuring that MUCH evaluates UQ methods under realistic deployment constraints. Finally, our evaluations show that current methods still have substantial room for improvement in both performance and efficiency.


Efficient Convex Completion of Coupled Tensors using Coupled Nuclear Norms

Kishan Wimalawarne, Hiroshi Mamitsuka

Neural Information Processing Systems

Coupled norms have emerged as a convex method to solve coupled tensor completion. A limitation with coupled norms is that they only induce low-rankness using the multilinear rank of coupled tensors. In this paper, we introduce a new set of coupled norms known as coupled nuclear norms by constraining the CP rank of coupled tensors. We propose new coupled completion models using the coupled nuclear norms as regularizers, which can be optimized using computationally efficient optimization methods. We derive excess risk bounds for proposed coupled completion models and show that proposed norms lead to better performance. Through simulation and real-data experiments, we demonstrate that proposed norms achieve better performance for coupled completion compared to existing coupled norms.


Few-Shot and Training-Free Review Generation via Conversational Prompting

Kusano, Genki

arXiv.org Artificial Intelligence

Personalized review generation helps businesses understand user preferences, yet most existing approaches assume extensive review histories of the target user or require additional model training. Real-world applications often face few-shot and training-free situations, where only a few user reviews are available and fine-tuning is infeasible. It is well known that large language models (LLMs) can address such low-resource settings, but their effectiveness depends on prompt engineering. In this paper, we propose Conversational Prompting, a lightweight method that reformulates user reviews as multi-turn conversations. Its simple variant, Simple Conversational Prompting (SCP), relies solely on the user's own reviews, while the contrastive variant, Contrastive Conversational Prompting (CCP), inserts reviews from other users or LLMs as incorrect replies and then asks the model to correct them, encouraging the model to produce text in the user's style. Experiments on eight product domains and five LLMs showed that the conventional non-conversational prompt often produced reviews similar to those written by random users, based on text-based metrics such as ROUGE-L and BERTScore, and application-oriented tasks like user identity matching and sentiment analysis. In contrast, both SCP and CCP produced reviews much closer to those of the target user, even when each user had only two reviews. CCP brings further improvements when high-quality negative examples are available, whereas SCP remains competitive when such data cannot be collected. These results suggest that conversational prompting offers a practical solution for review generation under few-shot and training-free constraints.


Calibrating LLM Confidence by Probing Perturbed Representation Stability

Khanmohammadi, Reza, Miahi, Erfan, Mardikoraem, Mehrsa, Kaur, Simerjot, Brugere, Ivan, Smiley, Charese H., Thind, Kundan, Ghassemi, Mohammad M.

arXiv.org Artificial Intelligence

Miscalibration in Large Language Models (LLMs) undermines their reliability, highlighting the need for accurate confidence estimation. We introduce CCPS (Calibrating LLM Confidence by Probing Perturbed Representation Stability), a novel method analyzing internal representational stability in LLMs. CCPS applies targeted adversarial perturbations to final hidden states, extracts features reflecting the model's response to these perturbations, and uses a lightweight classifier to predict answer correctness. CCPS was evaluated on LLMs from 8B to 32B parameters (covering Llama, Qwen, and Mistral architectures) using MMLU and MMLU-Pro benchmarks in both multiple-choice and open-ended formats. Our results show that CCPS significantly outperforms current approaches. Across four LLMs and three MMLU variants, CCPS reduces Expected Calibration Error by approximately 55% and Brier score by 21%, while increasing accuracy by 5 percentage points, Area Under the Precision-Recall Curve by 4 percentage points, and Area Under the Receiver Operating Characteristic Curve by 6 percentage points, all relative to the strongest prior method. CCPS delivers an efficient, broadly applicable, and more accurate solution for estimating LLM confidence, thereby improving their trustworthiness.


Trump's AI plan is a bulwark against the rising threat from China

FOX News

In July, some of the brightest minds in American technology descended on Washington to celebrate a major milestone: the launch of President Donald Trump's bold initiative to ensure the United States remains the world's unrivaled leader in artificial intelligence (AI). Let me be blunt: the AI arms race is no longer theoretical. And we cannot afford to come in second place. In business, if you don't constantly adapt and innovate, you lose. If we fail to lead in AI, we risk surrendering our economic and national security edge to the Chinese Communist Party (CCP) -- a regime that seeks to challenge American technological supremacy and reshape the global order in its authoritarian image.


Sentinel: Scheduling Live Streams with Proactive Anomaly Detection in Crowdsourced Cloud-Edge Platforms

Li, Yuting, Huang, Shaoyuan, Zhang, Tengwen, Zhang, Cheng, Wang, Xiaofei, Leung, Victor C. M.

arXiv.org Artificial Intelligence

With the rapid growth of live streaming services, Crowdsourced Cloud-edge service Platforms (CCPs) are playing an increasingly important role in meeting the increasing demand. Although stream scheduling plays a critical role in optimizing CCPs' revenue, most optimization strategies struggle to achieve practical results due to various anomalies in unstable CCPs. Additionally, the substantial scale of CCPs magnifies the difficulties of anomaly detection in time-sensitive scheduling. To tackle these challenges, this paper proposes Sentinel, a proactive anomaly detection-based scheduling framework. Sentinel models the scheduling process as a two-stage Pre-Post-Scheduling paradigm: in the pre-scheduling stage, Sentinel conducts anomaly detection and constructs a strategy pool; in the post-scheduling stage, upon request arrival, it triggers an appropriate scheduling based on a pre-generated strategy to implement the scheduling process. Extensive experiments on realistic datasets show that Sentinel significantly reduces anomaly frequency by 70%, improves revenue by 74%, and doubles the scheduling speed.


Was this the week DeepSeek started the slow unwinding of the AI bet?

The Guardian

At 2.16pm California time last Sunday, the US billionaire tech investor Marc Andreessen called it. "DeepSeek R1 is AI's Sputnik moment," he posted on X. A Chinese startup, operating since 2023 and helmed by a millennial mathematician, had unveiled a new chatbot that seemed to equal the performance of America's leading models at a fraction of the cost. Never mind that its answers on everything from the status of Taiwan to the 1989 Tiananmen Square massacre were curbed by Chinese Communist party (CCP) censors. To Andreessen, a veteran of decades of technology booms and busts, it was like the Soviet Union getting the first satellite into orbit in 1957 and shocking America. The next day, shares in several of the world's biggest companies plunged – including the biggest fall in US market history for microchip maker Nvidia, which lost nearly 600bn.


Hyperspectral Imaging-Based Grain Quality Assessment With Limited Labelled Data

Karmakar, Priyabrata, Murshed, Manzur, Teng, Shyh Wei

arXiv.org Artificial Intelligence

Recently hyperspectral imaging (HSI)-based grain quality assessment has gained research attention. However, unlike other imaging modalities, HSI data lacks sufficient labelled samples required to effectively train deep convolutional neural network (DCNN)-based classifiers. In this paper, we present a novel approach to grain quality assessment using HSI combined with few-shot learning (FSL) techniques. Traditional methods for grain quality evaluation, while reliable, are invasive, time-consuming, and costly. HSI offers a non-invasive, real-time alternative by capturing both spatial and spectral information. However, a significant challenge in applying DCNNs for HSI-based grain classification is the need for large labelled databases, which are often difficult to obtain. To address this, we explore the use of FSL, which enables models to perform well with limited labelled data, making it a practical solution for real-world applications where rapid deployment is required. We also explored the application of FSL for the classification of hyperspectral images of bulk grains to enable rapid quality assessment at various receival points in the grain supply chain. We evaluated the performance of few-shot classifiers in two scenarios: first, classification of grain types seen during training, and second, generalisation to unseen grain types, a crucial feature for real-world applications. In the first scenario, we introduce a novel approach using pre-computed collective class prototypes (CCPs) to enhance inference efficiency and robustness. In the second scenario, we assess the model's ability to classify novel grain types using limited support examples. Our experimental results show that despite using very limited labelled data for training, our FSL classifiers accuracy is comparable to that of a fully trained classifier trained using a significantly larger labelled database.


Former House China hawk warns Americans about the dangers of the CCP's growing technological dominance

FOX News

The former chairman of the House Select Committee on the Chinese Communist Party warned about a fast-moving software and technology race between the United States and China, arguing the weaponization of supply chains could force a showdown between the free world and its totalitarian rivals. Former Rep. Mike Gallagher, R-Wis., told Fox News chief political anchor Bret Baier about a Wall Street Journal (WSJ) op-ed he wrote Sunday, outlining his concerns about China's growing technological dominance. "On the modern battlefield, we need to not only know our adversary but know ourselves and map our supply chain in great detail," he said Monday on "Special Report." Gallagher, the head of defense for Palantir Technologies, a Denver-based software company, highlighted how China could use its manufactured port cranes across the world to disrupt international commerce if the United States were to get into a conflict with China over Taiwan. "The Biden administration recently warned that Chinese-made port cranes could be'controlled... from remote locations.' European companies found that Chinese groups may have gained access to the systems that control cargo ships. Billions of endpoints connect to the internet, including sensors and devices that physically interact with critical infrastructure. Anyone with control over a portion of the technology stack such as semiconductors, cellular modules, or hardware devices, can use it to snoop, incapacitate or kill," he wrote in the WSJ.