Goto

Collaborating Authors

 released


ParaScopes: What do Language Models Activations Encode About Future Text?

Pochinkov, Nicky, Volkova, Yulia, Vasileva, Anna, Chereddy, Sai V R

arXiv.org Artificial Intelligence

Interpretability studies in language models often investigate forward-looking representations of activations. However, as language models become capable of doing ever longer time horizon tasks, methods for understanding activations often remain limited to testing specific concepts or tokens. We develop a framework of Residual Stream Decoders as a method of probing model activations for paragraph-scale and document-scale plans. We test several methods and find information can be decoded equivalent to 5+ tokens of future context in small models. These results lay the groundwork for better monitoring of language models and better understanding how they might encode longer-term planning information.


PolyMath: Evaluating Mathematical Reasoning in Multilingual Contexts

Wang, Yiming, Zhang, Pei, Tang, Jialong, Wei, Haoran, Yang, Baosong, Wang, Rui, Sun, Chenshu, Sun, Feitong, Zhang, Jiran, Wu, Junxuan, Cang, Qiqian, Zhang, Yichang, Huang, Fei, Lin, Junyang, Huang, Fei, Zhou, Jingren

arXiv.org Artificial Intelligence

In this paper, we introduce PolyMath, a multilingual mathematical reasoning benchmark covering 18 languages and 4 easy-to-hard difficulty levels. Our benchmark ensures difficulty comprehensiveness, language diversity, and high-quality translation, making it a highly discriminative multilingual mathematical benchmark in the era of reasoning LLMs. We conduct a comprehensive evaluation for advanced LLMs and find that even Qwen-3-235B-A22B-Thinking and Gemini-2.5-pro, achieve only 54.6 and 52.2 benchmark scores, with about 40% accuracy under the highest level From a language perspective, our benchmark reveals several key challenges of LLMs in multilingual reasoning: (1) Reasoning performance varies widely across languages for current LLMs; (2) Input-output language consistency is low in reasoning LLMs and may be correlated with performance; (3) The thinking length differs significantly by language for current LLMs. Additionally, we demonstrate that controlling the output language in the instructions has the potential to affect reasoning performance, especially for some low-resource languages, suggesting a promising direction for improving multilingual capabilities in LLMs.


Large Language Model Prompt Datasets: An In-depth Analysis and Insights

Zhang, Yuanming, Lin, Yan, Khan, Arijit, Wan, Huaiyu

arXiv.org Artificial Intelligence

A prompt is a natural language instruction that defines a specific task for a large language model (LLM) and serves as the primary interface for human-LLM interaction. With the growing deployment of LLMs, diverse prompt datasets are emerging from platforms such as GitHub and social media. These datasets span a wide array of applications and content types, facilitating both broader LLM utilization and improved prompt engineering. In this work, we--for the first time--have compiled an extensive list of prompt datasets sourced from various channels, representing a spectrum of downstream tasks, languages, engineering techniques, attributes, and modalities. We select key representative datasets for systematic analysis, revealing commonalities and differences in prompt construction across categories, distinguishing them from other text corpora like literature and web. We further propose a prompt optimization approach that leverages syntactic embeddings of part-of-speech and dependency structures. By identifying a centroid representation of prompts and guiding LLMs to rewrite prompts toward this centroid, our method improves the meaningfulness of model outputs. We have made our datasets and code available.


How to Detect and Defeat Molecular Mirage: A Metric-Driven Benchmark for Hallucination in LLM-based Molecular Comprehension

Li, Hao, Lv, Liuzhenghao, Cao, He, Liu, Zijing, Yan, Zhiyuan, Wang, Yu, Tian, Yonghong, Li, Yu, Yuan, Li

arXiv.org Artificial Intelligence

Large language models are increasingly used in scientific domains, especially for molecular understanding and analysis. However, existing models are affected by hallucination issues, resulting in errors in drug design and utilization. In this paper, we first analyze the sources of hallucination in LLMs for molecular comprehension tasks, specifically the knowledge shortcut phenomenon observed in the PubChem dataset. To evaluate hallucination in molecular comprehension tasks with computational efficiency, we introduce \textbf{Mol-Hallu}, a novel free-form evaluation metric that quantifies the degree of hallucination based on the scientific entailment relationship between generated text and actual molecular properties. Utilizing the Mol-Hallu metric, we reassess and analyze the extent of hallucination in various LLMs performing molecular comprehension tasks. Furthermore, the Hallucination Reduction Post-processing stage~(HRPP) is proposed to alleviate molecular hallucinations, Experiments show the effectiveness of HRPP on decoder-only and encoder-decoder molecular LLMs. Our findings provide critical insights into mitigating hallucination and improving the reliability of LLMs in scientific applications.


Contrastive Sparse Autoencoders for Interpreting Planning of Chess-Playing Agents

Poupart, Yoann

arXiv.org Artificial Intelligence

AI led chess systems to a superhuman level, yet these systems heavily rely on black-box algorithms. This is unsustainable in ensuring transparency to the end-user, particularly when these systems are responsible for sensitive decision-making. Recent interpretability work has shown that the inner representations of Deep Neural Networks (DNNs) were fathomable and contained human-understandable concepts. Yet, these methods are seldom contextualised and are often based on a single hidden state, which makes them unable to interpret multi-step reasoning, e.g. planning. In this respect, we propose contrastive sparse autoencoders (CSAE), a novel framework for studying pairs of game trajectories. Using CSAE, we are able to extract and interpret concepts that are meaningful to the chess-agent plans. We primarily focused on a qualitative analysis of the CSAE features before proposing an automated feature taxonomy. Furthermore, to evaluate the quality of our trained CSAE, we devise sanity checks to wave spurious correlations in our results.


Top GPUs For Deep Learning and Machine Learning in 2022

#artificialintelligence

As we walk into the age of AI, there is an exponential rise in the demand for GPU. The not-so-old method of parallel computing is applied to process computations in GPUs. Moreover, with the availability of very high numbers of ALUs or processing units, GPUs have become very suitable for powerful computations in AI. Furthermore, with the recent advent of Deep Learning in the current decade, most of the Deep Learning frameworks, including vastly popular TensorFlow, Pytorch, Theano, etc., enable advanced optimization of computations with GPU. Currently, a vast number of GPUs are available, with many differences in their features, like no. of processing units, memory capacity, clock frequency, etc.


6 Papers Every Modern Data Scientist Must Read

#artificialintelligence

Data Scientist, Machine Learning Expert, Algorithm Engineer, Deep Learning Researcher -- whatever your title might be, if using advanced concepts of Machine Learning is part of your career, then keeping up to date with the latest innovations is also a part of your everyday tasks. But in order to be on-top of all the latest ingenuities and truly understand how they work, we must also be familiar with the building blocks and foundations they rely on. The field of Deep Learning is moving fast, breaking and setting new records in each and every possible metric exists. And as it evolves, it creates new fundamental concepts, allowing new architectures and concepts never seen before. While I tend to assume all modern ML-practitioners are familiar with the basics fundamentals, such as CNN, RNN, LSTM and GAN, some of the newer ones are occasionally missed or left out.


Deep Netts v2.0 Has Been Released - Deep Netts Blog

#artificialintelligence

Deep Netts 2.0.0 is out! With 2.0 release Deep Netts has reached an important milestone after testing through real world use cases and pilot projects. Deep Netts 2.0 provides ease of use with competitive performance and simplified integration. Deep Netts is now free for development and we also provide opportunities for free low volume production licenses. All examples can be used as starter projects for corresponding problems.


OpenAI GPT-3 Waiting List Dropped as GPT-3 Is Fully Released for Developer and Enterprise Use

#artificialintelligence

When OpenAI first debuted its powerful GPT-3 natural language model in June of 2020, it debuted in a limited beta capacity and featured a waiting list where developers could sign up to use its infrastructure and capabilities. Now, the waiting list has been dropped and GPT-3's capabilities are immediately available to developers and enterprises to work on their most challenging language problems, according to a Nov. 18 (Thursday) announcement by OpenAI, an independent AI research and deployment company. But there are some caveats – the general release adds conditions to prevent GPT-3 from being used to harm people, as well as conditions that only allow its use in certain nations around the world. That means that developers in some nations, including Cuba, Iran and Russia, cannot currently access it. "OpenAI is committed to the safe deployment of AI," the organization said in a statement.


AI Beyond the Bottom Line: Artificial Intelligence for Global Impact Report is Released

#artificialintelligence

A report on the thoughtful development and use of AI to solve some of the world's most challenging problems, by Roger Spitz (Techistential) and Charles Warnock. In popular culture, Artificial Intelligence (AI) is often portrayed as a dark force ushering in an apocalyptic future in which humans are pitted against menacing machines. Today's headlines are full of AI-powered drones, backflipping robot dogs, and language models that can write passable poetry and press releases. Still, others envision beneficial AI applications that help us tackle the world's most pressing challenges, including poverty and hunger, health, education, and climate change. For those concerned about the impact of AI beyond the bottom line, Roger Spitz and Charles Warnock have assembled a resource that provides a balanced context to the challenges and opportunities of leveraging AI for social good.