AITopics | Materials

Collaborating Authors

Materials

Rethinking Tokenization for Rich Morphology: The Dominance of Unigram over BPE and Morphological Alignment

Vemula, Saketh Reddy, Dandapat, Sandipan, Sharma, Dipti Misra, Krishnamurthy, Parameswari

arXiv.org Artificial IntelligenceNov-11-2025

The relationship between tokenizer algorithm (e.g., Byte-Pair Encoding (BPE), Unigram), morphological alignment, tokenization quality (e.g., compression efficiency), and downstream performance remains largely unclear, particularly for languages with complex morphology. In this paper, we conduct a comprehensive evaluation of tokenizers using small-sized BERT models -- from pre-training through fine-tuning -- for Telugu (agglutinative), along with preliminary evaluation in Hindi (primarily fusional with some agglutination) and English (fusional). To evaluate morphological alignment of tokenizers in Telugu, we create a dataset containing gold morpheme segmentations of 600 derivational and 7000 inflectional word forms. Our experiments reveal two key findings for Telugu. First, the choice of tokenizer algorithm is the most significant factor influencing performance, with Unigram-based tokenizers consistently outperforming BPE across most settings. Second, while better morphological alignment shows a moderate, positive correlation with performance on text classification and structure prediction tasks, its impact is secondary to the tokenizer algorithm. Notably, hybrid approaches that use morphological information for pre-segmentation significantly boost the performance of BPE, though not Unigram. Our results further showcase the need for comprehensive intrinsic evaluation metrics for tokenizers that could explain downstream performance trends consistently.

computational linguistic, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2508.08424

Country:

Europe (1.00)
Asia (1.00)
North America > United States (0.67)

Genre: Research Report > New Finding (1.00)

Industry: Materials > Metals & Mining > Gold (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.67)

Add feedback

Compressing Chemistry Reveals Functional Groups

Sharma, Ruben, King, Ross D.

arXiv.org Artificial IntelligenceNov-11-2025

We introduce the first formal large-scale assessment of the utility of traditional chemical functional groups as used in chemical explanations. Our assessment employs a fundamental principle from computational learning theory: a good explanation of data should also compress the data. We introduce an unsupervised learning algorithm based on the Minimum Message Length (MML) principle that searches for substructures that compress around three million biologically relevant molecules. We demonstrate that the discovered substructures contain most human-curated functional groups as well as novel larger patterns with more specific functions. We also run our algorithm on 24 specific bioactivity prediction datasets to discover dataset-specific functional groups. Fingerprints constructed from dataset-specific functional groups are shown to significantly outperform other fingerprint representations, including the MACCS and Morgan fingerprint, when training ridge regression models on bioactivity regression tasks.

artificial intelligence, machine learning, substructure, (19 more...)

arXiv.org Artificial Intelligence

2511.05728

Country: Europe > United Kingdom > England (0.28)

Genre: Research Report > Experimental Study (0.68)

Industry:

Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (1.00)
Health & Medicine > Therapeutic Area > Immunology (1.00)
Health & Medicine > Pharmaceuticals & Biotechnology (1.00)
Materials > Chemicals (0.68)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Computational Learning Theory > Minimum Complexity Machines (0.48)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.34)

Add feedback

Temporal Sparse Autoencoders: Leveraging the Sequential Nature of Language for Interpretability

Bhalla, Usha, Oesterling, Alex, Verdun, Claudio Mayrink, Lakkaraju, Himabindu, Calmon, Flavio P.

arXiv.org Artificial IntelligenceNov-11-2025

Translating the internal representations and computations of models into concepts that humans can understand is a key goal of interpretability. While recent dictionary learning methods such as Sparse Autoencoders (SAEs) provide a promising route to discover human-interpretable features, they suffer from a variety of problems, including a systematic failure to capture the rich conceptual information that drives linguistic understanding. Instead, they exhibit a bias towards shallow, token-specific, or noisy features, such as "the phrase 'The' at the start of sentences". In this work, we propose that this is due to a fundamental issue with how dictionary learning methods for LLMs are trained. Language itself has a rich, well-studied structure spanning syntax, semantics, and pragmatics; however, current unsupervised methods largely ignore this linguistic knowledge, leading to poor feature discovery that favors superficial patterns over meaningful concepts. We focus on a simple but important aspect of language: semantic content has long-range dependencies and tends to be smooth over a sequence, whereas syntactic information is much more local. Building on this insight, we introduce Temporal Sparse Autoencoders (T-SAEs), which incorporate a novel contrastive loss encouraging consistent activations of high-level features over adjacent tokens. This simple yet powerful modification enables SAEs to disentangle semantic from syntactic features in a self-supervised manner. Across multiple datasets and models, T-SAEs recover smoother, more coherent semantic concepts without sacrificing reconstruction quality. Strikingly, they exhibit clear semantic structure despite being trained without explicit semantic signal, offering a new pathway for unsupervised interpretability in language models.

artificial intelligence, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2511.05541

Country:

Europe (0.93)
North America > United States (0.93)

Genre: Research Report > Experimental Study (0.46)

Industry:

Materials (0.68)
Government > Immigration & Customs (0.68)
Health & Medicine > Pharmaceuticals & Biotechnology (0.67)
Health & Medicine > Therapeutic Area > Nephrology (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

How the US overtook China as Africa's biggest foreign investor

BBC NewsNov-10-2025, 00:17:37 GMT

You probably don't give much thought to the device that you're reading this article on, as long as it looks good and keeps working. But the elements that power and run it are the subject of an escalating struggle between the world's two biggest economies - the US and China - with African countries in the eye of the storm. The African continent is rich in critical minerals and metals - like lithium, rare earths, cobalt and tungsten - which are vital to making and running our personal tech. Such materials are also essential for everything from electric vehicles, to AI data centres, and weapon systems. China has long been the biggest player in the global market for critical minerals and metals.

africa, china, us overtook china, (14 more...)

BBC News

Country:

Africa > Rwanda (0.15)
North America > Central America (0.15)
Oceania > Australia (0.06)
(21 more...)

Industry:

Materials > Metals & Mining (1.00)
Government > Regional Government (1.00)
Banking & Finance (1.00)
Transportation > Ground > Road (0.55)

Technology:

Information Technology > Cloud Computing (0.55)
Information Technology > Artificial Intelligence (0.49)

Add feedback

Expert Evaluation of LLM World Models: A High-$T_c$ Superconductivity Case Study

Guo, Haoyu, Tikhanovskaya, Maria, Raccuglia, Paul, Vlaskin, Alexey, Co, Chris, Liebling, Daniel J., Ellsworth, Scott, Abraham, Matthew, Dorfman, Elizabeth, Armitage, N. P., Feng, Chunhan, Georges, Antoine, Gingras, Olivier, Kiese, Dominik, Kivelson, Steven A., Oganesyan, Vadim, Ramshaw, B. J., Sachdev, Subir, Senthil, T., Tranquada, J. M., Brenner, Michael P., Venugopalan, Subhashini, Kim, Eun-Ah

arXiv.org Artificial IntelligenceNov-7-2025

Large Language Models (LLMs) show great promise as a powerful tool for scientific literature exploration. However, their effectiveness in providing scientifically accurate and comprehensive answers to complex questions within specialized domains remains an active area of research. Using the field of high-temperature cuprates as an exemplar, we evaluate the ability of LLM systems to understand the literature at the level of an expert. We construct an expert-curated database of 1,726 scientific papers that covers the history of the field, and a set of 67 expert-formulated questions that probe deep understanding of the literature. We then evaluate six different LLM-based systems for answering these questions, including both commercially available closed models and a custom retrieval-augmented generation (RAG) system capable of retrieving images alongside text. Experts then evaluate the answers of these systems against a rubric that assesses balanced perspectives, factual comprehensiveness, succinctness, and evidentiary support. Among the six systems two using RAG on curated literature outperformed existing closed models across key metrics, particularly in providing comprehensive and well-supported answers. We discuss promising aspects of LLM performances as well as critical short-comings of all the models. The set of expert-formulated questions and the rubric will be valuable for assessing expert level performance of LLM based reasoning systems.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2511.03782

Country: North America > United States (0.95)

Genre: Research Report (1.00)

Industry:

Energy (0.68)
Materials (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.66)

Add feedback

Benchmarking LLM Faithfulness in RAG with Evolving Leaderboards

Tamber, Manveer Singh, Bao, Forrest Sheng, Xu, Chenyu, Luo, Ge, Kazi, Suleman, Bae, Minseok, Li, Miaoran, Mendelevitch, Ofer, Qu, Renyi, Lin, Jimmy

arXiv.org Artificial IntelligenceNov-7-2025

Retrieval-augmented generation (RAG) aims to reduce hallucinations by grounding responses in external context, yet large language models (LLMs) still frequently introduce unsupported information or contradictions even when provided with relevant context. This paper presents two complementary efforts at Vectara to measure and benchmark LLM faithfulness in RAG. First, we describe our original hallucination leaderboard, which has tracked hallucination rates for LLMs since 2023 using our HHEM hallucination detection model. Motivated by limitations observed in current hallucination detection methods, we introduce FaithJudge, an LLM-as-a-judge framework that leverages a pool of diverse human-annotated hallucination examples to substantially improve the automated hallucination evaluation of LLMs. We introduce an enhanced hallucination leaderboard centered on FaithJudge that benchmarks LLMs on RAG faithfulness in summarization, question-answering, and data-to-text generation tasks. FaithJudge enables a more reliable benchmarking of LLM hallucinations in RAG and supports the development of more trustworthy generative AI systems: https://github.com/vectara/FaithJudge.

computational linguistic, large language model, machine learning, (21 more...)

arXiv.org Artificial Intelligence

2505.04847

Country:

North America > United States (1.00)
Asia > China > Fujian Province (0.15)

Genre: Research Report (0.40)

Industry:

Materials > Chemicals (0.69)
Law Enforcement & Public Safety (0.47)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.34)

Add feedback

A convolutional neural network deep learning method for model class selection

Impraimakis, Marios

arXiv.org Artificial IntelligenceNov-7-2025

The response-only model class selection capability of a novel deep convolutional neural network method is examined herein in a simple, yet effective, manner. Specifically, the responses from a unique degree of freedom along with their class information train and validate a one-dimensional convolutional neural network. In doing so, the network selects the model class of new and unlabeled signals without the need of the system input information, or full system identification. An optional physics-based algorithm enhancement is also examined using the Kalman filter to fuse the system response signals using the kinematics constraints of the acceleration and displacement data. Importantly, the method is shown to select the model class in slight signal variations attributed to the damping behavior or hysteresis behavior on both linear and nonlinear dynamic systems, as well as on a 3D building finite element model, providing a powerful tool for structural health monitoring applications.

artificial intelligence, machine learning, model class, (13 more...)

arXiv.org Artificial Intelligence

doi: 10.1002/eqe.4045

2511.03743

Country: Europe > United Kingdom (0.28)

Genre: Research Report (0.64)

Industry:

Health & Medicine > Consumer Health (0.69)
Materials (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.69)

Add feedback

LGM: Enhancing Large Language Models with Conceptual Meta-Relations and Iterative Retrieval

Lei, Wenchang, Zou, Ping, Wang, Yue, Sun, Feng, Zhao, Lei

arXiv.org Artificial IntelligenceNov-6-2025

Large language models (LLMs) exhibit strong semantic understanding, yet struggle when user instructions involve ambiguous or conceptually misaligned terms. We propose the Language Graph Model (LGM) to enhance conceptual clarity by extracting meta-relations-inheritance, alias, and composition-from natural language. The model further employs a reflection mechanism to validate these meta-relations. Leveraging a Concept Iterative Retrieval Algorithm, these relations and related descriptions are dynamically supplied to the LLM, improving its ability to interpret concepts and generate accurate responses. Unlike conventional Retrieval-Augmented Generation (RAG) approaches that rely on extended context windows, our method enables large language models to process texts of any length without the need for truncation. Experiments on standard benchmarks demonstrate that the LGM consistently outperforms existing RAG baselines.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2511.03214

Country:

Asia > China (0.28)
Europe > United Kingdom (0.28)

Genre: Personal > Honors (0.93)

Industry:

Health & Medicine > Therapeutic Area (1.00)
Health & Medicine > Consumer Health (1.00)
Education > Health & Safety > School Nutrition (1.00)
(2 more...)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.50)

Add feedback

EGMOF: Efficient Generation of Metal-Organic Frameworks Using a Hybrid Diffusion-Transformer Architecture

Han, Seunghee, Kang, Yeonghun, Bae, Taeun, Bernales, Varinia, Aspuru-Guzik, Alan, Kim, Jihan

arXiv.org Artificial IntelligenceNov-6-2025

Designing materials with targeted properties remain s challenging due to the vastness of chemical space and the scarcity of propert y-labeled data. While r ecent advances in generative models offer a promising w ay for inverse design, most approaches require large datasets and must be retrained for every new target property. Here, we introduce the EGMOF ( Efficient Generation of MOFs), a hybrid diffusion-transformer framework that overcome s these limitations through a modular, descriptor - mediated workflow. EGMOF decomposes inverse design into two steps: (1) a one -dimensional diffusion model (Prop2Desc) that maps desired properties to chemically meaningful descriptors followed by (2) a transformer model (Desc2MOF) that generates structures from the se descriptors. This modular hybrid design enables minimal retraining and maintains high accuracy even under small-data conditions. On a hydrogen uptake dataset, EGMOF achieved over 95 % validity and 84% hit rate, representing significant improvements of up to 57 % in validity and 14% in hit rate compared to existing methods, while remaining effective with only 1,000 training samples . Moreover, our model successfully performed conditional generation across 29 diverse property datasets, including CoREMOF, QMOF, and text - mined experimental datasets, whereas previous models have not. This work presents a data - efficient, generalizable approach to the inverse design of diverse MOFs and highlights the potential of modular inverse design workflows for broader materials discovery.

descriptor, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2511.03122

Country: North America > Canada > Ontario > Toronto (0.17)

Genre:

Workflow (1.00)
Research Report > New Finding (0.46)

Industry: Materials > Chemicals (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

OrdShap: Feature Position Importance for Sequential Black-Box Models

Hill, Davin, Hill, Brian L., Masoomi, Aria, Nori, Vijay S., Tillman, Robert E., Dy, Jennifer

arXiv.org Artificial IntelligenceNov-6-2025

Sequential deep learning models excel in domains with temporal or sequential dependencies, but their complexity necessitates post-hoc feature attribution methods for understanding their predictions. While existing techniques quantify feature importance, they inherently assume fixed feature ordering - conflating the effects of (1) feature values and (2) their positions within input sequences. To address this gap, we introduce OrdShap, a novel attribution method that disentangles these effects by quantifying how a model's predictions change in response to permuting feature position. We establish a game-theoretic connection between OrdShap and Sanchez-Bergantiños values, providing a theoretically grounded approach to position-sensitive attribution. Empirical results from health, natural language, and synthetic datasets highlight OrdShap's effectiveness in capturing feature value and feature position attributions, and provide deeper insight into model behavior.

attribution, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2507.11855

Genre: Research Report > Experimental Study (1.00)

Industry:

Health & Medicine > Health Care Providers & Services (0.93)
Information Technology (0.92)
Materials > Chemicals (0.67)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback