AITopics | Oceania

Collaborating Authors

Oceania

Lossless Vocabulary Reduction for Auto-Regressive Language Models

Chijiwa, Daiki, Hasegawa, Taku, Nishida, Kyosuke, Yamaguchi, Shin'ya, Ohba, Tomoya, Sakao, Tamao, Takeuchi, Susumu

arXiv.org Machine LearningOct-10-2025

Tokenization -- the process of decomposing a given text into a sequence of subwords called tokens -- is one of the key components in the development of language models. Particularly, auto-regressive language models generate texts token by token, i.e., by predicting the next-token distribution given the previous ones, and thus tokenization directly affects their efficiency in text generation. Since each language model has their own vocabulary as a set of possible tokens, they struggle to cooperate with each other at the level of next-token distributions such as model ensemble. In this paper, we establish a theoretical framework of lossless vocabulary reduction, which efficiently converts a given auto-regressive language model into the one with an arbitrarily small vocabulary without any loss in accuracy. As an application, we demonstrate that language models with different tokenization can cooperate with each other efficiently through their maximal common vocabulary.

language model, sub, vocabulary reduction, (15 more...)

arXiv.org Machine Learning

2510.08102

Country:

Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.14)
Europe > Austria > Vienna (0.14)
Asia > Middle East > Jordan (0.04)
(10 more...)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.47)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)

Add feedback

Beyond Real Data: Synthetic Data through the Lens of Regularization

Shidani, Amitis, Farghly, Tyler, Sun, Yang, Ganjgahi, Habib, Deligiannidis, George

arXiv.org Machine LearningOct-10-2025

Synthetic data can improve generalization when real data is scarce, but excessive reliance may introduce distributional mismatches that degrade performance. In this paper, we present a learning-theoretic framework to quantify the trade-off between synthetic and real data. Our approach leverages algorithmic stability to derive generalization error bounds, characterizing the optimal synthetic-to-real data ratio that minimizes expected test error as a function of the Wasserstein distance between the real and synthetic distributions. We motivate our framework in the setting of kernel ridge regression with mixed data, offering a detailed analysis that may be of independent interest. Our theory predicts the existence of an optimal ratio, leading to a U-shaped behavior of test error with respect to the proportion of synthetic data. Empirically, we validate this prediction on CIFAR-10 and a clinical brain MRI dataset. Our theory extends to the important scenario of domain adaptation, showing that carefully blending synthetic target data with limited source data can mitigate domain shift and enhance generalization. We conclude with practical guidance for applying our results to both in-domain and out-of-domain scenarios.

generalization, george deligiannidis 13, synthetic data, (11 more...)

arXiv.org Machine Learning

2510.08095

Country:

Europe > Austria > Vienna (0.14)
Europe > Sweden > Stockholm > Stockholm (0.04)
North America > United States > New York > New York County > New York City (0.04)
(17 more...)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (0.92)

Industry:

Health & Medicine > Diagnostic Medicine > Imaging (0.93)
Health & Medicine > Therapeutic Area > Neurology (0.66)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
(3 more...)

Add feedback

Computationally-efficient Graph Modeling with Refined Graph Random Features

Choromanski, Krzysztof, Dubey, Avinava, Sehanobish, Arijit, Reid, Isaac

arXiv.org Artificial IntelligenceOct-10-2025

We propose refined GRFs (GRFs++), a new class of Graph Random Features (GRFs) for efficient and accurate computations involving kernels defined on the nodes of a graph. GRFs++ resolve some of the long-standing limitations of regular GRFs, including difficulty modeling relationships between more distant nodes. They reduce dependence on sampling long graph random walks via a novel walk-stitching technique, concatenating several shorter walks without breaking unbiasedness. By applying these techniques, GRFs++ inherit the approximation quality provided by longer walks but with greater efficiency, trading sequential, inefficient sampling of a long walk for parallel computation of short walks and matrix-matrix multiplication. Furthermore, GRFs++ extend the simplistic GRFs walk termination mechanism (Bernoulli schemes with fixed halting probabilities) to a broader class of strategies, applying general distributions on the walks' lengths. This improves the approximation accuracy of graph kernels, without incurring extra computational cost. We provide empirical evaluations to showcase all our claims and complement our results with theoretical analysis.

artificial intelligence, data mining, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2510.07716

Country:

North America > United States (1.00)
Oceania > Australia > New South Wales (0.28)

Genre: Research Report > New Finding (0.34)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.46)

Add feedback

OpenDebateEvidence: A Massive-Scale Argument Mining and Summarization Dataset

Neural Information Processing SystemsOct-9-2025, 23:53:49 GMT

This dataset includes over 3.5 million documents with rich metadata, making it one of the most extensive collections of debate evidence.

argument, dataset, opendebateevidence, (16 more...)

Neural Information Processing Systems

Country:

North America > United States > Texas (0.04)
South America > Colombia > Meta Department > Villavicencio (0.04)
Oceania > Australia > New South Wales (0.04)
(11 more...)

Genre: Research Report > New Finding (1.00)

Industry:

Law (1.00)
Education (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.99)

Add feedback

MG-Net: Learn to Customize QAOA with Circuit Depth Awareness Y ang Qian

Neural Information Processing SystemsOct-9-2025, 23:45:28 GMT

Despite these advancements, QAOA's practical efficacy is challenged by the quantum coherence limits of modern quantum devices, as there is a ceiling on the allowable maximum circuit depth

hamiltonian, mixer hamiltonian, qaoa, (14 more...)

Neural Information Processing Systems

Country:

Asia > Singapore (0.04)
Asia > China > Hubei Province > Wuhan (0.04)
Oceania > Australia > New South Wales > Sydney (0.04)
Asia > British Indian Ocean Territory > Diego Garcia (0.04)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Technology:

Information Technology > Hardware (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Search (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
(3 more...)

Add feedback

MatrixNet: Learning over symmetry groups using learned group representations

Neural Information Processing SystemsOct-9-2025, 23:34:12 GMT

We also show that MatrixNet respects group relations allowing generalization to group elements of greater word length than in the training set.

experiment, matrixnet, representation, (16 more...)

Neural Information Processing Systems

Country:

Asia > Middle East > Jordan (0.06)
North America > United States > Massachusetts > Suffolk County > Boston (0.04)
Oceania > Australia > Australian Capital Territory > Canberra (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Genre:

Research Report > Experimental Study (0.93)
Research Report > New Finding (0.67)

Industry:

Education (0.46)
Information Technology (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.95)

Add feedback

38cc5cba8e513547b96bc326e25610dc-Paper-Datasets_and_Benchmarks_Track.pdf

Neural Information Processing SystemsOct-9-2025, 23:33:32 GMT

absent reasoning and evidence, inferred absent, knowledge, (14 more...)

Neural Information Processing Systems

Country:

Europe > United Kingdom > England > Oxfordshire > Oxford (0.14)
North America > United States > Washington > King County > Seattle (0.14)
Europe > Austria > Vienna (0.14)
(31 more...)

Genre: Research Report > New Finding (1.00)

Industry:

Law (1.00)
Government (0.92)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

3848856978da28639d2057094a1287a5-Paper-Conference.pdf

Neural Information Processing SystemsOct-9-2025, 23:25:36 GMT

dataset, experiment, prediction, (17 more...)

Neural Information Processing Systems

Country:

Europe > Germany > Bavaria > Middle Franconia > Nuremberg (0.14)
Europe > Germany > Baden-Württemberg > Freiburg (0.04)
Oceania > Australia > New South Wales > Sydney (0.04)
(9 more...)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry:

Health & Medicine (0.67)
Information Technology (0.46)
Banking & Finance (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
(5 more...)

Add feedback

Enhancing Robustness of Graph Neural Networks on Social Media with Explainable Inverse Reinforcement Learning

Neural Information Processing SystemsOct-9-2025, 23:25:28 GMT

Social media platforms capture diverse attack sequence samples through both machine and manual screening processes. Investigating effective ways to leverage these adversarial samples to enhance robustness is imperative.

learning, node, reward function, (14 more...)

Neural Information Processing Systems

Country: