Goto

Collaborating Authors

 Asia


TDK ready to step up investment to ride AI wave

The Japan Times

TDK CEO Noboru Saito says the firm is prepared to add investments to ride the global boom in generative artificial intelligence. Electronics component linchpin TDK is prepared to add to what is already its biggest capital spending campaign ever in a push to ride the global boom in generative artificial intelligence. The company has added ¥100 billion ($640 million) to its multiyear investment plan each year since it rolled it out in 2024, and now CEO Noboru Saito says the effort may accelerate to match an expected surge in orders and demand. "Should promising prospects arise, our commitment is to make timely and opportunistic investments," Saito, 59, said in an interview. "If we don't sow the seeds for medium-to long-term growth now, we won't be able to reap the harvest later." In a time of both misinformation and too much information, quality journalism is more crucial than ever.


Japan megabanks set to win Mythos access after Bessent visit

The Japan Times

MUFG Bank, Mizuho Bank and Sumitomo Mitsui Banking are all likely to gain access to Anthropic's artificial intelligence model, Mythos. Japan's three megabanks are set to secure access to Anthropic's artificial intelligence model, Mythos, according to a person familiar with the matter, after its limited release last month sparked fears of a new age of cybersecurity risks. MUFG Bank, Sumitomo Mitsui Banking Corp. and Mizuho Bank are all likely to gain access to the artificial intelligence model developed by the U.S. firm, the person said, asking not to be identified because the information is private. The planned access was earlier reported by Nikkei. The move comes as financial institutions around the world grow alarmed about the risks created by Mythos, which has an unprecedented ability to detect software vulnerabilities. That has raised concerns that hackers could use Mythos to disrupt critical infrastructure, and access has so far been limited to a small number of U.S. companies and organizations.


Musk's xAI races to get Wall Street firms to use Grok chatbot

The Japan Times

Musk's xAI races to get Wall Street firms to use Grok chatbot A chat window for chatbot Grok. Musk's artificial intelligence venture, xAI, is moving with urgency to boost revenue by selling chatbot subscriptions and access to its computing resources before SpaceX's expected IPO next month. Billionaire Elon Musk's xAI has recruited multiple Wall Street firms with ties to his business empire to test its Grok chatbot, according to people familiar with the matter, part of a push to bolster revenue ahead of parent company SpaceX's initial public offering. Apollo Global Management and Morgan Stanley have begun using Grok internally alongside software from other AI model makers, said the people, who spoke on condition of anonymity as the information is not public. Valor Equity Partners is also using Grok, the people said. Despite some banks signing up for Grok, financiers are rarely using the chatbot for work, some of the people said.


From Generalist to Specialist Representation

arXiv.org Machine Learning

Given a generalist model, learning a task-relevant specialist representation is fundamental for downstream applications. Identifiability, the asymptotic guarantee of recovering the ground-truth representation, is critical because it sets the ultimate limit of any model, even with infinite data and computation. We study this problem in a completely nonparametric setting, without relying on interventions, parametric forms, or structural constraints. We first prove that the structure between time steps and tasks is identifiable in a fully unsupervised manner, even when sequences lack strict temporal dependence and may exhibit disconnections, and task assignments can follow arbitrarily complex and interleaving structures. We then prove that, within each time step, the task-relevant latent representation can be disentangled from the irrelevant part under a simple sparsity regularization, without any additional information or parametric constraints. Together, these results establish a hierarchical foundation: task structure is identifiable across time steps, and task-relevant latent representations are identifiable within each step. To our knowledge, each result provides a first general nonparametric identifiability guarantee, and together they mark a step toward provably moving from generalist to specialist models.


Uncovering Symmetry Transfer in Large Language Models via Layer-Peeled Optimization

arXiv.org Machine Learning

Large language models (LLMs) are pretrained by minimizing the cross-entropy loss for next-token prediction. In this paper, we study whether this optimization strategy can induce geometric structure in the learned model weights and context embeddings. We approach this problem by analyzing a constrained layer-peeled optimization program, which serves as a mathematically tractable surrogate for LLMs by treating the output projection matrix and last-layer context embeddings as optimization variables. Our analysis of this nonconvex optimization program demonstrates that symmetries in the target next-token distributions are transferred to the global minimizers of the layer-peeled model in a precise group-theoretic sense. Specifically, we prove that when the target tokens exhibit a cyclic-shift symmetry (such as the seven days of the week or the twelve months of the year), the optimal logit matrix is exactly circulant, and the Gram matrices of both the output projections and the context embeddings form circulant geometries as well. Next, for exchangeable target distributions invariant under the symmetric group and, more generally, under two-transitive group actions, we show that the global optimal output projection matrix forms a simplex equiangular tight frame, while the optimal logit matrix and context embeddings inherit the permutation symmetries present in the input data. A key technical step is to reduce the constrained nonconvex factorized problem to an explicit logit-level convex characterization for cyclic symmetry and to a symmetry-based lower bound for permutation symmetry, together with a sharp characterization of the optimal factorization. Finally, we empirically demonstrate that open-source LLMs naturally exhibit symmetries consistent with our theoretical predictions, despite being trained without any explicit regularization promoting such geometric structure.


Robust Sequential Experimental Design for A/B Testing

arXiv.org Machine Learning

Experimental design has emerged as a powerful approach for improving the sample efficiency of A/B testing, yet existing designs rely critically on correctly specified models. We study robust sequential experimental design under model misspecification and develop a unified framework that covers both contextual bandit and dynamic settings. Theoretically, we prove that our design bounds the worst-case mean squared error of the estimated treatment effect. Empirically, we demonstrate the effectiveness of the proposed approach using synthetic and real-world datasets from a leading technology company.


The Mechanism of Weak-to-Strong Generalization: Feature Elicitation from Latent Knowledge

arXiv.org Machine Learning

Weak-to-strong (W2S) generalization, in which a strong model is fine-tuned on outputs of a weaker, task-specialized model, has been proposed as an approach to aligning superhuman AI systems. Existing theoretical analyses either fix the student's representations or operate in restricted settings. Whether multi-step SGD can succeed in feature learning while preserving diverse pre-trained capabilities remains open. We study W2S in the setting of reward-model learning with two-layer neural networks. The strong model has pre-trained representations organized into low-dimensional subspaces $V_k$, and is fine-tuned under the supervision of a weak model specialized on task $κ$. We prove that the strong model efficiently learns task $κ$, eliciting its pre-trained knowledge while retaining general capabilities. This establishes W2S generalization in the feature-learning regime, in the sense that the strong model acquires the target feature direction through W2S training, rather than having it given a priori. Moreover, W2S preserves pre-trained off-target features, whereas standard supervised fine-tuning causes catastrophic forgetting when off-target feature directions are correlated with the target's. Numerical experiments on synthetic data confirm our theoretical results.


Adaptive Kernel Density Estimation with Pre-training

arXiv.org Machine Learning

Density estimation in high-dimensional settings is an important and challenging statistical problem.Traditional methods based on kernel smoothing are inefficient in high dimensions due to the difficulties in specifying appropriate location-adaptive kernels. In this work, we introduce pre-training, a key idea behind many cutting-edge AI technologies, to the context of non-parametric density estimation. By establishing a pre-trained neural network that can recommend an appropriate location-adaptive kernel for each sample point, efficient density estimation with adaptive kernels is achieved in high dimensions. A wide range of numerical experiments show that this strategy is highly effective for improving density-estimation accuracy, when the target distribution is close to the distribution family for pre-training. When the target distribution is substantially different from the pre-training distribution family, the benefit from the proposed pre-training strategy may be diluted, but can be reactivated by an additional fine-tuning procedure.


Amortized Neural Clustering of Time Series based on Statistical Features

arXiv.org Machine Learning

This paper introduces an algorithm-agnostic approach to feature-based time series clustering via amortized neural inference. By training neural networks to approximate the optimal partitioning rule from simulated data, the proposed framework reduces reliance on conventional clustering methods, such as $K$-means, $K$-medoids, or hierarchical clustering, and their associated objective functions and heuristics. Leveraging statistical features, such as autocorrelations and quantile autocorrelations, the approach learns a data-driven affinity structure from which clustering partitions can be recovered, without requiring explicit prior specification of cluster shapes or structures. In addition, one version of the method can automatically determine the number of clusters, avoiding ad-hoc selection procedures. Comprehensive empirical studies show that the proposed framework achieves competitive or superior clustering accuracy relative to traditional methods, even in challenging scenarios where competing techniques are provided with the true number of clusters. An application to financial time series of stock returns illustrates its practical utility. By reducing the need for algorithm selection and calibration, the proposed framework opens new possibilities for automated, adaptive, and data-driven clustering of temporal data across scientific and industrial domains.


Learning Perturbations to Extrapolate Your LLM

arXiv.org Machine Learning

Training large language models (LLMs) such as GPT-5 and Qwen-3 (Singh et al., 2025; Yang et al., 2025) on massive text corpora aims at capturing the underlying distribution of natural language. Yet, it remains challenging for the trained model to extrapolate to out-of-distribution or out-of-domain settings beyond the support of its training data. The literature has seen the development of various data perturbation techniques, such as synonym replacement, random insertion, deletion, and swap, that modify training instances into semantically similar variants to effectively expose LLMs to a broader range of inputs and improve their ability to generalize beyond the training data (Feng et al., 2019, 2020; Li et al., 2024; Cen et al., 2026). However, their approach remains grounded in the discrete, word-level augmentation procedures mentioned previously, which may restrict its adaptivity across diverse domains. While discrete perturbations are simple to use, they could be too coarse and hard to refine due to the complexity of natural language (Park et al., 2022; Li et al., 2023). Meanwhile, fixed perturbations apply the same transformations to the data regardless of the contexts, thus failing to generalize appropriately (Ismailov and Asanova, 2025).