Goto

Collaborating Authors

 plateau


What Happens During the Loss Plateau Understanding Abrupt Learning in Transformers

Neural Information Processing Systems

Training Transformers on algorithmic tasks frequently demonstrates an intriguing abrupt learning phenomenon: an extended performance plateau followed by a sudden, sharp improvement. This work investigates the underlying mechanisms for such dynamics, primarily in shallow Transformers. We reveal that during the plateau, the model often develops an interpretable partial solution while simultaneously exhibiting a strong repetition bias in their outputs. This output degeneracy is accompanied by internal representation collapse, where hidden states across different tokens become nearly parallel. We further identify the slow learning of optimal attention maps as a key bottleneck. Hidden progress in attention configuration during the plateau precedes the eventual rapid convergence, and directly intervening on attention significantly alters plateau duration and the severity of repetition bias and representational collapse. We validate that these identified phenomena--repetition bias and representation collapse--are not artifacts of toy setups but also manifest in the early pre-training stage of large language models like Pythia and OLMo.





Exploring Cumulative Effects in Survival Data Using Deep Learning Networks

arXiv.org Machine Learning

In epidemiological research, modeling the cumulative effects of time-dependent exposures on survival outcomes presents a challenge due to their intricate temporal dynamics. Conventional spline-based statistical methods, though effective, require repeated data transformation for each spline parameter tuning, with survival analysis computations relying on the entire dataset, posing difficulties for large datasets. Meanwhile, existing neural network-based survival analysis methods focus on accuracy but often overlook the interpretability of cumulative exposure patterns. To bridge this gap, we introduce CENNSurv, a novel deep learning approach that captures dynamic risk relationships from time-dependent data. Evaluated on two diverse real-world datasets, CENNSurv revealed a multi-year lagged association between chronic environmental exposure and a critical survival outcome, as well as a critical short-term behavioral shift prior to subscription lapse. This demonstrates CENNSurv's ability to model complex temporal patterns with improved scalability. CENNSurv provides researchers studying cumulative effects a practical tool with interpretable insights.


A machine learning approach to automation and uncertainty evaluation for self-validating thermocouples

arXiv.org Machine Learning

Thermocouples are in widespread use in industry, but they are particularly susceptible to calibration drift in harsh environments. Self-validating thermocouples aim to address this issue by using a miniature phase-change cell (fixed-point) in close proximity to the measurement junction (tip) of the thermocouple. The fixed point is a crucible containing an ingot of metal with a known melting temperature. When the process temperature being monitored passes through the melting temperature of the ingot, the thermocouple output exhibits a "plateau" during melting. Since the melting temperature of the ingot is known, the thermocouple can be recalibrated in situ. Identifying the melting plateau to determine the onset of melting is reasonably well established but requires manual intervention involving zooming in on the region around the actual melting temperature, a process which can depend on the shape of the melting plateau. For the first time, we present a novel machine learning approach to recognize and identify the characteristic shape of the melting plateau and once identified, to quantity the point at which melting begins, along with its associated uncertainty. This removes the need for human intervention in locating and characterizing the melting point. Results from test data provided by CCPI Europe show 100% accuracy of melting plateau detection. They also show a cross-validated R2 of 0.99 on predictions of calibration drift.


Reevaluating Self-Consistency Scaling in Multi-Agent Systems

arXiv.org Artificial Intelligence

This study examines the trade-offs of increasing sampled reasoning paths in self-consistency for modern large language models (LLMs). Earlier research with older models showed that combining multiple reasoning chains improves results before reaching a plateau. Using Gemini 2.5 models on HotpotQA and Math-500, we revisit those claims under current model conditions. Each configuration pooled outputs from varying sampled reasoning paths and compared them to a single chain-of-thought (CoT) baseline. Larger models exhibited a more stable and consistent improvement curve. The results confirm that performance gains taper off after moderate sampling, aligning with past findings. This plateau suggests diminishing returns driven by overlap among reasoning paths. Self-consistency remains useful, but high-sample configurations offer little benefit relative to their computational cost.


Closed-form $\ell_r$ norm scaling with data for overparameterized linear regression and diagonal linear networks under $\ell_p$ bias

arXiv.org Machine Learning

For overparameterized linear regression with isotropic Gaussian design and minimum-$\ell_p$ interpolator $p\in(1,2]$, we give a unified, high-probability characterization for the scaling of the family of parameter norms $ \\{ \lVert \widehat{w_p} \rVert_r \\}_{r \in [1,p]} $ with sample size. We solve this basic, but unresolved question through a simple dual-ray analysis, which reveals a competition between a signal *spike* and a *bulk* of null coordinates in $X^\top Y$, yielding closed-form predictions for (i) a data-dependent transition $n_\star$ (the "elbow"), and (ii) a universal threshold $r_\star=2(p-1)$ that separates $\lVert \widehat{w_p} \rVert_r$'s which plateau from those that continue to grow with an explicit exponent. This unified solution resolves the scaling of *all* $\ell_r$ norms within the family $r\in [1,p]$ under $\ell_p$-biased interpolation, and explains in one picture which norms saturate and which increase as $n$ grows. We then study diagonal linear networks (DLNs) trained by gradient descent. By calibrating the initialization scale $α$ to an effective $p_{\mathrm{eff}}(α)$ via the DLN separable potential, we show empirically that DLNs inherit the same elbow/threshold laws, providing a predictive bridge between explicit and implicit bias. Given that many generalization proxies depend on $\lVert \widehat {w_p} \rVert_r$, our results suggest that their predictive power will depend sensitively on which $l_r$ norm is used.


Dear Reviewers R1, R2, and R3: Thank you for your comments and suggestions to improve our paper

Neural Information Processing Systems

Dear Reviewers R1, R2, and R3: Thank you for your comments and suggestions to improve our paper. In contrast, BRN's statistical estimates are based on batches After tuning its hyperparameters, we observed that it performs worse (Figure 2). ON removes the batch size parameter and introduces two decay rate parameters. We will include this figure in the paper's appendix. Note, this is not the best value observed in the sweep.