fraction
Some Theoretical Limitations of t-SNE
t-SNE has gained popularity as a dimension reduction technique, especially for visualizing data. It is well-known that all dimension reduction techniques may lose important features of the data. We provide a mathematical framework for understanding this loss for t-SNE by establishing a number of results in different scenarios showing how important features of data are lost by using t-SNE.
- North America > United States > New York (0.04)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- North America > United States > California > Santa Clara County > Palo Alto (0.04)
The Order Is The Message
In a controlled experiment on modular arithmetic ($p = 9973$), varying only example ordering while holding all else constant, two fixed-ordering strategies achieve 99.5\% test accuracy by epochs 487 and 659 respectively from a training set comprising 0.3\% of the input space, well below established sample complexity lower bounds for this task under IID ordering. The IID baseline achieves 0.30\% after 5{,}000 epochs from identical data. An adversarially structured ordering suppresses learning entirely. The generalizing model reliably constructs a Fourier representation whose fundamental frequency is the Fourier dual of the ordering structure, encoding information present in no individual training example, with the same fundamental emerging across all seeds tested regardless of initialization or training set composition. We discuss implications for training efficiency, the reinterpretation of grokking, and the safety risks of a channel that evades all content-level auditing.
- Research Report > Experimental Study (0.54)
- Research Report > New Finding (0.46)
The Rules-and-Facts Model for Simultaneous Generalization and Memorization in Neural Networks
Farné, Gabriele, Boncoraglio, Fabrizio, Zdeborová, Lenka
A key capability of modern neural networks is their capacity to simultaneously learn underlying rules and memorize specific facts or exceptions. Yet, theoretical understanding of this dual capability remains limited. We introduce the Rules-and-Facts (RAF) model, a minimal solvable setting that enables precise characterization of this phenomenon by bridging two classical lines of work in the statistical physics of learning: the teacher-student framework for generalization and Gardner-style capacity analysis for memorization. In the RAF model, a fraction $1 - \varepsilon$ of training labels is generated by a structured teacher rule, while a fraction $\varepsilon$ consists of unstructured facts with random labels. We characterize when the learner can simultaneously recover the underlying rule - allowing generalization to new data - and memorize the unstructured examples. Our results quantify how overparameterization enables the simultaneous realization of these two objectives: sufficient excess capacity supports memorization, while regularization and the choice of kernel or nonlinearity control the allocation of capacity between rule learning and memorization. The RAF model provides a theoretical foundation for understanding how modern neural networks can infer structure while storing rare or non-compressible information.
- North America (0.14)
- Europe > Switzerland > Vaud > Lausanne (0.04)
- Europe > France (0.04)
- Health & Medicine (0.67)
- Education (0.67)
Consistency of the $k$-Nearest Neighbor Regressor under Complex Survey Designs
We study the consistency of the $k$-nearest neighbor regressor under complex survey designs. While consistency results for this algorithm are well established for independent and identically distributed data, corresponding results for complex survey data are lacking. We show that the $k$-nearest neighbor regressor is consistent under regularity conditions on the sampling design and the distribution of the data. We derive lower bounds for the rate of convergence and show that these bounds exhibit the curse of dimensionality, as in the independent and identically distributed setting. Empirical studies based on simulated and real data illustrate our theoretical findings.
- North America > United States > New York > New York County > New York City (0.14)
- North America > United States > Texas (0.04)
Multiple Instance Learning for Efficient Sequential Data Classification on Resource-constrained Devices
We study the problem of fast and efficient classification of sequential data (such as time-series) on tiny devices, which is critical for various IoT related applications like audio keyword detection or gesture detection. Such tasks are cast as a standard classification task by sliding windows over the data stream to construct data points. Deploying such classification modules on tiny devices is challenging as predictions over sliding windows of data need to be invoked continuously at a high frequency. Each such predictor instance in itself is expensive as it evaluates large models over long windows of data. In this paper, we address this challenge by exploiting the following two observations about classification tasks arising in typical IoT related applications: (a) the signature of a particular class (e.g. an audio keyword) typically occupies a small fraction of the overall data, and (b) class signatures tend to be discernible early on in the data. We propose a method, EMI-RNN, that exploits these observations by using a multiple instance learning formulation along with an early prediction technique to learn a model that achieves better accuracy compared to baseline models, while simultaneously reducing computation by a large fraction. For instance, on a gesture detection benchmark [ 25 ], EMI-RNN improves standard LSTM model's accuracy by up to 1% while requiring 72x less computation. This enables us to deploy such models for continuous real-time prediction on a small device such as Raspberry Pi0 and Arduino variants, a task that the baseline LSTM could not achieve. Finally, we also provide an analysis of our multiple instance learning algorithm in a simple setting and show that the proposed algorithm converges to the global optima at a linear rate, one of the first such result in this domain.
- North America > United States > Colorado > Denver County > Denver (0.04)
- North America > United States > California > Alameda County > Berkeley (0.04)
- North America > United States > Arizona > Maricopa County > Phoenix (0.04)
- (4 more...)
- North America > United States > California (0.14)
- Africa > South Africa (0.14)
- North America > United States > Virginia (0.04)
- (7 more...)
- Research Report > New Finding (0.67)
- Research Report > Experimental Study (0.45)
- Law (1.00)
- Information Technology (1.00)
- Health & Medicine (1.00)
- Government > Regional Government > North America Government > United States Government (1.00)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- South America > Uruguay (0.04)
- Oceania > New Zealand (0.04)
- (6 more...)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
- Government (1.00)
- Law (0.67)
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Asia > Middle East > Israel > Haifa District > Haifa (0.04)
- North America > United States (0.14)
- Europe > Switzerland > Vaud > Lausanne (0.04)
- Europe > Denmark > Capital Region > Copenhagen (0.04)
- (2 more...)
- Research Report > Experimental Study (1.00)
- Research Report > New Finding (0.93)
- Education (0.68)
- Information Technology (0.46)