structured data
Gradient-Based Feature Learning under Structured Data
Recent works have demonstrated that the sample complexity of gradient-based learning of single index models, i.e. functions that depend on a 1-dimensional projection of the input data, is governed by their information exponent. However, these results are only concerned with isotropic data, while in practice the input often contains additional structure which can implicitly guide the algorithm. In this work, we investigate the effect of a spiked covariance structure and reveal several interesting phenomena. First, we show that in the anisotropic setting, the commonly used spherical gradient dynamics may fail to recover the true direction, even when the spike is perfectly aligned with the target direction. Next, we show that appropriate weight normalization that is reminiscent of batch normalization can alleviate this issue. Further, by exploiting the alignment between the (spiked) input covariance and the target, we obtain improved sample complexity compared to the isotropic case. In particular, under the spiked model with a suitably large spike, the sample complexity of gradient-based training can be made independent of the information exponent while also outperforming lower bounds for rotationally invariant kernel methods.
Learning Overparameterized Neural Networks via Stochastic Gradient Descent on Structured Data
Neural networks have many successful applications, while much less theoretical understanding has been gained. Towards bridging this gap, we study the problem of learning a two-layer overparameterized ReLU neural network for multi-class classification via stochastic gradient descent (SGD) from random initialization. In the overparameterized setting, when the data comes from mixtures of well-separated distributions, we prove that SGD learns a network with a small generalization error, albeit the network has enough capacity to fit arbitrary labels. Furthermore, the analysis provides interesting insights into several aspects of learning neural networks and can be verified based on empirical studies on synthetic data and on the MNIST dataset.
AI Answer Engine Citation Behavior An Empirical Analysis of the GEO16 Framework
Kumar, Arlen, Palkhouski, Leanid
AI answer engines increasingly mediate access to domain knowledge by generating responses and citing web sources. We introduce GEO-16, a 16 pillar auditing framework that converts on page quality signals into banded pillar scores and a normalized GEO score G that ranges from 0 to 1. Using 70 product intent prompts, we collected 1,702 citations across three engines (Brave Summary, Google AI Overviews, and Perplexity) and audited 1,100 unique URLs. In our corpus, the engines differed in the GEO quality of the pages they cited, and pillars related to Metadata and Freshness, Semantic HTML, and Structured Data showed the strongest associations with citation. Logistic models with domain clustered standard errors indicate that overall page quality is a strong predictor of citation, and simple operating points (for example, G at least 0.70 combined with at least 12 pillar hits) align with substantially higher citation rates in our data. We report per engine contrasts, vertical effects, threshold analysis, and diagnostics, then translate findings into a practical playbook for publishers. The study is observational and focuses on English language B2B SaaS pages; we discuss limitations, threats to validity, and reproducibility considerations.
- Research Report > Experimental Study (0.69)
- Research Report > New Finding (0.69)
Gradient-Based Feature Learning under Structured Data
Recent works have demonstrated that the sample complexity of gradient-based learning of single index models, i.e. functions that depend on a 1-dimensional projection of the input data, is governed by their information exponent. However, these results are only concerned with isotropic data, while in practice the input often contains additional structure which can implicitly guide the algorithm. In this work, we investigate the effect of a spiked covariance structure and reveal several interesting phenomena. First, we show that in the anisotropic setting, the commonly used spherical gradient dynamics may fail to recover the true direction, even when the spike is perfectly aligned with the target direction. Next, we show that appropriate weight normalization that is reminiscent of batch normalization can alleviate this issue.
Assessing Generative Models for Structured Data
Cannon, Reilly, Laird, Nicolette M., Vazquez, Caesar, Lin, Andy, Wagler, Amy, Chiang, Tony
Synthetic tabular data generation has emerged as a promising method to address limited data availability and privacy concerns. With the sharp increase in the performance of large language models in recent years, researchers have been interested in applying these models to the generation of tabular data. However, little is known about the quality of the generated tabular data from large language models. The predominant method for assessing the quality of synthetic tabular data is the train-synthetic-test-real approach, where the artificial examples are compared to the original by how well machine learning models, trained separately on the real and synthetic sets, perform in some downstream tasks. This method does not directly measure how closely the distribution of generated data approximates that of the original. This paper introduces rigorous methods for directly assessing synthetic tabular data against real data by looking at inter-column dependencies within the data. We find that large language models (GPT-2), both when queried via few-shot prompting and when fine-tuned, and GAN (CTGAN) models do not produce data with dependencies that mirror the original real data. Results from this study can inform future practice in synthetic data generation to improve data quality.
Reviews: Learning Overparameterized Neural Networks via Stochastic Gradient Descent on Structured Data
This paper studies learning over-parametrized single hidden layer ReLU neural networks for multi-class classification via SGD and the corresponding generalization error. They consider a mixture data distribution where each class has well-separated and compact support. The authors show SGD applied on the considered learning model achieves good prediction error with high probability under suitable assumptions. As a result even in severely over-parametrized models, SGD can generalize well although the network has enough capacity to fit arbitrary labels. The main insight in the theoretical analysis appears to be the observation that in the over-parametrized case, many ReLU neurons don't change their activation pattern when initialized randomly.
Gradient-Based Feature Learning under Structured Data
Recent works have demonstrated that the sample complexity of gradient-based learning of single index models, i.e. functions that depend on a 1-dimensional projection of the input data, is governed by their information exponent. However, these results are only concerned with isotropic data, while in practice the input often contains additional structure which can implicitly guide the algorithm. In this work, we investigate the effect of a spiked covariance structure and reveal several interesting phenomena. First, we show that in the anisotropic setting, the commonly used spherical gradient dynamics may fail to recover the true direction, even when the spike is perfectly aligned with the target direction. Next, we show that appropriate weight normalization that is reminiscent of batch normalization can alleviate this issue.
Representation Learning of Structured Data for Medical Foundation Models
Dwivedi, Vijay Prakash, Schlegel, Viktor, Liu, Andy T., Nguyen, Thanh-Tung, Kashyap, Abhinav Ramesh, Wei, Jeng, Yin, Wei-Hsian, Winkler, Stefan, Tan, Robby T.
Large Language Models (LLMs) have demonstrated remarkable performance across various domains, including healthcare. However, their ability to effectively represent structured non-textual data, such as the alphanumeric medical codes used in records like ICD-10 or SNOMED-CT, is limited and has been particularly exposed in recent research. This paper examines the challenges LLMs face in processing medical codes due to the shortcomings of current tokenization methods. As a result, we introduce the UniStruct architecture to design a multimodal medical foundation model of unstructured text and structured data, which addresses these challenges by adapting subword tokenization techniques specifically for the structured medical codes. Our approach is validated through model pre-training on both an extensive internal medical database and a public repository of structured medical records. Trained on over 1 billion tokens on the internal medical database, the proposed model achieves up to a 23% improvement in evaluation metrics, with around 2% gain attributed to our proposed tokenization. Additionally, when evaluated on the EHRSHOT public benchmark with a 1/1000 fraction of the pre-training data, the UniStruct model improves performance on over 42% of the downstream tasks. Our approach not only enhances the representation and generalization capabilities of patient-centric models but also bridges a critical gap in representation learning models' ability to handle complex structured medical data, alongside unstructured text.
- Asia > Singapore (0.04)
- Europe > United Kingdom > England > Greater Manchester > Manchester (0.04)
- Asia > Taiwan (0.04)
- Asia > India (0.04)
- Health & Medicine > Therapeutic Area (1.00)
- Health & Medicine > Health Care Technology > Medical Record (0.89)
Compression of Structured Data with Autoencoders: Provable Benefit of Nonlinearities and Depth
Kögler, Kevin, Shevchenko, Alexander, Hassani, Hamed, Mondelli, Marco
Autoencoders are a prominent model in many empirical branches of machine learning and lossy data compression. However, basic theoretical questions remain unanswered even in a shallow two-layer setting. In particular, to what degree does a shallow autoencoder capture the structure of the underlying data distribution? For the prototypical case of the 1-bit compression of sparse Gaussian data, we prove that gradient descent converges to a solution that completely disregards the sparse structure of the input. Namely, the performance of the algorithm is the same as if it was compressing a Gaussian source - with no sparsity. For general data distributions, we give evidence of a phase transition phenomenon in the shape of the gradient descent minimizer, as a function of the data sparsity: below the critical sparsity level, the minimizer is a rotation taken uniformly at random (just like in the compression of non-sparse data); above the critical sparsity, the minimizer is the identity (up to a permutation). Finally, by exploiting a connection with approximate message passing algorithms, we show how to improve upon Gaussian performance for the compression of sparse data: adding a denoising function to a shallow architecture already reduces the loss provably, and a suitable multi-layer decoder leads to a further improvement. We validate our findings on image datasets, such as CIFAR-10 and MNIST.
- North America > United States > Pennsylvania (0.04)
- Europe > Austria (0.04)
Efficient Asynchronize Stochastic Gradient Algorithm with Structured Data
Deep learning has achieved impressive success in a variety of fields because of its good generalization. However, it has been a challenging problem to quickly train a neural network with a large number of layers. The existing works utilize the locality-sensitive hashing technique or some data structures on space partitioning to alleviate the training cost in each iteration. In this work, we try accelerating the computations in each iteration from the perspective of input data points. Specifically, for a two-layer fully connected neural network, when the training data have some special properties, e.g., Kronecker structure, each iteration can be completed in sublinear time in the data dimension.