small data
Small Data Explainer -- The impact of small data methods in everyday life
Hackenberg, Maren, Connor, Sophia G., Kabus, Fabian, Brawner, June, Markham, Ella, Hardalupas, Mahi, Chowdhury, Areeq, Backofen, Rolf, Köttgen, Anna, Rohde, Angelika, Binder, Nadine, Binder, Harald, Data, the Collaborative Research Center 1597 Small
The emergence of breakthrough artificial intelligence (AI) techniques has led to a renewed focus on how small data settings, i.e., settings with limited information, can benefit from such developments. This includes societal issues such as how best to include under-represented groups in data-driven policy and decision making, or the health benefits of assistive technologies such as wearables. We provide a conceptual overview, in particular contrasting small data with big data, and identify common themes from exemplary case studies and application areas. Potential solutions are described in a more detailed technical overview of current data analysis and modelling techniques, highlighting contributions from different disciplines, such as knowledge-driven modelling from statistics and data-driven modelling from computer science. By linking application settings, conceptual contributions and specific techniques, we highlight what is already feasible and suggest what an agenda for fully leveraging small data might look like.
- Europe > Germany > Baden-Württemberg > Freiburg (0.05)
- North America > United States > District of Columbia > Washington (0.04)
- Asia (0.04)
- (6 more...)
- Overview (1.00)
- Research Report > Experimental Study (0.67)
- Research Report > Promising Solution (0.48)
- Law (1.00)
- Information Technology > Security & Privacy (1.00)
- Health & Medicine > Therapeutic Area > Oncology (1.00)
- (5 more...)
On Multivariate Financial Time Series Classification
This article investigates the use of Machine Learning and Deep Learning models in multivariate time series analysis within financial markets. It compares small and big data approaches, focusing on their distinct challenges and the benefits of scaling. Traditional methods such as SVMs are contrasted with modern architectures like ConvTimeNet. The results show the importance of using and understanding Big Data in depth in the analysis and prediction of financial time series.
- North America > Trinidad and Tobago > Trinidad > Arima > Arima (0.04)
- Asia > Japan > Honshū > Tōhoku > Fukushima Prefecture > Fukushima (0.04)
- Oceania > Fiji (0.04)
- North America > United States > New York (0.04)
- Information Technology > Data Science > Data Mining (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis (0.95)
Reviews: Generative Neural Machine Translation
Summary This paper proposes a generative latent variable model for neural machine translation, where inference is performed with variational inference. This extends the work of Zhang et al., 2016, who proposed a conditional model with variational inference. The advantage of the generative model is to force the latent variable to capture more of the semantics of the sentence than the conditional model was able to do. The main disadvantage of this approach is that the value of the latent variable has to be infered during decoding (based on candidate generations). The paper also shows that a version of this model can be trained in a multilingual setting, that monolingual data can be used as semi-supervised training, and that the inference algorithm can be extended to perform translation with missing words.
Data Augmentation is Dead, Long Live Data Augmentation
Piedboeuf, Frédéric, Langlais, Philippe
Textual data augmentation (DA) is a prolific field of study where novel techniques to create artificial data are regularly proposed, and that has demonstrated great efficiency on small data settings, at least for text classification tasks. In this paper, we challenge those results, showing that classical data augmentation is simply a way of performing better fine-tuning, and that spending more time fine-tuning before applying data augmentation negates its effect. This is a significant contribution as it answers several questions that were left open in recent years, namely~: which DA technique performs best (all of them as long as they generate data close enough to the training set as to not impair training) and why did DA show positive results (facilitates training of network). We furthermore show that zero and few-shot data generation via conversational agents such as ChatGPT or LLama2 can increase performances, concluding that this form of data augmentation does still work, even if classical methods do not.
- North America > United States > New York > New York County > New York City (0.04)
- North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
- North America > Dominican Republic (0.04)
- (5 more...)
How to Do Machine Learning with Small Data? -- A Review from an Industrial Perspective
Kraljevski, Ivan, Ju, Yong Chul, Ivanov, Dmitrij, Tschöpe, Constanze, Wolff, Matthias
Artificial intelligence experienced a technological breakthrough in science, industry, and everyday life in the recent few decades. The advancements can be credited to the ever-increasing availability and miniaturization of computational resources that resulted in exponential data growth. However, because of the insufficient amount of data in some cases, employing machine learning in solving complex tasks is not straightforward or even possible. As a result, machine learning with small data experiences rising importance in data science and application in several fields. The authors focus on interpreting the general term of "small data" and their engineering and industrial application role. They give a brief overview of the most important industrial applications of machine learning and small data. Small data is defined in terms of various characteristics compared to big data, and a machine learning formalism was introduced. Five critical challenges of machine learning with small data in industrial applications are presented: unlabeled data, imbalanced data, missing data, insufficient data, and rare events. Based on those definitions, an overview of the considerations in domain representation and data acquisition is given along with a taxonomy of machine learning approaches in the context of small data.
- North America > United States > Massachusetts (0.28)
- North America > United States > Wisconsin (0.14)
- Europe > Germany (0.14)
- (4 more...)
- Research Report (1.00)
- Overview (1.00)
- Instructional Material > Course Syllabus & Notes (0.67)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- (7 more...)
Gauge-optimal approximate learning for small data classification problems
Vecchi, Edoardo, Bassetti, Davide, Graziato, Fabio, Pospisil, Lukas, Horenko, Illia
Small data learning problems are characterized by a significant discrepancy between the limited amount of response variable observations and the large feature space dimension. In this setting, the common learning tools struggle to identify the features important for the classification task from those that bear no relevant information, and cannot derive an appropriate learning rule which allows to discriminate between different classes. As a potential solution to this problem, here we exploit the idea of reducing and rotating the feature space in a lower-dimensional gauge and propose the Gauge-Optimal Approximate Learning (GOAL) algorithm, which provides an analytically tractable joint solution to the dimension reduction, feature segmentation and classification problems for small data learning problems. We prove that the optimal solution of the GOAL algorithm consists in piecewise-linear functions in the Euclidean space, and that it can be approximated through a monotonically convergent algorithm which presents -- under the assumption of a discrete segmentation of the feature space -- a closed-form solution for each optimization substep and an overall linear iteration cost scaling. The GOAL algorithm has been compared to other state-of-the-art machine learning (ML) tools on both synthetic data and challenging real-world applications from climate science and bioinformatics (i.e., prediction of the El Nino Southern Oscillation and inference of epigenetically-induced gene-activity networks from limited experimental data). The experimental results show that the proposed algorithm outperforms the reported best competitors for these problems both in learning performance and computational cost.
- North America > United States > New York (0.04)
- Europe > Germany > Rhineland-Palatinate > Kaiserslautern (0.04)
- Europe > Czechia > Moravian-Silesian Region > Ostrava (0.04)
- (9 more...)
- Health & Medicine > Pharmaceuticals & Biotechnology (1.00)
- Health & Medicine > Therapeutic Area > Neurology (0.46)
- Education > Focused Education > Special Education (0.45)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Transfer-Once-For-All: AI Model Optimization for Edge
Kundu, Achintya, Wynter, Laura, Lee, Rhui Dih, Bathen, Luis Angel
Weight-sharing neural architecture search aims to optimize a configurable neural network model (supernet) for a variety of deployment scenarios across many devices with different resource constraints. Existing approaches use evolutionary search to extract models of different sizes from a supernet trained on a very large data set, and then fine-tune the extracted models on the typically small, real-world data set of interest. The computational cost of training thus grows linearly with the number of different model deployment scenarios. Hence, we propose Transfer-Once-For-All (TOFA) for supernet-style training on small data sets with constant computational training cost over any number of edge deployment scenarios. Given a task, TOFA obtains custom neural networks, both the topology and the weights, optimized for any number of edge deployment scenarios. To overcome the challenges arising from small data, TOFA utilizes a unified semi-supervised training loss to simultaneously train all subnets within the supernet, coupled with on-the-fly architecture selection at deployment time.
A Survey of Learning on Small Data: Generalization, Optimization, and Challenge
Cao, Xiaofeng, Bu, Weixin, Huang, Shengjun, Zhang, Minling, Tsang, Ivor W., Ong, Yew Soon, Kwok, James T.
Learning on big data brings success for artificial intelligence (AI), but the annotation and training costs are expensive. In future, learning on small data that approximates the generalization ability of big data is one of the ultimate purposes of AI, which requires machines to recognize objectives and scenarios relying on small data as humans. A series of learning topics is going on this way such as active learning and few-shot learning. However, there are few theoretical guarantees for their generalization performance. Moreover, most of their settings are passive, that is, the label distribution is explicitly controlled by finite training resources from known distributions. This survey follows the agnostic active sampling theory under a PAC (Probably Approximately Correct) framework to analyze the generalization error and label complexity of learning on small data in model-agnostic supervised and unsupervised fashion. Considering multiple learning communities could produce small data representation and related topics have been well surveyed, we thus subjoin novel geometric representation perspectives for small data: the Euclidean and non-Euclidean (hyperbolic) mean, where the optimization solutions including the Euclidean gradients, non-Euclidean gradients, and Stein gradient are presented and discussed. Later, multiple learning communities that may be improved by learning on small data are summarized, which yield data-efficient representations, such as transfer learning, contrastive learning, graph representation learning. Meanwhile, we find that the meta-learning may provide effective parameter update policies for learning on small data. Then, we explore multiple challenging scenarios for small data, such as the weak supervision and multi-label. Finally, multiple data applications that may benefit from efficient small data representation are surveyed.
- Oceania > Australia > New South Wales > Sydney (0.14)
- Asia > China > Jiangsu Province > Nanjing (0.04)
- Asia > Singapore (0.04)
- (6 more...)
- Health & Medicine > Pharmaceuticals & Biotechnology (0.92)
- Information Technology (0.92)
- Education > Educational Setting (0.67)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Computational Learning Theory (1.00)
- Information Technology > Data Science > Data Mining > Big Data (0.86)
DA-VEGAN: Differentiably Augmenting VAE-GAN for microstructure reconstruction from extremely small data sets
Zhang, Yichi, Seibert, Paul, Otto, Alexandra, Raßloff, Alexander, Ambati, Marreddy, Kästner, Markus
Microstructure reconstruction is an important and emerging field of research and an essential foundation to improving inverse computational materials engineering (ICME). Much of the recent progress in the field is made based on generative adversarial networks (GANs). Although excellent results have been achieved throughout a variety of materials, challenges remain regarding the interpretability of the model's latent space as well as the applicability to extremely small data sets. The present work addresses these issues by introducing DA-VEGAN, a model with two central innovations. First, a $\beta$-variational autoencoder is incorporated into a hybrid GAN architecture that allows to penalize strong nonlinearities in the latent space by an additional parameter, $\beta$. Secondly, a custom differentiable data augmentation scheme is developed specifically for this architecture. The differentiability allows the model to learn from extremely small data sets without mode collapse or deteriorated sample quality. An extensive validation on a variety of structures demonstrates the potential of the method and future directions of investigation are discussed.
Pinaki Laskar on LinkedIn: #artificialintelligence #AItechnology #machinelearning
Today's AI is largely machine learning techniques, deep learning algorithms and deep neural networks can't identify causality, its elements and structures, processes and mechanisms, rules and relationships, data and models, all what makes our world. This leads to all sorts of decision and prediction errors, data and algorithmic biases, the lack of quality data, and implementation failings, or the absence of real machine intelligence and learning. Correlation-based ML; Predictions only; Limited explainability; Spirals out of control in novel situations; Minimal human-machine interaction; Constrained by historical data; No guarantees on fairness; Needs a lot of data; True AI will emerge as Causal AI, State-of-the-Art AI Causal AI True AI: Real AI Platform. Decision-making AI: Causal AI doesn't just predict the future, it shapes it. Explainable AI: Put the "cause" in "because" with next-generation explainable AI.