Goto

Collaborating Authors

 real world data


Reviews: Powerset Convolutional Neural Networks

Neural Information Processing Systems

The authors present a neural network architecture for set functions, i.e. to identify a subset within a larger set. The authors provide a clear introduction to the problem in terms of convolutional operators and design CNN architectures on top of [1] through the addition of pooling operations for addressing the set function problem. The resulting networks performed competitively with baseline graph convolutional networks although they were outperformed slightly on subsets of tasks. The reviewers greatly appreciated the presentation of the work as the ideas were well motivated, the explanations were clear and the overall presentation were organized. Reviewers commented on the fact that the experiments were conducted on relatively small datasets.


Reviews: How to tell when a clustering is (approximately) correct using convex relaxations

Neural Information Processing Systems

This paper presents a general method to derive bounds on how close a given clustering is to an optimal clustering where optimal is defined according to certain loss functions. The method can be used for any clustering loss function for which convex relaxation exists (e.g., K-means, spectral clustering). They show in experiments they obtain much better bounds than the only related work (as far as I know) on K-means [Mei06]. The paper is well-written and easy to follow and addresses an important problem of evaluating the quality of clusterings. The main contribution of the paper is to make use of tighter convex relaxations to derive bounds on the distance between optimal clustering and a given clustering.


The Power of LLM-Generated Synthetic Data for Stance Detection in Online Political Discussions

Wagner, Stefan Sylvius, Behrendt, Maike, Ziegele, Marc, Harmeling, Stefan

arXiv.org Artificial Intelligence

Stance detection holds great potential for enhancing the quality of online political discussions, as it has shown to be useful for summarizing discussions, detecting misinformation, and evaluating opinion distributions. Usually, transformer-based models are used directly for stance detection, which require large amounts of data. However, the broad range of debate questions in online political discussion creates a variety of possible scenarios that the model is faced with and thus makes data acquisition for model training difficult. In this work, we show how to leverage LLM-generated synthetic data to train and improve stance detection agents for online political discussions:(i) We generate synthetic data for specific debate questions by prompting a Mistral-7B model and show that fine-tuning with the generated synthetic data can substantially improve the performance of stance detection. (ii) We examine the impact of combining synthetic data with the most informative samples from an unlabelled dataset. First, we use the synthetic data to select the most informative samples, second, we combine both these samples and the synthetic data for fine-tuning. This approach reduces labelling effort and consistently surpasses the performance of the baseline model that is trained with fully labeled data. Overall, we show in comprehensive experiments that LLM-generated data greatly improves stance detection performance for online political discussions.


Generalizing Analytic Shrinkage for Arbitrary Covariance Structures

Neural Information Processing Systems

Analytic shrinkage is a statistical technique that offers a fast alternative to crossvalidation for the regularization of covariance matrices and has appealing consistency properties. We show that the proof of consistency requires bounds on the growth rates of eigenvalues and their dispersion, which are often violated in data. We prove consistency under assumptions which do not restrict the covariance structure and therefore better match real world data. In addition, we propose an extension of analytic shrinkage -orthogonal complement shrinkage-which adapts to the covariance structure. Finally we demonstrate the superior performance of our novel approach on data from the domains of finance, spoken letter and optical character recognition, and neuroscience.


A novel algorithm can generate data to train machine learning models in conditions of extreme scarcity of real world data

Niel, Olivier

arXiv.org Artificial Intelligence

Training machine learning models requires large datasets. However, collecting, curating, and operating large and complex sets of real world data poses problems of costs, ethical and legal issues, and data availability. Here we propose a novel algorithm to generate large artificial datasets to train machine learning models in conditions of extreme scarcity of real world data. The algorithm is based on a genetic algorithm, which mutates randomly generated datasets subsequently used for training a neural network. After training, the performance of the neural network on a batch of real world data is considered a surrogate for the fitness of the generated dataset used for its training. As selection pressure is applied to the population of generated datasets, unfit individuals are discarded, and the fitness of the fittest individuals increases through generations. The performance of the data generation algorithm was measured on the Iris dataset and on the Breast Cancer Wisconsin diagnostic dataset. In conditions of real world data abundance, mean accuracy of machine learning models trained on generated data was comparable to mean accuracy of models trained on real world data (0.956 in both cases on the Iris dataset, p = 0.6996, and 0.9377 versus 0.9472 on the Breast Cancer dataset, p = 0.1189). In conditions of simulated extreme scarcity of real world data, mean accuracy of machine learning models trained on generated data was significantly higher than mean accuracy of comparable models trained on scarce real world data (0.9533 versus 0.9067 on the Iris dataset, p < 0.0001, and 0.8692 versus 0.7701 on the Breast Cancer dataset, p = 0.0091). In conclusion, this novel algorithm can generate large artificial datasets to train machine learning models, in conditions of extreme scarcity of real world data, or when cost or data sensitivity prevent the collection of large real world datasets.


Density peak clustering using tensor network

Shi, Xiao, Shang, Yun

arXiv.org Artificial Intelligence

Tensor networks, which have been traditionally used to simulate many-body physics, have recently gained significant attention in the field of machine learning due to their powerful representation capabilities. In this work, we propose a density-based clustering algorithm inspired by tensor networks. We encode classical data into tensor network states on an extended Hilbert space and train the tensor network states to capture the features of the clusters. Here, we define density and related concepts in terms of fidelity, rather than using a classical distance measure. We evaluate the performance of our algorithm on six synthetic data sets, four real world data sets, and three commonly used computer vision data sets. The results demonstrate that our method provides state-of-the-art performance on several synthetic data sets and real world data sets, even when the number of clusters is unknown. Additionally, our algorithm performs competitively with state-of-the-art algorithms on the MNIST, USPS, and Fashion-MNIST image data sets. These findings reveal the great potential of tensor networks for machine learning applications.


CoMadOut -- A Robust Outlier Detection Algorithm based on CoMAD

Lohrer, Andreas, Kazempour, Daniyal, Hünemörder, Maximilian, Kröger, Peer

arXiv.org Artificial Intelligence

Unsupervised learning methods are well established in the area of anomaly detection and achieve state of the art performances on outlier data sets. Outliers play a significant role, since they bear the potential to distort the predictions of a machine learning algorithm on a given data set. Especially among PCA-based methods, outliers have an additional destructive potential regarding the result: they may not only distort the orientation and translation of the principal components, they also make it more complicated to detect outliers. To address this problem, we propose the robust outlier detection algorithm CoMadOut, which satisfies two required properties: (1) being robust towards outliers and (2) detecting them. Our outlier detection method using coMAD-PCA defines dependent on its variant an inlier region with a robust noise margin by measures of in-distribution (ID) and out-of-distribution (OOD). These measures allow distribution based outlier scoring for each principal component, and thus, for an appropriate alignment of the decision boundary between normal and abnormal instances. Experiments comparing CoMadOut with traditional, deep and other comparable robust outlier detection methods showed that the performance of the introduced CoMadOut approach is competitive to well established methods related to average precision (AP), recall and area under the receiver operating characteristic (AUROC) curve. In summary our approach can be seen as a robust alternative for outlier detection tasks.


Phenotype Detection in Real World Data via Online MixEHR Algorithm

Xu, Ying, Gauriau, Romane, Decker, Anna, Oppenheim, Jacob

arXiv.org Artificial Intelligence

Understanding patterns of diagnoses, medications, procedures, and laboratory tests from electronic health records (EHRs) and health insurer claims is important for understanding disease risk and for efficient clinical development, which often require rules-based curation in collaboration with clinicians. We extended an unsupervised phenotyping algorithm, mixEHR, to an online version allowing us to use it on order of magnitude larger datasets including a large, US-based claims dataset and a rich regional EHR dataset. In addition to recapitulating previously observed disease groups, we discovered clinically meaningful disease subtypes and comorbidities. This work scaled up an effective unsupervised learning method, reinforced existing clinical knowledge, and is a promising approach for efficient collaboration with clinicians.


How to Deploy Machine Learning with Messy, Real World Data

#artificialintelligence

Machine learning and artificial intelligence pose the ability for global health practitioners to glean new insights from data they are already collecting as part of implementing their programs. However, little practice-based research has been documented on how to incorporate machine learning into international development programs. Current systems mirror in form and format the use of manually completed paper records to create periodic reports for leadership. This has vexed health officials with a proliferation of systems leaving some "data rich, but information poor". Yet the growth of available analytical systems and exponential growth of data require the global digital health community to become conversant in this technology to continue to make contributions to help fulfill our missions.


No, You're Not Alone. Google Is Also Making This Big Mistake On AI

#artificialintelligence

Just this past month, an article was shared that showed that over 30% of the data used by Google for one of their shared machine learning models was mislabeled with the wrong data. Not only was the model itself full of errors, but the actual training data used by that model itself was full of mistakes. How could anyone using Google's model ever hope to trust the results if it's full of human-induced errors that computers can't fix. And Google isn't alone with major data mislabeling, an MIT study in 2021 found that almost 6% of the images in the industry-standard ImageNet database are mislabeled, and furthermore, found "label errors in the test sets of 10 of the most commonly-used computer vision, natural language, and audio datasets". How can we hope to trust or use these models if the data used to train those models is so bad?