Goto

Collaborating Authors

 data set


A Approximate Sampling from k-DPP Marginals The final piece of the A

Neural Information Processing Systems

In view of this, Barthelmé et al. (2019) propose an approximation to k-DPPs valid for large-scale ground sets which has better numerical properties. L( h): H [0, 1] be a random variable. The first equality uses Proposition 4. The second equality uses Proposition 3 and the fact that the We decompose bound the game regret into the sum of player and sampler regret. D, then a learner player that plays SGD algorithm suffers at most regret O ( GD T) . For convex regression and classification models we use linear models.


A Approximate Sampling from k-DPP Marginals The final piece of the A

Neural Information Processing Systems

In view of this, Barthelmé et al. (2019) propose an approximation to k-DPPs valid for large-scale ground sets which has better numerical properties. L( h): H [0, 1] be a random variable. The first equality uses Proposition 4. The second equality uses Proposition 3 and the fact that the We decompose bound the game regret into the sum of player and sampler regret. D, then a learner player that plays SGD algorithm suffers at most regret O ( GD T) . For convex regression and classification models we use linear models.


MLFMF: Data Sets for Machine Learning for Mathematical Formalization

Neural Information Processing Systems

We introduce MLFMF, a collection of data sets for benchmarking recommendation systems used to support formalization of mathematics with proof assistants. These systems help humans identify which previous entries (theorems, constructions, datatypes, and postulates) are relevant in proving a new theorem or carrying out a new construction. Each data set is derived from a library of formalized mathematics written in proof assistants Agda or Lean. The collection includes the largest Lean 4 library Mathlib, and some of the largest Agda libraries: the standard library, the library of univalent mathematics Agda-unimath, and the TypeTopology library. Each data set represents the corresponding library in two ways: as a heterogeneous network, and as a list of s-expressions representing the syntax trees of all the entries in the library.


A Russian Jeopardy! Data Set for Question-Answering Systems

Mikhalkova, Elena

arXiv.org Artificial Intelligence

Question answering (QA) is one of the most common NLP tasks that relates to named entity recognition, fact extraction, semantic search and some other fields. In industry, it is much appreciated in chatbots and corporate information systems. It is also a challenging task that attracted the attention of a very general audience at the quiz show Jeopardy! In this article we describe a Jeopardy!-like Russian QA data set collected from the official Russian quiz database Chgk (che ge ka). The data set includes 379,284 quiz-like questions with 29,375 from the Russian analogue of Jeopardy! - "Own Game". We observe its linguistic features and the related QA-task. We conclude about perspectives of a QA competition based on the data set collected from this database.


The Influence of Faulty Labels in Data Sets on Human Pose Estimation

Schwarz, Arnold, Hernadi, Levente, Bießmann, Felix, Hildebrand, Kristian

arXiv.org Artificial Intelligence

In this study we provide empirical evidence demonstrating that the quality of training data impacts model performance in Human Pose Estimation (HPE). Inaccurate labels in widely used data sets, ranging from minor errors to severe mislabeling, can negatively influence learning and distort performance metrics. We perform an in-depth analysis of popular HPE data sets to show the extent and nature of label inaccuracies. Our findings suggest that accounting for the impact of faulty labels will facilitate the development of more robust and accurate HPE models for a variety of real-world applications. We show improved performance with cleansed data.


The US wants to use facial recognition to identify migrant children as they age

MIT Technology Review

As Boyd explained at a conference in June, the key question for OBIM is, "If we pick up someone from Panama at the southern border at age four, say, and then pick them up at age six, are we going to recognize them?" Facial recognition technology (FRT) has traditionally not been applied to children, largely because training data sets of real children's faces are few and far between, and consist of either low-quality images drawn from the internet or small sample sizes with little diversity. Such limitations reflect the significant sensitivities regarding privacy and consent when it comes to minors. According to Syracuse University's Transactional Records Access Clearinghouse (TRAC), 339,234 children arrived at the US-Mexico border in 2022, the last year for which numbers are currently available. Of those children, 150,000 were unaccompanied--the highest annual number on record.


A rational model of causal induction with continuous causes

Neural Information Processing Systems

Rational models of causal induction have been successful in accounting for people's judgments about causal relationships. However, these models have focused on explaining inferences from discrete data of the kind that can be summarized in a 2 2 contingency table. This severely limits the scope of these models, since the world often provides non-binary data. We develop a new rational model of causal induction using continuous dimensions, which aims to diminish the gap between empirical and theoretical approaches and real-world causal induction. This model successfully predicts human judgments from previous studies better than models of discrete causal inference, and outperforms several other plausible models of causal induction with continuous causes in accounting for people's inferences in a new experiment.


f7e6c85504ce6e82442c770f7c8606f0-Reviews.html

Neural Information Processing Systems

The title of this paper is much like the paper itself: to-the-point, descriptive, and readable. "A simple example of Dirichlet process mixture inconsistency for the number of components" delivers on its promise by providing two easy-to-understand demonstrations of the severity of the problem of using Dirichlet process mixtures to estimate the number of components in a mixture model. The authors start by demonstrating that making such a component-cardinality estimate is widespread in the literature (and therefore a problem deserving of interest), briefly describe the Dirichlet process mixture (DPM) model (with particular emphasis on the popular normal likelihood case), and then demonstrate with a simple single-component mixture example how poorly estimation of component cardinality can go (their convincing answer: very poorly). Not only was the paper enjoyable to read but, refreshingly, didn't try to fit 20 pages of material into an 8 page limit. One potential criticism of this paper is that this result should be well-known in some sense in the community.


Variable selection for Na\"ive Bayes classification

Blanquero, Rafael, Carrizosa, Emilio, Ramírez-Cobo, Pepa, Sillero-Denamiel, M. Remedios

arXiv.org Artificial Intelligence

The Na\"ive Bayes has proven to be a tractable and efficient method for classification in multivariate analysis. However, features are usually correlated, a fact that violates the Na\"ive Bayes' assumption of conditional independence, and may deteriorate the method's performance. Moreover, datasets are often characterized by a large number of features, which may complicate the interpretation of the results as well as slow down the method's execution. In this paper we propose a sparse version of the Na\"ive Bayes classifier that is characterized by three properties. First, the sparsity is achieved taking into account the correlation structure of the covariates. Second, different performance measures can be used to guide the selection of features. Third, performance constraints on groups of higher interest can be included. Our proposal leads to a smart search, which yields competitive running times, whereas the flexibility in terms of performance measure for classification is integrated. Our findings show that, when compared against well-referenced feature selection approaches, the proposed sparse Na\"ive Bayes obtains competitive results regarding accuracy, sparsity and running times for balanced datasets. In the case of datasets with unbalanced (or with different importance) classes, a better compromise between classification rates for the different classes is achieved.


Symbolic Equation Solving via Reinforcement Learning

Dabelow, Lennart, Ueda, Masahito

arXiv.org Artificial Intelligence

Machine-learning methods are gradually being adopted in a great variety of social, economic, and scientific contexts, yet they are notorious for struggling with exact mathematics. A typical example is computer algebra, which includes tasks like simplifying mathematical terms, calculating formal derivatives, or finding exact solutions of algebraic equations. Traditional software packages for these purposes are commonly based on a huge database of rules for how a specific operation (e.g., differentiation) transforms a certain term (e.g., sine function) into another one (e.g., cosine function). Thus far, these rules have usually needed to be discovered and subsequently programmed by humans. Focusing on the paradigmatic example of solving linear equations in symbolic form, we demonstrate how the process of finding elementary transformation rules and step-by-step solutions can be automated using reinforcement learning with deep neural networks.