Goto

Collaborating Authors

 Karlaš, Bojan


DMLR: Data-centric Machine Learning Research -- Past, Present and Future

arXiv.org Artificial Intelligence

Drawing from discussions at the inaugural DMLR workshop at ICML 2023 and meetings prior, in this report we outline the relevance of community engagement and infrastructure development for the creation of next-generation public datasets that will advance machine learning science. We chart a path forward as a collective effort to sustain the creation and maintenance of these datasets and methods towards positive scientific, societal and business impact.


DataPerf: Benchmarks for Data-Centric AI Development

arXiv.org Artificial Intelligence

Machine learning research has long focused on models rather than datasets, and prominent datasets are used for common ML tasks without regard to the breadth, difficulty, and faithfulness of the underlying problems. Neglecting the fundamental importance of data has given rise to inaccuracy, bias, and fragility in real-world applications, and research is hindered by saturation across existing dataset benchmarks. In response, we present DataPerf, a community-led benchmark suite for evaluating ML datasets and data-centric algorithms. We aim to foster innovation in data-centric AI through competition, comparability, and reproducibility. We enable the ML community to iterate on datasets, instead of just architectures, and we provide an open, online platform with multiple rounds of challenges to support this iterative development. The first iteration of DataPerf contains five benchmarks covering a wide spectrum of data-centric techniques, tasks, and modalities in vision, speech, acquisition, debugging, and diffusion prompting, and we support hosting new contributed benchmarks from the community. The benchmarks, online evaluation platform, and baseline implementations are open source, and the MLCommons Association will maintain DataPerf to ensure long-term benefits to academia and industry.


Online Active Model Selection for Pre-trained Classifiers

arXiv.org Machine Learning

Model selection from a set of pre-trained models is an emerging problem in machine learning and has implications in several practical scenarios. Industrial examples include cases in which a telecommunication company or a flight booking company have multiple ML models trained over different sliding windows of data and hope to pick the one that performs the best on a given day. For many real-world problems, unlabeled data is abundant and can be inexpensively collected, while labels are expensive to acquire and require human expertise. Consequently, there is a need to robustly identify the best model under limited labeling resources. Similarly, one often needs reasonable predictions for the unlabeled data while keeping the labeling budget low. Depending on data availability, one can consider two settings: the pool-based setting assumes that the learner has access to a pool of unlabeled data, and she tries to select informative data samples from the pool to achieve her task. The online (streaming) setting works with a stream of data, where the data arrives one at a time, and the learner decides to ask for the label of the data samples on the go or to just throw the sample away. While offering less options on which data to label next, this streaming setting alleviates the scalability challenge of storing and processing a large pool of examples in the pool-based setting. Another important aspect is the nature of the data: the instance/label pairs might be sampled i.i.d.


Nearest Neighbor Classifiers over Incomplete Information: From Certain Answers to Certain Predictions

arXiv.org Machine Learning

Machine learning (ML) applications have been thriving recently, largely attributed to the increasing availability of data. However, inconsistency and incomplete information are ubiquitous in real-world datasets, and their impact on ML applications remains elusive. In this paper, we present a formal study of this impact by extending the notion of Certain Answers for Codd tables, which has been explored by the database research community for decades, into the field of machine learning. Specifically, we focus on classification problems and propose the notion of "Certain Predictions" (CP) -- a test data example can be certainly predicted (CP'ed) if all possible classifiers trained on top of all possible worlds induced by the incompleteness of data would yield the same prediction. We study two fundamental CP queries: (Q1) checking query that determines whether a data example can be CP'ed; and (Q2) counting query that computes the number of classifiers that support a particular prediction (i.e., label). Given that general solutions to CP queries are, not surprisingly, hard without assumption over the type of classifier, we further present a case study in the context of nearest neighbor (NN) classifiers, where efficient solutions to CP queries can be developed -- we show that it is possible to answer both queries in linear or polynomial time over exponentially many possible worlds. We demonstrate one example use case of CP in the important application of "data cleaning for machine learning (DC for ML)." We show that our proposed CPClean approach built based on CP can often significantly outperform existing techniques in terms of classification accuracy with mild manual cleaning effort.


Continuous Integration of Machine Learning Models with ease.ml/ci: Towards a Rigorous Yet Practical Treatment

arXiv.org Machine Learning

Continuous integration is an indispensable step of modern software engineering practices to systematically manage the life cycles of system development. Developing a machine learning model is no difference - it is an engineering process with a life cycle, including design, implementation, tuning, testing, and deployment. However, most, if not all, existing continuous integration engines do not support machine learning as first-class citizens. In this paper, we present ease.ml/ci, to our best knowledge, the first continuous integration system for machine learning. The challenge of building ease.ml/ci is to provide rigorous guarantees, e.g., single accuracy point error tolerance with 0.999 reliability, with a practical amount of labeling effort, e.g., 2K labels per test. We design a domain specific language that allows users to specify integration conditions with reliability constraints, and develop simple novel optimizations that can lower the number of labels required by up to two orders of magnitude for test conditions popularly used in real production systems.