AITopics | Morgenstern, Jamie

Plotting

Morgenstern, Jamie

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Reconstruction Attacks on Machine Unlearning: Simple Models are Vulnerable

Bertran, Martin, Tang, Shuai, Kearns, Michael, Morgenstern, Jamie, Roth, Aaron, Wu, Zhiwei Steven

arXiv.org Artificial IntelligenceMay-30-2024

As model training on personal data becomes commonplace, there has been a growing literature on data protection in machine learning (ML), which includes at least two thrusts: Data Privacy The primary concern regarding data privacy in machine learning (ML) applications is that models might inadvertently reveal details about the individual data points used in their training. This type of privacy risk can manifest in various ways, ranging from membership inference attacks [27]--which only seek to confirm whether a specific individual's data was used in the training--to more severe reconstruction attacks [10] that attempt to recover entire data records of numerous individuals. To address these risks, algorithms that adhere to differential privacy standards [12] provide proven safeguards, specifically limiting the ability to infer information about individual training data. Machine Unlearning Proponents of data autonomy have advocated for individuals to have the right to decide how their data is used, including the right to retroactively ask that their data and its influences be removed from any model trained on it. Data deletion, or machine unlearning, refer to technical approaches which allow such removal of influence [15, 4]. The idea is that, after an individual's data is deleted, the resulting model should be in the state it would have been had the model originally been trained without the individual in question's data. The primary focus of this literature has been on achieving or approximating this condition for complex models in ways that are more computationally efficient than full retraining (see e.g.

artificial intelligence, machine learning, reconstruction, (16 more...)

arXiv.org Artificial Intelligence

2405.20272

Country: North America > United States (0.68)

Genre: Research Report > New Finding (0.47)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Support Vector Machines (0.46)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.32)

Add feedback

Who's in and who's out? A case study of multimodal CLIP-filtering in DataComp

Hong, Rachel, Agnew, William, Kohno, Tadayoshi, Morgenstern, Jamie

arXiv.org Artificial IntelligenceMay-13-2024

As training datasets become increasingly drawn from unstructured, uncontrolled environments such as the web, researchers and industry practitioners have increasingly relied upon data filtering techniques to "filter out the noise" of web-scraped data. While datasets have been widely shown to reflect the biases and values of their creators, in this paper we contribute to an emerging body of research that assesses the filters used to create these datasets. We show that image-text data filtering also has biases and is value-laden, encoding specific notions of what is counted as "high-quality" data. In our work, we audit a standard approach of image-text CLIP-filtering on the academic benchmark DataComp's CommonPool by analyzing discrepancies of filtering through various annotation techniques across multiple modalities of image, text, and website source. We find that data relating to several imputed demographic groups -- such as LGBTQ+ people, older women, and younger men -- are associated with higher rates of exclusion. Moreover, we demonstrate cases of exclusion amplification: not only are certain marginalized groups already underrepresented in the unfiltered data, but CLIP-filtering excludes data from these groups at higher rates. The data-filtering step in the machine learning pipeline can therefore exacerbate representation disparities already present in the data-gathering step, especially when existing filters are designed to optimize a specifically-chosen downstream performance metric like zero-shot image classification accuracy. Finally, we show that the NSFW filter fails to remove sexually-explicit content from CommonPool, and that CLIP-filtering includes several categories of copyrighted content at high rates. Our conclusions point to a need for fundamental changes in dataset creation and filtering practices.

artificial intelligence, machine learning, natural language, (22 more...)

arXiv.org Artificial Intelligence

2405.08209

Country:

Asia (1.00)
North America > United States > Washington > King County > Seattle (0.14)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Europe > United Kingdom > England > Oxfordshire > Oxford (0.14)

Genre: Research Report > New Finding (1.00)

Industry:

Media (1.00)
Law (1.00)
Information Technology > Security & Privacy (1.00)
(3 more...)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
(3 more...)

Add feedback

Initializing Services in Interactive ML Systems for Diverse Users

Bose, Avinandan, Curmei, Mihaela, Jiang, Daniel L., Morgenstern, Jamie, Dean, Sarah, Ratliff, Lillian J., Fazel, Maryam

arXiv.org Artificial IntelligenceDec-18-2023

This paper studies ML systems that interactively learn from users across multiple subpopulations with heterogeneous data distributions. The primary objective is to provide specialized services for different user groups while also predicting user preferences. Once the users select a service based on how well the service anticipated their preference, the services subsequently adapt and refine themselves based on the user data they accumulate, resulting in an iterative, alternating minimization process between users and services (learning dynamics). Employing such tailored approaches has two main challenges: (i) Unknown user preferences: Typically, data on user preferences are unavailable without interaction, and uniform data collection across a large and diverse user base can be prohibitively expensive. (ii) Suboptimal Local Solutions: The total loss (sum of loss functions across all users and all services) landscape is not convex even if the individual losses on a single service are convex, making it likely for the learning dynamics to get stuck in local minima. The final outcome of the aforementioned learning dynamics is thus strongly influenced by the initial set of services offered to users, and is not guaranteed to be close to the globally optimal outcome. In this work, we propose a randomized algorithm to adaptively select very few users to collect preference data from, while simultaneously initializing a set of services. We prove that under mild assumptions on the loss functions, the expected total loss achieved by the algorithm right after initialization is within a factor of the globally optimal total loss with complete user preference data, and this factor scales only logarithmically in the number of services. Our theory is complemented by experiments on real as well as semi-synthetic datasets.

algorithm, artificial intelligence, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2312.11846

Country: North America > United States > California (0.28)

Genre:

Research Report (0.70)
Overview (0.48)

Industry:

Media (0.46)
Leisure & Entertainment (0.46)
Information Technology > Security & Privacy (0.34)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Personal Assistant Systems (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Fair Active Learning in Low-Data Regimes

Camilleri, Romain, Wagenmaker, Andrew, Morgenstern, Jamie, Jain, Lalit, Jamieson, Kevin

arXiv.org Machine LearningDec-13-2023

In critical machine learning applications, ensuring fairness is essential to avoid perpetuating social inequities. In this work, we address the challenges of reducing bias and improving accuracy in data-scarce environments, where the cost of collecting labeled data prohibits the use of large, labeled datasets. In such settings, active learning promises to maximize marginal accuracy gains of small amounts of labeled data. However, existing applications of active learning for fairness fail to deliver on this, typically requiring large labeled datasets, or failing to ensure the desired fairness tolerance is met on the population distribution. To address such limitations, we introduce an innovative active learning framework that combines an exploration procedure inspired by posterior sampling with a fair classification subroutine. We demonstrate that this framework performs effectively in very data-scarce regimes, maximizing accuracy while satisfying fairness constraints with high probability. We evaluate our proposed approach using well-established real-world benchmark datasets and compare it against state-of-the-art methods, demonstrating its effectiveness in producing fair models, and improvement over existing methods.

artificial intelligence, classifier, machine learning, (15 more...)

arXiv.org Machine Learning

2312.08559

Country: North America > United States > Washington > King County > Seattle (0.14)

Genre: Research Report > Promising Solution (0.87)

Industry:

Law Enforcement & Public Safety > Crime Prevention & Enforcement (0.67)
Law (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.46)

Add feedback

Emergent segmentation from participation dynamics and multi-learner retraining

Dean, Sarah, Curmei, Mihaela, Ratliff, Lillian J., Morgenstern, Jamie, Fazel, Maryam

arXiv.org Artificial IntelligenceAug-23-2023

The choice to participate in a data-driven service, often made on the basis of quality of that service, influences the ability of the service to learn and improve. We study the participation and retraining dynamics that arise when both the learners and sub-populations of users are \emph{risk-reducing}, which cover a broad class of updates including gradient descent, multiplicative weights, etc. Suppose, for example, that individuals choose to spend their time amongst social media platforms proportionally to how well each platform works for them. Each platform also gathers data about its active users, which it uses to update parameters with a gradient step. For this example and for our general class of dynamics, we show that the only asymptotically stable equilibria are segmented, with sub-populations allocated to a single learner. Under mild assumptions, the utilitarian social optimum is a stable equilibrium. In contrast to previous work, which shows that repeated risk minimization can result in representation disparity and high overall loss for a single learner \citep{hashimoto2018fairness,miller2021outside}, we find that repeated myopic updates with multiple learners lead to better outcomes. We illustrate the phenomena via a simulated example initialized from real data.

artificial intelligence, machine learning, social media, (19 more...)

arXiv.org Artificial Intelligence

2206.02667

Country:

Asia (0.71)
North America > United States > California (0.14)

Genre: Research Report (0.64)

Industry:

Leisure & Entertainment (0.93)
Media > Music (0.68)
Information Technology > Services (0.67)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.48)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.46)

Add feedback

Scalable Membership Inference Attacks via Quantile Regression

Bertran, Martin, Tang, Shuai, Kearns, Michael, Morgenstern, Jamie, Roth, Aaron, Wu, Zhiwei Steven

arXiv.org Artificial IntelligenceJul-7-2023

The basic goal of privacy-preserving machine learning is to find models that are predictive on some underlying data distribution, without being disclosive of the particular data points on which they were trained. The simplest kind of attack that can be launched on a trained model--falsifying privacy guarantees--is a membership inference attack. A membership inference attack, informally, is a statistical test that is able to reliably determine whether a particular data point was included in the training set used to train the model or not. Almost all membership inference attacks are based on the observation that models tend to overfit their training sets in different ways. In particular, they tend to systematically predict higher confidence in the true labels of data points from their training set, compared to points drawn from the same distribution not in their training set. The confidence that a model places on the true label of a data-point is thus a natural test statistic to build a membership-inference hypothesis test around. A variety of recent methods [Shokri et al., 2017, Long et al., 2020, Sablayrolles et al., 2019, Song and Mittal, 2021, Carlini et al., 2022] are based around this idea, and aim to estimate the distribution of the test statistic (the confidence assigned to the true label of a datapoint) over the distribution of datapoints that were not used in training (and sometimes, Martin and Shuai are lead authors; all other authors are listed in alphabetical order.

artificial intelligence, machine learning, membership inference attack, (14 more...)

arXiv.org Artificial Intelligence

2307.03694

Genre: Research Report > New Finding (0.46)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.95)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.78)

Add feedback

Distributionally Robust Data Join

Awasthi, Pranjal, Jung, Christopher, Morgenstern, Jamie

arXiv.org Artificial IntelligenceJun-14-2023

Suppose we are given two datasets: a labeled dataset and unlabeled dataset which also has additional auxiliary features not present in the first dataset. What is the most principled way to use these datasets together to construct a predictor? The answer should depend upon whether these datasets are generated by the same or different distributions over their mutual feature sets, and how similar the test distribution will be to either of those distributions. In many applications, the two datasets will likely follow different distributions, but both may be close to the test distribution. We introduce the problem of building a predictor which minimizes the maximum loss over all probability distributions over the original features, auxiliary features, and binary labels, whose Wasserstein distance is $r_1$ away from the empirical distribution over the labeled dataset and $r_2$ away from that of the unlabeled dataset. This can be thought of as a generalization of distributionally robust optimization (DRO), which allows for two data sources, one of which is unlabeled and may contain auxiliary features.

artificial intelligence, machine learning, optimization problem, (15 more...)

arXiv.org Artificial Intelligence

2202.05797

Country:

North America > United States (0.27)
North America > Canada > Quebec (0.14)

Genre: Research Report > New Finding (0.68)

Industry: Health & Medicine (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.45)

Add feedback

Doubly Constrained Fair Clustering

Dickerson, John, Esmaeili, Seyed A., Morgenstern, Jamie, Zhang, Claire Jie

arXiv.org Artificial IntelligenceMay-30-2023

The remarkable attention which fair clustering has received in the last few years has resulted in a significant number of different notions of fairness. Despite the fact that these notions are well-justified, they are often motivated and studied in a disjoint manner where one fairness desideratum is considered exclusively in isolation from the others. This leaves the understanding of the relations between different fairness notions as an important open problem in fair clustering. In this paper, we take the first step in this direction. Specifically, we consider the two most prominent demographic representation fairness notions in clustering: (1) Group Fairness (GF), where the different demographic groups are supposed to have close to population-level representation in each cluster and (2) Diversity in Center Selection (DS), where the selected centers are supposed to have close to population-level representation of each group. We show that given a constant approximation algorithm for one constraint (GF or DS only) we can obtain a constant approximation solution that satisfies both constraints simultaneously. Interestingly, we prove that any given solution that satisfies the GF constraint can always be post-processed at a bounded degradation to the clustering cost to additionally satisfy the DS constraint while the reverse is not true. Furthermore, we show that both GF and DS are incompatible (having an empty feasibility set in the worst case) with a collection of other distance-based fairness notions. Finally, we carry experiments to validate our theoretical findings.

constraint, data mining, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2305.19475

Country:

North America > United States > Maryland (0.14)
North America > United States > California (0.14)

Genre: Research Report (1.00)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.94)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.94)

Add feedback

Auctions and Peer Prediction for Academic Peer Review

Srinivasan, Siddarth, Morgenstern, Jamie

arXiv.org Artificial IntelligenceMay-10-2023

Peer reviewed publications are considered the gold standard in certifying and disseminating ideas that a research community considers valuable. However, we identify two major drawbacks of the current system: (1) the overwhelming demand for reviewers due to a large volume of submissions, and (2) the lack of incentives for reviewers to participate and expend the necessary effort to provide high-quality reviews. In this work, we adopt a mechanism-design approach to propose improvements to the peer review process, tying together the paper submission and review processes and simultaneously incentivizing high-quality submissions and reviews. In the submission stage, authors participate in a VCG auction for review slots by submitting their papers along with a bid that represents their expected value for having their paper reviewed. For the reviewing stage, we propose a novel peer prediction mechanism (H-DIPP) building on recent work in the information elicitation literature, which incentivizes participating reviewers to provide honest and effortful reviews. The revenue raised in the submission stage auction is used to pay reviewers based on the quality of their reviews in the reviewing stage.

artificial intelligence, reviewer, survey article, (19 more...)

arXiv.org Artificial Intelligence

2109.00923

Country: North America > United States > Washington > King County > Seattle (0.14)

Genre: Research Report (1.00)

Technology:

Information Technology > Game Theory (1.00)
Information Technology > Communications (0.92)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (0.46)

Add feedback

Active Learning with Safety Constraints

Camilleri, Romain, Wagenmaker, Andrew, Morgenstern, Jamie, Jain, Lalit, Jamieson, Kevin

arXiv.org Machine LearningJun-22-2022

Active learning methods have shown great promise in reducing the number of samples necessary for learning. As automated learning systems are adopted into real-time, real-world decision-making pipelines, it is increasingly important that such algorithms are designed with safety in mind. In this work we investigate the complexity of learning the best safe decision in interactive environments. We reduce this problem to a constrained linear bandits problem, where our goal is to find the best arm satisfying certain (unknown) safety constraints. We propose an adaptive experimental design-based algorithm, which we show efficiently trades off between the difficulty of showing an arm is unsafe vs suboptimal. To our knowledge, our results are the first on best-arm identification in linear bandits with safety constraints. In practice, we demonstrate that this approach performs well on synthetic and real world datasets.

constraint, data mining, machine learning, (17 more...)

arXiv.org Machine Learning

2206.11183

Country: North America (0.45)

Genre: Research Report > New Finding (0.66)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Data Science > Data Mining > Big Data (0.66)

Add feedback