Collaborating Authors


Extending the WILDS Benchmark for Unsupervised Adaptation Artificial Intelligence

Machine learning systems deployed in the wild are often trained on a source distribution but deployed on a different target distribution. Unlabeled data can be a powerful point of leverage for mitigating these distribution shifts, as it is frequently much more available than labeled data. However, existing distribution shift benchmarks for unlabeled data do not reflect the breadth of scenarios that arise in real-world applications. In this work, we present the WILDS 2.0 update, which extends 8 of the 10 datasets in the WILDS benchmark of distribution shifts to include curated unlabeled data that would be realistically obtainable in deployment. To maintain consistency, the labeled training, validation, and test sets, as well as the evaluation metrics, are exactly the same as in the original WILDS benchmark. These datasets span a wide range of applications (from histology to wildlife conservation), tasks (classification, regression, and detection), and modalities (photos, satellite images, microscope slides, text, molecular graphs). We systematically benchmark state-of-the-art methods that leverage unlabeled data, including domain-invariant, self-training, and self-supervised methods, and show that their success on WILDS 2.0 is limited. To facilitate method development and evaluation, we provide an open-source package that automates data loading and contains all of the model architectures and methods used in this paper. Code and leaderboards are available at

Brutal, brazen crimes shake L.A., leaving city at a crossroads

Los Angeles Times

And this week, the fatal shooting of 81-year-old Jacqueline Avant, an admired philanthropist and wife of music legend Clarence Avant, in her Beverly Hills home. After two years of rising violent crime in Los Angeles, these incidents have sparked a national conversation and led to local concern about both the crimes themselves and where the outrage over the violence will lead. "The fact that this has happened, her being shot and killed in her own home, after giving, sharing, and caring for 81 years has shaken the laws of the Universe," declared Oprah Winfrey, expressing her grief over Avant's killing to her 43 million Twitter followers. "The world is upside down." While overall city crime rates remain far below records set during the notorious gang wars of the 1990s, violent crime has jumped sharply in L.A., as it has in other cities.

Randomized Classifiers vs Human Decision-Makers: Trustworthy AI May Have to Act Randomly and Society Seems to Accept This Artificial Intelligence

As \emph{artificial intelligence} (AI) systems are increasingly involved in decisions affecting our lives, ensuring that automated decision-making is fair and ethical has become a top priority. Intuitively, we feel that akin to human decisions, judgments of artificial agents should necessarily be grounded in some moral principles. Yet a decision-maker (whether human or artificial) can only make truly ethical (based on any ethical theory) and fair (according to any notion of fairness) decisions if full information on all the relevant factors on which the decision is based are available at the time of decision-making. This raises two problems: (1) In settings, where we rely on AI systems that are using classifiers obtained with supervised learning, some induction/generalization is present and some relevant attributes may not be present even during learning. (2) Modeling such decisions as games reveals that any -- however ethical -- pure strategy is inevitably susceptible to exploitation. Moreover, in many games, a Nash Equilibrium can only be obtained by using mixed strategies, i.e., to achieve mathematically optimal outcomes, decisions must be randomized. In this paper, we argue that in supervised learning settings, there exist random classifiers that perform at least as well as deterministic classifiers, and may hence be the optimal choice in many circumstances. We support our theoretical results with an empirical study indicating a positive societal attitude towards randomized artificial decision-makers, and discuss some policy and implementation issues related to the use of random classifiers that relate to and are relevant for current AI policy and standardization initiatives.

Recent Advances in Natural Language Processing via Large Pre-Trained Language Models: A Survey Artificial Intelligence

Large, pre-trained transformer-based language models such as BERT have drastically changed the Natural Language Processing (NLP) field. We present a survey of recent work that uses these large language models to solve NLP tasks via pre-training then fine-tuning, prompting, or text generation approaches. We also present approaches that use pre-trained language models to generate data for training augmentation or other purposes. We conclude with discussions on limitations and suggested directions for future research.

College admissions scam case set for Sept. 8 trial in Boston

Boston Herald

USC's Pat Haden and now two "Varsity Blues" defendants want to file briefs in the college admissions scam case under seal. What they want to share, they argue, is "sensitive, confidential, and personally identifiable information." Haden, the former athletic director at the University of Southern California, has filed a motion in federal court in Boston to "quash a trial subpoena for testimony issued by counsel for defendants," as the Herald has reported. He was just granted permission to state his case in private. Defendants Gamal Abdelaziz and John Wilson are seeking that same protection to keep their arguments out of the public eye -- for now.

Machine Unlearning Artificial Intelligence

Once users have shared their data online, it is generally difficult for them to revoke access and ask for the data to be deleted. Machine learning (ML) exacerbates this problem because any model trained with said data may have memorized it, putting users at risk of a successful privacy attack exposing their information. Yet, having models unlearn is notoriously difficult. After a data point is removed from a training set, one often resorts to entirely retraining downstream models from scratch. We introduce SISA training, a framework that decreases the number of model parameters affected by an unlearning request and caches intermediate outputs of the training algorithm to limit the number of model updates that need to be computed to have these parameters unlearn. This framework reduces the computational overhead associated with unlearning, even in the worst-case setting where unlearning requests are made uniformly across the training set. In some cases, we may have a prior on the distribution of unlearning requests that will be issued by users. We may take this prior into account to partition and order data accordingly and further decrease overhead from unlearning. Our evaluation spans two datasets from different application domains, with corresponding motivations for unlearning. Under no distributional assumptions, we observe that SISA training improves unlearning for the Purchase dataset by 3.13x, and 1.658x for the SVHN dataset, over retraining from scratch. We also validate how knowledge of the unlearning distribution provides further improvements in retraining time by simulating a scenario where we model unlearning requests that come from users of a commercial product that is available in countries with varying sensitivity to privacy. Our work contributes to practical data governance in machine learning.

Federal tax changes could be why California's budget is more than $2 billion below projections

Los Angeles Times

Gov. Gavin Newsom's hopes for a record-setting tax revenue windfall this year could depend on whether California's wealthiest residents are simply waiting until the last moment to pay up -- a reaction to the 2017 federal tax changes championed by President Trump. State financial experts on Tuesday reported fiscal year-to-date revenues are more than $2.3 billion below the expectations set by Newsom's first spending plan. But they believe the money is simply delayed, not missing. "We don't think it reflects any underlying weakness in the economy," said H.D. Palmer, a spokesman for the California Department of Finance. Instead, what state economists are now projecting is the state's first and most significant ripple effect from the tax overhaul written by Republicans in Congress and signed into law by Trump in December 2017.

Google case set to examine if EU data rules extend globally

USATODAY - Tech Top Stories

Google employees reviewing the company appreciate the company's benefits and perks, which include free food and coffee made by baristas in every building. Other benefits include onsite gyms, free workout classes, and shuttles for free and easy commuting. Employees also appear confident in the company's leadership. Google CEO Sundar Pichai has a near-perfect 95% approval rating on Glassdoor. LONDON – Google is going to Europe's top court in its legal fight against an order requiring it to extend "right to be forgotten" rules to its search engines globally.

Police: Man put dismembered wife in suitcase, set it ablaze

FOX News

LOS ANGELES – Investigators believe a homeless man killed his wife in an abandoned restaurant, chopped up her body, stuffed it into a suitcase and then calmly rode with it aboard a train before he burned her remains in a parking lot, Los Angeles police said Tuesday.

Probabilistic Graphical Models for Credibility Analysis in Evolving Online Communities Machine Learning

One of the major hurdles preventing the full exploitation of information from online communities is the widespread concern regarding the quality and credibility of user-contributed content. Prior works in this domain operate on a static snapshot of the community, making strong assumptions about the structure of the data (e.g., relational tables), or consider only shallow features for text classification. To address the above limitations, we propose probabilistic graphical models that can leverage the joint interplay between multiple factors in online communities --- like user interactions, community dynamics, and textual content --- to automatically assess the credibility of user-contributed online content, and the expertise of users and their evolution with user-interpretable explanation. To this end, we devise new models based on Conditional Random Fields for different settings like incorporating partial expert knowledge for semi-supervised learning, and handling discrete labels as well as numeric ratings for fine-grained analysis. This enables applications such as extracting reliable side-effects of drugs from user-contributed posts in healthforums, and identifying credible content in news communities. Online communities are dynamic, as users join and leave, adapt to evolving trends, and mature over time. To capture this dynamics, we propose generative models based on Hidden Markov Model, Latent Dirichlet Allocation, and Brownian Motion to trace the continuous evolution of user expertise and their language model over time. This allows us to identify expert users and credible content jointly over time, improving state-of-the-art recommender systems by explicitly considering the maturity of users. This also enables applications such as identifying helpful product reviews, and detecting fake and anomalous reviews with limited information.