Goto

Collaborating Authors

 census


STEVE HILTON: Why I'm launching a legal war against California Democrats' unconstitutional power grab

FOX News

California gubernatorial candidate Steve Hilton on former Vice President Kamala Harris declining to run for the California governorship and Gov. Gavin Newsom's idea to redraw California's map if Texas redistricts. California Democrats are once again trying to rig the system, overturn elections and steal congressional seats from Republicans. Gov. Gavin Newsom and Attorney General Rob Bonta are planning to redraw California's congressional maps in 2025 or 2026, halfway through the decade and years before the next census. It's a blatant, unconstitutional power grab designed to silence millions of voters and cement one-party rule in California. Democrats are already trying to rewrite the history of this redistricting fight, claiming it's just retaliation for Republican maps in Texas.


Advancing Annotation of Stance in Social Media Posts: A Comparative Analysis of Large Language Models and Crowd Sourcing

Li, Mao, Conrad, Frederick

arXiv.org Artificial Intelligence

In the rapidly evolving landscape of Natural Language Processing (NLP), the use of Large Language Models (LLMs) for automated text annotation in social media posts has garnered significant interest. Despite the impressive innovations in developing LLMs like ChatGPT, their efficacy, and accuracy as annotation tools are not well understood. In this paper, we analyze the performance of eight open-source and proprietary LLMs for annotating the stance expressed in social media posts, benchmarking their performance against human annotators' (i.e., crowd-sourced) judgments. Additionally, we investigate the conditions under which LLMs are likely to disagree with human judgment. A significant finding of our study is that the explicitness of text expressing a stance plays a critical role in how faithfully LLMs' stance judgments match humans'. We argue that LLMs perform well when human annotators do, and when LLMs fail, it often corresponds to situations in which human annotators struggle to reach an agreement. We conclude with recommendations for a comprehensive approach that combines the precision of human expertise with the scalability of LLM predictions. This study highlights the importance of improving the accuracy and comprehensiveness of automated stance detection, aiming to advance these technologies for more efficient and unbiased analysis of social media.


EchoPrompt: Instructing the Model to Rephrase Queries for Improved In-context Learning

Mekala, Rajasekhar Reddy, Razeghi, Yasaman, Singh, Sameer

arXiv.org Artificial Intelligence

Language models are achieving impressive performance on various tasks by aggressively adopting inference-time prompting techniques, such as zero-shot and few-shot prompting. In this work, we introduce EchoPrompt, a simple yet effective approach that prompts the model to rephrase its queries before answering them. EchoPrompt is adapted for both zero-shot and few-shot in-context learning with standard and chain-of-thought prompting. Experimental results show that EchoPrompt yields substantial improvements across all these settings for four families of causal language models. These improvements are observed across various numerical reasoning (e.g. GSM8K, SVAMP), reading comprehension (e.g. DROP), and logical reasoning (e.g. Coin Flipping) tasks. On average, EchoPrompt improves the Zero-shot-CoT performance of code-davinci-002 by 5% in numerical tasks and 13% in reading comprehension tasks. We investigate the factors contributing to EchoPrompt's effectiveness through ablation studies, which reveal that both the original query and the model-generated rephrased version are instrumental in its performance gains. Our empirical results indicate that EchoPrompt is an effective technique that enhances in-context learning performance. We recommend incorporating EchoPrompt into various baseline prompting strategies to achieve performance boosts.


Lessons Learned: Defending Against Property Inference Attacks

Stock, Joshua, Wettlaufer, Jens, Demmler, Daniel, Federrath, Hannes

arXiv.org Artificial Intelligence

This work investigates and evaluates multiple defense strategies against property inference attacks (PIAs), a privacy attack against machine learning models. Given a trained machine learning model, PIAs aim to extract statistical properties of its underlying training data, e.g., reveal the ratio of men and women in a medical training data set. While for other privacy attacks like membership inference, a lot of research on defense mechanisms has been published, this is the first work focusing on defending against PIAs. With the primary goal of developing a generic mitigation strategy against white-box PIAs, we propose the novel approach property unlearning. Extensive experiments with property unlearning show that while it is very effective when defending target models against specific adversaries, property unlearning is not able to generalize, i.e., protect against a whole class of PIAs. To investigate the reasons behind this limitation, we present the results of experiments with the explainable AI tool LIME. They show how state-of-the-art property inference adversaries with the same objective focus on different parts of the target model. We further elaborate on this with a follow-up experiment, in which we use the visualization technique t-SNE to exhibit how severely statistical training data properties are manifested in machine learning models. Based on this, we develop the conjecture that post-training techniques like property unlearning might not suffice to provide the desirable generic protection against PIAs. As an alternative, we investigate the effects of simpler training data preprocessing methods like adding Gaussian noise to images of a training data set on the success rate of PIAs. We conclude with a discussion of the different defense approaches, summarize the lessons learned and provide directions for future work.


FairLay-ML: Intuitive Remedies for Unfairness in Data-Driven Social-Critical Algorithms

Yu, Normen, Tan, Gang, Tizpaz-Niari, Saeid

arXiv.org Artificial Intelligence

This thesis explores open-sourced machine learning (ML) model explanation tools to understand whether these tools can allow a layman to visualize, understand, and suggest intuitive remedies to unfairness in ML-based decision-support systems. Machine learning models trained on datasets biased against minority groups are increasingly used to guide life-altering social decisions, prompting the urgent need to study their logic for unfairness. Due to this problem's impact on vast populations of the general public, it is critical for the layperson -- not just subject matter experts in social justice or machine learning experts -- to understand the nature of unfairness within these algorithms and the potential trade-offs. Existing research on fairness in machine learning focuses mostly on the mathematical definitions and tools to understand and remedy unfair models, with some directly citing user-interactive tools as necessary for future work. This thesis presents FairLay-ML, a proof-of-concept GUI integrating some of the most promising tools to provide intuitive explanations for unfair logic in ML models by integrating existing research tools (e.g. Local Interpretable Model-Agnostic Explanations) with existing ML-focused GUI (e.g. Python Streamlit). We test FairLay-ML using models of various accuracy and fairness generated by an unfairness detector tool, Parfait-ML, and validate our results using Themis. Our study finds that the technology stack used for FairLay-ML makes it easy to install and provides real-time black-box explanations of pre-trained models to users. Furthermore, the explanations provided translate to actionable remedies.


More efficient manual review of automatically transcribed tabular data

Pedersen, Bjørn-Richard, Johansen, Rigmor Katrine, Holsbø, Einar, Sommerseth, Hilde, Bongo, Lars Ailo

arXiv.org Artificial Intelligence

Machine learning methods have proven useful in transcribing historical data. However, results from even highly accurate methods require manual verification and correction. Such manual review can be time-consuming and expensive, therefore the objective of this paper was to make it more efficient. Previously, we used machine learning to transcribe 2.3 million handwritten occupation codes from the Norwegian 1950 census with high accuracy (97%). We manually reviewed the 90,000 (3%) codes with the lowest model confidence. We allocated those 90,000 codes to human reviewers, who used our annotation tool to review the codes. To assess reviewer agreement, some codes were assigned to multiple reviewers. We then analyzed the review results to understand the relationship between accuracy improvements and effort. Additionally, we interviewed the reviewers to improve the workflow. The reviewers corrected 62.8% of the labels and agreed with the model label in 31.9% of cases. About 0.2% of the images could not be assigned a label, while for 5.1% the reviewers were uncertain, or they assigned an invalid label. 9,000 images were independently reviewed by multiple reviewers, resulting in an agreement of 86.43% and disagreement of 8.96%. We learned that our automatic transcription is biased towards the most frequent codes, with a higher degree of misclassification for the lowest frequency codes. Our interview findings show that the reviewers did internal quality control and found our custom tool well-suited. So, only one reviewer is needed, but they should report uncertainty.


Understanding how Differentially Private Generative Models Spend their Privacy Budget

Ganev, Georgi, Xu, Kai, De Cristofaro, Emiliano

arXiv.org Artificial Intelligence

Generative models trained with Differential Privacy (DP) are increasingly used to produce synthetic data while reducing privacy risks. Navigating their specific privacy-utility tradeoffs makes it challenging to determine which models would work best for specific settings/tasks. In this paper, we fill this gap in the context of tabular data by analyzing how DP generative models distribute privacy budgets across rows and columns, arguably the main source of utility degradation. We examine the main factors contributing to how privacy budgets are spent, including underlying modeling techniques, DP mechanisms, and data dimensionality. Our extensive evaluation of both graphical and deep generative models sheds light on the distinctive features that render them suitable for different settings and tasks. We show that graphical models distribute the privacy budget horizontally and thus cannot handle relatively wide datasets while the performance on the task they were optimized for monotonically increases with more data. Deep generative models spend their budget per iteration, so their behavior is less predictable with varying dataset dimensions but could perform better if trained on more features. Also, low levels of privacy ($\epsilon\geq100$) could help some models generalize, achieving better results than without applying DP.


Max-Min Diversification with Fairness Constraints: Exact and Approximation Algorithms

Wang, Yanhao, Mathioudakis, Michael, Li, Jia, Fabbri, Francesco

arXiv.org Artificial Intelligence

This has raised concerns about the possibility that algorithms may produce unfair and discriminatory decisions for specific population groups, particularly in sensitive socio-computational domains such as voting, hiring, banking, education, and criminal justice [12, 25]. To alleviate such concerns, there has been a lot of research devoted to incorporating fairness into the algorithms for automated decision tasks, including classification [14], clustering [10], ranking [24, 32], matching [28], and data summarization [8, 20]. This paper considers the diversity maximization problem and addresses its fairness-aware variant. The problem consists in selecting a diverse subset of items from a given dataset and is encountered in data summarization [8, 23], web search [2], recommendation [21], feature selection [31], and elsewhere [34]. Existing literature on the problem of diversity maximization primarily focuses on two objectives, namely max-min diversification (MMD), which aims to maximize the minimum distance between any pair of selected items, and max-sum diversification (MSD), which seeks to maximize the sum of pairwise distances between selected items. As shown in Figure 1, MMD tends to cover the data range uniformly, while MSD tends to pick "outliers" and may include highly similar items in the solution. Since the notion of diversity captured by MMD better represents the property that data summarization, feature selection, and many other tasks target with their solutions, we will only consider MMD in this paper. To be precise, given a set V of n items in a metric space and a positive integer k n, MMD asks for a size-k subset S of V to maximize the minimum pairwise distance within S. In particular, we study the fair max-min diversification (FMMD) problem, a variant of MMD that aims not only to maximize the diversity measure defined above but also to guarantee the satisfaction of group fairness constraints as described below.


Census Achieves Premier Partner Status with Snowflake

#artificialintelligence

Census, the leading reverse ETL platform that syncs customer data from data warehouses to key business tools, announced that it has expanded its partnership with Snowflake, the Data Cloud company, and achieved Premier Partner status. As a Premier Partner, Census will continue collaborating closely with Snowflake to develop new solutions that help joint customers extract more value from data stored in the cloud. Unlike legacy data platforms or point-to-point solutions prone to error, Census's reverse ETL provides users with a single source of truth that is always current, complete, and consistent. In a matter of hours, data and operations teams can set up Census workflows that integrate seamlessly with Snowflake's platform to deliver actionable data with immediate revenue impact. "We are thrilled to add Census to Snowflake's roster of Premier Partners," said Tarik Dwiek, Senior Director of Technology Alliances at Snowflake.


Data Cleaning - Filter

#artificialintelligence

Learn how to filter numbers, words, and just about anything in order to reduce bias in your dataset. Filtering through data is a very common transformation; it takes in a conditional and checks through all the data to keep only the data that meets the condition. By filtering you can improve your machine learning models by training on a specific subset of data to specialize the model, remove incorrect data and outliers, or prune biased features. To start off with what filtering does, it takes in a pile of data and turns it into something smaller and (hopefully) easier to work with. When you create a filter, you start from what will stay, not what will go.