Goto

Collaborating Authors

 tenure


Flow Matching for Tabular Data Synthesis

arXiv.org Machine Learning

Synthetic data generation is an important tool for privacy-preserving data sharing. While diffusion models have set recent benchmarks, flow matching (FM) offers a promising alternative. This paper presents different ways to implement flow matching for tabular data synthesis. We provide a comprehensive empirical study that compares flow matching (FM and variational FM) with a state-of-the-art diffusion method (TabDDPM and TabSyn) in tabular data synthesis. We evaluate both the standard Optimal Transport (OT) and the Variance Preserving (VP) probability paths, and also compare deterministic and stochastic samplers -- something possible when learning to generate using \textit{variational} flow matching -- characterising the empirical relationship between data utility and privacy risk. Our key findings reveal that flow matching, particularly TabbyFlow, outperforms diffusion baselines. Flow matching methods also achieves better performance with remarkably low function evaluations ($\leq$ 100 steps), offering a substantial computational advantage. The choice of probability path is also crucial, as using the OT path demonstrates superior performance, while VP has potential for producing synthetic data with lower disclosure risk. Lastly, our results show that making flows stochastic not only preserves marginal distributions but, in some instances, enables the generation of high utility synthetic data with reduced disclosure risk.


Another Big Reason to Worry About Bari Weiss' Tenure at CBS News

Mother Jones

Right now, a potential peril is at hand: the end of truth. The appointment of Bari Weiss, the former opinion writer who started the heterodox website, to lead venerable CBS News set the media world in a tizzy. Since she had no experience in television broadcast news operations, David Ellison, the CEO of Paramount Skydance, must have selected her for ideological and editorial reasons. Weiss had positioned herself as the scourge of supposedly woke and DEI-driven liberal media, presumably a stance that appealed to Ellison, the son of tech billionaire Larry Ellison, a Trump supporter who put up much of the money that financed his son's recent takeover of Paramount. Weiss' first days at the network yielded worrisome signs.


The FTC Is Disappearing Blog Posts About AI Published During Lina Khan's Tenure

WIRED

The FTC Is Disappearing Blog Posts About AI Published During Lina Khan's Tenure The Federal Trade Commission removed several blog posts in recent months about open source and potential risks to consumers from the rapid spread of commercial AI tools. Lina Khan, former chair of the Federal Trade Commission, arrives to testify before Congress in 2024. In late July 2024, Lina Khan, then the chair of the US Federal Trade Commission, gave a speech at an event hosted by the San Francisco startup accelerator Y Combinator in which she positioned herself as an advocate for open source artificial intelligence. The event took place as California lawmakers were considering a landmark bill called SB 1047 that would have imposed new testing and safety requirements on AI companies. Critics of the legislation, which was later vetoed by California governor Gavin Newsom, argued it would hamper the development and release of open source AI models.


The AI Data Scientist

arXiv.org Artificial Intelligence

Imagine decision-makers uploading data and, within minutes, receiving clear, actionable insights delivered straight to their fingertips. That is the promise of the AI Data Scientist, an autonomous Agent powered by large language models (LLMs) that closes the gap between evidence and action. Rather than simply writing code or responding to prompts, it reasons through questions, tests ideas, and delivers end-to-end insights at a pace far beyond traditional workflows. Guided by the scientific tenet of the hypothesis, this Agent uncovers explanatory patterns in data, evaluates their statistical significance, and uses them to inform predictive modeling. It then translates these results into recommendations that are both rigorous and accessible. At the core of the AI Data Scientist is a team of specialized LLM Subagents, each responsible for a distinct task such as data cleaning, statistical testing, validation, and plain-language communication. These Subagents write their own code, reason about causality, and identify when additional data is needed to support sound conclusions. Together, they achieve in minutes what might otherwise take days or weeks, enabling a new kind of interaction that makes deep data science both accessible and actionable.


Elon Musk's Grok chatbot melts down โ€“ and then wins a military contract

The Guardian

This week, Elon Musk's X, formerly Twitter, saw its artificial intelligence chatbot Grok go Nazi. In the past three years of Musk's ownership of the social network, it feels like X has weathered at least one public crisis per week, more often multiple. Last week, Musk's artificial intelligence firm, xAI, saw its flagship chatbot Grok declare itself a super-Nazi, referring to itself as "MechaHitler". It made racist, sexist and antisemitic posts, which the company deleted. One example, via my colleague Josh Taylor: Grok referred to a person with a common Jewish surname as someone who was "celebrating the tragic deaths of white kids" in the Texas floods as "future fascists".


Generate-then-Verify: Reconstructing Data from Limited Published Statistics

arXiv.org Machine Learning

We study the problem of reconstructing tabular data from aggregate statistics, in which the attacker aims to identify interesting claims about the sensitive data that can be verified with 100% certainty given the aggregates. Successful attempts in prior work have conducted studies in settings where the set of published statistics is rich enough that entire datasets can be reconstructed with certainty. In our work, we instead focus on the regime where many possible datasets match the published statistics, making it impossible to reconstruct the entire private dataset perfectly (i.e., when approaches in prior work fail). We propose the problem of partial data reconstruction, in which the goal of the adversary is to instead output a $\textit{subset}$ of rows and/or columns that are $\textit{guaranteed to be correct}$. We introduce a novel integer programming approach that first $\textbf{generates}$ a set of claims and then $\textbf{verifies}$ whether each claim holds for all possible datasets consistent with the published aggregates. We evaluate our approach on the housing-level microdata from the U.S. Decennial Census release, demonstrating that privacy violations can still persist even when information published about such data is relatively sparse.


Does this new tent repel both water and the laws of physics?

New Scientist

Feedback is New Scientist's popular sideways look at the latest science and technology news. You can submit items you believe may amuse readers to Feedback by emailing feedback@newscientist.com Ophthalmologist Gus Gazzard writes in after taking a close look at a marketing email he received from WildBounds. It advertised a revolutionary new range of tents from Colorado-based company Big Agnes, which has created a new kind of waterproofing called HyperBead. Marketing is often detached from reality, but one sentence stood out: "Waterproof at the molecular level, this proprietary material shrugs off rain without relying on coatings or chemicals, meaning no reproofing and no PFAS."


Towards Sustainable Workplace Mental Health: A Novel Approach to Early Intervention and Support

arXiv.org Artificial Intelligence

Employee well-being is a critical concern in the contemporary workplace, as highlighted by the American Psychological Association's 2021 report, indicating that 71% of employees experience stress or tension. This stress contributes significantly to workplace attrition and absenteeism, with 61% of attrition and 16% of sick days attributed to poor mental health. A major challenge for employers is that employees often remain unaware of their mental health issues until they reach a crisis point, resulting in limited utilization of corporate well-being benefits. This research addresses this challenge by presenting a groundbreaking stress detection algorithm that provides real-time support preemptively. Leveraging automated chatbot technology, the algorithm objectively measures mental health levels by analyzing chat conversations, offering personalized treatment suggestions in real-time based on linguistic biomarkers. The study explores the feasibility of integrating these innovations into practical learning applications within real-world contexts and introduces a chatbot-style system integrated into the broader employee experience platform. This platform, encompassing various features, aims to enhance overall employee well-being, detect stress in real time, and proactively engage with individuals to improve support effectiveness, demonstrating a 22% increase when assistance is provided early. Overall, the study emphasizes the importance of fostering a supportive workplace environment for employees' mental health.


Who is Ali Akbar Ahmadian, Iran's new security chief?

Al Jazeera

Iranian President Ebrahim Raisi has appointed a veteran commander with the Islamic Revolutionary Guard Corps (IRGC) as the country's new security chief. Ali Akbar Ahmadian, 62, was named on Monday as the new secretary of the Supreme National Security Council (SNSC), replacing Ali Shamkhani, who held the post for close to a decade. Ahmadian takes the reins of the SNSC at a time of rapidly accelerating diplomatic regional efforts facilitated by his predecessor, including the re-establishment of ties with rival Saudi Arabia after a China-brokered agreement in March. Iran's relations with the West, however, remain sour. A landmark 2015 nuclear deal with world powers remains in limbo, while Iran has been accused of supplying Russia with armed drones for the war in Ukraine and tensions have steadily risen following nationwide protests that erupted across the country in September last year.


Improved Churn Causal Analysis Through Restrained High-Dimensional Feature Space Effects in Financial Institutions

arXiv.org Artificial Intelligence

Customer churn describes terminating a relationship with a business or reducing customer engagement over a specific period. Customer acquisition cost can be five to six times that of customer retention, hence investing in customers with churn risk is wise. Causal analysis of the churn model can predict whether a customer will churn in the foreseeable future and identify effects and possible causes for churn. In general, this study presents a conceptual framework to discover the confounding features that correlate with independent variables and are causally related to those dependent variables that impact churn. We combine different algorithms including the SMOTE, ensemble ANN, and Bayesian networks to address churn prediction problems on a massive and high-dimensional finance data that is usually generated in financial institutions due to employing interval-based features used in Customer Relationship Management systems. The effects of the curse and blessing of dimensionality assessed by utilising the Recursive Feature Elimination method to overcome the high dimension feature space problem. Moreover, a causal discovery performed to find possible interpretation methods to describe cause probabilities that lead to customer churn. Evaluation metrics on validation data confirm the random forest and our ensemble ANN model, with %86 accuracy, outperformed other approaches. Causal analysis results confirm that some independent causal variables representing the level of super guarantee contribution, account growth, and account balance amount were identified as confounding variables that cause customer churn with a high degree of belief. This article provides a real-world customer churn analysis from current status inference to future directions in local superannuation funds.