synth
REALITrees: Rashomon Ensemble Active Learning for Interpretable Trees
Nguyen, Simon D., McTavish, Hayden, Hoffman, Kentaro, Rudin, Cynthia, McCormick, Tyler H.
Active learning reduces labeling costs by selecting samples that maximize information gain. A dominant framework, Query-by-Committee (QBC), typically relies on perturbation-based diversity by inducing model disagreement through random feature subsetting or data blinding. While this approximates one notion of epistemic uncertainty, it sacrifices direct characterization of the plausible hypothesis space. We propose the complementary approach: Rashomon Ensembled Active Learning (REAL) which constructs a committee by exhaustively enumerating the Rashomon Set of all near-optimal models. To address functional redundancy within this set, we adopt a PAC-Bayesian framework using a Gibbs posterior to weight committee members by their empirical risk. Leveraging recent algorithmic advances, we exactly enumerate this set for the class of sparse decision trees. Across synthetic and established active learning baselines, REAL outperforms randomized ensembles, particularly in moderately noisy environments where it strategically leverages expanded model multiplicity to achieve faster convergence.
- North America > United States > California > San Francisco County > San Francisco (0.14)
- North America > United States > New York > New York County > New York City (0.04)
- North America > United States > California > San Diego County > San Diego (0.04)
- (3 more...)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.48)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.48)
- Information Technology > Artificial Intelligence > Machine Learning > Decision Tree Learning (0.35)
TheUnreliabilityofExplanationsinFew-shot PromptingforTextualReasoning
However, text-davinci-002 is able to benefit more substantially. We further show that explanations generated by the LLMs may not entail the models' predictions norbefactually grounded intheinput, evenonsimple tasks with extractive explanations. However, these flawed explanations can still be useful as a way to verify LLMs' predictions post-hoc.
- North America > United States > New York > New York County > New York City (0.04)
- North America > United States > Louisiana (0.04)
- North America > Canada > Alberta > Census Division No. 15 > Improvement District No. 9 > Banff (0.04)
Disney advert banned for showing 'disturbing' severed body
Disney advert banned for showing'disturbing' severed body A menacing Disney advert featuring a severed body has been banned by the advertising regulator, which said it was likely to frighten and cause distress to children. The Advertising Standards Authority (ASA) found the entertainment giant had broken its rules with its advert for the Predator Badlands film. Parents complained that the digital poster, which featured a large alien holding aloft the severed body of a smaller, human figure, was inappropriate and disturbing for young children. Disney said the severed body was actually that of a robot, and the fact it had been cut in two further emphasised its non-human nature. The advert, which was seen on the roadside in Giffnock, Glasgow, was promoting the Disney sci-fi film ahead of its release in November.
- North America > United States (0.31)
- North America > Central America (0.15)
- Oceania > Australia (0.06)
- (11 more...)
- Leisure & Entertainment (1.00)
- Media > Film (0.97)
- Research Report > Experimental Study (0.93)
- Research Report > New Finding (0.67)
Distributional Treatment Effect Estimation across Heterogeneous Sites via Optimal Transport
Bateni, Borna, Yuan, Yubai, Xu, Qi, Qu, Annie
We propose a novel framework for synthesizing counterfactual treatment group data in a target site by integrating full treatment and control group data from a source site with control group data from the target. Departing from conventional average treatment effect estimation, our approach adopts a distributional causal inference perspective by modeling treatment and control as distinct probability measures on the source and target sites. We formalize the cross-site heterogeneity (effect modification) as a push-forward transformation that maps the joint feature-outcome distribution from the source to the target site. This transformation is learned by aligning the control group distributions between sites using an Optimal Transport-based procedure, and subsequently applied to the source treatment group to generate the synthetic target treatment distribution. Under general regularity conditions, we establish theoretical guarantees for the consistency and asymptotic convergence of the synthetic treatment group data to the true target distribution. Simulation studies across multiple data-generating scenarios and a real-world application to patient-derived xenograft data demonstrate that our framework robustly recovers the full distributional properties of treatment effects.
- North America > United States > California > Los Angeles County > Los Angeles (0.14)
- North America > United States > California > Santa Barbara County > Santa Barbara (0.14)
- North America > United States > Pennsylvania > Centre County > University Park (0.04)
- (10 more...)
- Research Report > Experimental Study (0.93)
- Research Report > New Finding (0.67)
A Supplementary Material Learning Compositional Rules via Neural Program Synthesis
All models were implemented in PyTorch. For all experiments, we report standard error below. Primitive rules map a word to a color (e.g. In a higher-order rule, the left hand side can be one or two variables and a word, and the right hand side can be any sequence of bracketed forms of those variables. Figure A.2 shows several example training grammars sampled from the meta-grammar.
KLAAD: Refining Attention Mechanisms to Reduce Societal Bias in Generative Language Models
Kim, Seorin, Lee, Dongyoung, Lee, Jaejin
Large language models (LLMs) often exhibit societal biases in their outputs, prompting ethical concerns regarding fairness and harm. In this work, we propose KLAAD (KL-Attention Alignment Debiasing), an attention-based debiasing framework that implicitly aligns attention distributions between stereotypical and anti-stereotypical sentence pairs without directly modifying model weights. KLAAD introduces a composite training objective combining Cross-Entropy, KL divergence, and Triplet losses, guiding the model to consistently attend across biased and unbiased contexts while preserving fluency and coherence. Experimental evaluation of KLAAD demonstrates improved bias mitigation on both the BBQ and BOLD benchmarks, with minimal impact on language modeling quality. The results indicate that attention-level alignment offers a principled solution for mitigating bias in generative language models.
- North America > United States (1.00)
- Asia > Middle East > UAE (0.46)
- Leisure & Entertainment (1.00)
- Energy > Renewable (0.46)
GraphGen: Enhancing Supervised Fine-Tuning for LLMs with Knowledge-Driven Synthetic Data Generation
Chen, Zihong, Jiang, Wanli, Li, Jinzhe, Yuan, Zhonghang, Kong, Huanjun, Ouyang, Wanli, Dong, Nanqing
Fine-tuning for large language models (LLMs) typically requires substantial amounts of high-quality supervised data, which is both costly and labor-intensive to acquire. While synthetic data generation has emerged as a promising solution, existing approaches frequently suffer from factual inaccuracies, insufficient long-tail coverage, simplistic knowledge structures, and homogenized outputs. To address these challenges, we introduce GraphGen, a knowledge graph-guided framework designed for three key question-answering (QA) scenarios: atomic QA, aggregated QA, and multi-hop QA. It begins by constructing a fine-grained knowledge graph from the source text. It then identifies knowledge gaps in LLMs using the expected calibration error metric, prioritizing the generation of QA pairs that target high-value, long-tail knowledge. Furthermore, GraphGen incorporates multi-hop neighborhood sampling to capture complex relational information and employs style-controlled generation to diversify the resulting QA data. Experimental results on knowledge-intensive tasks under closed-book settings demonstrate that GraphGen outperforms conventional synthetic data methods, offering a more reliable and comprehensive solution to the data scarcity challenge in supervised fine-tuning. The code and data are publicly available at https://github.com/open-sciencelab/GraphGen.
TAMIS: Tailored Membership Inference Attacks on Synthetic Data
Andrey, Paul, Bars, Batiste Le, Tommasi, Marc
Membership Inference Attacks (MIA) enable to empirically assess the privacy of a machine learning algorithm. In this paper, we propose TAMIS, a novel MIA against differentially-private synthetic data generation methods that rely on graphical models. This attack builds upon MAMA-MIA, a recently-published state-of-the-art method. It lowers its computational cost and requires less attacker knowledge. Our attack is the product of a two-fold improvement. First, we recover the graphical model having generated a synthetic dataset by using solely that dataset, rather than shadow-modeling over an auxiliary one. This proves less costly and more performant. Second, we introduce a more mathematically-grounded attack score, that provides a natural threshold for binary predictions. In our experiments, TAMIS achieves better or similar performance as MAMA-MIA on replicas of the SNAKE challenge.
- North America > United States > New York > New York County > New York City (0.04)
- North America > United States > California > Santa Clara County > Santa Clara (0.04)
- Europe > Switzerland > Basel-City > Basel (0.04)
- (2 more...)