Goto

Collaborating Authors

 Law


GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning

arXiv.org Artificial Intelligence

Large language models (LLMs) are increasingly adapted to downstream tasks via reinforcement learning (RL) methods like Group Relative Policy Optimization (GRPO), which often require thousands of rollouts to learn new tasks. We argue that the interpretable nature of language can often provide a much richer learning medium for LLMs, compared with policy gradients derived from sparse, scalar rewards. To test this, we introduce GEPA (Genetic-Pareto), a prompt optimizer that thoroughly incorporates natural language reflection to learn high-level rules from trial and error. Given any AI system containing one or more LLM prompts, GEPA samples system-level trajectories (e.g., reasoning, tool calls, and tool outputs) and reflects on them in natural language to diagnose problems, propose and test prompt updates, and combine complementary lessons from the Pareto frontier of its own attempts. As a result of GEPA's design, it can often turn even just a few rollouts into a large quality gain. Across four tasks, GEPA outperforms GRPO by 10% on average and by up to 20%, while using up to 35x fewer rollouts. GEPA also outperforms the leading prompt optimizer, MIPROv2, by over 10% across two LLMs, and demonstrates promising results as an inference-time search strategy for code optimization.


Forest-Guided Clustering -- Shedding Light into the Random Forest Black Box

arXiv.org Artificial Intelligence

As machine learning models are increasingly deployed in sensitive application areas, the demand for interpretable and trustworthy decision-making has increased. Random Forests (RF), despite their widespread use and strong performance on tabular data, remain difficult to interpret due to their ensemble nature. We present Forest-Guided Clustering (FGC), a model-specific explainability method that reveals both local and global structure in RFs by grouping instances according to shared decision paths. FGC produces human-interpretable clusters aligned with the model's internal logic and computes cluster-specific and global feature importance scores to derive decision rules underlying RF predictions. FGC accurately recovered latent subclass structure on a benchmark dataset and outperformed classical clustering and post-hoc explanation methods. Applied to an AML transcriptomic dataset, FGC uncovered biologically coherent subpopulations, disentangled disease-relevant signals from confounders, and recovered known and novel gene expression patterns. FGC bridges the gap between performance and interpretability by providing structure-aware insights that go beyond feature-level attribution.


Objectifying the Subjective: Cognitive Biases in Topic Interpretations

arXiv.org Artificial Intelligence

Interpretation of topics is crucial for their downstream applications. State-of-the-art evaluation measures of topic quality such as coherence and word intrusion do not measure how much a topic facilitates the exploration of a corpus. To design evaluation measures grounded on a task, and a population of users, we do user studies to understand how users interpret topics. We propose constructs of topic quality and ask users to assess them in the context of a topic and provide rationale behind evaluations. We use reflexive thematic analysis to identify themes of topic interpretations from rationales. Users interpret topics based on availability and representativeness heuristics rather than probability. We propose a theory of topic interpretation based on the anchoring-and-adjustment heuristic: users anchor on salient words and make semantic adjustments to arrive at an interpretation. Topic interpretation can be viewed as making a judgment under uncertainty by an ecologically rational user, and hence cognitive biases aware user models and evaluation frameworks are needed.


Closing the Modality Gap for Mixed Modality Search

arXiv.org Artificial Intelligence

Mixed modality search -- retrieving information across a heterogeneous corpus composed of images, texts, and multimodal documents -- is an important yet underexplored real-world application. In this work, we investigate how contrastive vision-language models, such as CLIP, perform on the mixed modality search task. Our analysis reveals a critical limitation: these models exhibit a pronounced modality gap in the embedding space, where image and text embeddings form distinct clusters, leading to intra-modal ranking bias and inter-modal fusion failure. To address this issue, we propose GR-CLIP, a lightweight post-hoc calibration method that removes the modality gap in CLIP's embedding space. Evaluated on MixBench -- the first benchmark specifically designed for mixed modality search -- GR-CLIP improves NDCG@10 by up to 26 percentage points over CLIP, surpasses recent vision-language generative embedding models by 4 percentage points, while using 75x less compute.


Legal Document Summarization: Enhancing Judicial Efficiency through Automation Detection

arXiv.org Artificial Intelligence

Legal document summarization represents a significant advancement towards improving judicial efficiency through the automation of key information detection. Our approach leverages state-of-the-art natural language processing techniques to meticulously identify and extract essential data from extensive legal texts, which facilitates a more efficient review process. By employing advanced machine learning algorithms, the framework recognizes underlying patterns within judicial documents to create precise summaries that encapsulate the crucial elements. This automation alleviates the burden on legal professionals, concurrently reducing the likelihood of overlooking vital information that could lead to errors. Through comprehensive experiments conducted with actual legal datasets, we demonstrate the capability of our method to generate high-quality summaries while preserving the integrity of the original content and enhancing processing times considerably. The results reveal marked improvements in operational efficiency, allowing legal practitioners to direct their efforts toward critical analytical and decision-making activities instead of manual reviews. This research highlights promising technology-driven strategies that can significantly alter workflow dynamics within the legal sector, emphasizing the role of automation in refining judicial processes.


REPRO-Bench: Can Agentic AI Systems Assess the Reproducibility of Social Science Research?

arXiv.org Artificial Intelligence

Assessing the reproducibility of social science papers is essential for promoting rigor in research processes, but manual assessment is costly. With recent advances in agentic AI systems (i.e., AI agents), we seek to evaluate their capability to automate this process. However, existing benchmarks for reproducing research papers (1) focus solely on reproducing results using provided code and data without assessing their consistency with the paper, (2) oversimplify real-world scenarios, and (3) lack necessary diversity in data formats and programming languages. To address these issues, we introduce REPRO-Bench, a collection of 112 task instances, each representing a social science paper with a publicly available reproduction report. The agents are tasked with assessing the reproducibility of the paper based on the original paper PDF and the corresponding reproduction package. REPRO-Bench features end-to-end evaluation tasks on the reproducibility of social science papers with complexity comparable to real-world assessments. We evaluate three representative AI agents on REPRO-Bench, with the best-performing agent achieving an accuracy of only 21.4%. Building on our empirical analysis, we develop REPRO-Agent, which improves the highest accuracy achieved by existing agents by 71%. We conclude that more advanced AI agents should be developed to automate real-world reproducibility assessment. REPRO-Bench is publicly available at https://github.com/uiuc-kang-lab/REPRO-Bench.


Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity

arXiv.org Artificial Intelligence

Despite widespread adoption, the impact of AI tools on software development in the wild remains understudied. We conduct a randomized controlled trial (RCT) to understand how AI tools at the February-June 2025 frontier affect the productivity of experienced open-source developers. 16 developers with moderate AI experience complete 246 tasks in mature projects on which they have an average of 5 years of prior experience. Each task is randomly assigned to allow or disallow usage of early 2025 AI tools. When AI tools are allowed, developers primarily use Cursor Pro, a popular code editor, and Claude 3.5/3.7 Sonnet. Before starting tasks, developers forecast that allowing AI will reduce completion time by 24%. After completing the study, developers estimate that allowing AI reduced completion time by 20%. Surprisingly, we find that allowing AI actually increases completion time by 19%--AI tooling slowed developers down. This slowdown also contradicts predictions from experts in economics (39% shorter) and ML (38% shorter). To understand this result, we collect and evaluate evidence for 20 properties of our setting that a priori could contribute to the observed slowdown effect--for example, the size and quality standards of projects, or prior developer experience with AI tooling. Although the influence of experimental artifacts cannot be entirely ruled out, the robustness of the slowdown effect across our analyses suggests it is unlikely to primarily be a function of our experimental design.


Concept-TRAK: Understanding how diffusion models learn concepts through concept-level attribution

arXiv.org Artificial Intelligence

While diffusion models excel at image generation, their growing adoption raises critical concerns around copyright issues and model transparency. Existing attribution methods identify training examples influencing an entire image, but fall short in isolating contributions to specific elements, such as styles or objects, that matter most to stakeholders. To bridge this gap, we introduce \emph{concept-level attribution} via a novel method called \emph{Concept-TRAK}. Concept-TRAK extends influence functions with two key innovations: (1) a reformulated diffusion training loss based on diffusion posterior sampling, enabling robust, sample-specific attribution; and (2) a concept-aware reward function that emphasizes semantic relevance. We evaluate Concept-TRAK on the AbC benchmark, showing substantial improvements over prior methods. Through diverse case studies--ranging from identifying IP-protected and unsafe content to analyzing prompt engineering and compositional learning--we demonstrate how concept-level attribution yields actionable insights for responsible generative AI development and governance.


Hackers steal images from women's dating safety app that vets men

BBC News

The app has recently experienced a surge in popularity - as well as criticism from some who claim it is anti-men. Tea lets women check whether potential partners are married or registered sex offenders as well as run reverse image searches to protect against "catfishing", where people use fake online identities. But one of the most controversial aspects of Tea is that it allows women to share information on men they have dated to "avoid red flags" but also highlight those with "green flag" qualities. The company said the breached photos "can in no way be linked to posts within Tea". The firm blocks screenshots so that posts are not shared outside the app.


California man accused by feds of scamming 2 million from people on dating apps

FOX News

Fox News Flash top headlines are here. Check out what's clicking on Foxnews.com. A California man was federally charged for allegedly scamming more than 2 million from people over popular dating apps by posing as someone who was "financially successful and knowledgeable about investments," prosecutors said. Christopher Earl Lloyd, 39, of Whittier, is now facing a 14-count federal indictment in connection with the alleged scheme he carried out for nearly three years on dating apps such as Tinder, Hinge and Bumble, according to the U.S. Attorney's Office of the Central District of California. "According to the indictment that a federal grand jury returned on July 2, from April 2021 to February 2024, Lloyd used dating apps and websites to befriend and engage in romantic relationships with his victims. Lloyd lied to his victims to give them the impression that he was financially successful and knowledgeable about investments," the Attorney's Office said.