contributor
SyncTwin: Treatment Effect Estimation with Longitudinal Outcomes
Most of the medical observational studies estimate the causal treatment effects using electronic health records (EHR), where a patient's covariates and outcomes are both observed longitudinally. However, previous methods focus only on adjusting for the covariates while neglecting the temporal structure in the outcomes. To bridge the gap, this paper develops a new method, SyncTwin, that learns a patient-specific time-constant representation from the pre-treatment observations. SyncTwin issues counterfactual prediction of a target patient by constructing a synthetic twin that closely matches the target in representation. The reliability of the estimated treatment effect can be assessed by comparing the observed and synthetic pre-treatment outcomes. The medical experts can interpret the estimate by examining the most important contributing individuals to the synthetic twin. In the real-data experiment, SyncTwin successfully reproduced the findings of a randomized controlled clinical trial using observational data, which demonstrates its usability in the complex real-world EHR.
Auditing the Auditors: Does Community-based Moderation Get It Right?
Alimohammadi, Yeganeh, Huang, Karissa, Borgs, Christian, Chayes, Jennifer
Online social platforms increasingly rely on crowd-sourced systems to label misleading content at scale, but these systems must both aggregate users' evaluations and decide whose evaluations to trust. To address the latter, many platforms audit users by rewarding agreement with the final aggregate outcome, a design we term consensus-based auditing. We analyze the consequences of this design in X's Community Notes, which in September 2022 adopted consensus-based auditing that ties users' eligibility for participation to agreement with the eventual platform outcome. We find evidence of strategic conformity: minority contributors' evaluations drift toward the majority and their participation share falls on controversial topics, where independent signals matter most. We formalize this mechanism in a behavioral model in which contributors trade off private beliefs against anticipated penalties for disagreement. Motivated by these findings, we propose a two-stage auditing and aggregation algorithm that weights contributors by the stability of their past residuals rather than by agreement with the majority. The method first accounts for differences across content and contributors, and then measures how predictable each contributor's evaluations are relative to the latent-factor model. Contributors whose evaluations are consistently informative receive greater influence in aggregation, even when they disagree with the prevailing consensus. In the Community Notes data, this approach improves out-of-sample predictive performance while avoiding penalization of disagreement.
Wikipedia's Existential Threats Feel Greater Than Ever
As the free online encyclopedia turns 25, it's facing political opposition, AI scraping, dwindling volunteers, and a public that may no longer believe in its ideals. In 2010, the FBI sent Wikipedia a letter that would be intimidating for any organization to receive. The missive demanded that the free online encyclopedia remove the FBI's logo from an entry about the agency, claiming that reproducing the emblem was illegal and punishable with fines, imprisonment, "or both." Rather than back down, a lawyer for the Wikimedia Foundation, which hosts Wikipedia, shot back a sharp refusal outlining how the FBI's interpretation of the relevant statute was incorrect and saying that Wikipedia was "prepared to argue our view in court." It worked--the FBI dropped the matter.
Contributor: Rob Reiner reshaped how California understands and invests in children
Things to Do in L.A. Hollywood director Rob Reiner engineered Proposition 10, a 1998 tobacco tax that created First 5 California, generating more than $11 billion for early childhood programs statewide. This is read by an automated voice. Please report any issues or inconsistencies here . After his tragic death Sunday, the world remembers Rob Reiner as a cinematic force -- and he was one, as an unforgettable presence on the ambitious 1970s sitcom "All in the Family" and later as the director of beloved films. I came to know him differently: as a restless thinker who transformed his own life story into bold public policy, reshaping how California understands and invests in its youngest children.
Eka-Eval: An Evaluation Framework for Low-Resource Multilingual Large Language Models
Sinha, Samridhi Raj, Sheth, Rajvee, Upperwal, Abhishek, Singh, Mayank
The rapid evolution of Large Language Models' has underscored the need for evaluation frameworks that are globally applicable, flexible, and modular, and that support a wide range of tasks, model types, and linguistic settings. We introduce EKA-EVAL, a unified, end- to-end framework that combines a zero-code web interface and an interactive CLI to ensure broad accessibility. It integrates 50+ multilingual benchmarks across nine evaluation categories, supports local and proprietary models, and provides 11 core capabilities through a modular, plug-and-play architecture. Designed for scalable, multilingual evaluation with support for low-resource multilingual languages, EKA-EVAL is, to the best of our knowledge, the first suite to offer comprehensive coverage in a single platform. Comparisons against five existing baselines indicate improvements of at least 2x better on key usability measures, with the highest user satisfaction, faster setup times, and consistent benchmark reproducibility. The framework is open-source and publicly available at https://github.com/lingo-iitgn/eka-eval.
MERA Code: A Unified Framework for Evaluating Code Generation Across Tasks
Chervyakov, Artem, Kharitonov, Alexander, Zadorozhny, Pavel, Pavel, Adamenko, Levichev, Rodion, Vorobev, Dmitrii, Salikhov, Dmitrii, Valeev, Aidar, Pestova, Alena, Dziuba, Maria, Alimova, Ilseyar, Zavgorodnev, Artem, Medvedev, Aleksandr, Moiseev, Stanislav, Bruches, Elena, Grebenkin, Daniil, Derunets, Roman, Vladimir, Vikulov, Emelyanov, Anton, Babaev, Dmitrii, Ivanov, Vladimir V., Malykh, Valentin, Fenogenova, Alena
Advancements in LLMs have enhanced task automation in software engineering; however, current evaluations primarily focus on natural language tasks, overlooking code quality. Most benchmarks prioritize high-level reasoning over executable code and real-world performance, leaving gaps in understanding true capabilities and risks associated with these models in production. To address this issue, we propose MERA Code, a new addition to the MERA benchmark family, specifically focused on evaluating code for the latest code generation LLMs in Russian. This benchmark includes 11 evaluation tasks that span 8 programming languages. Our proposed evaluation methodology features a taxonomy that outlines the practical coding skills necessary for models to complete these tasks. The benchmark comprises an open-source codebase for users to conduct MERA assessments, a scoring system compatible with various programming environments, and a platform featuring a leaderboard and submission system. We evaluate open LLMs and frontier API models, analyzing their limitations in terms of practical coding tasks in non-English languages. We are publicly releasing MERA to guide future research, anticipate groundbreaking features in model development, and standardize evaluation procedures.
Trustless Federated Learning at Edge-Scale: A Compositional Architecture for Decentralized, Verifiable, and Incentive-Aligned Coordination
Onobhayedo, Pius, Oamen, Paul Osemudiame
Artificial intelligence is retracing the Internet's path from centralized provision to distributed creation. Initially, resource-intensive computation concentrates within institutions capable of training and serving large models. Eventually, as federated learning matures, billions of edge devices holding sensitive data will be able to collectively improve models without surrendering raw information, enabling both contribution and consumption at scale. This democratic vision remains unrealized due to certain compositional gaps; aggregators handle updates without accountability, economic mechanisms are lacking and even when present remain vulnerable to gaming, coordination serializes state modifications limiting scalability, and governance permits retroactive manipulation. This work addresses these gaps by leveraging cryptographic receipts to prove aggregation correctness, geometric novelty measurement to prevent incentive gaming, parallel object ownership to achieve linear scalability, and time-locked policies to check retroactive manipulation. The product of this work is a design architecture--not an actual implementation--that seeks to pass the baton in the race toward truly collaborative intelligence; an intelligence of the people, by the people, for the people.