Education
Mysterious UFO-shaped 'Dorito' aircraft spotted over Area 51 as strange military code is heard
Trump orders a massive armada toward Iran with ominous warning about what could come next: 'We're watching' Mysterious UFO-shaped'Dorito' aircraft spotted over Area 51 as strange military code is heard Florida, Texas and California lead America's housing crash as other Sun Belt states start to crack as values plunge 7.6 percent Meghan Trainor's teary photo with her new baby born via surrogate has sparked an almost unsayable thought. Most women won't admit it... but I will: CAROLINE BULLOCK Billionaire who predicted 2008 crash issues stark warning over'worrying' new US trend but there's one way to protect your savings AND make money Canadian woman was euthanized'against her will' after husband was fed-up with caring for her Another awkward moment between Victoria Beckham and Nicola Peltz goes viral as fans claim Brooklyn's mum'is not the problem' Chilling video shows high school student rampaging through classroom with knife... before teacher steps in Trump describes excruciating ...
Learning from Synthetic Data: Limitations of ERM
Amin, Kareem, Bie, Alex, Kong, Weiwei, Syed, Umar, Vassilvitskii, Sergei
The first generation of LLMs were largely trained on human-generated data. However, the success of LLMs and their increased adoption has had an unexpected consequence of AI-generated content appearing in places where there was previously none. Thus machine learning practitioners should be aware that there is an increased chance that their training data is contaminated by LLM-generated content. Previous work has looked into the value of synthetic (i.e., AI-generated) data, and showed that while naively adding this data to the training mix may lead to model collapse, being more diligent about which data is added, the amount of curation it undergoes, and the specifics of the training process may mitigate that risk, or reverse it, leading to improved performance. These works almost uniquely focus on the LLM setting, trying to improve state of the art performance on a set of benchmarks. In contrast, in this work we take a traditional learning theory view on this problem. We begin by formalizing the setting and developing a framework that captures the invariants of having natural training data contaminated by synthetic additions. Specifically, we see three salient points: Groundtruth. There exists a (potentially small) set of natural data, coming from the true data generation distribution.
Statistical Reinforcement Learning in the Real World: A Survey of Challenges and Future Directions
Gazi, Asim H., Guo, Yongyi, Gao, Daiqi, Xu, Ziping, Zhang, Kelly W., Murphy, Susan A.
Reinforcement learning (RL) has achieved remarkable success in real-world decision-making across diverse domains, including gaming, robotics, online advertising, public health, and natural language processing. Despite these advances, a substantial gap remains between RL research and its deployment in many practical settings. Two recurring challenges often underlie this gap. First, many settings offer limited opportunity for the agent to interact extensively with the target environment due to practical constraints. Second, many target environments often undergo substantial changes, requiring redesign and redeployment of RL systems (e.g., advancements in science and technology that change the landscape of healthcare delivery). Addressing these challenges and bridging the gap between basic research and application requires theory and methodology that directly inform the design, implementation, and continual improvement of RL systems in real-world settings. In this paper, we frame the application of RL in practice as a three-component process: (i) online learning and optimization during deployment, (ii) post- or between-deployment offline analyses, and (iii) repeated cycles of deployment and redeployment to continually improve the RL system. We provide a narrative review of recent advances in statistical RL that address these components, including methods for maximizing data utility for between-deployment inference, enhancing sample efficiency for online learning within-deployment, and designing sequences of deployments for continual improvement. We also outline future research directions in statistical RL that are use-inspired -- aiming for impactful application of RL in practice.
Fairness-informed Pareto Optimization : An Efficient Bilevel Framework
Tanji, Sofiane, Vaiter, Samuel, Laguel, Yassine
Despite their promise, fair machine learning methods often yield Pareto-inefficient models, in which the performance of certain groups can be improved without degrading that of others. This issue arises frequently in traditional in-processing approaches such as fairness-through-regularization. In contrast, existing Pareto-efficient approaches are biased towards a certain perspective on fairness and fail to adapt to the broad range of fairness metrics studied in the literature. In this paper, we present BADR, a simple framework to recover the optimal Pareto-efficient model for any fairness metric. Our framework recovers its models through a Bilevel Adaptive Rescalarisation procedure. The lower level is a weighted empirical risk minimization task where the weights are a convex combination of the groups, while the upper level optimizes the chosen fairness objective. We equip our framework with two novel large-scale, single-loop algorithms, BADR-GD and BADR-SGD, and establish their convergence guarantees. We release badr, an open-source Python toolbox implementing our framework for a variety of learning tasks and fairness metrics. Finally, we conduct extensive numerical experiments demonstrating the advantages of BADR over existing Pareto-efficient approaches to fairness.
Online Continual Learning for Time Series: a Natural Score-driven Approach
Urettini, Edoardo, Atzeni, Daniele, Tsaknaki, Ioanna-Yvonni, Carta, Antonio
Online continual learning (OCL) methods adapt to changing environments without forgetting past knowledge. Similarly, online time series forecasting (OTSF) is a real-world problem where data evolve in time and success depends on both rapid adaptation and long-term memory. Indeed, time-varying and regime-switching forecasting models have been extensively studied, offering a strong justification for the use of OCL in these settings. Building on recent work that applies OCL to OTSF, this paper aims to strengthen the theoretical and practical connections between time series methods and OCL. First, we reframe neural network optimization as a parameter filtering problem, showing that natural gradient descent is a score-driven method and proving its information-theoretic optimality. Then, we show that using a Student's t likelihood in addition to natural gradient induces a bounded update, which improves robustness to outliers. Finally, we introduce Natural Score-driven Replay (NatSR), which combines our robust optimizer with a replay buffer and a dynamic scale heuristic that improves fast adaptation at regime drifts. Empirical results demonstrate that NatSR achieves stronger forecasting performance than more complex state-of-the-art methods.
She Was Given Up by Her Chinese Parents--and Spent 14 Years Trying to Find a Way Back
More and more Chinese adoptees in the US are trying to reunite with their birth parents. For Youxue, it took more than a decade, and a remarkable coincidence. A girl is found on a street in Ma'Anshan, China, in May 1993. Her paternal grandfather, the story goes, set her down and walked away. It's unclear how long she's been outside when somebody arrives and takes her to the orphanage. A white woman adopts the girl and brings her to America in August 1994. She gives her an English name. In spring 2010, when Youxue (her Chinese name) was a high school sophomore in Dallas, Texas, she decided to start searching for her birth parents.
Thousands of Companies Are Driving China's AI Boom. A Government Registry Tracks Them All
Thousands of Companies Are Driving China's AI Boom. How the Cyberspace Administration of China inadvertently made a guide to the country's homegrown AI revolution. When DeepSeek burst onto the global stage in January 2025, it seemed to appear out of nowhere. But the large language model was just one of the thousands of generative AI tools that have been released in China since 2023--and there's a public archive of every single one of them. Here are 23 ways China is rewiring the future .