Goto

Collaborating Authors

 unify


Minimax Value Interval for Off-Policy Evaluation and Policy Optimization

Neural Information Processing Systems

We study minimax methods for off-policy evaluation (OPE) using value functions and marginalized importance weights. Despite that they hold promises of overcoming the exponential variance in traditional importance sampling, several key problems remain: (1) They require function approximation and are generally biased. For the sake of trustworthy OPE, is there anyway to quantify the biases?




Competitive Programming with Large Reasoning Models

OpenAI, null, :, null, El-Kishky, Ahmed, Wei, Alexander, Saraiva, Andre, Minaev, Borys, Selsam, Daniel, Dohan, David, Song, Francis, Lightman, Hunter, Clavera, Ignasi, Pachocki, Jakub, Tworek, Jerry, Kuhn, Lorenz, Kaiser, Lukasz, Chen, Mark, Schwarzer, Max, Rohaninejad, Mostafa, McAleese, Nat, contributors, o3, Mürk, Oleg, Garg, Rhythm, Shu, Rui, Sidor, Szymon, Kosaraju, Vineet, Zhou, Wenda

arXiv.org Artificial Intelligence

We show that reinforcement learning applied to large language models (LLMs) significantly boosts performance on complex coding and reasoning tasks. Additionally, we compare two general-purpose reasoning models - OpenAI o1 and an early checkpoint of o3 - with a domain-specific system, o1-ioi, which uses hand-engineered inference strategies designed for competing in the 2024 International Olympiad in Informatics (IOI). We competed live at IOI 2024 with o1-ioi and, using hand-crafted test-time strategies, placed in the 49th percentile. Under relaxed competition constraints, o1-ioi achieved a gold medal. However, when evaluating later models such as o3, we find that o3 achieves gold without hand-crafted domain-specific strategies or relaxed constraints. Our findings show that although specialized pipelines such as o1-ioi yield solid improvements, the scaled-up, general-purpose o3 model surpasses those results without relying on hand-crafted inference heuristics. Notably, o3 achieves a gold medal at the 2024 IOI and obtains a Codeforces rating on par with elite human competitors. Overall, these results indicate that scaling general-purpose reinforcement learning, rather than relying on domain-specific techniques, offers a robust path toward state-of-the-art AI in reasoning domains, such as competitive programming.


Minimax Value Interval for Off-Policy Evaluation and Policy Optimization

Neural Information Processing Systems

We study minimax methods for off-policy evaluation (OPE) using value functions and marginalized importance weights. Despite that they hold promises of overcoming the exponential variance in traditional importance sampling, several key problems remain: (1) They require function approximation and are generally biased. For the sake of trustworthy OPE, is there anyway to quantify the biases? In this paper we answer both questions positively. By slightly altering the derivation of previous methods (one from each style), we unify them into a single value interval that comes with a special type of double robustness: when either the value-function or the importance-weight class is well specified, the interval is valid and its length quantifies the misspecification of the other class.



UNIFY: a Unified Policy Designing Framework for Solving Constrained Optimization Problems with Machine Learning

Silvestri, Mattia, De Filippo, Allegra, Lombardi, Michele, Milano, Michela

arXiv.org Artificial Intelligence

Methods for combining Machine Learning (ML) and Constrained Optimization (CO) for decision support have attracted considerable interest in recent years. This is motivated by the possibility to tackle complex decision making problems subject to uncertainty (sometimes over multiple stages), and having a partially specified structure where knowledge is available both in explicit form (cost function, constraints) and implicit form (historical data or simulators). As a practical example, an Energy Management Systems (EMS) needs to allocate minimum-cost power flows from different Distributed Energy Resources (DERs) [1]. Based on actual energy prices, and forecasts on the availability of DERs and on consumption, the EMS decides which power generators should be used and whether the surplus should be stored or sold to the market. Such a problem involves hard constraints (maintaining power balance, power flow limits), a clear cost structure, elements of uncertainty that are partially known via historical data, and multiple decision stages likely subject to execution time restrictions. In this type of use case, pure CO methods struggle with robustness and scalability, while pure ML methods such as Reinforcement Learning (RL) have trouble dealing with hard constraints and combinatorial decision spaces. Motivated by the opportunity to obtain improvements via a combination of ML and CO, multiple lines of research have emerged, such as Decision Focused Learning, Constrained Reinforcement Learning, or Algorithm Configuration. While existing methods have obtained a good measure of success, to the best of the authors knowledge no existing method can deal with all the challenges we have identified. Ideally, one wishes to obtain a solution policy capable of providing feasible (and high-quality) solutions, handling robustness, taking advantage of existing data, and with a reasonable computational load.


ep.8: New Voices in AI: philosophy, cognitive science and AI, with Dimitri Coelho Mollo

AIHub

Coelho Mollo: Yeah exactly, like you know what we do is the benchmark for intelligence. If you know other animals and the AI systems don't, uh, you know, meet our capacities then they are not intelligent. And and if you think even about, you know the kinds of capacities that until recently lots of AI was interested in it was those kinds of things that we humans tend to think are markers of intelligence. Playing go, playing chess, you know, proving mathematical theorems and stuff like that, right? While all the things that we tend to think are, you know, not intelligence like just moving around and you know, uh, being able to look here, picking up things or turning open opening doors and so on so forth that we take not being intelligent.


Why Unify?

#artificialintelligence

"What is the point of unifying all ML frameworks?" you may ask. You may be perfectly happy with the framework you currently use, and that's great! We live in a time where great ML tools are in abundance, and that's a wonderful thing! We'll give two clear examples of how Ivy can streamline your ML workflow and save you weeks of development time. Let's say DeepMind release an awesome paper in JAX, and you'd love to try it out using your own framework of choice.


MLOps Interview Questions?

#artificialintelligence

Machine Learning Operations, is an emerging domain within the more significant AI/DS/ML space that addresses operationalizing the ML models. Machine Learning Operations, is an emerging domain within the more significant AI/DS/ML space that addresses operationalizing the ML models.