proposal
i) Training phaseii) Evaluation phase WSCMR Query-video pair WSCMRTest-Trivial Novel-Words Novel-Composition VMRFine-grainedtimestamps Query-video pair VMRTest-Trivial
With the exponential growth of video content, aiming at localizing relevant video moments based on natural language queries, video moment retrieval (VMR) has gained significant attention. Existing weakly supervised VMR methods focus on designing various feature modeling and modal interaction modules to alleviate the reliance on precise temporal annotations. However, these methods have poor generalization capabilities on compositional queries with novel syntactic structures or vocabulary in real-world scenarios. To this end, we propose a new task: weakly supervised compositional moment retrieval (WSCMR). This task trains models using only video-query pairs without precise temporal annotations, while enabling generalization to complex compositional queries.
MLR-Bench: Evaluating AIAgents on Open-Ended Machine Learning Research Hui Chen Miao Xiong Yujie Lu Wei Han Ailin Deng Yufei He Jiaying Wu Yibo Li
Recent advancements in AI agents have demonstrated their growing potential to drive and support scientific discovery. In this work, we introduce MLR-Bench, a comprehensive benchmark for evaluating AI agents on open-ended machine learning research. MLR-Bench includes three key components: (1) 201 research tasks sourced from NeurIPS, ICLR, and ICML workshops covering diverse ML topics; (2) MLR-Judge, an automated evaluation framework combining LLMbased reviewers with carefully designed review rubrics to assess research quality; and (3) MLR-Agent, a modular agent scaffold capable of completing research tasks through four stages: idea generation, proposal formulation, experimentation, and paper writing. Our framework supports both stepwise assessment across these distinct research stages, and end-to-end evaluation of the final research paper. We then use MLR-Bench to evaluate six frontier LLMs and an advanced coding agent, finding that while LLMs are effective at generating coherent ideas and well-structured papers, current coding agents frequently (e.g., in 80% of the cases) produce fabricated or invalidated experimental results--posing a major barrier to scientific reliability.
Object-Centric Concept-Bottlenecks
Developing high-performing, yet interpretable models remains a critical challenge in modern AI. Concept-based models (CBMs) attempt to address this by extracting human-understandable concepts from a global encoding (e.g., image encoding) and then applying a linear classifier on the resulting concept activations, enabling transparent decision-making. However, their reliance on holistic image encodings limits their expressiveness in object-centric real-world settings and thus hinders their ability to solve complex vision tasks beyond single-label classification. To tackle these challenges, we introduce Object-Centric Concept Bottlenecks (OCB), a framework that combines the strengths of CBMs and pre-trained object-centric foundation models, boosting performance and interpretability. We evaluate OCB on complex image datasets and conduct a comprehensive ablation study to analyze key components of the framework, such as strategies for aggregating object-concept encodings. The results show that OCB outperforms traditional CBMs and allows one to make interpretable decisions for complex visual tasks.
APrincipled Approach to Randomized Selection under Uncertainty: Applications to Peer Review and Grant Funding
Many decision-making processes involve evaluating and selecting items, including scientific peer review, job hiring, school admissions, and investment decisions. These domains feature error-prone evaluations and uncertainty about outcomes, which undermine deterministic selection rules. Consequently, randomized selection mechanisms are gaining traction. However, current randomized approaches are ad hoc and, as we prove, inappropriate for their purported objectives. We propose a principled framework for randomized decision-making based on interval estimates of item quality. We introduce MERIT (Maximin Efficient Randomized Interval Top-k), which maximizes the worst-case expected number of top candidates selected under uncertainty represented by overlapping intervals. MERIT provides optimal resource allocation under an interpretable robustness notion. We develop a polynomial-time, practically efficient algorithm and prove our approach satisfies desirable axiomatic properties not guaranteed by existing methods. Experiments on synthetic peer review data from grant funding and conferences demonstrate that MERIT matches existing algorithms' expected utility under fully probabilistic models while outperforming them under our worst-case formulation.
AI HardwareObject Detection ModelsEvaluate and ValidateAdversarial DigitalExamplesEVADE
"Caught in a landslide, no escape from reality" summarizes the state of the research in AI offense: an attack might work on paper but does not necessarily in practice. In the last 5 years, we have seen the rise of latency attacks against computer vision systems. Most of them targeted 2D object detection, especially its Non-MaxSuppression (NMS) block, via adversarial images. However, we uncovered that, when tested in realistic deployment settings, the NMS latency attacks, accepted to top conferences, have very limited negative effects. In this paper, we define an evaluation framework (EVADE) to assess the practicality of attacks, and apply it to state-of-the-art NMS latency attacks.
Matching Markets Meet LLMs: Algorithmic Reasoning with Ranked Preferences
The rise of Large Language Models (LLMs) has driven progress in reasoning tasks, from program synthesis to scientific hypothesis generation, yet their ability to handle ranked preferences and structured algorithms in combinatorial domains remains underexplored. We study matching markets, a core framework behind applications like resource allocation and ride-sharing, which require reconciling individual ranked preferences to ensure stable outcomes. We evaluate seven stateof-the-art models on a hierarchy of preference-based reasoning tasks--ranging from stable-matching generation to instability detection, instability resolution, and finegrained preference queries--to systematically expose their logical and algorithmic limitations in handling ranked inputs. Surprisingly, even top-performing models with advanced reasoning struggle to resolve instability in large markets, often failing to identify blocking pairs or execute algorithms iteratively. We further show that parameter-efficient fine-tuning (LoRA) significantly improves performance in small markets, but fails to bring about a similar improvement in large instances, suggesting the need for more sophisticated strategies to improve LLMs' reasoning with larger-context inputs.
The FCC Wants to Kill Burner Phones
After WIRED reported last week that Meta's smart glasses app contained code that would enable the company to activate face-recognition features on the devices, the company removed the code this week without commenting on why or whether it plans to add such functionality back into the app later. Another WIRED investigation this week found that xAI's Grok is still hosting sexualized deepfakes, including "nudified" images and videos, of celebrities and at least one prominent US politician. After limiting the release of its new Mythos-class AI model over concerns about its potential impacts on cybersecurity, Anthropic announced a model upgrade for partners in its limited-access group this week and launched a "safe" version of the model to the public with guardrails meant to keep the system from being used to fuel cyberattacks. Meanwhile, the United States Cybersecurity and Infrastructure Security Agency issued a new directive to federal agencies this week in reaction to new AI threats that includes a requirement to fix the most urgent software vulnerabilities in as little as three days. As Europe looks to separate and insulate itself from US Big Tech, WIRED created a timeline that tracks all the ways EU governments, companies, and other organizations are moving away from US tech.
Reverse Diffusion Sequential Monte Carlo Samplers
We propose a novel sequential Monte Carlo (SMC) method for sampling from unnormalized target distributions based on a reverse denoising diffusion process. While recent diffusion-based samplers simulate the reverse diffusion using approximate score functions, they can suffer from accumulating errors due to time discretization and imperfect score estimation. In this work, we introduce a principled SMC framework that formalizes diffusion-based samplers as proposals while systematically correcting for their biases. The core idea is to construct informative intermediate target distributions that progressively steer the sampling trajectory toward the final target distribution. Although ideal intermediate targets are intractable, we develop \emph{exact approximations} using quantities from the score estimation-based proposal, without requiring additional model training or inference overhead. The resulting sampler, termed \textit{\ourmethodfull}, enables consistent sampling and unbiased estimation of the target's normalization constant under mild conditions. We demonstrate the effectiveness of our method on a range of synthetic targets and real-world Bayesian inference problems.