Industry
Caption This, Reason That: VLMs Caught in the Middle
Vision-Language Models (VLMs) have shown remarkable progress in visual understanding in recent years. Yet, they still lag behind human capabilities in specific visual tasks such as counting or relational reasoning. To understand the underlying limitations, we adopt methodologies from cognitive science, analyzing VLM performance along core cognitive axes: Perception, Attention, and Memory. Using a suite of tasks targeting these abilities, we evaluate state-of-the-art VLMs, including GPT-4o. Our analysis reveals distinct cognitive profiles: while advanced models approach ceiling performance on some tasks (e.g.
Streamer IShowSpeed Is Gen Z's ESPN
At 21, Speed has pushed the limits of streaming by transforming a distinctly solo format into a global group chat. His song for this year's World Cup is becoming the tournament's unofficial anthem. Streamer IShowSpeed is a huge soccer fan who plans to bring this year's World Cup to his millions of followers. In the days leading up to the 2026 World Cup, the streamer IShowSpeed--one of the most watched people on the planet, who occasionally moonlights as a rapper--released the music video " World Cup (Champions)," a song about flexing national pride where he mentions all 48 teams. As with everything the 21-year-old born Darren Watkins Jr. does, the video was instantly everywhere. The song racked up over 7 million views on YouTube in under 24 hours. The internet rushed to christen it as the anthem of the tournament, even though the World Cup already has one. FIFA, following a ridiculous outpouring from fans and perhaps realizing the massive instant exposure he could bring, added the song to its official album.
All that structure matches does not glitter
Generative models for materials, especially inorganic crystals, hold potential to transform the theoretical prediction of novel compounds and structures. Advancement in this field depends critically on robust benchmarks and minimal, information-rich datasets that enable meaningful model evaluation. This paper critically examines common datasets and reported metrics for a crystal structure prediction task--generating the most likely structures given the chemical composition of a material. We focus on three key issues: First, materials datasets should contain unique crystal structures; for example, we show that the widely-utilized carbon-24 dataset only contains 40%unique structures. Second, materials datasets should not be split randomly if polymorphs of many different compositions are numerous, which we find to be the case for the perov-5 and MP-20 datasets.
Reinforcing Image Generation with Collaborative Semantic level and Token level CoT
Recent advancements in large language models have demonstrated how chain-ofthought (CoT) and reinforcement learning (RL) can improve performance. However, applying such reasoning strategies to the visual generation domain remains largely unexplored. In this paper, we present T2I-R1, a novel reasoning-enhanced text-to-image generation model, powered by RL with a bi-level CoT reasoning process. Specifically, we identify two levels of CoT that can be utilized to enhance different stages of generation: (1) the semantic-level CoT for high-level planning of the prompt and (2) the token-level CoT for low-level pixel processing during patch-by-patch generation. To better coordinate these two levels of CoT, we introduce BiCoT-GRPO with an ensemble of generation rewards, which seamlessly optimizes both generated CoTs within the same training step. By applying our reasoning strategies to the baseline model, Janus-Pro, we achieve superior performance with 13% improvement on T2I-CompBench and 19% improvement on the WISE benchmark, even surpassing the state-of-the-art model FLUX.1. All the training code and data are available at https://github.com/CaraJ7/T2I-R1.
Generalized and Invariant Single-Neuron In-Vivo Activity Representation Learning
In neuroscience, models that learn representations of single-neuron in-vivo activity are essential for understanding the functional identities of individual neurons. The primary goal of these models--spanning Transformer-based, contrastive, and variational autoencoder frameworks, is not to predict neural activity, but to distill it into a stable, low-dimensional embedding that captures a neuron's intrinsic features. These learned identity embeddings should be invariant to changing experimental conditions while reflecting the neuron's molecular type and anatomical location, thus enabling downstream tasks like in-vivo cell type prediction. However, current models suffer from limited generalizability due to batch effects: non-biological variations arising from differences in experimental design, animal subjects, or recording platforms. These batch effects cause overfitting, reducing model robustness and utility.
KnowMol: Advancing Molecular Large Language Models with Multi-Level Chemical Knowledge
Tthesehe challenges, we introduce cKnoarbwMol-100K,oxylate group and the polarizable sulfur atom, methylsulfanyl group attaalarchge-scaed tole tdatasethe sixwithth c100Karbofine-grainedn and molecular annotations Theacross polamriultiplety of the molecule is increased by the polar verum with data available.
Don't Just Chase " Highlighted Tokens " in MLLMs: Revisiting Visual Holistic Context Retention
Despite their powerful capabilities, Multimodal Large Language Models (MLLMs) suffer from considerable computational overhead due to their reliance on massive visual tokens. Recent studies have explored token pruning to alleviate this problem, which typically uses text-vision cross-attention or [CLS] attention to assess and discard redundant visual tokens. In this work, we identify a critical limitation of such attention-first pruning approaches, i.e., they tend to preserve semantically similar tokens, resulting in pronounced performance drops under high pruning ratios. To this end, we propose HoloV, a simple yet effective, plug-and-play visual token pruning framework for efficient inference.
Efficient Fairness-Performance Pareto Front Computation
There is a well known intrinsic trade-off between the fairness of a representation and the performance of classifiers derived from the representation. In this paper we propose a new method to compute the optimal Pareto front of this trade off. In contrast to the existing methods, this approach does not require the training of complex fair representation models. Our approach is derived through three main steps: We analyze fair representations theoretically, and derive several structural properties of optimal representations. We then show that these properties enable a reduction of the computation of the Pareto Front to a compact discrete problem. Finally, we show that these compact approximating problems can be efficiently solved via off-the shelf concave-convex programming methods.
'Pretty Crazy' Token Usage Is Testing Bosses' Bet on AI
'Pretty Crazy' Token Usage Is Testing Bosses' Bet on AI A Silicon Valley software maker and an ecommerce company reveal to WIRED how they are navigating the emerging challenge of "tokenomics." At the software company 8x8, employees are using Anthropic's Claude to draft emails, analyze customer feedback, and write code, but so far, their growing reliance on the artificial intelligence chatbot hasn't troubled the finance team. While other Silicon Valley companies, such as Meta, Uber, and Salesforce, have publicly expressed concerns about the growing cost of generative AI tools and have begun introducing usage caps in some cases, 8x8 says it finds itself in the black. Over the past 18 months, the company estimates it has saved about $5 million in annual costs by canceling subscriptions to dozens of software and educational tools it deemed unnecessary in part because Claude could provide similar capabilities. So far, 8x8's annualized bill for Claude is "well below" that figure, says Joel Neeb, the company's chief transformation and business operations officer.