Plotting

 Technology


Jiayu Wang 1 Yifei Ming

Neural Information Processing Systems

Large language models (LLMs) and vision-language models (VLMs) have demonstrated remarkable performance across a wide range of tasks and domains. Despite this promise, spatial understanding and reasoning--a fundamental component of human cognition--remains under-explored. We propose SpatialEval, a novel benchmark that covers diverse aspects of spatial reasoning such as relationship understanding, navigation, and counting. We conduct a comprehensive evaluation of competitive language and vision-language models. Our findings reveal several counter-intuitive insights that have been overlooked in the literature: (1) Spatial reasoning poses significant challenges where competitive models can fall behind random guessing; (2) Despite additional visual input, VLMs often under-perform compared to their LLM counterparts; (3) When both textual and visual information is available, multi-modal language models become less reliant on visual information if sufficient textual clues are provided. Additionally, we demonstrate that leveraging redundancy between vision and text can significantly enhance model performance. We hope our study will inform the development of multimodal models to improve spatial intelligence and further close the gap with human intelligence.


Structural Inference of Dynamical Systems with Conjoined State Space Models & Jun Pang

Neural Information Processing Systems

This paper introduces SICSM, a novel structural inference framework that integrates Selective State Space Models (selective SSMs) with Generative Flow Networks (GFNs) to handle the challenges posed by dynamical systems with irregularly sampled trajectories and partial observations. By utilizing the robust temporal modeling capabilities of selective SSMs, our approach learns input-dependent transition functions that adapt to non-uniform time intervals, thereby enhancing the accuracy of structural inference. By aggregating dynamics across diverse temporal dependencies and channeling them into the GFN, the SICSM adeptly approximates the posterior distribution of the system's structure. This process not only enables precise inference of complex interactions within partially observed systems but also ensures the seamless integration of prior knowledge, enhancing the model's accuracy and robustness. Extensive evaluations on sixteen diverse datasets demonstrate that SICSM outperforms existing methods, particularly in scenarios characterized by irregular sampling and incomplete observations, which highlight its potential as a reliable tool for scientific discovery and system diagnostics in disciplines that demand precise modeling of complex interactions.


MambaTree: Tree Topology is All You Need in State Space Model

Neural Information Processing Systems

The state space models, employing recursively propagated features, demonstrate strong representation capabilities comparable to Transformer models and superior efficiency. However, constrained by the inherent geometric constraints of sequences, it still falls short in modeling long-range dependencies. To address this issue, we propose the MambaTree network, which first dynamically generates a tree topology based on spatial relationships and input features. Then, feature propagation is performed based on this graph, thereby breaking the original sequence constraints to achieve stronger representation capabilities. Additionally, we introduce a linear complexity dynamic programming algorithm to enhance long-range interactions without increasing computational cost. MambaTree is a versatile multimodal framework that can be applied to both visual and textual tasks. Extensive experiments demonstrate that our method significantly outperforms existing structured state space models on image classification, object detection and segmentation. Besides, by fine-tuning large language models, our approach achieves consistent improvements in multiple textual tasks at minor training cost.


DePLM: Denoising Protein Language Models for Property Optimization 2 Ming Qin

Neural Information Processing Systems

Protein optimization is a fundamental biological task aimed at enhancing the performance of proteins by modifying their sequences. Computational methods primarily rely on evolutionary information (EI) encoded by protein language models (PLMs) to predict fitness landscape for optimization. However, these methods suffer from a few limitations.


On scalable oversight with weak LLMs judging strong LLMs

Neural Information Processing Systems

Scalable oversight protocols aim to enable humans to accurately supervise superhuman AI. In this paper we study debate, where two AI's compete to convince a judge; consultancy, where a single AI tries to convince a judge that asks questions; and compare to a baseline of direct question-answering, where the judge just answers outright without the AI. We use large language models (LLMs) as both AI agents and as stand-ins for human judges, taking the judge models to be weaker than agent models. We benchmark on a diverse range of asymmetries between judges and agents, extending previous work on a single extractive QA task with information asymmetry, to also include mathematics, coding, logic and multimodal reasoning asymmetries. We find that debate outperforms consultancy across all tasks when the consultant is randomly assigned to argue for the correct/incorrect answer. Comparing debate to direct question answering, the results depend on the type of task: in extractive QA tasks with information asymmetry debate outperforms direct question answering, but in other tasks without information asymmetry the results are mixed. Previous work assigned debaters/consultants an answer to argue for. When we allow them to instead choose which answer to argue for, we find judges are less frequently convinced by the wrong answer in debate than in consultancy. Further, we find that stronger debater models increase judge accuracy, though more modestly than in previous studies.


abb4847bbd60f38b1b7649d26c7a0067-Supplemental-Conference.pdf

Neural Information Processing Systems

While multimodal learning methods need modality-complete data to learn the crossmodal correspondences, our method gives more effective cross-modal fusion for unseen modality combinations. In Table 4 in the main paper, we summarized the multimedia retrieval results with the Mean Rank (MnR) averaged between video-to-text and text-to-video. Here, we provide the full comparison with all the metrics for both text-to-video retrieval and video-to-text retrieval in Table 5. As recent multimodal learning methods need modality-complete data for training, our model outperforms these approaches on all metrics by effectively accumulating the information from any modality combinations. With fewer learnable tokens, the projected features become less discriminative since many of them with different semantic meanings need to match the same learable token.


RL: Efficient Exploration for Nonepisodic RL

Neural Information Processing Systems

We study the problem of nonepisodic reinforcement learning (RL) for nonlinear dynamical systems, where the system dynamics are unknown and the RL agent has to learn from a single trajectory, i.e., adapt online and without resets. This setting is ubiquitous in the real world, where resetting is impossible or requires human intervention.


Ukraine under Russian missile, drone attacks for second night, 12 killed

Al Jazeera

Russia has targeted Ukraine for a second consecutive night with drones and missiles, killing at least 12 people as the two countries pursue a major prisoner swap. Ukraine's air force said on Sunday that Russian forces attacked Ukrainian regions with 298 drones and 69 missiles overnight, one of the largest aerial attacks of the war. "Most regions of Ukraine were affected by the hostile attack. Enemy air strikes were recorded in 22 areas, and downed cruise missiles and attack UAVs (drones) fell in 15 locations," the air force said on Telegram. Ukraine's security service reported that at least four people were killed and 16 were injured in the capital, Kyiv.


Neural Flow Diffusion Models: Learnable Forward Process for Improved Diffusion Modelling

Neural Information Processing Systems

Conventional diffusion models often rely on a fixed forward process, which implicitly defines complex marginal distributions over latent variables. This can often complicate the reverse process' task in learning generative trajectories, and results in costly inference for diffusion models. To address these limitations, we introduce Neural Flow Diffusion Models (NFDM), a novel framework that enhances diffusion models by supporting a broader range of forward processes beyond the standard linear Gaussian. We also propose a novel parameterization technique for learning the forward process. Our framework provides an end-to-end, simulation-free optimization objective, effectively minimizing a variational upper bound on the negative log-likelihood. Experimental results demonstrate NFDM's strong performance, evidenced by state-of-the-art likelihoods across a range of image generation tasks. Furthermore, we investigate NFDM's capacity for learning generative dynamics with specific characteristics, such as deterministic straight lines trajectories, and demonstrate how the framework can be adopted for learning bridges between two distributions. The results underscores NFDM's versatility and its potential for a wide range of applications.


Hybrid Mamba for Few-Shot Segmentation

Neural Information Processing Systems

Many few-shot segmentation (FSS) methods use cross attention to fuse support foreground (FG) into query features, regardless of the quadratic complexity. A recent advance Mamba can also well capture intra-sequence dependencies, yet the complexity is only linear. Hence, we aim to devise a cross (attention-like) Mamba to capture inter-sequence dependencies for FSS. A simple idea is to scan on support features to selectively compress them into the hidden state, which is then used as the initial hidden state to sequentially scan query features. Nevertheless, it suffers from (1) support forgetting issue: query features will also gradually be compressed when scanning on them, so the support features in hidden state keep reducing, and many query pixels cannot fuse sufficient support features; (2) intra-class gap issue: query FG is essentially more similar to itself rather than to support FG, i.e., query may prefer not to fuse support features but their own ones from the hidden state, yet the success of FSS relies on the effective use of support information. To tackle them, we design a hybrid Mamba network (HMNet), including (1) a support recapped Mamba to periodically recap the support features when scanning query, so the hidden state can always contain rich support information; (2) a query intercepted Mamba to forbid the mutual interactions among query pixels, and encourage them to fuse more support features from the hidden state. Consequently, the support information is better utilized, leading to better performance. Extensive experiments have been conducted on two public benchmarks, showing the superiority of HMNet. The code is available at https://github.com/Sam1224/HMNet.