Goto

Collaborating Authors

 Rail


How Japan built the worlds first 3D-printed train station in 6 hours

Mashable

Loading... Subscribe These newsletters may contain advertising, deals, or affiliate links. By clicking Subscribe, you confirm you are 16 and agree to ourTerms of Use and Privacy Policy.


PUZZLES: A Benchmark for Neural Algorithmic Reasoning

Neural Information Processing Systems

Algorithmic reasoning is a fundamental cognitive ability that plays a pivotal role in problem-solving and decision-making processes. Reinforcement Learning (RL) has demonstrated remarkable proficiency in tasks such as motor control, handling perceptual input, and managing stochastic environments. These advancements have been enabled in part by the availability of benchmarks. In this work we introduce PUZZLES, a benchmark based on Simon Tatham's Portable Puzzle Collection, aimed at fostering progress in algorithmic and logical reasoning in RL. PUZZLES contains 40 diverse logic puzzles of adjustable sizes and varying levels of complexity; many puzzles also feature a diverse set of additional configuration parameters.


Synergistic Dual Spatial-aware Generation of Image-to-Text and Text-to-Image

Neural Information Processing Systems

In the visual spatial understanding (VSU) area, spatial image-to-text (SI2T) and spatial text-to-image (ST2I) are two fundamental tasks that appear in dual form. Existing methods for standalone SI2T or ST2I perform imperfectly in spatial understanding, due to the difficulty of 3D-wise spatial feature modeling. In this work, we consider modeling the SI2T and ST2I together under a dual learning framework. During the dual framework, we then propose to represent the 3D spatial scene features with a novel 3D scene graph (3DSG) representation that can be shared and beneficial to both tasks.



Improving CLIP Training with Language Rewrites Lijie Fan 1,2, Phillip Isola 2 Dina Katabi

Neural Information Processing Systems

Contrastive Language-Image Pre-training (CLIP) stands as one of the most effective and scalable methods for training transferable vision models using paired image and text data. CLIP models are trained using contrastive loss, which typically relies on data augmentations to prevent overfitting and shortcuts. However, in the CLIP training paradigm, data augmentations are exclusively applied to image inputs, while language inputs remain unchanged throughout the entire training process, limiting the exposure of diverse texts to the same image. In this paper, we introduce Language augmented CLIP (LaCLIP), a simple yet highly effective approach to enhance CLIP training through language rewrites. Leveraging the in-context learning capability of large language models, we rewrite the text descriptions associated with each image. These rewritten texts exhibit diversity in sentence structure and vocabulary while preserving the original key concepts and meanings. During training, LaCLIP randomly selects either the original texts or the rewritten versions as text augmentations for each image. Extensive experiments on CC3M, CC12M, RedCaps and LAION-400M datasets show that CLIP pre-training with language rewrites significantly improves the transfer performance without computation or memory overhead during training.


Supplementary Material A Comprehensive Benchmark for Neural Human Radiance Fields

Neural Information Processing Systems

Due to space constraints in the main paper, We elaborate on the following in the supplementary material: (1) more ablation study of different settings in Sec. For multi-view video sequences, consecutive frames may have severe information redundancy that is adverse to the reconstruction of the human geometry and appearance. Thus, we fix the number of train frames and train views but set the frame sampling interval to be 1, 5 and 10. The experimental results are shown in Tab. 1. In general, with larger sampling interval, the performance of selected representative scene-specific methods would increase.



Web-Scale Visual Entity Recognition: An LLM-Driven Data Approach

Neural Information Processing Systems

Web-scale visual entity recognition, the task of associating images with their corresponding entities within vast knowledge bases like Wikipedia, presents significant challenges due to the lack of clean, large-scale training data. In this paper, we propose a novel methodology to curate such a dataset, leveraging a multimodal large language model (LLM) for label verification, metadata generation, and rationale explanation. Instead of relying on the multimodal LLM to directly annotate data, which we found to be suboptimal, we prompt it to reason about potential candidate entity labels by accessing additional contextually relevant information (such as Wikipedia), resulting in more accurate annotations. We further use the multimodal LLM to enrich the dataset by generating question-answer pairs and a grounded finegrained textual description (referred to as "rationale") that explains the connection between images and their assigned entities. Experiments demonstrate that models trained on this automatically curated data achieve state-of-the-art performance on web-scale visual entity recognition tasks (e.g.


Deep Belief Markov Models for POMDP Inference

arXiv.org Artificial Intelligence

This work introduces a novel deep learning-based architecture, termed the Deep Belief Markov Model (DBMM), which provides efficient, model-formulation agnostic inference in Partially Observable Markov Decision Process (POMDP) problems. The POMDP framework allows for modeling and solving sequential decision-making problems under observation uncertainty. In complex, high-dimensional, partially observable environments, existing methods for inference based on exact computations (e.g., via Bayes' theorem) or sampling algorithms do not scale well. Furthermore, ground truth states may not be available for learning the exact transition dynamics. DBMMs extend deep Markov models into the partially observable decision-making framework and allow efficient belief inference entirely based on available observation data via variational inference methods. By leveraging the potency of neural networks, DBMMs can infer and simulate non-linear relationships in the system dynamics and naturally scale to problems with high dimensionality and discrete or continuous variables. In addition, neural network parameters can be dynamically updated efficiently based on data availability. DBMMs can thus be used to infer a belief variable, thus enabling the derivation of POMDP solutions over the belief space. We evaluate the efficacy of the proposed methodology by evaluating the capability of model-formulation agnostic inference of DBMMs in benchmark problems that include discrete and continuous variables.


Netflix's Most Expensive Movie Ever Is Here, and It's a Monumental Disaster

Slate

When he got his first glimpse of a movie studio, Orson Welles excitedly proclaimed it "the biggest electric train set any boy ever had." But with a reported budget of more than 300 million, Joe and Anthony Russo's The Electric State makes Welles' train set look like a busted caboose. The most expensive movie in Netflix's history, it's also among the costliest of all time, joining a list that includes the brothers' own Avengers: Infinity War and Avengers: Endgame. If the Russos are the most profligate creators in history--their Amazon series Citadel is also one of the most expensive TV shows ever made--they're among the most successful too. And yet for all the money they're making, and all that they're allowed to spend, they don't seem to be enjoying themselves very much.