Goto

Collaborating Authors

 film


The New Yorker Film "I'm Not a Robot" Wins a 2025 Academy Award

The New Yorker

A film released by The New Yorker was among the winners at Sunday's Academy Awards. "I'm Not a Robot," a darkly comic portrayal of a woman trying to convince her computer that she is human, claimed the prize for Best Live Action Short. It is the second film released by the magazine to be honored with an Oscar. The film, written and directed by Victoria Warmerdam, opens with a seemingly typical office scene that quickly unravels. When the protagonist, a music producer, fails a series of CAPTCHA tests, she begins to question her own grip on reality.


Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success

Kim, Moo Jin, Finn, Chelsea, Liang, Percy

arXiv.org Artificial Intelligence

Recent vision-language-action models (VLAs) build upon pretrained vision-language models and leverage diverse robot datasets to demonstrate strong task execution, language following ability, and semantic generalization. Despite these successes, VLAs struggle with novel robot setups and require fine-tuning to achieve good performance, yet how to most effectively fine-tune them is unclear given many possible strategies. In this work, we study key VLA adaptation design choices such as different action decoding schemes, action representations, and learning objectives for fine-tuning, using OpenVLA as our representative base model. Our empirical analysis informs an Optimized Fine-Tuning (OFT) recipe that integrates parallel decoding, action chunking, a continuous action representation, and a simple L1 regression-based learning objective to altogether improve inference efficiency, policy performance, and flexibility in the model's input-output specifications. We propose OpenVLA-OFT, an instantiation of this recipe, which sets a new state of the art on the LIBERO simulation benchmark, significantly boosting OpenVLA's average success rate across four task suites from 76.5% to 97.1% while increasing action generation throughput by 26$\times$. In real-world evaluations, our fine-tuning recipe enables OpenVLA to successfully execute dexterous, high-frequency control tasks on a bimanual ALOHA robot and outperform other VLAs ($\pi_0$ and RDT-1B) fine-tuned using their default recipes, as well as strong imitation learning policies trained from scratch (Diffusion Policy and ACT) by up to 15% (absolute) in average success rate. We release code for OFT and pretrained model checkpoints at https://openvla-oft.github.io/.


Visual Question Answering From Another Perspective: CLEVR Mental Rotation Tests

Beckham, Christopher, Weiss, Martin, Golemo, Florian, Honari, Sina, Nowrouzezahrai, Derek, Pal, Christopher

arXiv.org Artificial Intelligence

Different types of mental rotation tests have been used extensively in psychology to understand human visual reasoning and perception. Understanding what an object or visual scene would look like from another viewpoint is a challenging problem that is made even harder if it must be performed from a single image. We explore a controlled setting whereby questions are posed about the properties of a scene if that scene was observed from another viewpoint. To do this we have created a new version of the CLEVR dataset that we call CLEVR Mental Rotation Tests (CLEVR-MRT). Using CLEVR-MRT we examine standard methods, show how they fall short, then explore novel neural architectures that involve inferring volumetric representations of a scene. These volumes can be manipulated via camera-conditioned transformations to answer the question. We examine the efficacy of different model variants through rigorous ablations and demonstrate the efficacy of volumetric representations.


Adversarial Attacks are a Surprisingly Strong Baseline for Poisoning Few-Shot Meta-Learners

Oldewage, Elre T., Bronskill, John, Turner, Richard E.

arXiv.org Artificial Intelligence

This paper examines the robustness of deployed few-shot meta-learning systems when they are fed an imperceptibly perturbed few-shot dataset. We attack amortized meta-learners, which allows us to craft colluding sets of inputs that are tailored to fool the system's learning algorithm when used as training data. Jointly crafted adversarial inputs might be expected to synergistically manipulate a classifier, allowing for very strong data-poisoning attacks that would be hard to detect. We show that in a white box setting, these attacks are very successful and can cause the target model's predictions to become worse than chance. However, in opposition to the well-known transferability of adversarial examples in general, the colluding sets do not transfer well to different classifiers. We explore two hypotheses to explain this: "overfitting" by the attack, and mismatch between the model on which the attack is generated and that to which the attack is transferred. Regardless of the mitigation strategies suggested by these hypotheses, the colluding inputs transfer no better than adversarial inputs that are generated independently in the usual way.


Prompter: Utilizing Large Language Model Prompting for a Data Efficient Embodied Instruction Following

Inoue, Yuki, Ohashi, Hiroki

arXiv.org Artificial Intelligence

Embodied Instruction Following (EIF) studies how mobile manipulator robots should be controlled to accomplish long-horizon tasks specified by natural language instructions. While most research on EIF are conducted in simulators, the ultimate goal of the field is to deploy the agents in real life. As such, it is important to minimize the data cost required for training an agent, to help the transition from sim to real. However, many studies only focus on the performance and overlook the data cost -- modules that require separate training on extra data are often introduced without a consideration on deployability. In this work, we propose FILM++ which extends the existing work FILM with modifications that do not require extra data. While all data-driven modules are kept constant, FILM++ more than doubles FILM's performance. Furthermore, we propose Prompter, which replaces FILM++'s semantic search module with language model prompting. Unlike FILM++'s implementation that requires training on extra sets of data, no training is needed for our prompting based implementation while achieving better or at least comparable performance. Prompter achieves 42.64% and 45.72% on the ALFRED benchmark with high-level instructions only and with step-by-step instructions, respectively, outperforming the previous state of the art by 6.57% and 10.31%.


Don't Copy the Teacher: Data and Model Challenges in Embodied Dialogue

Min, So Yeon, Zhu, Hao, Salakhutdinov, Ruslan, Bisk, Yonatan

arXiv.org Artificial Intelligence

Embodied dialogue instruction following requires an agent to complete a complex sequence of tasks from a natural language exchange. The recent introduction of benchmarks (Padmakumar et al., 2022) raises the question of how best to train and evaluate models for this multi-turn, multi-agent, long-horizon task. This paper contributes to that conversation, by arguing that imitation learning (IL) and related low-level metrics are actually misleading and do not align with the goals of embodied dialogue research and may hinder progress. We provide empirical comparisons of metrics, analysis of three models, and make suggestions for how the field might best progress. First, we observe that models trained with IL take spurious actions during evaluation. Second, we find that existing models fail to ground query utterances, which are essential for task completion. Third, we argue evaluation should focus on higher-level semantic goals.


This robot crossed a line it shouldn't have because humans told it to

#artificialintelligence

Video of a sidewalk delivery robot crossing yellow caution tape and rolling through a crime scene in Los Angeles went viral this week, amassing more than 650,000 views on Twitter and sparking debate about whether the technology is ready for prime time. It turns out the robot's error, at least in this case, was caused by humans. The video of the event was taken and posted on Twitter by William Gude, the owner of Film the Police LA, an LA-based police watchdog account. Gude was in the area of a suspected school shooting at Hollywood High School at around 10 a.m. when he captured on video the bot as it hovered on the street corner, looking confused, until someone lifted the tape, allowing the bot to continue on its way through the crime scene. A food delivery robot forces it's way across a police crime scene.


FiLM: Frequency improved Legendre Memory Model for Long-term Time Series Forecasting

Zhou, Tian, Ma, Ziqing, wang, Xue, Wen, Qingsong, Sun, Liang, Yao, Tao, Yin, Wotao, Jin, Rong

arXiv.org Artificial Intelligence

Recent studies have shown that deep learning models such as RNNs and Transformers have brought significant performance gains for long-term forecasting of time series because they effectively utilize historical information. We found, however, that there is still great room for improvement in how to preserve historical information in neural networks while avoiding overfitting to noise presented in the history. Addressing this allows better utilization of the capabilities of deep learning models. To this end, we design a \textbf{F}requency \textbf{i}mproved \textbf{L}egendre \textbf{M}emory model, or {\bf FiLM}: it applies Legendre Polynomials projections to approximate historical information, uses Fourier projection to remove noise, and adds a low-rank approximation to speed up computation. Our empirical studies show that the proposed FiLM significantly improves the accuracy of state-of-the-art models in multivariate and univariate long-term forecasting by (\textbf{20.3\%}, \textbf{22.6\%}), respectively. We also demonstrate that the representation module developed in this work can be used as a general plug-in to improve the long-term prediction performance of other deep learning modules. Code is available at https://github.com/tianzhou2011/FiLM/


The Digital Afterlife in Film

#artificialintelligence

For decades science fiction film, television, and literature have addressed our human desire for connection with our dead loved ones. With the creation of artificial intelligence, our imagination for machine learning holograms and robots has turned into reality. More recent film and documentary programs have addressed this new technology and I will be examining numerous mediated stories throughout my research studies.


Three ways that big data reveals what you really like to watch, read and listen to

#artificialintelligence

Anyone who's watched "Bridget Jones's Diary" knows one of her New Year's resolutions is "Not go out every night but stay in and read books and listen to classical music." The reality, however, is substantially different. What people actually do in their leisure time often doesn't match with what they say they'll do. Economists have termed this phenomenon "hyperbolic discounting." In a famous study titled "Paying Not to Go to the Gym," a couple of economists found that, when people were offered the choice between a pay-per-visit contract and a monthly fee, they were more likely to choose the monthly fee and actually ended up paying more per visit.