Human Computer Interaction
ActionSense: A Multimodal Dataset and Recording Framework for Human Activities Using Wearable Sensors in a Kitchen Environment
Joseph DelPreto, Chao Liu, Yiyue Luo, Michael Foshey, Yunzhu Li, Antonio Torralba, Wojciech Matusik, Daniela Rus
This paper introduces ActionSense, a multimodal dataset and recording framework with an emphasis on wearable sensing in a kitchen environment. It provides rich, synchronized data streams along with ground truth data to facilitate learning pipelines that could extract insights about how humans interact with the physical world during activities of daily living, and help lead to more capable and collaborative robot assistants. The wearable sensing suite captures motion, force, and attention information; it includes eye tracking with a first-person camera, forearm muscle activity sensors, a body-tracking system using 17 inertial sensors, finger-tracking gloves, and custom tactile sensors on the hands that use a matrix of conductive threads. This is coupled with activity labels and with externallycaptured data from multiple RGB cameras, a depth camera, and microphones. The specific tasks recorded in ActionSense are designed to highlight lower-level physical skills and higher-level scene reasoning or action planning. They include simple object manipulations (e.g., stacking plates), dexterous actions (e.g., peeling or cutting vegetables), and complex action sequences (e.g., setting a table or loading a dishwasher).
Avalon: A Benchmark for RL Generalization Using Procedurally Generated Worlds
Despite impressive successes, deep reinforcement learning (RL) systems still fall short of human performance on generalization to new tasks and environments that differ from their training. As a benchmark tailored for studying RL generalization, we introduce Avalon, a set of tasks in which embodied agents in highly diverse procedural 3D worlds must survive by navigating terrain, hunting or gathering food, and avoiding hazards. Avalon is unique among existing RL benchmarks in that the reward function, world dynamics, and action space are the same for every task, with tasks differentiated solely by altering the environment; its 20 tasks, ranging in complexity from eat and throw to hunt and navigate, each create worlds in which the agent must perform specific skills in order to survive. This setup enables investigations of generalization within tasks, between tasks, and to compositional tasks that require combining skills learned from previous tasks. Avalon includes a highly efficient simulator, a library of baselines, and a benchmark with scoring metrics evaluated against hundreds of hours of human performance, all of which are open-source and publicly available. We find that standard RL baselines make progress on most tasks but are still far from human performance, suggesting Avalon is challenging enough to advance the quest for generalizable RL.
Interaction-Grounded Learning with Action-Inclusive Feedback
Consider the problem setting of Interaction-Grounded Learning (IGL), in which a learner's goal is to optimally interact with the environment with no explicit reward to ground its policies. The agent observes a context vector, takes an action, and receives a feedback vector--using this information to effectively optimize a policy with respect to a latent reward function. Prior analyzed approaches fail when the feedback vector contains the action, which significantly limits IGL's success in many potential scenarios such as Brain-computer interface (BCI) or Humancomputer interface (HCI) applications. We address this by creating an algorithm and analysis which allows IGL to work even when the feedback vector contains the action, encoded in any fashion. We provide theoretical guarantees and large-scale experiments based on supervised datasets to demonstrate the effectiveness of the new approach.
emg2pose: A Large and Diverse Benchmark for Surface Electromyographic Hand Pose Estimation Sasha Salter, Richard Warren
Hands are the primary means through which humans interact with the world. Reliable and always-available hand pose inference could yield new and intuitive control schemes for human-computer interactions, particularly in virtual and augmented reality. Computer vision is effective but requires one or multiple cameras and can struggle with occlusions, limited field of view, and poor lighting. Wearable wrist-based surface electromyography (sEMG) presents a promising alternative as an always-available modality sensing muscle activities that drive hand motion. However, sEMG signals are strongly dependent on user anatomy and sensor placement; existing sEMG models have thus required hundreds of users and device placements to effectively generalize for tasks other than pose inference. To facilitate progress on sEMG pose inference, we introduce the emg2pose benchmark, which is to our knowledge the first publicly available dataset of high-quality hand pose labels and wrist sEMG recordings.
Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments
Autonomous agents that accomplish complex computer tasks with minimal human interventions can significantly enhance accessibility and productivity of humancomputer interactions. Existing benchmarks either lack interactive environments or are limited to specific applications/domains, failing to reflect the diversity and complexity of real-world computer use and limiting agent scalability.
SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering
Language model (LM) agents are increasingly being used to automate complicated tasks in digital environments. Just as humans benefit from powerful software applications, such as integrated development environments, for complex tasks like software engineering, we posit that LM agents represent a new category of end users with their own needs and abilities, and would benefit from specially-built interfaces to the software they use. We investigate how interface design affects the performance of language model agents. As a result of this exploration, we introduce SWE-agent: a system that facilitates LM agents to autonomously use computers to solve software engineering tasks. SWE-agent's custom agent-computer interface (ACI) significantly enhances an agent's ability to create and edit code files, navigate entire repositories, and execute tests and other programs. We evaluate SWE-agent on SWE-bench and HumanEvalFix, achieving state-of-the-art performance on both with a pass@1 rate of 12.5% and 87.7%, respectively, far exceeding the previous state-of-the-art achieved with non-interactive LMs. Finally, we provide insight on how the design of the ACI can impact agents' behavior and performance.
Our favorite budget video doorbell gets an upgrade - see what's new with Amazon's Blink
ZDNET's pick for the best budget-friendly video doorbell is now getting an upgrade, as Amazon has announced the launch of a new generation of the Blink Video Doorbell. The second-generation Blink Video Doorbell is being released four years after the first; it comes alongside the new Blink Sync Module Core and, unfortunately, sports a price increase. The new Blink Video Doorbell is available now for 70, a 20 price increase compared to the first model, which cost 50 when it was released. Also: I test smart locks for a living, and the most reliable one I've used is on sale for 130 However, the higher price brings some upgrades. The new Blink Video Doorbell has a 150-degree field of view, which gives you a bigger picture of people from head to toe and any packages that may be on your porch.
New Blink Video Doorbell promises 2 years of runtime on 3 AA batteries
Amazon's Blink division--the budget-oriented sibling of Ring in the smart home space--has announced an all-new version of its affordable battery-powered Blink Video Doorbell, which is available now for 69.99. While the price has gone up by 10 compared to the previous model, the new version's higher-resolution camera (1440 1440 pixels), 150-degree field of view, and 1:1 aspect ratio will provide a head-to-toe view of visitors. Blink is bundling the new doorbell with its new AC-powered Blink Sync Module Core, which acts as a bridge to the home's Wi-Fi network (it does not have a chime that sounds off when a visitor rings the doorbell). Blink says having the doorbell connect to the Core, instead of directly to a home's router, is what enables the doorbell's remarkable 2-year battery life. The latest Blink Video Doorbell comes bundled with the Blink Sync Core, which serves as a bridge to a home's Wi-Fi network. Since the Sync Module Core does not offer any means for local storage--unlike the Sync Module 2 (with a USB-A socket) and the Sync Module XR (with a microSD card slot)--buyers of the latest Blink Video Doorbell will need to either sign up for a subscription to store video recordings in the cloud, or they'll need to replace the Core with either of Blink's other two Sync Modules.
Acoustic Volume Rendering for Neural Impulse Response Fields Zitong Lan
Realistic audio synthesis that captures accurate acoustic phenomena is essential for creating immersive experiences in virtual and augmented reality. Synthesizing the sound received at any position relies on the estimation of impulse response (IR), which characterizes how sound propagates in one scene along different paths before arriving at the listener's position. In this paper, we present Acoustic Volume Rendering (AVR), a novel approach that adapts volume rendering techniques to model acoustic impulse responses. While volume rendering has been successful in modeling radiance fields for images and neural scene representations, IRs present unique challenges as time-series signals. To address these challenges, we introduce frequency-domain volume rendering and use spherical integration to fit the IR measurements. Our method constructs an impulse response field that inherently encodes wave propagation principles and achieves state-of-the-art performance in synthesizing impulse responses for novel poses. Experiments show that AVR surpasses current leading methods by a substantial margin. Additionally, we develop an acoustic simulation platform, AcoustiX, which provides more accurate and realistic IR simulations than existing simulators. Code for AVR and AcoustiX are available at https://zitonglan.github.io/avr.