Goto

Collaborating Authors

 playback




Gear News of the Week: Veo 3 Comes to Google Photos, and Garmin Adds Satellite Comms to a Watch

WIRED

A few months ago, Google debuted a feature in Google Photos that lets you convert your existing photos into short videos using generative AI. These videos introduce slight synthetic movements to your stills, so a person may appear to slightly shift around in the frame, or a picture of your sleeping pup could gain a leg twitch. This week, the company upgraded this feature with its Veo 3 video generation model, which boosts the quality of the results. To play around with it, head to any photo in Google Photos, tap the three-dot button at the top right, and tap Create. Choose the Photo to Video option, and then pick between Subtle Movement or I'm Feeling Lucky, which will be a little more creative.


The Echo Spot hits its best price ahead of Prime Day, now 44% off

PCWorld

Prime Day is just days away and we're already reveling in the early deals we're spotting. Here's one such deal: the gorgeous Amazon Echo Spot has been slashed down to just 45. That's 44% off its 80 MSRP and matches the best price it's ever been. The 2024 version of the Echo Spot has a sleek redesign that sets it apart from previous models. Instead of the big speaker we normally associate with the Echo line, this one is more of a proper digital display with the speaker as more of a supplement.


EdgeVidSum: Real-Time Personalized Video Summarization at the Edge

Mujtaba, Ghulam, Ryu, Eun-Seok

arXiv.org Artificial Intelligence

EdgeVidSum is a lightweight method that generates personalized, fast-forward summaries of long-form videos directly on edge devices. The proposed approach enables real-time video summarization while safeguarding user privacy through local data processing using innovative thumbnail-based techniques and efficient neural architectures. Unlike conventional methods that process entire videos frame by frame, the proposed method uses thumbnail containers to significantly reduce computational complexity without sacrificing semantic relevance. The framework employs a hierarchical analysis approach, where a lightweight 2D CNN model identifies user-preferred content from thumbnails and generates timestamps to create fast-forward summaries. Our interactive demo highlights the system's ability to create tailored video summaries for long-form videos, such as movies, sports events, and TV shows, based on individual user preferences. The entire computation occurs seamlessly on resource-constrained devices like Jetson Nano, demonstrating how EdgeVidSum addresses the critical challenges of computational efficiency, personalization, and privacy in modern video consumption environments.


Decoding Phone Pairs from MEG Signals Across Speech Modalities

de Zuazo, Xabier, Navas, Eva, Saratxaga, Ibon, Bourguignon, Mathieu, Molinaro, Nicola

arXiv.org Artificial Intelligence

Understanding the neural mechanisms underlying speech production is essential for both advancing cognitive neuroscience theory and developing practical communication technologies. In this study, we investigated magnetoencephalography signals to decode phones from brain activity during speech production and perception (passive listening and voice playback) tasks. Using a dataset comprising 17 participants, we performed pairwise phone classification, extending our analysis to 15 phonetic pairs. Multiple machine learning approaches, including regularized linear models and neural network architectures, were compared to determine their effectiveness in decoding phonetic information. Our results demonstrate significantly higher decoding accuracy during speech production (76.6%) compared to passive listening and playback modalities (~51%), emphasizing the richer neural information available during overt speech. Among the models, the Elastic Net classifier consistently outperformed more complex neural networks, highlighting the effectiveness of traditional regularization techniques when applied to limited and high-dimensional MEG datasets. Besides, analysis of specific brain frequency bands revealed that low-frequency oscillations, particularly Delta (0.2-3 Hz) and Theta (4-7 Hz), contributed the most substantially to decoding accuracy, suggesting that these bands encode critical speech production-related neural processes. Despite using advanced denoising methods, it remains unclear whether decoding solely reflects neural activity or if residual muscular or movement artifacts also contributed, indicating the need for further methodological refinement. Overall, our findings underline the critical importance of examining overt speech production paradigms, which, despite their complexity, offer opportunities to improve brain-computer interfaces to help individuals with severe speech impairments.


Calliope: An Online Generative Music System for Symbolic Multi-Track Composition

Tchemeube, Renaud Bougueng, Ens, Jeff, Pasquier, Philippe

arXiv.org Artificial Intelligence

With the rise of artificial intelligence in recent years, there has been a rapid increase in its application towards creative domains, including music. There exist many systems built that apply machine learning approaches to the problem of computer-assisted music composition (CAC). Calliope is a web application that assists users in performing a variety of multi-track composition tasks in the symbolic domain. The user can upload (Musical Instrument Digital Interface) MIDI files, visualize and edit MIDI tracks, and generate partial (via bar in-filling) or complete multi-track content using the Multi-Track Music Machine (MMM). Generation of new MIDI excerpts can be done in batch and can be combined with active playback listening for an enhanced assisted-composition workflow. The user can export generated MIDI materials or directly stream MIDI playback from the system to their favorite Digital Audio Workstation (DA W). We present a demonstration of the system, its features, generative parameters and describe the co-creative workflows that it affords.


Setup-Invariant Augmented Reality for Teaching by Demonstration with Surgical Robots

Banks, Alexandre, Cook, Richard, Salcudean, Septimiu E.

arXiv.org Artificial Intelligence

Augmented reality (AR) is an effective tool in robotic surgery education as it combines exploratory learning with three-dimensional guidance. However, existing AR systems require expert supervision and do not account for differences in the mentor and mentee robot configurations. To enable novices to train outside the operating room while receiving expert-informed guidance, we present dV-STEAR: an open-source system that plays back task-aligned expert demonstrations without assuming identical setup joint positions between expert and novice. Pose estimation was rigorously quantified, showing a registration error of 3.86 (SD=2.01)mm. In a user study (N=24), dV-STEAR significantly improved novice performance on tasks from the Fundamentals of Laparoscopic Surgery. In a single-handed ring-over-wire task, dV-STEAR increased completion speed (p=0.03) and reduced collision time (p=0.01) compared to dry-lab training alone. During a pick-and-place task, it improved success rates (p=0.004). Across both tasks, participants using dV-STEAR exhibited significantly more balanced hand use and reported lower frustration levels. This work presents a novel educational tool implemented on the da Vinci Research Kit, demonstrates its effectiveness in teaching novices, and builds the foundation for further AR integration into robot-assisted surgery.


Reolink unveils Altas Wireless Security System with 24/7 2K recording

PCWorld

Reolink has unveiled the Altas Wireless Security System, a battery-powered camera setup capable of delivering 24/7 recording in 2K resolution. Designed with flexibility and ease of use in mind, the system targets homeowners who want reliable surveillance without technical headaches. Unveiled this week at CES in Las Vegas, the Altas Wireless Security System includes two 2K bullet-style Altas cameras, two 6-watt solar panels, and a Home Hub for centralized management. Each camera features a 20,000mAh battery, providing up to seven days of continuous recording. With just two hours of sunlight daily, the solar panels keep the cameras running around the clock, reducing reliance on motion detection.


Variable-Speed Teaching-Playback as Real-World Data Augmentation for Imitation Learning

Masuya, Nozomu, Sato, Hiroshi, Yamane, Koki, Kusume, Takuya, Sakaino, Sho, Tsuji, Toshiaki

arXiv.org Artificial Intelligence

Because imitation learning relies on human demonstrations in hard-to-simulate settings, the inclusion of force control in this method has resulted in a shortage of training data, even with a simple change in speed. Although the field of data augmentation has addressed the lack of data, conventional methods of data augmentation for robot manipulation are limited to simulation-based methods or downsampling for position control. This paper proposes a novel method of data augmentation that is applicable to force control and preserves the advantages of real-world datasets. We applied teaching-playback at variable speeds as real-world data augmentation to increase both the quantity and quality of environmental reactions at variable speeds. An experiment was conducted on bilateral control-based imitation learning using a method of imitation learning equipped with position-force control. We evaluated the effect of real-world data augmentation on two tasks, pick-and-place and wiping, at variable speeds, each from two human demonstrations at fixed speed. The results showed a maximum 55% increase in success rate from a simple change in speed of real-world reactions and improved accuracy along the duration/frequency command by gathering environmental reactions at variable speeds.