Goto

Collaborating Authors

 Mobile


It's the End of the World (And It's Their Fault)

The Atlantic - Technology

It's late morning on a Monday in March and I am, for reasons I will explain momentarily, in a private bowling alley deep in the bowels of a 65 million mansion in Utah. Jesse Armstrong, the showrunner of HBO's hit series Succession, approaches me, monitor headphones around his neck and a wide grin on his face. "I take it you've seen the news," he says, flashing his phone and what appears to be his X feed in my direction. Everyone had: An hour earlier, my boss Jeffrey Goldberg had published a story revealing that U.S. national-security leaders had accidentally added him to a Signal group chat where they discussed their plans to conduct then-upcoming military strikes in Yemen. "Incredibly fucking depressing," Armstrong said.


The Drone Wars

Slate

The war between Ukraine and Russia is being fought increasingly via drone --and NATO and US military leadership is training troops for future conflicts that will pit man against machine. Subscribe to Slate Plus to access ad-free listening to the whole What Next family and all your favorite Slate podcasts. Subscribe today on Apple Podcasts by clicking "Try Free" at the top of our show page. Sign up now at slate.com/whatnextplus to get access wherever you listen.


10 must-try Google Photos tips and tricks - including a new AI editor

ZDNet

Google Photos has just reached its 10th birthday, and the company is celebrating. To mark the occasion, Google is serving up a host of tips and tricks designed to enhance your photos via your mobile device. But first, here are a few stats to show the reach of Google Photos. More than 1.5 billion people use Google Photos on a monthly basis, according to Google. Each month, people run more than 370 million searches, edit 210 million photos, and share 440 million of them. First up is a new and improved photo editor that employs AI to help you fine-tune your images.


Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments

Neural Information Processing Systems

Autonomous agents that accomplish complex computer tasks with minimal human interventions can significantly enhance accessibility and productivity of humancomputer interactions. Existing benchmarks either lack interactive environments or are limited to specific applications/domains, failing to reflect the diversity and complexity of real-world computer use and limiting agent scalability.


NeurIPS_2021___sparse_train__camera_ready_

Neural Information Processing Systems

The example in Figure A.1 (d) is defined as 4-entry kernel pattern, since every kernel preserves 4 non-zero weights out of the original 3 3 kernels. Besides that, the connectivity sparsity cuts the connections between some input and output channels, which is equivalent to removing corresponding whole kernels. Consider a sparse model with a sparsity ratio s 2 [0, 1] obtained from a dense model with a total of N weights. For sparse models, we need indices for denoting the sparse topology of weights/gradients within the dense model. Generally, mobile edge devices can support 8-bit fixed-point, 16-bit floating-point, and 32-bit floating-point numbers. Weights and gradients are usually using 16-bit or 32-bit. Due to the data storage format on edge devices, 8-bit or 16-bit is preferred for indices.


MultiScan: Scalable RGBD scanning for 3D environments with articulated objects

Neural Information Processing Systems

We introduce MultiScan, a scalable RGBD dataset construction pipeline leveraging commodity mobile devices to scan indoor scenes with articulated objects and web-based semantic annotation interfaces to efficiently annotate object and part semantics and part mobility parameters. We use this pipeline to collect 273 scans of 117 indoor scenes containing 10957 objects and 5129 parts. The resulting MultiScan dataset provides RGBD streams with per-frame camera poses, textured 3D surface meshes, richly annotated part-level and object-level semantic labels, and part mobility parameters.


HydraViT: Stacking Heads for a Scalable ViT

Neural Information Processing Systems

The architecture of Vision Transformers (ViTs), particularly the Multi-head Attention (MHA) mechanism, imposes substantial hardware demands. Deploying ViTs on devices with varying constraints, such as mobile phones, requires multiple models of different sizes. However, this approach has limitations, such as training and storing each required model separately. This paper introduces HydraViT, a novel approach that addresses these limitations by stacking attention heads to achieve a scalable ViT. By repeatedly changing the size of the embedded dimensions throughout each layer and their corresponding number of attention heads in MHA during training, HydraViT induces multiple subnetworks. Thereby, HydraViT achieves adaptability across a wide spectrum of hardware environments while maintaining performance. Our experimental results demonstrate the efficacy of HydraViT in achieving a scalable ViT with up to 10 subnetworks, covering a wide range of resource constraints. HydraViT achieves up to 5 p.p. more accuracy with the same GMACs and up to 7 p.p. more accuracy with the same throughput on ImageNet-1K compared to the baselines, making it an effective solution for scenarios where hardware availability is diverse or varies over time.


DigiRL: Training In-The-Wild Device-Control Agents with Autonomous Reinforcement Learning Hao Bai 1,2 Yifei Zhou 1 Jiayi Pan

Neural Information Processing Systems

While training with static demonstrations has shown some promise, we show that such methods fall short for controlling real GUIs due to their failure to deal with real world stochasticity and non-stationarity not captured in static observational data. This paper introduces a novel autonomous RL approach, called DigiRL, for training in-the-wild device control agents through fine-tuning a pre-trained VLM in two stages: offline RL to initialize the model, followed by offline-to-online RL. To do this, we build a scalable and parallelizable Android learning environment equipped with a VLM-based evaluator and develop a simple yet effective RL approach for learning in this domain. Our approach runs advantage-weighted RL with advantage estimators enhanced to account for stochasticity along with an automatic curriculum for deriving maximal learning signal. We demonstrate the effectiveness of DigiRL using the Android-in-the-Wild (AitW) dataset, where our 1.3B VLM trained with RL achieves a 49.5% absolute improvement - from 17.7 to 67.2% success rate - over supervised fine-tuning with static human demonstration data. These results significantly surpass not only the prior best agents, including AppAgent with GPT-4V (8.3% success rate) and the 17B CogAgent trained with AitW data (38.5%),


I switched my search engine to DuckDuckGo, and it made Google better

PCWorld

I've been trying to disentangle my online life from Google for a while. And as someone who wrote about Android professionally for years, it hasn't been easy. I've ditched Chrome, but I still use a Samsung Galaxy phone and Google Pixel Watch, for example. But when I finally got off the big daddy, Google Search, and switched to DuckDuckGo, it had a surprising effect: Google got better. That's a broad statement, so let me be more particular right away.


Mobile-Agent-v2: Mobile Device Operation Assistant with Effective Navigation via Multi-Agent Collaboration

Neural Information Processing Systems

Mobile device operation tasks are increasingly becoming a popular multi-modal AI application scenario. Current Multi-modal Large Language Models (MLLMs), constrained by their training data, lack the capability to function effectively as operation assistants. Instead, MLLM-based agents, which enhance capabilities through tool invocation, are gradually being applied to this scenario. However, the two major navigation challenges in mobile device operation tasks -- task progress navigation and focus content navigation -- are difficult to effectively solve under the single-agent architecture of existing work. This is due to the overly long token sequences and the interleaved text-image data format, which limit performance.