Goto

Collaborating Authors

 xyz


RoboTwin: Dual-Arm Robot Benchmark with Generative Digital Twins

arXiv.org Artificial Intelligence

In the rapidly advancing field of robotics, dual-arm coordination and complex object manipulation are essential capabilities for developing advanced autonomous systems. However, the scarcity of diverse, high-quality demonstration data and real-world-aligned evaluation benchmarks severely limits such development. To address this, we introduce RoboTwin, a generative digital twin framework that uses 3D generative foundation models and large language models to produce diverse expert datasets and provide a real-world-aligned evaluation platform for dual-arm robotic tasks. Specifically, RoboTwin creates varied digital twins of objects from single 2D images, generating realistic and interactive scenarios. It also introduces a spatial relation-aware code generation framework that combines object annotations with large language models to break down tasks, determine spatial constraints, and generate precise robotic movement code. Our framework offers a comprehensive benchmark with both simulated and real-world data, enabling standardized evaluation and better alignment between simulated training and real-world performance. We validated our approach using the open-source COBOT Magic Robot platform. Policies pre-trained on RoboTwin-generated data and fine-tuned with limited real-world samples demonstrate significant potential for enhancing dual-arm robotic manipulation systems by improving success rates by over 70% for single-arm tasks and over 40% for dual-arm tasks compared to models trained solely on real-world data.


Privacy-Preserving Edge Speech Understanding with Tiny Foundation Models

arXiv.org Artificial Intelligence

Robust speech recognition systems rely on cloud service providers for inference. It needs to ensure that an untrustworthy provider cannot deduce the sensitive content in speech. Sanitization can be done on speech content keeping in mind that it has to avoid compromising transcription accuracy. Realizing the under utilized capabilities of tiny speech foundation models (FMs), for the first time, we propose a novel use: enhancing speech privacy on resource-constrained devices. We introduce XYZ, an edge/cloud privacy preserving speech inference engine that can filter sensitive entities without compromising transcript accuracy. We utilize a timestamp based on-device masking approach that utilizes a token to entity prediction model to filter sensitive entities. Our choice of mask strategically conceals parts of the input and hides sensitive data. The masked input is sent to a trusted cloud service or to a local hub to generate the masked output. The effectiveness of XYZ hinges on how well the entity time segments are masked. Our recovery is a confidence score based approach that chooses the best prediction between cloud and on-device model. We implement XYZ on a 64 bit Raspberry Pi 4B. Experiments show that our solution leads to robust speech recognition without forsaking privacy. XYZ with < 100 MB memory, achieves state-of-the-art (SOTA) speech transcription performance while filtering about 83% of private entities directly on-device. XYZ is 16x smaller in memory and 17x more compute efficient than prior privacy preserving speech frameworks and has a relative reduction in word error rate (WER) by 38.8-77.5% when compared to existing offline transcription services.


UNet: A Generic and Reliable Multi-UAV Communication and Networking Architecture for Heterogeneous Applications

arXiv.org Artificial Intelligence

The rapid growth of UAV applications necessitates a robust communication and networking architecture capable of addressing the diverse requirements of various applications concurrently, rather than relying on application-specific solutions. This paper proposes a generic and reliable multi-UAV communication and networking architecture designed to support the varying demands of heterogeneous applications, including short-range and long-range communication, star and mesh topologies, different data rates, and multiple wireless standards. Our architecture accommodates both adhoc and infrastructure networks, ensuring seamless connectivity throughout the network. Additionally, we present the design of a multi-protocol UAV gateway that enables interoperability among various communication protocols. Furthermore, we introduce a data processing and service layer framework with a graphical user interface of a ground control station that facilitates remote control and monitoring from any location at any time. We practically implemented the proposed architecture and evaluated its performance using different metrics, demonstrating its effectiveness.


On Triangular versus Edge Representations -- Towards Scalable Modeling of Networks

Neural Information Processing Systems

In this paper, we argue for representing networks as a bag of triangular motifs, particularly for important network problems that current model-based approaches handle poorly due to computational bottlenecks incurred by using edge representations.


Leveraging cache to enable SLU on tiny devices

arXiv.org Artificial Intelligence

This paper addresses spoken language understanding (SLU) on microcontroller-like embedded devices, integrating on-device execution with cloud offloading in a novel fashion. We exploit temporal locality in a device's speech inputs and accordingly reuse recent SLU inferences. Our idea is simple: let the device match new inputs against cached results, and only offload unmatched inputs to the cloud for full inference. Realization of this idea, however, is non-trivial: the device needs to compare acoustic features in a robust, low-cost way. To this end, we present XYZ, a speech cache for tiny devices. It matches speech inputs at two levels of representations: first by clustered sequences of raw sound units, then as sequences of phonemes. Working in tandem, the two representations offer complementary cost/accuracy tradeoffs. To further boost accuracy, our cache is learning: with the mismatched and then offloaded inputs, it continuously finetunes the device's feature extractors (with the assistance of the cloud). We implement XYZ on an off-the-shelf STM32 microcontroller. The resultant implementation has a small memory footprint of 2MB. Evaluated on challenging speech benchmarks, our system resolves 45%--90% of inputs on device, reducing the average latency by up to 80% compared to offloading to popular cloud speech services. Our benefit is pronounced even in adversarial settings -- noisy environments, cold cache, or one device shared by a number of users.


Simulating Opinion Dynamics with Networks of LLM-based Agents

arXiv.org Artificial Intelligence

Accurately simulating human opinion dynamics is crucial for understanding a variety of societal phenomena, including polarization and the spread of misinformation. However, the agent-based models (ABMs) commonly used for such simulations lack fidelity to human behavior. We propose a new approach to simulating opinion dynamics based on populations of Large Language Models (LLMs). Our findings reveal a strong inherent bias in LLM agents towards accurate information, leading to consensus in line with scientific reality. However, this bias limits the simulation of individuals with resistant views on issues like climate change. After inducing confirmation bias through prompt engineering, we observed opinion fragmentation in line with existing agent-based research. These insights highlight the promise and limitations of LLM agents in this domain and suggest a path forward: refining LLMs with real-world discourse to better simulate the evolution of human beliefs.


Twitch's AI-Generated, 'Seinfeld' Like Show Gets Weird - usalive.xyz

#artificialintelligence

Artificial intelligence's take on a classic sitcom is more than a load of "yada yada yada." "Nothing, Forever" is an AI-generated, "Seinfeld" like show on streaming platform Twitch that's set to never stop broadcasting. The 24/7 show, which has been streaming since December, has grown in popularity over the past week as thousands have tuned in to watch the adventures of animated characters Larry Feinberg, Fred Kastopolous, Yvonne Torres and Zoltan Kalker. As of Saturday morning, "Nothing, Forever" had over 131,000 Twitch followers. The show plays out in a similar fashion to the TV classic: It includes stand-up sequences, laugh tracks and conversations among AI friends similar to Jerry, Elaine, George and Kramer inside of an apartment.


Negative Shannon Information Hides Networks

arXiv.org Artificial Intelligence

Shannon information was defined for characterizing the uncertainty information of classical probabilistic distributions. As an uncertainty measure it is generally believed to be positive. This holds for any information quantity from two random variables because of the polymatroidal axioms. However, it is unknown why there is negative information for more than two random variables on finite dimensional spaces. We first show the negative tripartite Shannon mutual information implies specific Bayesian network representations of its joint distribution. We then show that the negative Shannon information is obtained from general tripartite Bayesian networks with quantum realizations. This provides a device-independent witness of negative Shannon information. We finally extend the result for general networks. The present result shows new insights in the network compatibility from non-Shannon information inequalities.


The War between AI and the Blockchain

#artificialintelligence

Deepfakes are developing fast, and although faking video and audio is not new, experts agree that we can't win this fight. Machines will be able to create digital media that can not be recognized as such by a normal human consumer. We have written about this threat because it spells disaster. Chaos is what we expect to be the result in any media/public relation, motivated by malignant attitudes, desire to have fun or the desire to exploit. Fake news is already a problem, leading to lynchings in some countries, based only on accusations.


The Woman Worked as a Babysitter: On Biases in Language Generation

arXiv.org Artificial Intelligence

W e present a systematic study of biases in natural language generation (NLG) by analyzing text generated from prompts that contain mentions of different demographic groups. In this work, we introduce the notion of the regard towards a demographic, use the varying levels of regard towards different demographics as a defining metric for bias in NLG, and analyze the extent to which sentiment scores are a relevant proxy metric for regard. To this end, we collect strategically-generated text from language models and manually annotate the text with both sentiment and regard scores. Additionally, we build an automatic regard classifier through transfer learning, so that we can analyze biases in unseen text. Together, these methods reveal the extent of the biased nature of language model generations. Our analysis provides a study of biases in NLG, bias metrics and correlated human judgments, and empirical evidence on the usefulness of our annotated dataset.