Goto

Collaborating Authors

 elsa


Learning Spatially-Aware Language and Audio Embeddings

Neural Information Processing Systems

Humans can picture a sound scene given an imprecise natural language description. For example, it is easy to imagine an acoustic environment given a phrase like the lion roar came from right behind me!. For a machine to have the same degree of comprehension, the machine must know what a lion is (semantic attribute), what the concept of behind is (spatial attribute) and how these pieces of linguistic information align with the semantic and spatial attributes of the sound (what a roar sounds like when its coming from behind). State-of-the-art audio foundation models, such as CLAP, which learn to map between audio scenes and natural textual descriptions, are trained on non-spatial audio and text pairs, and hence lack spatial awareness. In contrast, sound event localization and detection models are limited to recognizing sounds from a fixed number of classes, and they localize the source to absolute position (e.g., 0.2m) rather than a position described using natural language (e.g., next to me).


Drink Whole Milk, Eat Red Meat, and Use ChatGPT

The Atlantic - Technology

Robert F. Kennedy Jr. is an AI guy. Last week, during a stop in Nashville on his Take Back Your Health tour, the Health and Human Services secretary brought up the technology between condemning ultra-processed foods and urging Americans to eat protein. "My agency is now leading the federal government in driving AI into all of our activities," he declared. An army of bots, Kennedy said, will transform medicine, eliminate fraud, and put a virtual doctor in everyone's pocket. RFK Jr. has talked up the promise of infusing his department with AI for months.



Finger-prick diabetes blood test could be early warning for children

BBC News

All UK children could be offered screening for type 1 diabetes using a simple finger-prick blood test, say researchers who have been running a large study. Currently, many young people go undiagnosed and risk developing a life-threatening complication called diabetic ketoacidosis that needs urgent hospital treatment. Identifying diabetes earlier could help avoid this and mean treatments to control problematic blood sugar levels can be given sooner. Some 17,000 children aged three to 13 have already been checked as part of the ELSA (Early Surveillance for Autoimmune diabetes) study, funded by diabetes charities. Imogen, who is 12 and from the West Midlands, is one of those found to have diabetes thanks to the screening.


The Disney-OpenAI Deal Redefines the AI Copyright War

WIRED

Disney is hedging against the future. OpenAI is clearing a path for Sora. And together they've made a blueprint for how AI and Hollywood can move forward. On Thursday, Disney and OpenAI announced a deal that might have seemed unthinkable not so long ago. Starting next year, OpenAI will be able to use Disney characters like Mickey Mouse, Ariel, and Yoda in its Sora video-generation model .



'Wall-E With a Gun': Midjourney Generates Videos of Disney Characters Amid Massive Copyright Lawsuit

WIRED

It's been a busy month for Midjourney. This week, the generative AI startup released its sophisticated new video tool, V1, which lets users make short animated clips from images they generate or upload. The current version of Midjourney's AI video tool requires an image as a starting point; generating videos using text-only prompts is not supported. Midjourney did not immediately respond to requests for comment. Disney and Universal reiterated statements made by its executives about the lawsuit, including Disney's legal head Horacio Gutierrez alleging that Midjourney's output amounts to "piracy."


Learning Spatially-Aware Language and Audio Embeddings

Neural Information Processing Systems

Humans can picture a sound scene given an imprecise natural language description. For example, it is easy to imagine an acoustic environment given a phrase like "the lion roar came from right behind me!". For a machine to have the same degree of comprehension, the machine must know what a lion is (semantic attribute), what the concept of "behind" is (spatial attribute) and how these pieces of linguistic information align with the semantic and spatial attributes of the sound (what a roar sounds like when its coming from behind). State-of-the-art audio foundation models, such as CLAP, which learn to map between audio scenes and natural textual descriptions, are trained on non-spatial audio and text pairs, and hence lack spatial awareness. In contrast, sound event localization and detection models are limited to recognizing sounds from a fixed number of classes, and they localize the source to absolute position (e.g., 0.2m) rather than a position described using natural language (e.g., "next to me").


An extension of linear self-attention for in-context learning

arXiv.org Artificial Intelligence

In-context learning is a remarkable property of transformers and has been the focus of recent research. An attention mechanism is a key component in transformers, in which an attention matrix encodes relationships between words in a sentence and is used as weights for words in a sentence. This mechanism is effective for capturing language representations. However, it is questionable whether naive self-attention is suitable for in-context learning in general tasks, since the computation implemented by self-attention is somewhat restrictive in terms of matrix multiplication. In fact, we may need appropriate input form designs when considering heuristic implementations of computational algorithms. In this paper, in case of linear self-attention, we extend it by introducing a bias matrix in addition to a weight matrix for an input. Despite the simple extension, the extended linear self-attention can output any constant matrix, input matrix and multiplications of two or three matrices in the input. Note that the second property implies that it can be a skip connection. Therefore, flexible matrix manipulations can be implemented by connecting the extended linear self-attention components. As an example of implementation using the extended linear self-attention, we show a heuristic construction of a batch-type gradient descent of ridge regression under a reasonable input form.


ELSA: Exploiting Layer-wise N:M Sparsity for Vision Transformer Acceleration

arXiv.org Artificial Intelligence

$N{:}M$ sparsity is an emerging model compression method supported by more and more accelerators to speed up sparse matrix multiplication in deep neural networks. Most existing $N{:}M$ sparsity methods compress neural networks with a uniform setting for all layers in a network or heuristically determine the layer-wise configuration by considering the number of parameters in each layer. However, very few methods have been designed for obtaining a layer-wise customized $N{:}M$ sparse configuration for vision transformers (ViTs), which usually consist of transformer blocks involving the same number of parameters. In this work, to address the challenge of selecting suitable sparse configuration for ViTs on $N{:}M$ sparsity-supporting accelerators, we propose ELSA, Exploiting Layer-wise $N{:}M$ Sparsity for ViTs. Considering not only all $N{:}M$ sparsity levels supported by a given accelerator but also the expected throughput improvement, our methodology can reap the benefits of accelerators supporting mixed sparsity by trading off negligible accuracy loss with both memory usage and inference time reduction for ViT models. For instance, our approach achieves a noteworthy 2.9$\times$ reduction in FLOPs for both Swin-B and DeiT-B with only a marginal degradation of accuracy on ImageNet. Our code will be released upon paper acceptance.