Goto

Collaborating Authors

 Large Language Model


Why Elon Musk and Sam Altman are fighting over OpenAI

BBC News

Musk, who co-founded the company that created ChatGPT with Altman, wants more than $130 billion in damages in a lawsuit that could shakeup the artificial intelligence landscape. The BBC's Lily Jamali explains why the two tech giants are facing off in court. How much screen time is too much for under fives? Some major retailers and independent stores have introduced AI body scans, CCTV or facial recognition equipment to identify crimes like shoplifting. What does TikTok's deal mean for America's users?


The Download: DeepSeek's latest AI breakthrough, and the race to build world models

MIT Technology Review

The Download: DeepSeek's latest AI breakthrough, and the race to build world models Plus: China has blocked Meta's $2 billion acquisition of AI startup Manus. On Friday, Chinese AI firm DeepSeek released a preview of V4, its long-awaited new flagship model. Notably, the model can process much longer prompts than its last generation, thanks to a new design that handles large amounts of text more efficiently. While the model remains open source, its performance matches leading closed-source rivals from Anthropic, OpenAI, and Google. Here are three ways V4 could shake up AI . AI systems have already gained impressive mastery over the digital world, but the physical world remains humanity's domain.


An eye for an ear: zero-shot audio description leveraging an image captioner with audio-visual token distribution matching

Neural Information Processing Systems

Multimodal large language models have fueled progress in image captioning. These models, fine-tuned on vast image datasets, exhibit a deep understanding of semantic concepts.In this work, we show that this ability can be re-purposed for audio captioning, where the joint image-language decoder can be leveraged to describe auditory content associated with image sequences within videos featuring audiovisual content. This can be achieved via multimodal alignment.Yet, this multimodal alignment task is non-trivial due to the inherent disparity between audible and visible elements in real-world videos. Moreover, multimodal representation learning often relies on contrastive learning, facing the challenge of the so-called modality gap which hinders smooth integration between modalities. In this work, we introduce a novel methodology for bridging the audiovisual modality gap by matching the distributions of tokens produced by an audio backbone and those of an image captioner. Our approach aligns the audio token distribution with that of the image tokens, enabling the model to perform zero-shot audio captioning in an unsupervised fashion. This alignment allows for the use of either audio or audiovisual input by combining or substituting the image encoder with the aligned audio encoder. Our method achieves significantly improved performances in zero-shot audio captioning, compared to existing approaches.


Elon Musk and Sam Altman face off in court over OpenAI's founding mission

The Guardian

The two Silicon Valley tycoons are headed to court. The two Silicon Valley tycoons are headed to court. Musk's lawsuit accuses Altman of fraud, while OpenAI says that Musk is'motivated by jealousy' A lawsuit between two of Silicon Valley's biggest tycoons goes to trial Monday in California, the culmination of a years-long bitter feud. Elon Musk has accused Sam Altman of betraying the founding agreement of the non-profit they started together, OpenAI, by changing it to a for-profit enterprise. Musk accuses Altman, OpenAI, its president Greg Brockman, and its major partner Microsoft of breach of contract and unjust enrichment in the lawsuit.



Large Language Models Are Bad Dice Players: LLMs Struggle to Generate Random Numbers from Statistical Distributions

arXiv.org Machine Learning

As large language models (LLMs) transition from chat interfaces to integral components of stochastic pipelines and systems approaching general intelligence, the ability to faithfully sample from specified probability distributions has become a functional requirement rather than a theoretical curiosity. We present the first large-scale, statistically powered audit of native probabilistic sampling in frontier LLMs, benchmarking 11 models across 15 distributions. To disentangle failure modes, we employ a dual-protocol design: Batch Generation, where a model produces $N{=}1000$ samples within one response, and Independent Requests, comprising $N{=}1000$ stateless calls. We observe a sharp protocol asymmetry: batch generation achieves only modest statistical validity, with a 7% median pass rate, while independent requests collapse almost entirely, with 10 of 11 models passing none of the distributions. Beyond this asymmetry, we reveal that sampling fidelity degrades monotonically with distributional complexity and aggravates as the sampling horizon $N$ increases. Finally, we demonstrate how the propagation of these failures into downstream real-world application tasks introduces systematic biases: models fail to enforce uniform answer-position constraints in Multiple Choice Question generation and systematically violate demographic targets in attribute-constrained text-to-image prompt synthesis. These findings indicate that current LLMs lack a functional internal sampler, necessitating external tools for applications requiring statistical guarantees.


Fine-grained Late-interaction Multi-modal Retrieval for Retrieval Augmented Visual Question Answering

Neural Information Processing Systems

Knowledge-based Visual Question Answering (KB-VQA) requires VQA systems to utilize knowledge from external knowledge bases to answer visually-grounded questions. Retrieval-Augmented Visual Question Answering (RA-VQA), a strong framework to tackle KB-VQA, first retrieves related documents with Dense Passage Retrieval (DPR) and then uses them to answer questions. This paper proposes Fine-grained Late-interaction Multi-modal Retrieval (FLMR) which significantly improves knowledge retrieval in RA-VQA. FLMR addresses two major limitations in RA-VQA's retriever: (1) the image representations obtained via image-to-text transforms can be incomplete and inaccurate and (2) relevance scores between queries and documents are computed with one-dimensional embeddings, which can be insensitive to finer-grained relevance. FLMR overcomes these limitations by obtaining image representations that complement those from the image-totext transforms using a vision model aligned with an existing text-based retriever through a simple alignment network. FLMR also encodes images and questions using multi-dimensional embeddings to capture finer-grained relevance between queries and documents. FLMR significantly improves the original RA-VQA retriever's PRRecall@5 by approximately 8%. Finally, we equipped RA-VQA with two state-of-the-art large multi-modal/language models to achieve 61% VQA score in the OK-VQA dataset.


MedSafetyBench: Evaluating and Improving the Medical Safety of Large Language Models

Neural Information Processing Systems

As large language models (LLMs) develop increasingly sophisticated capabilities and find applications in medical settings, it becomes important to assess their medical safety due to their far-reaching implications for personal and public health, patient safety, and human rights. However, there is little to no understanding of the notion of medical safety in the context of LLMs, let alone how to evaluate and improve it. To address this gap, we first define the notion of medical safety in LLMs based on the Principles of Medical Ethics set forth by the American Medical Association. We then leverage this understanding to introduce MedSafetyBench, the first benchmark dataset designed to measure the medical safety of LLMs. We demonstrate the utility of MedSafetyBench by using it to evaluate and improve the medical safety of LLMs. Our results show that publicly-available medical LLMs do not meet standards of medical safety and that fine-tuning them using MedSafetyBench improves their medical safety while preserving their medical performance. By introducing this new benchmark dataset, our work enables a systematic study of the state of medical safety in LLMs and motivates future work in this area, paving the way to mitigate the safety risks of LLMs in medicine.


Musk and Altman's bitter feud over OpenAI to be laid bare in court

The Guardian

The tech titans are slated to duke it out in court. The tech titans are slated to duke it out in court. Musk and Altman's bitter feud over OpenAI to be laid bare in court Tesla chief believes Altman broke company's founding agreement - and legal battle promises to be explosive T he bitter rivalry between two of the tech world's most powerful men arrives in court this week, as Elon Musk's lawsuit against Sam Altman and OpenAI heads to trial in Oakland, California. The case is set to feature some of the biggest names in Silicon Valley, and its outcome could affect the course of the AI boom. Musk's suit, filed in 2024, focuses on the formative years of OpenAI when he, Altman and others co-founded the artificial intelligence company as a nonprofit with a grand purpose.


Distribution of Mentioned IDs17R2>= 3# of IDs

Neural Information Processing Systems

For each image's list of candidate objects, we heuristically downsample to a set of "most interesting" regions by: 1) selecting the at-most k " 4 largest/most central people; 2) keeping the most central/large objects; 3) over-sampling rarer objects according to prior frequency of detection in the LVIS vocabulary; 4) limiting the number of objects of a single type per-image; and 5) downsampling overlapping region proposals to encourage broader coverage of the pixel area of the image.