Goto

Collaborating Authors

 trove


Trove: A Flexible Toolkit for Dense Retrieval

Esfandiarpoor, Reza, Zuo, Max, Bach, Stephen H.

arXiv.org Artificial Intelligence

We introduce Trove, an easy-to-use open-source retrieval toolkit that simplifies research experiments without sacrificing flexibility or speed. For the first time, we introduce efficient data management features that load and process (filter, select, transform, and combine) retrieval datasets on the fly, with just a few lines of code. This gives users the flexibility to easily experiment with different dataset configurations without the need to compute and store multiple copies of large datasets. Trove is highly customizable: in addition to many built-in options, it allows users to freely modify existing components or replace them entirely with user-defined objects. It also provides a low-code and unified pipeline for evaluation and hard negative mining, which supports multi-node execution without any code changes. Trove's data management features reduce memory consumption by a factor of 2.6. Moreover, Trove's easy-to-use inference pipeline incurs no overhead, and inference times decrease linearly with the number of available nodes. Most importantly, we demonstrate how Trove simplifies retrieval experiments and allows for arbitrary customizations, thus facilitating exploratory research.


Biotech firm aims to create 'ChatGPT of biology' – will it work?

New Scientist

A British biotech firm called Basecamp Research has spent the past few years collecting troves of genetic data from microbes living in extreme environments around the world, identifying more than a million species and nearly 10 billion genes new to science. It claims that this massive database of the planet's biodiversity will help train a "ChatGPT of biology" that will answer questions about life on Earth – but there's no guarantee this will work. A hydrogen fuel revolution is coming – here's why we might not want it Jörg Overmann at the Leibniz Institute DSMZ in Germany, which houses one of the world's most diverse collections of microbial cultures, says increasing known genetic sequences is valuable, but may not result in useful findings for things like drug discovery or chemistry without more information about the organisms from which they were collected. "I'm not convinced that in the end the understanding of really novel functions will be accelerated by this brute-force increase in the sequence space," he says. Recent years have seen researchers develop a number of machine learning models trained to identify patterns and predict relationships amid vast amounts of biological data.


TROVE: A Challenge for Fine-Grained Text Provenance via Source Sentence Tracing and Relationship Classification

Zhu, Junnan, Xiao, Min, Wang, Yining, Zhai, Feifei, Zhou, Yu, Zong, Chengqing

arXiv.org Artificial Intelligence

LLMs have achieved remarkable fluency and coherence in text generation, yet their widespread adoption has raised concerns about content reliability and accountability. In high-stakes domains such as healthcare, law, and news, it is crucial to understand where and how the content is created. To address this, we introduce the Text pROVEnance (TROVE) challenge, designed to trace each sentence of a target text back to specific source sentences within potentially lengthy or multi-document inputs. Beyond identifying sources, TROVE annotates the fine-grained relationships (quotation, compression, inference, and others), providing a deep understanding of how each target sentence is formed. To benchmark TROVE, we construct our dataset by leveraging three public datasets covering 11 diverse scenarios (e.g., QA and summarization) in English and Chinese, spanning source texts of varying lengths (0-5k, 5-10k, 10k+), emphasizing the multi-document and long-document settings essential for provenance. To ensure high-quality data, we employ a three-stage annotation process: sentence retrieval, GPT provenance, and human provenance. We evaluate 11 LLMs under direct prompting and retrieval-augmented paradigms, revealing that retrieval is essential for robust performance, larger models perform better in complex relationship classification, and closed-source models often lead, yet open-source models show significant promise, particularly with retrieval augmentation.


Library Learning Doesn't: The Curious Case of the Single-Use "Library"

Berlot-Attwell, Ian, Rudzicz, Frank, Si, Xujie

arXiv.org Artificial Intelligence

Advances in Large Language Models (LLMs) have spurred a wave of LLM library learning systems for mathematical reasoning. These systems aim to learn a reusable library of tools, such as formal Isabelle lemmas or Python programs that are tailored to a family of tasks. Many of these systems are inspired by the human structuring of knowledge into reusable and extendable concepts, but do current methods actually learn reusable libraries of tools? We study two library learning systems for mathematics which both reported increased accuracy: LEGO-Prover and TroVE. We find that function reuse is extremely infrequent on miniF2F and MATH. Our followup ablation experiments suggest that, rather than reuse, self-correction and self-consistency are the primary drivers of the observed performance gains. Our code and data are available at https://github.com/ikb-a/curious-case


Faux ScarJo and the Descent of the A.I. Vultures

The New Yorker

On May 13th, during a live event, the artificial-intelligence company OpenAI unveiled the next generation of its technology, GPT-4o, the successor to GPT-3. When OpenAI first released its product to the public in late 2022, as the text-based tool ChatGPT, it nearly single-handedly ushered in the A.I. era. The latest version is far more powerful still. The "o" in the name stands for "omni"; the model can communicate seamlessly across various forms of media at once, including text, audio, and video, receiving prompts in one medium and responding in another. It can maintain a memory of everything you tell it.


A Vast New Data Set Could Supercharge the AI Hunt for Crypto Money Laundering

WIRED

One task where AI tools have proven to be particularly superhuman is analyzing vast troves of data to find patterns that humans can't see, or automating and accelerating the discovery of those we can. That makes Bitcoin's blockchain, a public record of nearly a billion transactions between pseudonymous addresses, the perfect sort of puzzle for AI to solve. Now, a new study--along with a vast, newly released trove of crypto crime training data--may be about to trigger a leap forward in automated tools' ability to suss out illicit money flows across the Bitcoin economy. On Wednesday, researchers from cryptocurrency tracing firm Elliptic, MIT, and IBM published a paper that lays out a new approach to finding money laundering on Bitcoin's blockchain. Rather than try to identify cryptocurrency wallets or clusters of addresses associated with criminal entities such as dark-web black markets, thieves, or scammers, the researchers collected patterns of bitcoin transactions that led from one of those known bad actors to a cryptocurrency exchange where dirty crypto might be cashed out.


Even the CIA is developing an AI chatbot

Engadget

The CIA and other US intelligence agencies will soon have an AI chatbot similar to ChatGPT. The program, revealed on Tuesday by Bloomberg, will train on publicly available data and provide sources alongside its answers so agents can confirm their validity. The aim is for US spies to more easily sift through ever-growing troves of information, although the exact nature of what constitutes "public data" could spark some thorny privacy issues. "We've gone from newspapers and radio, to newspapers and television, to newspapers and cable television, to basic internet, to big data, and it just keeps going," Randy Nixon, the CIA's director of Open Source Enterprise, said in an interview with Bloomberg. "We have to find the needles in the needle field."


Artificial Intelligence and Extended Reality May Pose Security Risks, Expert Warns

#artificialintelligence

Payton predicted that "AI poisoning" will be something to be concerned about in 2021. As Towards Data Science notes, a "poisoning attack happens when the adversary is able to inject bad data into your model's training pool, and hence get it to learn something it shouldn't." In solidly built AI models, Payton noted, "your [AI] coach should be self-learning and contextually aware and almost become a black box to the engineer" once it gets up and running. "My prediction is that, as we're implementing more AI, hackers will hack in and change that algorithm undetected, so that the AI will do things not initially in the design," she said. "AI is going to be cybercriminals' weapon of choice, to help them crack into more accounts, networks and data stores."


Italy slaps facial recognition firm Clearview AI with €20 million fine

Engadget

Italy's data privacy watchdog said it will fine the controversial facial recognition firm Clearview AI for breaching EU law. An investigation by Garante, Italy's data protection authority, found that the company's database of 10 billion images of faces includes those of Italians and residents in Italy. The New York City-based firm is being fined €20 million, and will also have to delete any facial biometrics it holds of Italian nationals. This isn't the first time that the beleaguered facial recognition tech company is facing legal consequences. The UK data protection authority last November fined the company £17 million after finding its practices--which include collecting selfies of people without their consent from security camera footage or mugshots--violate the nation's data protection laws.


Meta Unveils New AI Supercomputer

WSJ.com: WSJD - Technology

The Morning Download delivers daily insights and news on business technology from the CIO Journal team. Meta, which announced the news in a blog post Monday, said its research team currently is using the supercomputer to train AI models in natural-language processing and computer vision for research. The aim is to boost capabilities to one day train models with more than a trillion parameters on data sets as large as an exabyte, which is roughly equivalent to 36,000 years of high-quality video. "The experiences we're building for the metaverse require enormous compute power…and RSC will enable new AI models that can learn from trillions of examples, understand hundreds of languages, and more," Meta CEO Mark Zuckerberg said in a statement provided to The Wall Street Journal. By mid-summer, when the AI Research SuperCluster is fully built, it will house some 16,000 GPUs, becoming the fastest AI supercomputer in the world, Meta said.