AITopics | haystack

Collaborating Authors

haystack

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Finding the Needle in the Haystack with Convolutions: on the benefits of architectural bias

Neural Information Processing SystemsDec-25-2025, 01:03:01 GMT

Despite the phenomenal success of deep neural networks in a broad range of learning tasks, there is a lack of theory to understand the way they work. In particular, Convolutional Neural Networks (CNNs) are known to perform much better than Fully-Connected Networks (FCNs) on spatially structured data: the architectural structure of CNNs benefits from prior knowledge on the features of the data, for instance their translation invariance. The aim of this work is to understand this fact through the lens of dynamics in the loss landscape. We introduce a method that maps a CNN to its equivalent FCN (denoted as eFCN). Such an embedding enables the comparison of CNN and FCN training dynamics directly in the FCN space.

architectural bias, convolution, name change, (9 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.58)

Add feedback

A Controllable Examination for Long-Context Language Models

Yang, Yijun, Huang, Zeyu, Zhu, Wenhao, Qiu, Zihan, Yuan, Fei, Pan, Jeff Z., Titov, Ivan

arXiv.org Artificial IntelligenceOct-21-2025

Existing frameworks for evaluating long-context language models (LCLM) can be broadly categorized into real-world applications (e.g, document summarization) and synthetic tasks (e.g, needle-in-a-haystack). Despite their utility, both approaches are accompanied by certain intrinsic limitations. Real-world tasks often involve complexity that makes interpretation challenging and suffer from data contamination, whereas synthetic tasks frequently lack meaningful coherence between the target information (needle) and its surrounding context (haystack), undermining their validity as proxies for realistic applications. In response to these challenges, we posit that an ideal long-context evaluation framework should be characterized by three essential features: 1) seamless context 2) controllable setting and 3) sound evaluation. This study introduces $\textbf{LongBioBench}$, a benchmark that utilizes artificially generated biographies as a controlled environment for assessing LCLMs across dimensions of understanding, reasoning, and trustworthiness. Our experimental evaluation, which includes 18 LCLMs in total, demonstrates that most models still exhibit deficiencies in semantic understanding and elementary reasoning over retrieved results and are less trustworthy as context length increases. Our further analysis indicates some design choices employed by existing synthetic benchmarks, such as contextual non-coherence, numerical needles, and the absence of distractors, rendering them vulnerable to test the model's long-context capabilities. To sum up, compared to previous synthetic benchmarks, LongBioBench achieves a better trade-off between mirroring authentic language tasks and maintaining controllability, and is highly interpretable and configurable.

benchmark, large language model, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2506.02921

Country: Asia (1.00)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)

Add feedback

Context Is What You Need: The Maximum Effective Context Window for Real World Limits of LLMs

Paulsen, Norman

arXiv.org Artificial IntelligenceSep-29-2025

Large language model (LLM) providers boast big numbers for maximum context window sizes. To test the real world use of context windows, we 1) define a concept of maximum effective context window, 2) formulate a testing method of a context window's effectiveness over various sizes and problem types, and 3) create a standardized way to compare model efficacy for increasingly larger context window sizes to find the point of failure. We collected hundreds of thousands of data points across several models and found significant differences between reported Maximum Context Window (MCW) size and Maximum Effective Context Window (MECW) size. Our findings show that the MECW is, not only, drastically different from the MCW but also shifts based on the problem type. A few top of the line models in our test group failed with as little as 100 tokens in context; most had severe degradation in accuracy by 1000 tokens in context. All models fell far short of their Maximum Context Window by as much as 99 percent. Our data reveals the Maximum Effective Context Window shifts based on the type of problem provided, offering clear and actionable insights into how to improve model accuracy and decrease model hallucination rates.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2509.21361

Country: North America > United States (0.28)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Decomposing Prediction Mechanisms for In-Context Recall

Daniels, Sultan, Davis, Dylan, Gautam, Dhruv, Liao, Wentinn, Ranade, Gireeja, Sahai, Anant

arXiv.org Artificial IntelligenceJul-3-2025

We introduce a new family of toy problems that combine features of linear-regression-style continuous in-context learning (ICL) with discrete associative recall. We pretrain transformer models on sample traces from this toy, specifically symbolically-labeled interleaved state observations from randomly drawn linear deterministic dynamical systems. We study if the transformer models can recall the state of a sequence previously seen in its context when prompted to do so with the corresponding in-context label. Taking a closer look at this task, it becomes clear that the model must perform two functions: (1) identify which system's state should be recalled and apply that system to its last seen state, and (2) continuing to apply the correct system to predict the subsequent states. Training dynamics reveal that the first capability emerges well into a model's training. Surprisingly, the second capability, of continuing the prediction of a resumed sequence, develops much earlier. Via out-of-distribution experiments, and a mechanistic analysis on model weights via edge pruning, we find that next-token prediction for this toy problem involves at least two separate mechanisms. One mechanism uses the discrete symbolic labels to do the associative recall required to predict the start of a resumption of a previously seen sequence. The second mechanism, which is largely agnostic to the discrete symbolic labels, performs a "Bayesian-style" prediction based on the previous token and the context. These two mechanisms have different learning dynamics. To confirm that this multi-mechanism (manifesting as separate phase transitions) phenomenon is not just an artifact of our toy setting, we used OLMo training checkpoints on an ICL translation task to see a similar phenomenon: a decisive gap in the emergence of first-task-token performance vs second-task-token performance.

artificial intelligence, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2507.01414

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.93)

Add feedback

Quantifying Misattribution Unfairness in Authorship Attribution

Alipoormolabashi, Pegah, Patel, Ajay, Balasubramanian, Niranjan

arXiv.org Artificial IntelligenceJun-4-2025

Authorship misattribution can have profound consequences in real life. In forensic settings simply being considered as one of the potential authors of an evidential piece of text or communication can result in undesirable scrutiny. This raises a fairness question: Is every author in the candidate pool at equal risk of misattribution? Standard evaluation measures for authorship attribution systems do not explicitly account for this notion of fairness. We introduce a simple measure, Misattribution Unfairness Index (MAUIk), which is based on how often authors are ranked in the top k for texts they did not write. Using this measure we quantify the unfairness of five models on two different datasets. All models exhibit high levels of unfairness with increased risks for some authors. Furthermore, we find that this unfairness relates to how the models embed the authors as vectors in the latent search space. In particular, we observe that the risk of misattribution is higher for authors closer to the centroid (or center) of the embedded authors in the haystack. These results indicate the potential for harm and the need for communicating with and calibrating end users on misattribution risk when building and providing such models for downstream use.

centroid, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2506.02321

Country: North America > United States (0.68)

Genre: Research Report > New Finding (0.46)

Industry: Government (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Communications > Social Media (0.76)

Add feedback

Enhancing Multi-Image Question Answering via Submodular Subset Selection

Sharma, Aaryan, Gupta, Shivansh, Agarwal, Samar, C., Vishak Prasad, Ramakrishnan, Ganesh

arXiv.org Artificial IntelligenceMay-16-2025

Large multimodal models (LMMs) have achieved high performance in vision-language tasks involving single image but they struggle when presented with a collection of multiple images (Multiple Image Question Answering scenario). These tasks, which involve reasoning over large number of images, present issues in scalability (with increasing number of images) and retrieval performance. In this work, we propose an enhancement for retriever framework introduced in MIRAGE model using submodular subset selection techniques. Our method leverages query-aware submodular functions, such as GraphCut, to pre-select a subset of semantically relevant images before main retrieval component. We demonstrate that using anchor-based queries and augmenting the data improves submodular-retriever pipeline effectiveness, particularly in large haystack sizes.

machine learning, natural language, question answering, (17 more...)

arXiv.org Artificial Intelligence

2505.10533

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Question Answering (0.61)

Add feedback

phepy: Visual Benchmarks and Improvements for Out-of-Distribution Detectors

Tyree, Juniper, Rupp, Andreas, Clusius, Petri S., Boy, Michael H.

arXiv.org Artificial IntelligenceMar-7-2025

Applying machine learning to increasingly high-dimensional problems with sparse or biased training data increases the risk that a model is used on inputs outside its training domain. For such out-of-distribution (OOD) inputs, the model can no longer make valid predictions, and its error is potentially unbounded. Testing OOD detection methods on real-world datasets is complicated by the ambiguity around which inputs are in-distribution (ID) or OOD. We design a benchmark for OOD detection, which includes three novel and easily-visualisable toy examples. These simple examples provide direct and intuitive insight into whether the detector is able to detect (1) linear and (2) non-linear concepts and (3) identify thin ID subspaces (needles) within high-dimensional spaces (haystacks). We use our benchmark to evaluate the performance of various methods from the literature. Since tactile examples of OOD inputs may benefit OOD detection, we also review several simple methods to synthesise OOD inputs for supervised training. We introduce two improvements, $t$-poking and OOD sample weighting, to make supervised detectors more precise at the ID-OOD boundary. This is especially important when conflicts between real ID and synthetic OOD sample blur the decision boundary. Finally, we provide recommendations for constructing and applying out-of-distribution detectors in machine learning.

boundary, detector, ood input, (15 more...)

arXiv.org Artificial Intelligence

2503.05169

Country:

Europe > Finland > Uusimaa > Helsinki (0.05)
Europe > Finland > South Karelia > Lappeenranta (0.04)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

NoLiMa: Long-Context Evaluation Beyond Literal Matching

Modarressi, Ali, Deilamsalehy, Hanieh, Dernoncourt, Franck, Bui, Trung, Rossi, Ryan A., Yoon, Seunghyun, Schütze, Hinrich

arXiv.org Artificial IntelligenceFeb-7-2025

Recent large language models (LLMs) support long contexts ranging from 128K to 1M tokens. A popular method for evaluating these capabilities is the needle-in-a-haystack (NIAH) test, which involves retrieving a "needle" (relevant information) from a "haystack" (long irrelevant context). Extensions of this approach include increasing distractors, fact chaining, and in-context reasoning. However, in these benchmarks, models can exploit existing literal matches between the needle and haystack to simplify the task. To address this, we introduce NoLiMa, a benchmark extending NIAH with a carefully designed needle set, where questions and needles have minimal lexical overlap, requiring models to infer latent associations to locate the needle within the haystack. We evaluate 12 popular LLMs that claim to support contexts of at least 128K tokens. While they perform well in short contexts (<1K), performance degrades significantly as context length increases. At 32K, for instance, 10 models drop below 50% of their strong short-length baselines. Even GPT-4o, one of the top-performing exceptions, experiences a reduction from an almost-perfect baseline of 99.3% to 69.7%. Our analysis suggests these declines stem from the increased difficulty the attention mechanism faces in longer contexts when literal matches are absent, making it harder to retrieve relevant information.

large language model, llama 3, machine learning, (21 more...)

arXiv.org Artificial Intelligence

2502.05167

Country:

Asia > Thailand > Bangkok > Bangkok (0.04)
Europe > Germany > Bavaria > Upper Bavaria > Munich (0.04)
North America > United States > Florida > Miami-Dade County > Miami (0.04)
(5 more...)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Systematic Evaluation of Long-Context LLMs on Financial Concepts

Gupta, Lavanya, Sharma, Saket, Zhao, Yiyun

arXiv.org Artificial IntelligenceDec-19-2024

Long-context large language models (LC LLMs) promise to increase reliability of LLMs in real-world tasks requiring processing and understanding of long input documents. However, this ability of LC LLMs to reliably utilize their growing context windows remains under investigation. In this work, we evaluate the performance of state-of-the-art GPT-4 suite of LC LLMs in solving a series of progressively challenging tasks, as a function of factors such as context length, task difficulty, and position of key information by creating a real world financial news dataset. Our findings indicate that LC LLMs exhibit brittleness at longer context lengths even for simple tasks, with performance deteriorating sharply as task complexity increases. At longer context lengths, these state-of-the-art models experience catastrophic failures in instruction following resulting in degenerate outputs. Our prompt ablations also reveal unfortunate continued sensitivity to both the placement of the task instruction in the context window as well as minor markdown formatting. Finally, we advocate for more rigorous evaluation of LC LLMs by employing holistic metrics such as F1 (rather than recall) and reporting confidence intervals, thereby ensuring robust and conclusive findings.

large language model, machine learning, natural language, (22 more...)

arXiv.org Artificial Intelligence

doi: 10.18653/v1/2024.emnlp-industry.88

2412.15386

Country: North America > Mexico (0.28)

Genre: Research Report > New Finding (1.00)

Industry: Banking & Finance (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

'A needle in a haystack:' How AI is helping uncover abandoned oil wells

Popular ScienceDec-5-2024, 19:25:43 GMT

The continental United States is jam-packed with reminders of our ravenous oil appetite. Since the 1850s, there have been an estimated 3.5 million oil and gas wells drilled across the country. Many of those were abandoned after the companies running them ran out of business or otherwise ceased operating. These forgotten fossil fuel artifacts, referred to officially as "undocumented orphan wells" (UOWs) are often left behind without meaningful efforts taken to safely seal them. Unplugged orphan wells can leak out dangerous methane, oil, and other chemicals for years which can pollute the air and potentially contaminate nearby water sources.

artificial intelligence, machine learning, orphan well, (7 more...)

Popular Science

Country: North America > United States (1.00)

Industry:

Energy > Oil & Gas > Upstream (1.00)
Government > Regional Government > North America Government > United States Government (0.31)

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.54)

Add feedback