AITopics | heap

Collaborating Authors

heap

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Learning-AugmentedPriority Queues

Neural Information Processing SystemsFeb-18-2026, 10:24:44 GMT

Their primary objective is to efficiently support the insertion of new elements with assigned priorities and the extraction of the highest priorityelement.

algorithm, artificial intelligence, prediction, (17 more...)

Neural Information Processing Systems

Country:

Europe > Slovenia > Drava > Municipality of Benedikt > Benedikt (0.04)
North America > United States (0.04)
Europe > France > Provence-Alpes-Côte d'Azur > Alpes-Maritimes > Nice (0.04)

Genre: Research Report (0.67)

Technology: Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)

Add feedback

Quadratic Term Correction on Heaps' Law

Fontanelli, Oscar, Li, Wentian

arXiv.org Artificial IntelligenceNov-19-2025

Heaps' or Herdan's law characterizes the word-type vs. word-token relation by a power-law function, which is concave in linear-linear scale but a straight line in log-log scale. However, it has been observed that even in log-log scale, the type-token curve is still slightly concave, invalidating the power-law relation. At the next-order approximation, we have shown, by twenty English novels or writings (some are translated from another language to English), that quadratic functions in log-log scale fit the type-token data perfectly. Regression analyses of log(type)-log(token) data with both a linear and quadratic term consistently lead to a linear coefficient of slightly larger than 1, and a quadratic coefficient around -0.02. Using the ``random drawing colored ball from the bag with replacement" model, we have shown that the curvature of the log-log scale is identical to a ``pseudo-variance" which is negative. Although a pseudo-variance calculation may encounter numeric instability when the number of tokens is large, due to the large values of pseudo-weights, this formalism provides a rough estimation of the curvature when the number of tokens is small.

machine learning, natural language, regression, (19 more...)

arXiv.org Artificial Intelligence

2511.14683

Country:

North America > United States (0.46)
Europe > Netherlands (0.28)

Genre: Research Report (0.90)

Industry: Health & Medicine (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.68)

Add feedback

e08e1a60c006ac3f0c9f953626b0f0c8-Paper-Conference.pdf

Neural Information Processing SystemsOct-10-2025, 19:12:09 GMT

algorithm, prediction, priority queue, (16 more...)

Neural Information Processing Systems

Country:

Europe > United Kingdom > England > Oxfordshire > Oxford (0.14)
North America > United States > New York (0.04)
Europe > Slovenia > Drava > Municipality of Benedikt > Benedikt (0.04)
Europe > France > Provence-Alpes-Côte d'Azur > Alpes-Maritimes > Nice (0.04)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry: Telecommunications (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Modeling & Simulation (0.69)
Information Technology > Software > Programming Languages (0.69)
(3 more...)

Add feedback

Refactoring Codebases through Library Design

Kovacic, Ziga, Chiu, Justin T., Lee, Celine, Zhao, Wenting, Ellis, Kevin

arXiv.org Artificial IntelligenceOct-7-2025

Maintainable and general software allows developers to build robust applications efficiently, yet achieving these qualities often requires refactoring specialized solutions into reusable components. This challenge becomes particularly relevant as code agents become used to solve isolated one-off programming problems. We investigate code agents' capacity to refactor code in ways that support growth and reusability. We first investigate what makes a good refactoring, finding via simulation results and a human study that Minimum Description Length best correlates with preferable refactorings. We then present both a benchmark and a method for refactoring: MiniCode, a benchmark where multiple files must be refactored into a shared library, and Librarian, a sample-and-rerank method for generating reusable libraries. We compare Librarian to state-of-the-art library generation methods, and study it on real-world code bases.

large language model, machine learning, programming language, (23 more...)

arXiv.org Artificial Intelligence

2506.11058

Genre: Research Report (0.82)

Industry: Education (0.47)

Technology:

Information Technology > Software > Programming Languages (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
(2 more...)

Add feedback

Scale-free Characteristics of Multilingual Legal Texts and the Limitations of LLMs

Chen, Haoyang, Tanaka-Ishii, Kumiko

arXiv.org Artificial IntelligenceSep-23-2025

We present a comparative analysis of text complexity across domains using scale-free metrics. We quantify linguistic complexity via Heaps' exponent $β$ (vocabulary growth), Taylor's exponent $α$ (word-frequency fluctuation scaling), compression rate $r$ (redundancy), and entropy. Our corpora span three domains: legal documents (statutes, cases, deeds) as a specialized domain, general natural language texts (literature, Wikipedia), and AI-generated (GPT) text. We find that legal texts exhibit slower vocabulary growth (lower $β$) and higher term consistency (higher $α$) than general texts. Within legal domain, statutory codes have the lowest $β$ and highest $α$, reflecting strict drafting conventions, while cases and deeds show higher $β$ and lower $α$. In contrast, GPT-generated text shows the statistics more aligning with general language patterns. These results demonstrate that legal texts exhibit domain-specific structures and complexities, which current generative models do not fully replicate.

computational linguistic, large language model, machine learning, (19 more...)

arXiv.org Artificial Intelligence

doi: 10.1007/978-3-032-02551-7_10

2509.17367

Country: Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.28)

Genre: Research Report > New Finding (0.66)

Industry: Law (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.97)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.97)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.93)

Add feedback

Analysing the Language of Neural Audio Codecs

Park, Joonyong, Takamichi, Shinnosuke, Chan, David M., Kando, Shunsuke, Saito, Yuki, Saruwatari, Hiroshi

arXiv.org Artificial IntelligenceSep-3-2025

This study presents a comparative analysis of the statistical and linguistic properties of neural audio codecs (NACs). We investigate discrete speech tokens produced by various NAC models, examining their adherence to linguistic statistical laws such as Zipf's law and Heaps' law, as well as their entropy and redundancy. To assess how these token-level properties relate to semantic and acoustic preservation in synthesized speech, we evaluate intelligibility using error rates of automatic speech recognition, and quality using the UTMOS score. Our results reveal that NAC tokens, particularly 3-grams, exhibit language-like statistical patterns. Moreover, these properties, together with measures of information content, are found to correlate with improved performances in speech recognition and resynthesis tasks. These findings offer insights into the structure of NAC token sequences and inform the design of more effective generative speech models.

artificial intelligence, machine learning, natural language, (15 more...)

arXiv.org Artificial Intelligence

2509.0139

Country:

Asia (1.00)
North America > United States (0.68)
Europe > Austria (0.46)

Genre: Research Report > New Finding (0.48)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.35)

Add feedback

CORE: Measuring Multi-Agent LLM Interaction Quality under Game-Theoretic Pressures

Pandey, Punya Syon, Yang, Yongjin, Liu, Jiarui, Jin, Zhijing

arXiv.org Artificial IntelligenceAug-19-2025

Game-theoretic interactions between agents with Large Language Models (LLMs) have revealed many emergent capabilities, yet the linguistic diversity of these interactions has not been sufficiently quantified. In this paper, we present the Conversational Robustness Evaluation Score: CORE, a metric to quantify the effectiveness of language use within multi-agent systems across different game-theoretic interactions. CORE integrates measures of cluster entropy, lexical repetition, and semantic similarity, providing a direct lens of dialog quality. We apply CORE to pairwise LLM dialogs across competitive, cooperative, and neutral settings, further grounding our analysis in Zipf's and Heaps' Laws to characterize word frequency distributions and vocabulary growth. Our findings show that cooperative settings exhibit both steeper Zipf distributions and higher Heap exponents, indicating more repetition alongside greater vocabulary expansion. In contrast, competitive interactions display lower Zipf and Heaps exponents, reflecting less repetition and more constrained vocabularies. These results provide new insights into how social incentives influence language adaptation, and highlight CORE as a robust diagnostic for measuring linguistic robustness in multi-agent LLM systems. Our code is available at https://github.com/psyonp/core.

artificial intelligence, large language model, natural language, (16 more...)

arXiv.org Artificial Intelligence

2508.11915

Country:

North America > United States (0.46)
North America > Canada > Ontario (0.28)

Genre:

Research Report > New Finding (0.68)
Personal > Interview (0.46)

Industry: Government (0.93)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents > Agent Societies (0.46)

Add feedback

Bridging the Semantic Gap in Virtual Machine Introspection and Forensic Memory Analysis

Fellicious, Christofer, Reiser, Hans P., Granitzer, Michael

arXiv.org Artificial IntelligenceMar-7-2025

Forensic Memory Analysis (FMA) and Virtual Machine Introspection (VMI) are critical tools for security in a virtualization-based approach. VMI and FMA involves using digital forensic methods to extract information from the system to identify and explain security incidents. A key challenge in both FMA and VMI is the "Semantic Gap", which is the difficulty of interpreting raw memory data without specialized tools and expertise. In this work, we investigate how a priori knowledge, metadata and engineered features can aid VMI and FMA, leveraging machine learning to automate information extraction and reduce the workload of forensic investigators. We choose OpenSSH as our use case to test different methods to extract high level structures. We also test our method on complete physical memory dumps to showcase the effectiveness of the engineered features. Our features range from basic statistical features to advanced graph-based representations using malloc headers and pointer translations. The training and testing are carried out on public datasets that we compare against already recognized baseline methods. We show that using metadata, we can improve the performance of the algorithm when there is very little training data and also quantify how having more data results in better generalization performance. The final contribution is an open dataset of physical memory dumps, totalling more than 1 TB of different memory state, software environments, main memory capacities and operating system versions. Our methods show that having more metadata boosts performance with all methods obtaining an F1-Score of over 80%. Our research underscores the possibility of using feature engineering and machine learning techniques to bridge the semantic gap.

artificial intelligence, machine learning, pointer, (18 more...)

arXiv.org Artificial Intelligence

2503.05482

Country:

Europe > Iceland > Capital Region > Reykjavik (0.04)
Asia > India > Tamil Nadu > Chennai (0.04)

Genre: Research Report > New Finding (0.68)

Industry: Information Technology > Security & Privacy (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.69)

Add feedback

Efficient distributional regression trees learning algorithms for calibrated non-parametric probabilistic forecasts

Quentin, Duchemin, Guillaume, Obozinski

arXiv.org Artificial IntelligenceFeb-7-2025

The perspective of developing trustworthy AI for critical applications in science and engineering requires machine learning techniques that are capable of estimating their own uncertainty. In the context of regression, instead of estimating a conditional mean, this can be achieved by producing a predictive interval for the output, or to even learn a model of the conditional probability $p(y|x)$ of an output $y$ given input features $x$. While this can be done under parametric assumptions with, e.g. generalized linear model, these are typically too strong, and non-parametric models offer flexible alternatives. In particular, for scalar outputs, learning directly a model of the conditional cumulative distribution function of $y$ given $x$ can lead to more precise probabilistic estimates, and the use of proper scoring rules such as the weighted interval score (WIS) and the continuous ranked probability score (CRPS) lead to better coverage and calibration properties. This paper introduces novel algorithms for learning probabilistic regression trees for the WIS or CRPS loss functions. These algorithms are made computationally efficient thanks to an appropriate use of known data structures - namely min-max heaps, weight-balanced binary trees and Fenwick trees. Through numerical experiments, we demonstrate that the performance of our methods is competitive with alternative approaches. Additionally, our methods benefit from the inherent interpretability and explainability of trees. As a by-product, we show how our trees can be used in the context of conformal prediction and explain why they are particularly well-suited for achieving group-conditional coverage guarantees.

artificial intelligence, bayesian inference, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2502.05157

Country:

Europe > Switzerland > Zürich > Zürich (0.14)
South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
North America > United States > Florida > Palm Beach County > Boca Raton (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report (0.50)

Industry:

Health & Medicine (0.67)
Leisure & Entertainment > Games (0.34)

Add feedback

The Heap: A Contamination-Free Multilingual Code Dataset for Evaluating Large Language Models

Katzy, Jonathan, Popescu, Razvan Mihai, van Deursen, Arie, Izadi, Maliheh

arXiv.org Artificial IntelligenceJan-16-2025

To cover more specific use cases, we also include The data-intensive training process of Large Language Models domain-specific languages such as Mathematica, Emacs-Lisp, (LLMs) has driven the release of numerous large-scale and Coq. A complete list of all languages included in the datasets, particularly for code, to facilitate the development dataset is presented in Table I. of new models. This rapid increase in the amount of training B. Query data used to pre-train LLMs has resulted in extensive datasets covering almost all publicly available code [1]-[3]. We focus on repositories that have one of the targeted To assess the success of such LLMs in downstream tasks, languages as the main language of the repository. We further fresh data not seen during training is needed. Otherwise such select only repositories that are licensed under non-permissive evaluations are contaminated, possibly resulting in overly optimistic licenses. We choose non-permissive licenses as an initial filter results. Unfortunately, obtaining such non-contaminated for repositories, as many large-scale datasets focus on exclusively data is increasingly difficult. In fact, a recent study establishes unlicensed or permissively licensed code [2], [3], [5].

dataset, deduplication, duplicate, (12 more...)

arXiv.org Artificial Intelligence

2501.09653

Country:

Europe > Netherlands > South Holland > Delft (0.06)
North America > United States > New York > New York County > New York City (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Genre: Research Report (0.70)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback