heap
Quadratic Term Correction on Heaps' Law
Fontanelli, Oscar, Li, Wentian
Heaps' or Herdan's law characterizes the word-type vs. word-token relation by a power-law function, which is concave in linear-linear scale but a straight line in log-log scale. However, it has been observed that even in log-log scale, the type-token curve is still slightly concave, invalidating the power-law relation. At the next-order approximation, we have shown, by twenty English novels or writings (some are translated from another language to English), that quadratic functions in log-log scale fit the type-token data perfectly. Regression analyses of log(type)-log(token) data with both a linear and quadratic term consistently lead to a linear coefficient of slightly larger than 1, and a quadratic coefficient around -0.02. Using the ``random drawing colored ball from the bag with replacement" model, we have shown that the curvature of the log-log scale is identical to a ``pseudo-variance" which is negative. Although a pseudo-variance calculation may encounter numeric instability when the number of tokens is large, due to the large values of pseudo-weights, this formalism provides a rough estimation of the curvature when the number of tokens is small.
Refactoring Codebases through Library Design
Kovacic, Ziga, Chiu, Justin T., Lee, Celine, Zhao, Wenting, Ellis, Kevin
Maintainable and general software allows developers to build robust applications efficiently, yet achieving these qualities often requires refactoring specialized solutions into reusable components. This challenge becomes particularly relevant as code agents become used to solve isolated one-off programming problems. We investigate code agents' capacity to refactor code in ways that support growth and reusability. We first investigate what makes a good refactoring, finding via simulation results and a human study that Minimum Description Length best correlates with preferable refactorings. We then present both a benchmark and a method for refactoring: MiniCode, a benchmark where multiple files must be refactored into a shared library, and Librarian, a sample-and-rerank method for generating reusable libraries. We compare Librarian to state-of-the-art library generation methods, and study it on real-world code bases.
Scale-free Characteristics of Multilingual Legal Texts and the Limitations of LLMs
Chen, Haoyang, Tanaka-Ishii, Kumiko
We present a comparative analysis of text complexity across domains using scale-free metrics. We quantify linguistic complexity via Heaps' exponent $β$ (vocabulary growth), Taylor's exponent $α$ (word-frequency fluctuation scaling), compression rate $r$ (redundancy), and entropy. Our corpora span three domains: legal documents (statutes, cases, deeds) as a specialized domain, general natural language texts (literature, Wikipedia), and AI-generated (GPT) text. We find that legal texts exhibit slower vocabulary growth (lower $β$) and higher term consistency (higher $α$) than general texts. Within legal domain, statutory codes have the lowest $β$ and highest $α$, reflecting strict drafting conventions, while cases and deeds show higher $β$ and lower $α$. In contrast, GPT-generated text shows the statistics more aligning with general language patterns. These results demonstrate that legal texts exhibit domain-specific structures and complexities, which current generative models do not fully replicate.
Analysing the Language of Neural Audio Codecs
Park, Joonyong, Takamichi, Shinnosuke, Chan, David M., Kando, Shunsuke, Saito, Yuki, Saruwatari, Hiroshi
This study presents a comparative analysis of the statistical and linguistic properties of neural audio codecs (NACs). We investigate discrete speech tokens produced by various NAC models, examining their adherence to linguistic statistical laws such as Zipf's law and Heaps' law, as well as their entropy and redundancy. To assess how these token-level properties relate to semantic and acoustic preservation in synthesized speech, we evaluate intelligibility using error rates of automatic speech recognition, and quality using the UTMOS score. Our results reveal that NAC tokens, particularly 3-grams, exhibit language-like statistical patterns. Moreover, these properties, together with measures of information content, are found to correlate with improved performances in speech recognition and resynthesis tasks. These findings offer insights into the structure of NAC token sequences and inform the design of more effective generative speech models.
CORE: Measuring Multi-Agent LLM Interaction Quality under Game-Theoretic Pressures
Pandey, Punya Syon, Yang, Yongjin, Liu, Jiarui, Jin, Zhijing
Game-theoretic interactions between agents with Large Language Models (LLMs) have revealed many emergent capabilities, yet the linguistic diversity of these interactions has not been sufficiently quantified. In this paper, we present the Conversational Robustness Evaluation Score: CORE, a metric to quantify the effectiveness of language use within multi-agent systems across different game-theoretic interactions. CORE integrates measures of cluster entropy, lexical repetition, and semantic similarity, providing a direct lens of dialog quality. We apply CORE to pairwise LLM dialogs across competitive, cooperative, and neutral settings, further grounding our analysis in Zipf's and Heaps' Laws to characterize word frequency distributions and vocabulary growth. Our findings show that cooperative settings exhibit both steeper Zipf distributions and higher Heap exponents, indicating more repetition alongside greater vocabulary expansion. In contrast, competitive interactions display lower Zipf and Heaps exponents, reflecting less repetition and more constrained vocabularies. These results provide new insights into how social incentives influence language adaptation, and highlight CORE as a robust diagnostic for measuring linguistic robustness in multi-agent LLM systems. Our code is available at https://github.com/psyonp/core.
Bridging the Semantic Gap in Virtual Machine Introspection and Forensic Memory Analysis
Fellicious, Christofer, Reiser, Hans P., Granitzer, Michael
Forensic Memory Analysis (FMA) and Virtual Machine Introspection (VMI) are critical tools for security in a virtualization-based approach. VMI and FMA involves using digital forensic methods to extract information from the system to identify and explain security incidents. A key challenge in both FMA and VMI is the "Semantic Gap", which is the difficulty of interpreting raw memory data without specialized tools and expertise. In this work, we investigate how a priori knowledge, metadata and engineered features can aid VMI and FMA, leveraging machine learning to automate information extraction and reduce the workload of forensic investigators. We choose OpenSSH as our use case to test different methods to extract high level structures. We also test our method on complete physical memory dumps to showcase the effectiveness of the engineered features. Our features range from basic statistical features to advanced graph-based representations using malloc headers and pointer translations. The training and testing are carried out on public datasets that we compare against already recognized baseline methods. We show that using metadata, we can improve the performance of the algorithm when there is very little training data and also quantify how having more data results in better generalization performance. The final contribution is an open dataset of physical memory dumps, totalling more than 1 TB of different memory state, software environments, main memory capacities and operating system versions. Our methods show that having more metadata boosts performance with all methods obtaining an F1-Score of over 80%. Our research underscores the possibility of using feature engineering and machine learning techniques to bridge the semantic gap.
Efficient distributional regression trees learning algorithms for calibrated non-parametric probabilistic forecasts
Quentin, Duchemin, Guillaume, Obozinski
The perspective of developing trustworthy AI for critical applications in science and engineering requires machine learning techniques that are capable of estimating their own uncertainty. In the context of regression, instead of estimating a conditional mean, this can be achieved by producing a predictive interval for the output, or to even learn a model of the conditional probability $p(y|x)$ of an output $y$ given input features $x$. While this can be done under parametric assumptions with, e.g. generalized linear model, these are typically too strong, and non-parametric models offer flexible alternatives. In particular, for scalar outputs, learning directly a model of the conditional cumulative distribution function of $y$ given $x$ can lead to more precise probabilistic estimates, and the use of proper scoring rules such as the weighted interval score (WIS) and the continuous ranked probability score (CRPS) lead to better coverage and calibration properties. This paper introduces novel algorithms for learning probabilistic regression trees for the WIS or CRPS loss functions. These algorithms are made computationally efficient thanks to an appropriate use of known data structures - namely min-max heaps, weight-balanced binary trees and Fenwick trees. Through numerical experiments, we demonstrate that the performance of our methods is competitive with alternative approaches. Additionally, our methods benefit from the inherent interpretability and explainability of trees. As a by-product, we show how our trees can be used in the context of conformal prediction and explain why they are particularly well-suited for achieving group-conditional coverage guarantees.
The Heap: A Contamination-Free Multilingual Code Dataset for Evaluating Large Language Models
Katzy, Jonathan, Popescu, Razvan Mihai, van Deursen, Arie, Izadi, Maliheh
To cover more specific use cases, we also include The data-intensive training process of Large Language Models domain-specific languages such as Mathematica, Emacs-Lisp, (LLMs) has driven the release of numerous large-scale and Coq. A complete list of all languages included in the datasets, particularly for code, to facilitate the development dataset is presented in Table I. of new models. This rapid increase in the amount of training B. Query data used to pre-train LLMs has resulted in extensive datasets covering almost all publicly available code [1]-[3]. We focus on repositories that have one of the targeted To assess the success of such LLMs in downstream tasks, languages as the main language of the repository. We further fresh data not seen during training is needed. Otherwise such select only repositories that are licensed under non-permissive evaluations are contaminated, possibly resulting in overly optimistic licenses. We choose non-permissive licenses as an initial filter results. Unfortunately, obtaining such non-contaminated for repositories, as many large-scale datasets focus on exclusively data is increasingly difficult. In fact, a recent study establishes unlicensed or permissively licensed code [2], [3], [5].