Goto

Collaborating Authors

 regex


ChatGPT Unveils Its Limits: Principles of Law Deliver Checkmate

Molinari, Marianna, Amantea, Ilaria Angela, Quaranta, Marinella, Governatori, Guido

arXiv.org Artificial Intelligence

This study examines the performance of ChatGPT with an experiment in the legal domain. We compare the outcome with it a baseline using regular expressions (Regex), rather than focusing solely on the assessment against human performance. The study reveals that even if ChatGPT has access to the necessary knowledge and competencies, it is unable to assemble them, reason through, in a way that leads to an exhaustive result. This unveils a major limitation of ChatGPT. Intelligence encompasses the ability to break down complex issues and address them according to multiple required competencies, providing a unified and comprehensive solution. In the legal domain, one of the most crucial tasks is reading legal decisions and extracting key passages condensed from principles of law (PoLs), which are then incorporated into subsequent rulings by judges or defense documents by lawyers. In performing this task, artificial intelligence lacks an all-encompassing understanding and reasoning, which makes it inherently limited. Genuine intelligence, remains a uniquely human trait, at least in this particular field.


RegexPSPACE: A Benchmark for Evaluating LLM Reasoning on PSPACE-complete Regex Problems

Jin, Hyundong, Hahn, Joonghyuk, Han, Yo-Sub

arXiv.org Artificial Intelligence

Large language models (LLMs) show strong performance across natural language processing (NLP), mathematical reasoning, and programming, and recent large reasoning models (LRMs) further emphasize explicit reasoning. Yet their computational limits, particularly spatial complexity constrained by finite context windows, remain poorly understood. While recent works often focus on problems within the NP complexity class, we push the boundary by introducing a novel benchmark grounded in two PSPACE-complete regular expression (regex) problems: equivalence decision (RegexEQ) and minimization (RegexMin). PSPACE-complete problems serve as a more rigorous standard for assessing computational capacity, as their solutions require massive search space exploration. We perform a double-exponential space exploration to construct a labeled dataset of over a million regex instances with a sound filtering process to build the benchmark. We conduct extensive evaluations on 6 LLMs and 5 LRMs of varying scales, revealing common failure patterns such as verbosity and repetition. With its well-defined structure and quantitative evaluation metrics, this work presents the first empirical investigation into the spatial computational limitations of LLMs and LRMs, offering a new framework for evaluating their advanced reasoning capabilities. Our code is available at https://github.com/hyundong98/RegexPSPACE .


Correctness-Guaranteed Code Generation via Constrained Decoding

Li, Lingxiao, Rahili, Salar, Zhao, Yiwei

arXiv.org Artificial Intelligence

Language Models (LMs) are increasingly being used for code generation, but ensuring the correctness of generated programs remains a significant challenge. Although imperfect code may be acceptable during software development with human oversight, domains such as video games and robotics require one-shot correctness for runtime-critical components. W e present a constrained decoding algorithm for generating semantically correct programs that incorporates a context-sensitive parser, which, at each step, outputs a regular expression that satisfies a critical non-extensible property to guide the generation of the next token sequence that can continue to a correct program. T o build such a context-sensitive parser, we propose a framework of a dynamic tree of parsers (T oP) during parsing, where each parser corresponds to a modular context-free grammar enriched with contextual information such as variable scopes and type constraints, with tree branches representing ambiguity in the future code segment. W e demonstrate our approach through sLua, a strongly typed variant of Lua, showing that our method can generate semantically correct programs conforming to any prescribed scripting API. W e further show that, with careful design, our semantic guarantees extend to runtime correctness, as validated in the application of generating game mechanics for a roguelike video game.


Boundless Byte Pair Encoding: Breaking the Pre-tokenization Barrier

Schmidt, Craig W., Reddy, Varshini, Tanner, Chris, Pinter, Yuval

arXiv.org Artificial Intelligence

Pre-tokenization, the initial step in many modern tokenization pipelines, segments text into smaller units called pretokens, typically splitting on whitespace and punctuation. While this process encourages having full, individual words as tokens, it introduces a fundamental limitation in most tokenization algorithms such as Byte Pair Encoding (BPE). Specifically, pre-tokenization causes the distribution of tokens in a corpus to heavily skew towards common, full-length words. This skewed distribution limits the benefits of expanding to larger vocabularies, since the additional tokens appear with progressively lower counts. To overcome this barrier, we propose BoundlessBPE, a modified BPE algorithm that relaxes the pretoken boundary constraint. Our approach selectively merges two complete pretokens into a larger unit we term a superword. Superwords are not necessarily semantically cohesive. For example, the pretokens " of" and " the" might be combined to form the superword " of the". This merging strategy results in a substantially more uniform distribution of tokens across a corpus than standard BPE, and compresses text more effectively, with an approximate 20% increase in bytes per token.


SWE-Lancer: Can Frontier LLMs Earn $1 Million from Real-World Freelance Software Engineering?

Miserendino, Samuel, Wang, Michele, Patwardhan, Tejal, Heidecke, Johannes

arXiv.org Artificial Intelligence

We introduce SWE-Lancer, a benchmark of over 1,400 freelance software engineering tasks from Upwork, valued at \$1 million USD total in real-world payouts. SWE-Lancer encompasses both independent engineering tasks--ranging from \$50 bug fixes to \$32,000 feature implementations--and managerial tasks, where models choose between technical implementation proposals. Independent tasks are graded with end-to-end tests triple-verified by experienced software engineers, while managerial decisions are assessed against the choices of the original hired engineering managers. We evaluate model performance and find that frontier models are still unable to solve the majority of tasks. To facilitate future research, we open-source a unified Docker image and a public evaluation split, SWE-Lancer Diamond (https://github.com/openai/SWELancer-Benchmark). By mapping model performance to monetary value, we hope SWE-Lancer enables greater research into the economic impact of AI model development.


Decoding Complexity: Intelligent Pattern Exploration with CHPDA (Context Aware Hybrid Pattern Detection Algorithm)

Koli, Lokesh, Kalra, Shubham, Singh, Karanpreet

arXiv.org Artificial Intelligence

Efficient data management is essential for organizations to ensure that sensitive information such as Personally Identifiable Information (PII), Protected Health Information (PHI) and financial records are systematically identified and protected. Effective classification aids in compliance with regulations such as the General Data Protection Regulation (GDPR) and the Health Insurance Portability and Accountability Act (HIPAA), while mitigating security risks through real-time threat detection[3] Automated tools improve operational efficiency by streamlining access and eliminating redundancies. Customized classification systems fulfill global compliance requirements, while centralized control mechanisms enhance governance through unified policy enforcement.[4] Strategic data classification is crucial to achieve security, compliance, and operational effectiveness in the digital environment of today. Identifying PII and PHI across various data formats presents considerable challenges, particularly with unstructured data sets. Differences in encoding and file formats (e.g., PDFs, Word documents, databases, CSV, and other text files) and data storage systems complicate the consistent extraction of sensitive information [5]. Moreover, international regulations such as GDPR, HIPAA, and the California Consumer Privacy Act (CCPA) impose varied compliance mandates, adding further complexity to detection efforts. Customizing detection mechanisms to align with region-specific regulations while ensuring accuracy across different content types is formidable. The necessity for real-time detection and the reduction of false positives amplifies this challenge, necessitating advanced algorithms and comprehensive data management strategies.


Challenges and Considerations in Annotating Legal Data: A Comprehensive Overview

Darji, Harshil, Mitrović, Jelena, Granitzer, Michael

arXiv.org Artificial Intelligence

The process of annotating data within the legal sector is filled with distinct challenges that differ from other fields, primarily due to the inherent complexities of legal language and documentation. The initial task usually involves selecting an appropriate raw dataset that captures the intricate aspects of legal texts. Following this, extracting text becomes a complicated task, as legal documents often have complex structures, footnotes, references, and unique terminology. The importance of data cleaning is magnified in this context, ensuring that redundant information is eliminated while maintaining crucial legal details and context. Creating comprehensive yet straightforward annotation guidelines is imperative, as these guidelines serve as the road map for maintaining uniformity and addressing the subtle nuances of legal terminology. Another critical aspect is the involvement of legal professionals in the annotation process. Their expertise is valuable in ensuring that the data not only remains contextually accurate but also adheres to prevailing legal standards and interpretations. This paper provides an expanded view of these challenges and aims to offer a foundational understanding and guidance for researchers and professionals engaged in legal data annotation projects. In addition, we provide links to our created and fine-tuned datasets and language models. These resources are outcomes of our discussed projects and solutions to challenges faced while working on them.


ReflectSumm: A Benchmark for Course Reflection Summarization

Zhong, Yang, Elaraby, Mohamed, Litman, Diane, Butt, Ahmed Ashraf, Menekse, Muhsin

arXiv.org Artificial Intelligence

This paper introduces ReflectSumm, a novel summarization dataset specifically designed for summarizing students' reflective writing. The goal of ReflectSumm is to facilitate developing and evaluating novel summarization techniques tailored to real-world scenarios with little training data, %practical tasks with potential implications in the opinion summarization domain in general and the educational domain in particular. The dataset encompasses a diverse range of summarization tasks and includes comprehensive metadata, enabling the exploration of various research questions and supporting different applications. To showcase its utility, we conducted extensive evaluations using multiple state-of-the-art baselines. The results provide benchmarks for facilitating further research in this area.


Memory Augmented Large Language Models are Computationally Universal

Schuurmans, Dale

arXiv.org Artificial Intelligence

We show that transformer-based large language models are computationally universal when augmented with an external memory. Any deterministic language model that conditions on strings of bounded length is equivalent to a finite automaton, hence computationally limited. However, augmenting such models with a read-write memory creates the possibility of processing arbitrarily large inputs and, potentially, simulating any algorithm. We establish that an existing large language model, Flan-U-PaLM 540B, can be combined with an associative read-write memory to exactly simulate the execution of a universal Turing machine, $U_{15,2}$. A key aspect of the finding is that it does not require any modification of the language model weights. Instead, the construction relies solely on designing a form of stored instruction computer that can subsequently be programmed with a specific set of prompts.


Are we using Deep Learning where it should be used?

#artificialintelligence

Deep Learning has become a trend that everyone is attempting to enforce in many problems as it has proved its success in solving many problems with extremely high accuracy. Even though deep learning is highly effective, many problems could have been solved with exact programs and classical computer science techniques. Does deep learning actually get applied where it should? Do people experiment with other techniques before turning to deep learning when solving a problem? In this article, we attempt to answer these questions with the help of two actual examples.