Goto

Collaborating Authors

 punctuation mark


A Appendix

Neural Information Processing Systems

A.1 TPPE Method We present the pseudo code for TPPE in this paper, using the Insertion mode as an example. According to Alg. 1, we reduce the query time complexity from In our study, we assume the worst-case scenario of applying punctuation-level attacks. Softmax layer is adopted to predict the label of the input text. Paraphrase (TPPEP) to achieve a single-shot attack. We describe the TPPEP method as being decomposed into two parts: training and searching.


Beyond Line-Level Filtering for the Pretraining Corpora of LLMs

Park, Chanwoo, Park, Suyoung, Ahn, Yelim, Kim, Jongmin, Park, Jongyeon, Lee, Jaejin

arXiv.org Artificial Intelligence

While traditional line-level filtering techniques, such as line-level deduplication and trailing-punctuation filters, are commonly used, these basic methods can sometimes discard valuable content, negatively affecting downstream performance. In this paper, we introduce two methods-pattern-aware line-level deduplication (PLD) and pattern-aware trailing punctuation filtering (PTF)-by enhancing the conventional filtering techniques. Our approach not only considers line-level signals but also takes into account their sequential distribution across documents, enabling us to retain structurally important content that might otherwise be removed. We evaluate these proposed methods by training small language models (1 B parameters) in both English and Korean. The results demonstrate that our methods consistently improve performance on multiple-choice benchmarks and significantly enhance generative question-answering accuracy on both SQuAD v1 and KorQuAD v1.


A Appendix

Neural Information Processing Systems

A.1 TPPE Method We present the pseudo code for TPPE in this paper, using the Insertion mode as an example. According to Alg. 1, we reduce the query time complexity from In our study, we assume the worst-case scenario of applying punctuation-level attacks. Softmax layer is adopted to predict the label of the input text. Paraphrase (TPPEP) to achieve a single-shot attack. We describe the TPPEP method as being decomposed into two parts: training and searching.


Open Automatic Speech Recognition Models for Classical and Modern Standard Arabic

Grigoryan, Lilit, Karpov, Nikolay, Albasiri, Enas, Lavrukhin, Vitaly, Ginsburg, Boris

arXiv.org Artificial Intelligence

Despite Arabic being one of the most widely spoken languages, the development of Arabic Automatic Speech Recognition (ASR) systems faces significant challenges due to the language's complexity, and only a limited number of public Arabic ASR models exist. While much of the focus has been on Modern Standard Arabic (MSA), there is considerably less attention given to the variations within the language. This paper introduces a universal methodology for Arabic speech and text processing designed to address unique challenges of the language. Using this methodology, we train two novel models based on the FastConformer architecture: one designed specifically for MSA and the other, the first unified public model for both MSA and Classical Arabic (CA). The MSA model sets a new benchmark with state-of-the-art (SOTA) performance on related datasets, while the unified model achieves SOTA accuracy with diacritics for CA while maintaining strong performance for MSA. To promote reproducibility, we open-source the models and their training recipes.


Mark My Words: A Robust Multilingual Model for Punctuation in Text and Speech Transcripts

Pulipaka, Sidharth, Jain, Sparsh, Sankar, Ashwin, Dabre, Raj

arXiv.org Artificial Intelligence

Punctuation plays a vital role in structuring meaning, yet current models often struggle to restore it accurately in transcripts of spontaneous speech, especially in the presence of disfluencies such as false starts and backtracking. These limitations hinder the performance of downstream tasks like translation, text to speech, summarization, etc. where sentence boundaries are critical for preserving quality. In this work, we introduce Cadence, a generalist punctuation restoration model adapted from a pretrained large language model. Cadence is designed to handle both clean written text and highly spontaneous spoken transcripts. It surpasses the previous state of the art in performance while expanding support from 14 to all 22 Indian languages and English. We conduct a comprehensive analysis of model behavior across punctuation types and language families, identifying persistent challenges under domain shift and with rare punctuation marks. Our findings demonstrate the efficacy of utilizing pretrained language models for multilingual punctuation restoration and highlight Cadence practical value for low resource NLP pipelines at scale.


Quantifying patterns of punctuation in modern Chinese prose

Dolina, Michał, Dec, Jakub, Drożdż, Stanisław, Kwapień, Jarosław, Liu, Jin, Stanisz, Tomasz

arXiv.org Artificial Intelligence

Recent research shows that punctuation patterns in texts exhibit universal features across languages. Analysis of Western classical literature reveals that the distribution of spaces between punctuation marks aligns with a discrete Weibull distribution, typically used in survival analysis. By extending this analysis to Chinese literature represented here by three notable contemporary works, it is shown that Zipf's law applies to Chinese texts similarly to Western texts, where punctuation patterns also improve adherence to the law. Additionally, the distance distribution between punctuation marks in Chinese texts follows the Weibull model, though larger spacing is less frequent than in English translations. Sentence-ending punctuation, representing sentence length, diverges more from this pattern, reflecting greater flexibility in sentence length. This variability supports the formation of complex, multifractal sentence structures, particularly evident in Gao Xingjian's "Soul Mountain". These findings demonstrate that both Chinese and Western texts share universal punctuation and word distribution patterns, underscoring their broad applicability across languages.


Punctuation patterns in "Finnegans Wake" by James Joyce are largely translation-invariant

Bartnicki, Krzysztof, Drożdż, Stanisław, Kwapień, Jarosław, Stanisz, Tomasz

arXiv.org Artificial Intelligence

The complexity characteristics of texts written in natural languages are significantly related to the rules of punctuation. In particular, the distances between punctuation marks measured by the number of words quite universally follow the family of Weibull distributions known from survival analyses. However, the values of two parameters marking specific forms of these distributions distinguish specific languages. This is such a strong constraint that the punctuation distributions of texts translated from the original language into another adopt quantitative characteristics of the target language. All these changes take place within Weibull distributions such that the corresponding hazard functions are always increasing. Recent previous research shows that James Joyce's famous "Finnegans Wake" is subject to such extreme distribution from the Weibull family that the corresponding hazard function is clearly decreasing. At the same time, the distances of sentence ending punctuation marks, determining the variability of sentence length, have an almost perfect multifractal organization, so far to such an extent found nowhere else in the literature. In the present contribution based on several available translations (Dutch, French, German, Polish, Russian) of "Finnegans Wake", it is shown that the punctuation characteristics of this work remain largely translation invariant, contrary to the common cases. These observations may constitute further evidence that "Finnegans Wake" is a translinguistic work in this respect as well, in line with Joyce's original intention.


Multifractal hopscotch in "Hopscotch" by Julio Cortazar

Dec, Jakub, Dolina, Michał, Drożdż, Stanisław, Kwapień, Jarosław, Stanisz, Tomasz

arXiv.org Artificial Intelligence

Punctuation is the main factor introducing correlations in natural language written texts and it crucially impacts their overall effectiveness, expressiveness, and readability. Punctuation marks at the end of sentences are of particular importance as their distribution can determine various complexity features of written natural language. Here, the sentence length variability (SLV) time series representing "Hopscotch" by Julio Cortazar are subjected to quantitative analysis with an attempt to identify their distribution type, long-memory effects, and potential multiscale patterns. The analyzed novel is an important and innovative piece of literature whose essential property is freedom of movement between its building blocks given to a reader by the author. The statistical consequences of this freedom are closely investigated in both the original, Spanish version of the novel, and its translations into English and Polish. Clear evidence of rich multifractality in the SLV dynamics, with a left-sided asymmetry, however, is observed in all three language versions as well as in the versions with differently ordered chapters.


ArithmAttack: Evaluating Robustness of LLMs to Noisy Context in Math Problem Solving

Abedin, Zain Ul, Qamar, Shahzeb, Flek, Lucie, Karimi, Akbar

arXiv.org Artificial Intelligence

While Large Language Models (LLMs) have shown impressive capabilities in math problem-solving tasks, their robustness to noisy inputs is not well-studied. In this work, we propose ArithmAttack to examine how robust the LLMs are when they encounter noisy prompts that contain extra noise in the form of punctuation marks. While being easy to implement, ArithmAttack does not cause any information loss since words are not added or deleted from the context. We evaluate the robustness of seven LLMs, including LLama3, Mistral, and Mathstral, on noisy GSM8K and MultiArith datasets. Our experiments suggest that all the studied models show vulnerability to such noise, with more noise leading to poorer performances.


See Where You Read with Eye Gaze Tracking and Large Language Model

Yang, Sikai, Yan, Gang, Du, Wan

arXiv.org Artificial Intelligence

Losing track of reading progress during line switching can be frustrating. Eye gaze tracking technology offers a potential solution by highlighting read paragraphs, aiding users in avoiding wrong line switches. However, the gap between gaze tracking accuracy (2-3 cm) and text line spacing (3-5 mm) makes direct application impractical. Existing methods leverage the linear reading pattern but fail during jump reading. This paper presents a reading tracking and highlighting system that supports both linear and jump reading. Based on experimental insights from the gaze nature study of 16 users, two gaze error models are designed to enable both jump reading detection and relocation. The system further leverages the large language model's contextual perception capability in aiding reading tracking. A reading tracking domain-specific line-gaze alignment opportunity is also exploited to enable dynamic and frequent calibration of the gaze results. Controlled experiments demonstrate reliable linear reading tracking, as well as 84% accuracy in tracking jump reading. Furthermore, real field tests with 18 volunteers demonstrated the system's effectiveness in tracking and highlighting read paragraphs, improving reading efficiency, and enhancing user experience.