punctuation
- Asia > China > Guangdong Province > Shenzhen (0.04)
- Asia > China > Jiangsu Province > Nanjing (0.04)
ArbESC+: Arabic Enhanced Edit Selection System Combination for Grammatical Error Correction Resolving conflict and improving system combination in Arabic GEC
Alrehili, Ahlam, Alhothali, Areej
Grammatical Error Correction (GEC) is an important aspect of natural language processing. Arabic has a complicated morphological and syntactic structure, posing a greater challenge than other languages. Even though modern neural models have improved greatly in recent years, the majority of previous attempts used individual models without taking into account the potential benefits of combining different systems. In this paper, we present one of the first multi-system approaches for correcting grammatical errors in Arabic, the Arab Enhanced Edit Selection System Complication (ArbESC+). Several models are used to collect correction proposals, which are represented as numerical features in the framework. A classifier determines and implements the appropriate corrections based on these features. In order to improve output quality, the framework uses support techniques to filter overlapping corrections and estimate decision reliability. A combination of AraT5, ByT5, mT5, AraBART, AraBART+Morph+GEC, and Text editing systems gave better results than a single model alone, with F0.5 at 82.63% on QALB-14 test data, 84.64% on QALB-15 L1 data, and 65.55% on QALB-15 L2 data. As one of the most significant contributions of this work, it's the first Arab attempt to integrate linguistic error correction. Improving existing models provides a practical step towards developing advanced tools that will benefit users and researchers of Arabic text processing.
- Africa > Middle East > Egypt > Cairo Governorate > Cairo (0.04)
- Asia > Middle East > Saudi Arabia > Medina Province > Medina (0.04)
- Asia > Middle East > Saudi Arabia > Mecca Province > Jeddah (0.04)
- Research Report > New Finding (1.00)
- Overview (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Grammars & Parsing (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- (2 more...)
Punctuation-aware treebank tree binarization
Klinger, Eitan, Wadhwa, Vivaan, Park, Jungyeul
This article presents a curated resource and evaluation suite for punctuation-aware treebank binarization. Standard binarization pipelines drop punctuation before head selection, which alters constituent shape and harms head-child identification. We release (1) a reproducible pipeline that preserves punctuation as sibling nodes prior to binarization, (2) derived artifacts and metadata (intermediate @X markers, reversibility signatures, alignment indices), and (3) an accompanying evaluation suite covering head-child prediction, round-trip reversibility, and structural compatibility with derivational resources (CCGbank). On the Penn Treebank, punctuation-aware preprocessing improves head prediction accuracy from 73.66\% (Collins rules) and 86.66\% (MLP) to 91.85\% with the same classifier, and achieves competitive alignment against CCGbank derivations. All code, configuration files, and documentation are released to enable replication and extension to other corpora.
- North America > United States > Pennsylvania > Philadelphia County > Philadelphia (0.04)
- North America > United States > New York (0.04)
- North America > United States > Illinois (0.04)
- (5 more...)
Appendices 1 All codes, data, and instructions for our C
We plan to expand the study to a larger scale in future work. "Please extract as many components as possible from the provided images. Only provide the component names, separated by commas. We treat objects and their attributes (if found) as options for the questions. "These sentences describe the differences between the two images.
- Asia > China > Guangdong Province > Shenzhen (0.04)
- Asia > China > Jiangsu Province > Nanjing (0.04)
From Canonical to Complex: Benchmarking LLM Capabilities in Undergraduate Thermodynamics
Geißler, Anna, Bien, Luca-Sophie, Schöppler, Friedrich, Hertel, Tobias
Large language models (LLMs) are increasingly considered as tutoring aids in science education. Yet their readiness for unsupervised use in undergraduate instruction remains uncertain, as reliable teaching requires more than fluent recall: it demands consistent, principle-grounded reasoning. Thermodynamics, with its compact laws and subtle distinctions between state and path functions, reversibility, and entropy, provides an ideal testbed for evaluating such capabilities. Here we present UTQA, a 50-item undergraduate thermodynamics question answering benchmark, covering ideal-gas processes, reversibility, and diagram interpretation. No leading 2025-era model exceeded our 95\% competence threshold: the best LLMs achieved 82\% accuracy, with text-only items performing better than image reasoning tasks, which often fell to chance levels. Prompt phrasing and syntactic complexity showed modest to little correlation with performance. The gap concentrates in finite-rate/irreversible scenarios and in binding visual features to thermodynamic meaning, indicating that current LLMs are not yet suitable for unsupervised tutoring in this domain.
- North America > United States > New York (0.04)
- Europe > Germany > Bavaria > Lower Franconia > Würzburg (0.04)
- North America > United States > Illinois > Cook County > Evanston (0.04)
- (3 more...)
Punctuation and Predicates in Language Models
Chauhan, Sonakshi, Chaudhary, Maheep, Choy, Koby, Nellessen, Samuel, Schoots, Nandi
In this paper we explore where information is collected and how it is propagated throughout layers in large language models (LLMs). We begin by examining the surprising computational importance of punctuation tokens which previous work has identified as attention sinks and memory aids. Using intervention-based techniques, we evaluate the necessity and sufficiency (for preserving model performance) of punctuation tokens across layers in GPT-2, DeepSeek, and Gemma. Our results show stark model-specific differences: for GPT-2, punctuation is both necessary and sufficient in multiple layers, while this holds far less in DeepSeek and not at all in Gemma. Extending beyond punctuation, we ask whether LLMs process different components of input (e.g., subjects, adjectives, punctuation, full sentences) by forming early static summaries reused across the network, or if the model remains sensitive to changes in these components across layers. Extending beyond punctuation, we investigate whether different reasoning rules are processed differently by LLMs. In particular, through interchange intervention and layer-swapping experiments, we find that conditional statements (if, then), and universal quantification (for all) are processed very differently. Our findings offer new insight into the internal mechanisms of punctuation usage and reasoning in LLMs and have implications for interpretability.
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
- Europe > Netherlands > Gelderland > Nijmegen (0.04)
- Asia > Middle East > Jordan (0.04)
Whispering Context: Distilling Syntax and Semantics for Long Speech Transcripts
ASR systems often struggle with maintaining syntactic and semantic accuracy in long audio transcripts, impacting tasks like Named Entity Recognition (NER), capitalization, and punctuation. We propose a novel approach that enhances ASR by distilling contextual knowledge from LLaMA models into Whisper. Our method uses two strategies: (1) token level distillation with optimal transport to align dimensions and sequence lengths, and (2) representation loss minimization between sentence embeddings of Whisper and LLaMA, blending syntax and semantics. Evaluations on the Spoken Wikipedia dataset, a benchmark with long audios and rich entities demonstrate significant improvements in Word Error Rate (WER), NER, capitalization, and punctuation success. By introducing novel NER metrics and exploring semantics aware ASR, our work highlights the value of integrating linguistic context into transcription, setting a foundation for robust, context-aware ASR in longform speech.
- North America > United States > New York (0.04)
- North America > United States > Florida > Hillsborough County > Tampa (0.04)
- Europe > Germany (0.04)
- (2 more...)
- Law (0.68)
- Government > Regional Government > North America Government > United States Government (0.46)
emoji-development-face-tears-joy-book-keith-houston.html?via=rss
A couple of years ago, I frequently found myself driving past a roadside ice cream stand under construction. For weeks, the roof of this stand, a gigantic white swirl of fiberglass soft serve, sat on the ground next to the structure, waiting to be lowered onto the finished, cone-shaped building with a crane. I know what it was supposed to represent, but every time I glimpsed it, my instinctive first thought was There's a giant poop emoji. Keith Houston's history of emoji, Face With Tears of Joy, argues that emoji have "become so ubiquitous in our writing, so quotidian, that we should be talking about them in the same breath as grammar or punctuation." I don't know about grammar, which seems as fundamental to language, spoken and written, as words themselves.