Punctuation-aware treebank tree binarization
Klinger, Eitan, Wadhwa, Vivaan, Park, Jungyeul
–arXiv.org Artificial Intelligence
This article presents a curated resource and evaluation suite for punctuation-aware treebank binarization. Standard binarization pipelines drop punctuation before head selection, which alters constituent shape and harms head-child identification. We release (1) a reproducible pipeline that preserves punctuation as sibling nodes prior to binarization, (2) derived artifacts and metadata (intermediate @X markers, reversibility signatures, alignment indices), and (3) an accompanying evaluation suite covering head-child prediction, round-trip reversibility, and structural compatibility with derivational resources (CCGbank). On the Penn Treebank, punctuation-aware preprocessing improves head prediction accuracy from 73.66\% (Collins rules) and 86.66\% (MLP) to 91.85\% with the same classifier, and achieves competitive alignment against CCGbank derivations. All code, configuration files, and documentation are released to enable replication and extension to other corpora.
arXiv.org Artificial Intelligence
Oct-14-2025
- Country:
- Asia > South Korea (0.04)
- Europe
- North America
- Canada > British Columbia
- United States
- California > Monterey County
- Pacific Grove (0.04)
- Illinois (0.04)
- New York (0.04)
- Pennsylvania > Philadelphia County
- Philadelphia (0.04)
- California > Monterey County
- Genre:
- Research Report (0.64)
- Technology: