SCALAR: A Part-of-speech Tagger for Identifiers

Newman, Christian D., Scholten, Brandon, Testa, Sophia, Behler, Joshua A. C., Banabilah, Syreen, Collard, Michael L., Decker, Michael J., Mkaouer, Mohamed Wiem, Zampieri, Marcos, AlOmar, Eman Abdullah, Alsuhaibani, Reem, Peruma, Anthony, Maletic, Jonathan I.

Apr-25-2025–arXiv.org Artificial Intelligence

--The paper presents the Source Code Analysis and Lexical Annotation Runtime (SCALAR), a tool specialized for mapping (annotating) source code identifier names to their corresponding part-of-speech tag sequence (grammar pattern). SCALAR's internal model is trained using scikit-learn's GradientBoostingClassifier in conjunction with a manually-curated oracle of identifier names and their grammar patterns. This specializes the tagger to recognize the unique structure of the natural language used by developers to create all types of identifiers (e.g., function names, variable names etc.). SCALAR's output is compared with a previous version of the tagger, as well as a modern off-the-shelf part-of-speech tagger to show how it improves upon other taggers' output for annotating identifiers. The code is available on Github 1 Index T erms --Program comprehension, identifier naming, part-of-speech tagging, natural language processing, software maintenance, software evolution I. I NTRODUCTION The identifiers developers create represent a significant amount of the information other developers must use to understand related code. Given that identifiers represent, on average, 70% of the characters in a code base [1], and developers spend more time reading code than writing [2], [3], it is important for researchers to better understand of how identifiers convey information, and how they can be improved to increase developer reading efficiency.

artificial intelligence, identifier, natural language, (18 more...)

arXiv.org Artificial Intelligence

Apr-25-2025

arXiv.org PDF

Add feedback

Country:
- North America > United States
  - District of Columbia > Washington (0.04)
  - Ohio
    - Wood County > Bowling Green (0.04)
    - Portage County > Kent (0.04)
    - Summit County
      - Green (0.04)
      - Akron (0.04)
  - New York > Monroe County
    - Rochester (0.04)
  - New Jersey > Hudson County
    - Hoboken (0.04)
  - Michigan > Genesee County
    - Flint (0.14)
  - Hawaii > Honolulu County
    - Honolulu (0.04)
  - Delaware > New Castle County
    - Newark (0.04)
- Asia > Middle East
  - Saudi Arabia > Riyadh Province > Riyadh (0.04)

Genre:
- Research Report (0.64)

Technology:
- Information Technology > Artificial Intelligence > Natural Language > Grammars & Parsing (0.69)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found