StylOch at PAN: Gradient-Boosted Trees with Frequency-Based Stylometric Features
Ochab, Jeremi K., Matias, Mateusz, Boba, Tymoteusz, Walkowiak, Tomasz
–arXiv.org Artificial Intelligence
This submission to the binary AI detection task is based on a modular stylometric pipeline, where: public spaCy models are used for text preprocessing (including tokenisation, named entity recognition, dependency parsing, part-of-speech tagging, and morphology annotation) and extracting several thousand features (frequencies of n-grams of the above linguistic annotations); light-gradient boosting machines are used as the classifier. We collect a large corpus of more than 500 000 machine-generated texts for the classifier's training. We explore several parameter options to increase the classifier's capacity and take advantage of that training set. Our approach follows the non-neural, computationally inexpensive but explainable approach found effective previously.
arXiv.org Artificial Intelligence
Jul-17-2025
- Country:
- Asia
- Europe
- France > Auvergne-Rhône-Alpes
- Middle East > Malta
- Eastern Region > Northern Harbour District > St. Julian's (0.04)
- Poland
- Lesser Poland Province > Kraków (0.05)
- Lower Silesia Province > Wroclaw (0.04)
- Spain
- Andalusia > Jaén Province
- Jaén (0.04)
- Galicia > Madrid (0.04)
- Andalusia > Jaén Province
- Switzerland (0.05)
- North America > United States
- District of Columbia > Washington (0.04)
- New York (0.04)
- South America > Brazil
- Pernambuco > Recife (0.04)
- Genre:
- Overview (0.46)
- Research Report (0.50)
- Technology:
- Information Technology > Artificial Intelligence
- Machine Learning
- Neural Networks > Deep Learning (1.00)
- Performance Analysis > Accuracy (0.94)
- Natural Language
- Chatbot (1.00)
- Large Language Model (1.00)
- Text Processing (1.00)
- Machine Learning
- Information Technology > Artificial Intelligence