Goto

Collaborating Authors

 tree depth



LLMs Know More Than Words: A Genre Study with Syntax, Metaphor & Phonetics

Shi, Weiye, Zhang, Zhaowei, Yan, Shaoheng, Yang, Yaodong

arXiv.org Artificial Intelligence

Large language models (LLMs) demonstrate remarkable potential across diverse language-related tasks, yet whether they capture deeper linguistic properties--such as syntactic structure, phonetic cues, and metrical patterns--from raw text remains unclear. To analysis whether LLMs can learn these features effectively and apply them to important nature language related tasks, we introduce a novel multilingual genre classification dataset derived from Project Gutenberg, a large-scale digital library offering free access to thousands of public domain literary works, comprising thousands of sentences per binary task (poetry vs. novel; drama vs. poetry; drama vs. novel) in six languages (English, French, German, Italian, Spanish, and Portuguese). We augment each with three explicit linguistic feature sets (syntactic tree structures, metaphor counts, and phonetic metrics) to evaluate their impact on classification performance. Experiments demonstrate that although LLM classifiers can learn latent linguistic structures either from raw text or from explicitly provided features, different features contribute unevenly across tasks, which underscores the importance of incorporating more complex linguistic signals during model training.




Type and Complexity Signals in Multilingual Question Representations

Kokot, Robin, Poelman, Wessel

arXiv.org Artificial Intelligence

This work investigates how a multilingual transformer model represents morphosyntactic properties of questions. We introduce the Question Type and Complexity (QTC) dataset with sentences across seven languages, annotated with type information and complexity metrics including dependency length, tree depth, and lexical density. Our evaluation extends probing methods to regression labels with selectivity controls to quantify gains in generalizability. We compare layer-wise probes on frozen Glot500-m (Imani et al., 2023) representations against subword TF-IDF baselines, and a fine-tuned model. Results show that statistical features classify questions effectively in languages with explicit marking, while neural probes capture fine-grained structural complexity patterns better. We use these results to evaluate when contextual representations outperform statistical baselines and whether parameter updates reduce the availability of pre-trained linguistic information.





A novel gradient-based method for decision trees optimizing arbitrary differential loss functions

Konstantinov, Andrei V., Utkin, Lev V.

arXiv.org Machine Learning

There are many approaches for training decision trees. This work introduces a novel gradient-based method for constructing decision trees that optimize arbitrary differentiable loss functions, overcoming the limitations of heuristic splitting rules. Unlike traditional approaches that rely on heuristic splitting rules, the proposed method refines predictions using the first and second derivatives of the loss function, enabling the optimization of complex tasks such as classification, regression, and survival analysis. We demonstrate the method's applicability to classification, regression, and survival analysis tasks, including those with censored data. Numerical experiments on both real and synthetic datasets compare the proposed method with traditional decision tree algorithms, such as CART, Extremely Randomized Trees, and SurvTree. The implementation of the method is publicly available, providing a practical tool for researchers and practitioners. This work advances the field of decision tree-based modeling, offering a more flexible and accurate approach for handling structured data and complex tasks. By leveraging gradient-based optimization, the proposed method bridges the gap between traditional decision trees and modern machine learning techniques, paving the way for further innovations in interpretable and high-performing models.


Resource efficient data transmission on animals based on machine learning

Kerle-Malcharek, Wilhelm, Klein, Karsten, Wikelski, Martin, Schreiber, Falk, Wild, Timm A.

arXiv.org Artificial Intelligence

Bio-loggers, electronic devices used to track animal behaviour through various sensors, have become essential in wildlife research. Despite continuous improvements in their capabilities, bio-loggers still face significant limitations in storage, processing, and data transmission due to the constraints of size and weight, which are necessary to avoid disturbing the animals. This study aims to explore how selective data transmission, guided by machine learning, can reduce the energy consumption of bio-loggers, thereby extending their operational lifespan without requiring hardware modifications.