Goto

Collaborating Authors

 Niekler, Andreas


Small-Text: Active Learning for Text Classification in Python

arXiv.org Artificial Intelligence

We introduce small-text, an easy-to-use active learning library, which offers pool-based active learning for single- and multi-label text classification in Python. It features numerous pre-implemented state-of-the-art query strategies, including some that leverage the GPU. Standardized interfaces allow the combination of a variety of classifiers, query strategies, and stopping criteria, facilitating a quick mix and match, and enabling a rapid and convenient development of both active learning experiments and applications. With the objective of making various classifiers and query strategies accessible for active learning, small-text integrates several well-known machine learning libraries, namely scikit-learn, PyTorch, and Hugging Face transformers. The latter integrations are optionally installable extensions, so GPUs can be used but are not required. Using this new library, we investigate the performance of the recently published SetFit training paradigm, which we compare to vanilla transformer fine-tuning, finding that it matches the latter in classification accuracy while outperforming it in area under the curve. The library is available under the MIT License at https://github.com/webis-de/small-text, in version 1.3.0 at the time of writing.


Using Language Models on Low-end Hardware

arXiv.org Artificial Intelligence

The transition to neural networks as primary machine learning paradigm in natural language processing (NLP), and especially pre-training language models, became a major driver in NLP tasks within the Digital Humanities. Many applications in fields ranging, among other things, from Library Science, Literature Studies or Cultural Studies have been dramatically improved and automation of text based tasks is becoming widely possible. Current state-of-the-art approaches utilize pre-trained neural language models, which are fine-tuned to a given set of target variables (i.e., by training all parameters of the language model). Training neural networks requires calculating a gradient for every layer and batch element, thus easily tripling the required memory. Those complex and multi-step architectures often use specific hardware, for example Graphics processing units (GPU), in order to be efficiently trained.


Using Text Classification with a Bayesian Correction for Estimating Overreporting in the Creditor Reporting System on Climate Adaptation Finance

arXiv.org Artificial Intelligence

There is international consensus on the need to respond to the global threat posed by climate change (Paris Accord, Article 2). Development funds are essential to finance climate change adaptation and are thus an important part of international climate policy. The 2009 Copenhagen Accord (UNFCCC, 2009) aimed to mobilize USD 100 billion by 2020. Implementation of climate change adaptation measures is one of five targets set to reach the 13th Sustainable Development Goal (SDG): "Take urgent action to combat climate change and its impacts". The Creditor Reporting System (CRS), maintained by the OECD Development Assistance Committee (DAC), monitors adaptation finance flows from OECD DAC member countries to developing countries. One of the challenges in ensuring valid reporting - or at least comparable figures - across reporting agencies is that the agreements mentioned above lack indicators. To this end, the OECD DAC established in 2009 the Rio markers on climate change adaptation (CCA). For each aid activity, donors report whether it contributes to CCA, i.e. reducing "the vulnerability of human or natural systems to the current and expected impacts of climate change, including climate variability, by maintaining or increasing resilience, through increased ability to adapt to, or absorb, climate change stresses, shocks and variability and/or by helping reduce exposure to them" (OECD DAC, 2022, p. 4). Activities are eligible for a marker if "a) the climate change adaptation objective is explicitly indicated in the activity documentation; and b) the activity contains specific measures targeting the definition above."