Goto

Collaborating Authors

 clarin


CLASSLA-Express: a Train of CLARIN.SI Workshops on Language Resources and Tools with Easily Expanding Route

arXiv.org Artificial Intelligence

This paper introduces the CLASSLA-Express workshop series as an innovative approach to disseminating linguistic resources and infrastructure provided by the CLASSLA Knowledge Centre for South Slavic languages and the Slovenian CLARIN.SI infrastructure. The workshop series employs two key strategies: (1) conducting workshops directly in countries with interested audiences, and (2) designing the series for easy expansion to new venues. The first iteration of the CLASSLA-Express workshop series encompasses 6 workshops in 5 countries. Its goal is to share knowledge on the use of corpus querying tools, as well as the recently-released CLASSLA-web corpora - the largest general corpora for South Slavic languages. In the paper, we present the design of the workshop series, its current scope and the effortless extensions of the workshop to new venues that are already in sight.


CLASSLA-web: Comparable Web Corpora of South Slavic Languages Enriched with Linguistic and Genre Annotation

arXiv.org Artificial Intelligence

This paper presents a collection of highly comparable web corpora of Slovenian, Croatian, Bosnian, Montenegrin, Serbian, Macedonian, and Bulgarian, covering thereby the whole spectrum of official languages in the South Slavic language space. The collection of these corpora comprises a total of 13 billion tokens of texts from 26 million documents. The comparability of the corpora is ensured by a comparable crawling setup and the usage of identical crawling and post-processing technology. All the corpora were linguistically annotated with the state-of-the-art CLASSLA-Stanza linguistic processing pipeline, and enriched with document-level genre information via the Transformer-based multilingual X-GENRE classifier, which further enhances comparability at the level of linguistic annotation and metadata enrichment. The genre-focused analysis of the resulting corpora shows a rather consistent distribution of genres throughout the seven corpora, with variations in the most prominent genre categories being well-explained by the economic strength of each language community. A comparison of the distribution of genre categories across the corpora indicates that web corpora from less developed countries primarily consist of news articles.


Doing text analytics for Digital Humanities and Social Sciences with CLARIN (LDK tutorial), Galway 2017

VideoLectures.NET

Text is a basic material, a primary data layer, in many areas of humanities and social sciences. If we want to move forward with the agenda that the fields of digital humanities and computational social sciences are projecting, it is vital to bring together the technical areas that deal with automated text processing, and scholars in the humanities and social sciences. Much progress has been made in the last two decades in text analytics, a field that draws on recent advances in computational linguistics, information retrieval and machine learning. By now we know what to expect from basic tools, such as named entity recognition. To foster new areas of research, it is necessary to not only understand what is out there in terms of proven technologies and infrastructures such as CLARIN, but also how the developers of text analytics can work with researchers in the humanities and social sciences to understand the challenges in each other's field better.