AITopics | Corro, Caio

Collaborating Authors

Corro, Caio

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

EuroBERT: Scaling Multilingual Encoders for European Languages

Boizard, Nicolas, Gisserot-Boukhlef, Hippolyte, Alves, Duarte M., Martins, André, Hammal, Ayoub, Corro, Caio, Hudelot, Céline, Malherbe, Emmanuel, Malaboeuf, Etienne, Jourdan, Fanny, Hautreux, Gabriel, Alves, João, El-Haddad, Kevin, Faysse, Manuel, Peyrard, Maxime, Guerreiro, Nuno M., Fernandes, Patrick, Rei, Ricardo, Colombo, Pierre

arXiv.org Artificial IntelligenceMar-7-2025

Many important tasks in Natural Language Processing (NLP), including information retrieval, classification, or regression, are built upon general-purpose vector representations. These representations are traditionally obtained from bidirectional encoder models, which aggregate information from the left and right contexts of each token (Devlin et al., 2019; Conneau et al., 2020; He et al., 2023). In contrast, recent advances in generative modeling have shifted the research community's attention towards unidirectional architectures (Bai et al., 2023; Llama Team, 2024; OLMo et al., 2025). Notably, these efforts have identified several key performance drivers that span architectural advances, data improvements, and increased scale. Yet, despite no apparent barrier to transferring these insights to bidirectional architectures, little effort has been devoted towards this objective, forcing practitioners to depend on outdated models. In this paper, we introduce a refreshed recipe for training general-purpose multilingual encoders, resulting in the EuroBERT family. Drawing inspiration from recent progress in decoder models, our models feature an updated architecture ( 2.1), and are trained on a 5T-token multilingual dataset, covering widely spoken European and global languages,

computational linguistic, large language model, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2503.055

Country:

Asia (1.00)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Europe > Portugal > Lisbon > Lisbon (0.14)

Genre: Research Report (0.41)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.93)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.68)

Add feedback

Training LayoutLM from Scratch for Efficient Named-Entity Recognition in the Insurance Domain

Uthayasooriyar, Benno, Ly, Antoine, Vermet, Franck, Corro, Caio

arXiv.org Artificial IntelligenceDec-12-2024

Generic pre-trained neural networks may struggle to produce good results in specialized domains like finance and insurance. This is due to a domain mismatch between training data and downstream tasks, as in-domain data are often scarce due to privacy constraints. In this work, we compare different pre-training strategies for LayoutLM. We show that using domain-relevant documents improves results on a named-entity recognition (NER) problem using a novel dataset of anonymized insurance-related financial documents called Payslips. Moreover, we show that we can achieve competitive results using a smaller and faster model.

artificial intelligence, information retrieval, natural language, (15 more...)

arXiv.org Artificial Intelligence

2412.09341

Country:

North America > United States > Minnesota (0.28)
North America > United States > California (0.28)

Genre: Research Report (1.00)

Industry: Information Technology > Security & Privacy (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (1.00)

Add feedback

Few-Shot Domain Adaptation for Named-Entity Recognition via Joint Constrained k-Means and Subspace Selection

Hammal, Ayoub, Uthayasooriyar, Benno, Corro, Caio

arXiv.org Artificial IntelligenceDec-12-2024

Named-entity recognition (NER) is a task that typically requires large annotated datasets, which limits its applicability across domains with varying entity definitions. This paper addresses few-shot NER, aiming to transfer knowledge to new domains with minimal supervision. Unlike previous approaches that rely solely on limited annotated data, we propose a weakly supervised algorithm that combines small labeled datasets with large amounts of unlabeled data. Our method extends the k-means algorithm with label supervision, cluster size constraints and domain-specific discriminative subspace selection. This unified framework achieves state-of-the-art results in few-shot NER on several English datasets.

artificial intelligence, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

2412.00426

Country: North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.93)

Add feedback

A fast and sound tagging method for discontinuous named-entity recognition

Corro, Caio

arXiv.org Artificial IntelligenceSep-24-2024

We introduce a novel tagging scheme for discontinuous named entity recognition based on an explicit description of the inner structure of discontinuous mentions. We rely on a weighted finite state automaton for both marginal and maximum a posteriori inference. As such, our method is sound in the sense that (1) well-formedness of predicted tag sequences is ensured via the automaton structure and (2) there is an unambiguous mapping between well-formed sequences of tags and (discontinuous) mentions. We evaluate our approach on three English datasets in the biomedical domain, and report comparable results to state-of-the-art while having a way simpler and faster model.

artificial intelligence, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2409.16243

Country:

North America > United States > Texas (0.46)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)

Genre: Research Report (0.50)

Industry:

Health & Medicine > Therapeutic Area (0.46)
Health & Medicine > Pharmaceuticals & Biotechnology (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.46)

Add feedback

Sparse Logistic Regression with High-order Features for Automatic Grammar Rule Extraction from Treebanks

Herrera, Santiago, Corro, Caio, Kahane, Sylvain

arXiv.org Artificial IntelligenceMar-26-2024

Descriptive grammars are highly valuable, but writing them is time-consuming and difficult. Furthermore, while linguists typically use corpora to create them, grammar descriptions often lack quantitative data. As for formal grammars, they can be challenging to interpret. In this paper, we propose a new method to extract and explore significant fine-grained grammar patterns and potential syntactic grammar rules from treebanks, in order to create an easy-to-understand corpus-based grammar. More specifically, we extract descriptions and rules across different languages for two linguistic phenomena, agreement and word order, using a large search space and paying special attention to the ranking order of the extracted rules. For that, we use a linear classifier to extract the most salient features that predict the linguistic phenomena under study. We associate statistical information to each rule, and we compare the ranking of the model's results to those of other quantitative and statistical measures.

artificial intelligence, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2403.17534

Country:

North America > United States (0.46)
Europe > France > Île-de-France (0.14)
Asia > Middle East > Republic of Türkiye (0.14)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Grammars & Parsing (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Rule-Based Reasoning (0.64)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.41)

Add feedback

SaulLM-7B: A pioneering Large Language Model for Law

Colombo, Pierre, Pires, Telmo Pessoa, Boudiaf, Malik, Culver, Dominic, Melo, Rui, Corro, Caio, Martins, Andre F. T., Esposito, Fabrizio, Raposo, Vera Lúcia, Morgado, Sofia, Desa, Michael

arXiv.org Artificial IntelligenceMar-7-2024

In this paper, we introduce SaulLM-7B, a large language model (LLM) tailored for the legal domain. With 7 billion parameters, SaulLM-7B is the first LLM designed explicitly for legal text comprehension and generation. Leveraging the Mistral 7B architecture as its foundation, SaulLM-7B is trained on an English legal corpus of over 30 billion tokens. SaulLM-7B exhibits state-of-the-art proficiency in understanding and processing legal documents. Additionally, we present a novel instructional fine-tuning method that leverages legal datasets to further enhance SaulLM-7B's performance in legal tasks. SaulLM-7B is released under the MIT License.

large language model, machine learning, preprint arxiv, (19 more...)

arXiv.org Artificial Intelligence

2403.03883

Country:

North America > United States (0.46)
Europe > Portugal > Lisbon > Lisbon (0.14)
Europe > United Kingdom > Scotland (0.14)

Genre: Research Report (0.82)

Industry: Law (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

CroissantLLM: A Truly Bilingual French-English Language Model

Faysse, Manuel, Fernandes, Patrick, Guerreiro, Nuno M., Loison, António, Alves, Duarte M., Corro, Caio, Boizard, Nicolas, Alves, João, Rei, Ricardo, Martins, Pedro H., Casademunt, Antoni Bigata, Yvon, François, Martins, André F. T., Viaud, Gautier, Hudelot, Céline, Colombo, Pierre

arXiv.org Artificial IntelligenceFeb-2-2024

We introduce CroissantLLM, a 1.3B language model pretrained on a set of 3T English and French tokens, to bring to the research and industrial community a high-performance, fully open-sourced bilingual model that runs swiftly on consumer-grade local hardware. To that end, we pioneer the approach of training an intrinsically bilingual model with a 1:1 English-to-French pretraining data ratio, a custom tokenizer, and bilingual finetuning datasets. We release the training dataset, notably containing a French split with manually curated, high-quality, and varied data sources. To assess performance outside of English, we craft a novel benchmark, FrenchBench, consisting of an array of classification and generation tasks, covering various orthogonal aspects of model performance in the French Language. Additionally, rooted in transparency and to foster further Large Language Model research, we release codebases, and dozens of checkpoints across various model sizes, training data distributions, and training steps, as well as fine-tuned Chat models, and strong translation models. We evaluate our model through the FMTI framework, and validate 81 % of the transparency criteria, far beyond the scores of even most open initiatives. This work enriches the NLP landscape, breaking away from previous English-centric work in order to strengthen our understanding of multilinguality in language models.

large language model, machine learning, natural language, (22 more...)

arXiv.org Artificial Intelligence

2402.00786

Country:

Europe > France (0.68)
Africa (0.67)
Europe > Portugal > Lisbon > Lisbon (0.14)
(5 more...)

Genre: Research Report (1.00)

Industry:

Leisure & Entertainment (1.00)
Information Technology > Security & Privacy (1.00)
Health & Medicine > Therapeutic Area (1.00)
(5 more...)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)

Add feedback

Structural generalization in COGS: Supertagging is (almost) all you need

Petit, Alban, Corro, Caio, Yvon, François

arXiv.org Artificial IntelligenceOct-21-2023

In many Natural Language Processing applications, neural networks have been found to fail to generalize on out-of-distribution examples. In particular, several recent semantic parsing datasets have put forward important limitations of neural networks in cases where compositional generalization is required. In this work, we extend a neural graph-based semantic parsing framework in several ways to alleviate this issue. Notably, we propose: (1) the introduction of a supertagging step with valency constraints, expressed as an integer linear program; (2) a reduction of the graph prediction problem to the maximum matching problem; (3) the design of an incremental early-stopping training strategy to prevent overfitting. Experimentally, our approach significantly improves results on examples that require structural generalization in the COGS dataset, a known challenging benchmark for compositional generalization. Overall, our results confirm that structural constraints are important for generalization in semantic parsing.

artificial intelligence, natural language, structural generalization, (2 more...)

arXiv.org Artificial Intelligence

2310.14124

Genre: Research Report > New Finding (0.53)

Technology: Information Technology > Artificial Intelligence > Natural Language > Grammars & Parsing (1.00)

Add feedback

A dynamic programming algorithm for span-based nested named-entity recognition in O(n^2)

Corro, Caio

arXiv.org Artificial IntelligenceMay-26-2023

Our main contributions can be summarized as Named entity recognition (NER) is a fundamental follows: problem in information retrieval that aims to identify We present the semi-Markov and CYK-like mentions of entities and their associated types models for non-nested and nested NER, respectively in natural language documents. As such, the problem -- although we do not claim that can be reduced to the identification and classification these approaches for NER are new, our presentation of segments of texts. In particular, we of the CYK-like algorithm differs focus on mentions that have the following properties: from previous work as it is tailored to the NER problem and guarantees uniqueness of 1. continuous, i.e. a mention corresponds to a derivations; contiguous sequence of words; We introduce a novel search space for nested 2. potentially nested, i.e. one mention can be inside NER that has no significant loss in coverage another, but they can never partially overlap.

artificial intelligence, information retrieval, natural language, (21 more...)

arXiv.org Artificial Intelligence

2210.04738

Country:

North America > United States (1.00)
Europe (1.00)
Asia (0.68)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.93)

Add feedback

On graph-based reentrancy-free semantic parsing

Petit, Alban, Corro, Caio

arXiv.org Artificial IntelligenceFeb-15-2023

We propose a novel graph-based approach for semantic parsing that resolves two problems observed in the literature: (1) seq2seq models fail on compositional generalization tasks; (2) previous work using phrase structure parsers cannot cover all the semantic parses observed in treebanks. We prove that both MAP inference and latent tag anchoring (required for weakly-supervised learning) are NP-hard problems. We propose two optimization algorithms based on constraint smoothing and conditional gradient to approximately solve these inference problems. Experimentally, our approach delivers state-of-the-art results on Geoquery, Scan and Clevr, both for i.i.d. splits and for splits that test for compositional generalization.

artificial intelligence, computational linguistic, natural language, (18 more...)

arXiv.org Artificial Intelligence

2302.07679

Country:

North America > United States (1.00)
Europe (1.00)
Asia (1.00)
North America > Canada > British Columbia (0.28)

Genre: Research Report (0.50)

Technology: Information Technology > Artificial Intelligence > Natural Language > Grammars & Parsing (1.00)

Add feedback