montpellier
Adaptations of AI models for querying the LandMatrix database in natural language
Kbir, Fatiha Ait, Bourgoin, Jérémy, Decoupes, Rémy, Gradeler, Marie, Interdonato, Roberto
The Land Matrix initiative (https://landmatrix.org) and its global observatory aim to provide reliable data on large-scale land acquisitions to inform debates and actions in sectors such as agriculture, extraction, or energy in low- and middle-income countries. Although these data are recognized in the academic world, they remain underutilized in public policy, mainly due to the complexity of access and exploitation, which requires technical expertise and a good understanding of the database schema. The objective of this work is to simplify access to data from different database systems. The methods proposed in this article are evaluated using data from the Land Matrix. This work presents various comparisons of Large Language Models (LLMs) as well as combinations of LLM adaptations (Prompt Engineering, RAG, Agents) to query different database systems (GraphQL and REST queries). The experiments are reproducible, and a demonstration is available online: https://github.com/tetis-nlp/landmatrix-graphql-python.
Semi Supervised Heterogeneous Domain Adaptation via Disentanglement and Pseudo-Labelling
Dantas, Cassio F., Gaetano, Raffaele, Ienco, Dino
Semi-supervised domain adaptation methods leverage information from a source labelled domain with the goal of generalizing over a scarcely labelled target domain. While this setting already poses challenges due to potential distribution shifts between domains, an even more complex scenario arises when source and target data differs in modality representation (e.g. they are acquired by sensors with different characteristics). For instance, in remote sensing, images may be collected via various acquisition modes (e.g. optical or radar), different spectral characteristics (e.g. RGB or multi-spectral) and spatial resolutions. Such a setting is denoted as Semi-Supervised Heterogeneous Domain Adaptation (SSHDA) and it exhibits an even more severe distribution shift due to modality heterogeneity across domains.To cope with the challenging SSHDA setting, here we introduce SHeDD (Semi-supervised Heterogeneous Domain Adaptation via Disentanglement) an end-to-end neural framework tailored to learning a target domain classifier by leveraging both labelled and unlabelled data from heterogeneous data sources. SHeDD is designed to effectively disentangle domain-invariant representations, relevant for the downstream task, from domain-specific information, that can hinder the cross-modality transfer. Additionally, SHeDD adopts an augmentation-based consistency regularization mechanism that takes advantages of reliable pseudo-labels on the unlabelled target samples to further boost its generalization ability on the target domain. Empirical evaluations on two remote sensing benchmarks, encompassing heterogeneous data in terms of acquisition modes and spectral/spatial resolutions, demonstrate the quality of SHeDD compared to both baseline and state-of-the-art competing approaches. Our code is publicly available here: https://github.com/tanodino/SSHDA/
- Europe > France > Occitanie > Hérault > Montpellier (0.05)
- North America > United States > Nevada > Clark County > Las Vegas (0.04)
Cooperative learning of Pl@ntNet's Artificial Intelligence algorithm: how does it work and how can we improve it?
Lefort, Tanguy, Affouard, Antoine, Charlier, Benjamin, Lombardo, Jean-Christophe, Chouet, Mathias, Goëau, Hervé, Salmon, Joseph, Bonnet, Pierre, Joly, Alexis
Deep learning models for plant species identification rely on large annotated datasets. The PlantNet system enables global data collection by allowing users to upload and annotate plant observations, leading to noisy labels due to diverse user skills. Achieving consensus is crucial for training, but the vast scale of collected data makes traditional label aggregation strategies challenging. Existing methods either retain all observations, resulting in noisy training data or selectively keep those with sufficient votes, discarding valuable information. Additionally, as many species are rarely observed, user expertise can not be evaluated as an inter-user agreement: otherwise, botanical experts would have a lower weight in the AI training step than the average user. Our proposed label aggregation strategy aims to cooperatively train plant identification AI models. This strategy estimates user expertise as a trust score per user based on their ability to identify plant species from crowdsourced data. The trust score is recursively estimated from correctly identified species given the current estimated labels. This interpretable score exploits botanical experts' knowledge and the heterogeneity of users. Subsequently, our strategy removes unreliable observations but retains those with limited trusted annotations, unlike other approaches. We evaluate PlantNet's strategy on a released large subset of the PlantNet database focused on European flora, comprising over 6M observations and 800K users. We demonstrate that estimating users' skills based on the diversity of their expertise enhances labeling performance. Our findings emphasize the synergy of human annotation and data filtering in improving AI performance for a refined dataset. We explore incorporating AI-based votes alongside human input. This can further enhance human-AI interactions to detect unreliable observations.
- Europe > France > Occitanie > Hérault > Montpellier (0.05)
- Europe > Western Europe (0.04)
- Europe > Spain (0.04)
- (2 more...)
A lexicon obtained and validated by a data-driven approach for organic residues valorization in emerging and developing countries
Rakotomalala, Christiane, Paillat, Jean-Marie, Feder, Frédéric, Avadí, Angel, Thuriès, Laurent, Vermeire, Marie-Liesse, Médoc, Jean-Michel, Wassenaar, Tom, Hottelart, Caroline, Kieffer, Lilou, Ndjie, Elisa, Picart, Mathieu, Tchamgoue, Jorel, Tulle, Alvin, Valade, Laurine, Boyer, Annie, Duchamp, Marie-Christine, Roche, Mathieu
The text mining method presented in this paper was used for annotation of terms related to biological transformation and valorization of organic residues in agriculture in low and middle-income country. Specialized lexicon was obtained through different steps: corpus and extraction of terms, annotation of extracted terms, selection of relevant terms.
- Africa > Saint Helena, Ascension and Tristan da Cunha (0.29)
- North America > Central America (0.14)
- Asia > North Korea (0.14)
- (132 more...)
A two-head loss function for deep Average-K classification
Garcin, Camille, Servajean, Maximilien, Joly, Alexis, Salmon, Joseph
Average-K classification is an alternative to top-K classification in which the number of labels returned varies with the ambiguity of the input image but must average to K over all the samples. A simple method to solve this task is to threshold the softmax output of a model trained with the cross-entropy loss. This approach is theoretically proven to be asymptotically consistent, but it is not guaranteed to be optimal for a finite set of samples. In this paper, we propose a new loss function based on a multi-label classification head in addition to the classical softmax. This second head is trained using pseudo-labels generated by thresholding the softmax head while guaranteeing that K classes are returned on average. We show that this approach allows the model to better capture ambiguities between classes and, as a result, to return more consistent sets of possible classes. Experiments on two datasets from the literature demonstrate that our approach outperforms the softmax baseline, as well as several other loss functions more generally designed for weakly supervised multi-label classification. The gains are larger the higher the uncertainty, especially for classes with few samples.
- North America > Canada > Ontario > Toronto (0.14)
- Europe > France > Occitanie > Hérault > Montpellier (0.05)
- South America (0.04)
- (3 more...)
Structuring ontologies in a context of collaborative system modelling
Chaib, Romy Lynn, Thomopoulos, Rallou, Macombe, Catherine
Prospective studies require discussing and collaborating with the stakeholders to create scenarios of the possible evolution of the studied value-chain. However, stakeholders don't always use the same words when referring to one idea. Constructing an ontology and homogenizing vocabularies is thus crucial to identify key variables which serve in the construction of the needed scenarios. Nevertheless, it is a very complex and timeconsuming task. In this paper we present the method we used to manually build ontologies adapted to the needs of two complementary system-analysis models (namely the "Godet" and the "MyChoice" models), starting from interviews of the agri-food system's stakeholders.
Towards a Data-Driven Requirements Engineering Approach: Automatic Analysis of User Reviews
Wei, Jialiang, Courbis, Anne-Lise, Lambolais, Thomas, Xu, Binbin, Bernard, Pierre Louis, Dray, Gérard
We are concerned by Data Driven Requirements Engineering, and in particular the consideration of user's reviews. These online reviews are a rich source of information for extracting new needs and improvement requests. In this work, we provide an automated analysis using CamemBERT, which is a state-of-the-art language model in French. We created a multi-label classification dataset of 6000 user reviews from three applications in the Health & Fitness field. The results are encouraging and suggest that it's possible to identify automatically the reviews concerning requests for new features. Dataset is available at: https://github.com/Jl-wei/APIA2022-French-user-reviews-classification-dataset.
- Europe > France > Occitanie > Hérault > Montpellier (0.05)
- North America > United States > Washington > King County > Seattle (0.04)
- Europe > Germany > Berlin (0.04)
- Health & Medicine > Consumer Health (0.48)
- Information Technology (0.46)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Communications > Social Media (0.94)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)
- Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis (0.68)
Bounds of MIN_NCC and MAX_NCC and filtering scheme for graph domain variables
Justeau-Allaire, Dimitri, Birnbaum, Philippe, Lorca, Xavier
Graph domain variables and constraints are an extension of constraint programming introduced by Dooms et al. This approach had been further investigated by Fages in its PhD thesis. On the other hand, Beldiceanu et al. presented a generic filtering scheme for global constraints based on graph properties. This scheme strongly relies on the computation of graph properties' bounds and can be used in the context of graph domain variables and constraints with a few adjustments. Bounds of MIN_NCC and MAX_NCC had been defined for the graph-based representation of global constraint for the path_with_loops graph class. In this note, we generalize those bounds for graph domain variables and for any graph class. We also provide a filtering scheme for any graph class and arbitrary bounds.
The GeoLifeCLEF 2020 Dataset
Cole, Elijah, Deneu, Benjamin, Lorieul, Titouan, Servajean, Maximilien, Botella, Christophe, Morris, Dan, Jojic, Nebojsa, Bonnet, Pierre, Joly, Alexis
Understanding the geographic distribution of species is a key concern in conservation. By pairing species occurrences with environmental features, researchers can model the relationship between an environment and the species which may be found there. To facilitate research in this area, we present the GeoLifeCLEF 2020 dataset, which consists of 1.9 million species observations paired with high-resolution remote sensing imagery, land cover data, and altitude, in addition to traditional low-resolution climate and soil variables. We also discuss the GeoLifeCLEF 2020 competition, which aims to use this dataset to advance the state-of-the-art in location-based species recommendation.
- Europe > France > Occitanie > Hérault > Montpellier (0.06)
- North America > United States > Nevada (0.04)
- North America > United States > Alaska (0.04)
- Europe > France > Auvergne-Rhône-Alpes > Isère > Grenoble (0.04)
A Language-Agnostic Model for Semantic Source Code Labeling
Gelman, Ben, Hoyle, Bryan, Moore, Jessica, Saxe, Joshua, Slater, David
Code search and comprehension have become more difficult in recent years due to the rapid expansion of available source code. Current tools lack a way to label arbitrary code at scale while maintaining up-to-date representations of new programming languages, libraries, and functionalities. Comprehensive labeling of source code enables users to search for documents of interest and obtain a high-level understanding of their contents. We use Stack Overflow code snippets and their tags to train a language-agnostic, deep convolutional neural network to automatically predict semantic labels for source code documents. On Stack Overflow code snippets, we demonstrate a mean area under ROC of 0.957 over a long-tailed list of 4,508 tags. We also manually validate the model outputs on a diverse set of unlabeled source code documents retrieved from Github, and we obtain a top-1 accuracy of 86.6%. This strongly indicates that the model successfully transfers its knowledge from Stack Overflow snippets to arbitrary source code documents.
- Europe > France > Occitanie > Hérault > Montpellier (0.05)
- North America > United States > Virginia > Arlington County > Arlington (0.04)
- North America > United States > Washington > Pierce County > Tacoma (0.04)
- (3 more...)