AITopics | Lorré, Jean-Pierre

Collaborating Authors

Lorré, Jean-Pierre

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

The Lucie-7B LLM and the Lucie Training Dataset: Open resources for multilingual language generation

Gouvert, Olivier, Hunter, Julie, Louradour, Jérôme, Cerisara, Christophe, Dufraisse, Evan, Sy, Yaya, Rivière, Laura, Lorré, Jean-Pierre, community, OpenLLM-France

arXiv.org Artificial IntelligenceMar-15-2025

We present both the Lucie Training Dataset and the Lucie-7B foundation model. The Lucie Training Dataset is a multilingual collection of textual corpora centered around French and designed to offset anglo-centric biases found in many datasets for large language model pretraining. Its French data is pulled not only from traditional web sources, but also from French cultural heritage documents, filling an important gap in modern datasets. Beyond French, which makes up the largest share of the data, we added documents to support several other European languages, including English, Spanish, German, and Italian. Apart from its value as a resource for French language and culture, an important feature of this dataset is that it prioritizes data rights by minimizing copyrighted material. In addition, building on the philosophy of past open projects, it is redistributed in the form used for training and its processing is described on Hugging Face and GitHub. The Lucie-7B foundation model is trained on equal amounts of data in French and English -- roughly 33% each -- in an effort to better represent cultural aspects of French-speaking communities. We also describe two instruction fine-tuned models, Lucie-7B-Instruct-v1.1 and Lucie-7B-Instruct-human-data, which we release as demonstrations of Lucie-7B in use. These models achieve promising results compared to state-of-the-art models, demonstrating that an open approach prioritizing data rights can still deliver strong performance. We see these models as an initial step toward developing more performant, aligned models in the near future. Model weights for Lucie-7B and the Lucie instruct models, along with intermediate checkpoints for the former, are published on Hugging Face, while model training and data preparation code is available on GitHub. This makes Lucie-7B one of the first OSI compliant language models according to the new OSI definition.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2503.12294

Country:

Europe > France (1.00)
Asia (0.67)
North America > United States > Hawaii (0.14)
Europe > United Kingdom > Scotland (0.14)

Genre: Research Report > Promising Solution (0.34)

Industry:

Education (1.00)
Law > Intellectual Property & Technology Law (0.93)
Energy (0.67)
Government > Regional Government > Europe Government > France Government (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

The Claire French Dialogue Dataset

Hunter, Julie, Louradour, Jérôme, Rennard, Virgile, Harrando, Ismaïl, Shang, Guokan, Lorré, Jean-Pierre

arXiv.org Artificial IntelligenceNov-28-2023

The overwhelming success of OpenAI's ChatGPT, whose first version was released one year ago, has led to an undeniable surge of excitement about large language models (LLMs) among researchers and the general public alike. OpenAI's anything-but-open approach to sharing its models or information about training them, however, has led to an equally passionate reaction among those who feel that AI development should be widely accessible and that data usage should be transparent in order to protect the rights of those who have contributed the data and that data - a resource crucial to the development and understanding of AI models - should be shared with the broader research community. The call for transparency has begun to bear fruit. High-profile language models like Falcon,[Almazrouei et al., 2023] LLaMa2 [Touvron et al., 2023] and MPT [MosaicML NLP Team, 2023] - to name just a few - come very close to a classic definition of open source. A central part of OpenLLM France's mission is to contribute to this momentum by building language models and remaining fully transparent about every step of model training, including the data used for training. Another objective, which we find equally important, is to increase the availability of language models and training data geared to languages other than English and to non-anglophone cultures. Indeed, the majority of the high-profile LLMs available today are trained primarily on English documents coming from anglophone cultures. Only 0.16% of the data used to train LLaMa2 comes from French, for example.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2311.1684

Country:

Europe > France (1.00)
Europe > Spain > Canary Islands (0.14)
Asia > Middle East > UAE (0.14)

Genre: Research Report (0.40)

Industry: Government (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.44)

Add feedback

Speaker-change Aware CRF for Dialogue Act Classification

Shang, Guokan, Tixier, Antoine Jean-Pierre, Vazirgiannis, Michalis, Lorré, Jean-Pierre

arXiv.org Artificial IntelligenceJun-24-2023

Recent work in Dialogue Act (DA) classification approaches the task as a sequence labeling problem, using neural network models coupled with a Conditional Random Field (CRF) as the last layer. CRF models the conditional probability of the target DA label sequence given the input utterance sequence. However, the task involves another important input sequence, that of speakers, which is ignored by previous work. To address this limitation, this paper proposes a simple modification of the CRF layer that takes speaker-change into account. Experiments on the SwDA corpus show that our modified CRF layer outperforms the original one, with very wide margins for some DA labels. Further, visualizations demonstrate that our CRF layer can learn meaningful, sophisticated transition patterns between DA label pairs conditioned on speaker-change in an end-to-end way. Code is publicly available.

artificial intelligence, machine learning, sequence, (18 more...)

arXiv.org Artificial Intelligence

2004.02913

Country:

Europe (1.00)
Asia (1.00)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
North America > United States > California > San Francisco County > San Francisco (0.14)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.87)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.66)

Add feedback