Research projects

EU projects

DARIAH

Digital Research Infrastructure for the Arts and Humanities.

ATRIUM

ATRIUM aims to empower Arts and Humanities scholars in their use of digital methods by facilitating access to a wide range of reusable workflows and interoperable, composable services offered by leading research infrastructures in the Arts and Humanities domain.

ANR projects

MaTOS

The MaTOS (Machine Translation for Open Science) project aims to develop new methods for the machine translation (MT) of complete scientific documents, as well as automatic metrics to evaluate the quality of these translations.

SINNet

Socio-inspired Neural Networks.

TraLaLaM

Translating with large language models.

Other national projects

Cap'FALC

Development of a text simplification algorithm and an accessible tool to ease the production of FALC (the French equivalent of “Easy read”).

LiLT

Linguistic issues in language technology.

Huma-Num

Very large research infrastructure (TGIR) aimed at facilitating the digitalisation of humanities and social sciences.

Patrimoines matériels – innovation, expérimentation et résilience

COLaF

Resources and tools for languages of France.

TIERED

Transforming Interdisciplinary Education and Research for Evolving Democracies.

SaLM

The SaLM project is a collaborative effort between Inria Paris and Sciences Po that seeks to enhance NLP and LLM algorithms by integrating social contexts into their development and assessment.

BackInTime

Automation of document deciperhing from ancient, medieval and modern history.

AI4IDF

AI4IDF is a structuring and federating research project devoted to artificial intelligence, a scientific and technological field that has become unavoidable.

Code Commons

Code Common is a project aiming at solidifying and extending the Software Heritage archive. The goal is to allow its users to build robust applications such as LLM-based code generation tools that will respect copyright laws.

Corpus Liberatum Linguae Graecae

CLLG is a project aimed at creating a FAIR corpus of ancient Greek texts and improving the production of structured documents from digitized books.

FG4H

FG4H is a project aimed at creating a Large Language Model (LLM) over French medical data provided by a large panel of health institutions.

SCRIBE

Scribe is a project aimed at producing large language models specialized on specific industrial sectors (finance, legal, etc.) with an emphasis on the French socio-economic context.

PRAIRIE-PSAI

PR[AI]RIE-PSAI (Paris School of AI) is the largest of the AI Clusters established as part of the France 2030 national strategy.

Justine Cassell's Choose France Chair

PEPR eNSEMBLE

BASHtr

Transcription guidelines and text recognition dataset annotation for Arabic-script manuscripts.

International projects

Universal Dependencies Project

The Universal Dependencies project is an open community effort with over 300 contributors producing nearly 200 treebanks in over 100 languages.

Interpersonality

Impact of personality on conversation, impact of conversational features on personality.

SPHERE

Social Physiology and Human-like Embodied Response Engineering.

Past projects

EU projects

CounteR (H2020, 2021-2024): In order to support the fight against radicalization and thus prevent future terrorist attacks from taking place, the CounteR project brings data from diverse sources into an analysis and early alert platform for data mining and prediction of critical areas (e.g. communities), aiming to be a frontline community policing tool which looks at the community and its related risk factors rather than targeting and monitoring individuals. The system will incorporate state of the art NLP technologies combined with expert knowledge in the psychology of radicalization processes to provide a complete solution for law enforcement authorities to understand the when, where and why of radicalization in the community.
enCollect (COST, 2017-2020): Combining language learning and crowdsourcing for developing language teaching materials and more generic language resources for NLP.
DESIR (H2020, 2017-2019): The DESIR project aims at contributing to the sustainability of the DARIAH infrastructure along all its dimensions: dissemination, growth, technology, robustness, trust and education. Inria is responsable for providing of a portfolio of text analytics services based on GROBID and entity-fishing.
HIRMEOS (H2020, 2017-2019): Integration of Research Monographs in the European Open Science infrastructure.
Parthenos (H2020, 2015-2019): Strengthening the cohesion of research in the broad sector of Linguistic Studies, Humanities, Cultural Heritage, History, Archaeology and related fields through a thematic cluster of European Research Infrastructures, integrating initiatives, e-infrastructures and other world-class infrastructures, and building bridges between different, although tightly interrelated, fields.
EHRI “European Holocaust Research Infrastructure” (H2020, 2015-2025): Transforming archival research on the Holocaust, by providing methods and tools to integrate and provide access to a wide variety of archival content.
Iperion CH (H2020, 2015-2019): Coordinating infrastructural activities in the cultural heritage domain.

ANR projects

REVITALISE (ANR PRCE, 2022-2025): More than ever, with the increasing use of online video-conferencing solutions in daily professional interactions, public speaking skills are becoming crucial. The aim of this project is to obtain better insights into the best approaches allowing the practice of public speaking skills with technologically mediated tools. To this end, we will investigate different training environments (e.g. w/o a virtual/real audience) and different training approaches (e.g., modeling-based, feedback-based, simulation-based) to help users acquire, improve, and practice public speaking skills in full autonomy. For this purpose, different research challenges will be tackled to 1/ automatically learn, from different corpora, the multimodal cues correlated to the quality of public speaking; 2/ provide pedagogical activities rooted in coaching practice, taking a user-centered approach and 3/ provide a global evaluation of the training session as well as the specific behavioral characteristics to improve.
BASNUM (ANR, 2018-2023): Digitalisation and computational annotation and exploitation of Henri Basnage de Beauval’s encyclopedic dictionary (1701).
Profiterole (ANR, 2017-2021): Modelling and analysis of Medieval French.
ParSiTi (ANR, 2016-2022): Context-aware parsing and machine translation of user-generated content.
TIME-US (ANR, 2016-2021): Digital study of remuneration and time budget textile trades in XVIIIth and XIXth century France.
SoSweet (ANR, 2015-2020): Studying sociolinguistic variability on Twitter, comparing linguistic and graph-based views on tweets.
PARSE-ME (ANR, 2015-2021): Multi-word expressions in parsing.
VerDI (ANR RAPID, 2015-2018): Automatic identification of information concealment on the internet.

Other national projects

PaRAMHTRS (BNF Datalab, 2025-2025): The PaRAMHTRS project advances large-scale experiments on medieval manuscripts (7th–15th centuries) in Latin and vernacular languages, leveraging HTR technology. It focuses on creating extensive corpora for culturomic studies and training models for ancient languages, while resolving abbreviations in HTR outputs. These efforts aim to enhance manuscript research and computational philology.
HTRogène (Biblissima+ Grant, 2024-2025): The project focuses on the production of transcriptions for literary manuscripts and public or private archives in Romance languages from the 11th to the 16th centuries. The main goal of the project is to produce training data and transcription models that are resilient to language and hand changes. HTRogenic is therefore envisaged as a building block for the infrastructure of Biblissima+ and the medieval philology of Romance languages: the project does not focus on a particular text or a small selection of texts, but on the contrary aims to produce examples of transcription capable of to constitute a representative sample. This sampling is based on specific criteria of language, script, genre and even dating.
HTRomance (BNF Datalab, 2023-2023): The HTRomance project is based on handwriting recognition (HTR). In particular, it proposes to evaluate and improve the capabilities of this technology when applied to literary manuscripts and public and private archives, in Latin and Romance languages, from the 11th to the 19th century, kept at the French National Library. The main objective of the project is the production of training data and transcription models resistant to changes in handwriting and language. It also intends to produce language models applicable to documents in ancient languages or to ancient language states. The development of training corpora will be accompanied and consolidated by the development and implementation of a novel process for evaluating the readability of output texts and the costs of producing new training data for HTR. HTRomance is complementary to editing or data mining projects: the models produced are likely to be used to obtain the textual data needed for editing or text mining.
OncoLab (Contrat PIA (AMI santé numérique), 2022-2026): The aim of the project is to make cancer data from health institutions accessible to all stakeholders involved for research and innovation purposes. The data at hand will be standardised and structured, in particular by extracting information from textual documents.
DAdaNMT (Sorbonne Emergence, 2022-2023): The aim of this project is to investigate domain adaptation for neural machine translation. We will be exploring the adaptation of models to specific, low-resource domains domains as well as training models for multiple domains.
Gallic(orpor)a (BNF Datalab, 2021-2022): Consolidate and apply a processing chain for ancient Gallica documents in long diachrony, from the first French manuscripts to revolutionary prints.
DataCatalogue (Convention (MIC), 2021-2024): The project aims at contributing to the proper transition between a basic digitalisation of cultural heritage content and the actual usage of the corresponding content within a "collection as data" perspective. To acheive this, we experiment news methods for extracting the logical structure of scanned (and OCRed) catalogues and standardise their content for publication towards curators, researchers, or wider users.
NER4archives (Convention (MIC, Archives Nationales), 2020-2024): The project focuses on named entity recognition and disambiguation on data of the Archives Nationales de France (AN). The NER task is applied to the XML/EAD resources and consists in fine-tuning a spaCy based Transformer. A spaCy wrapper of the entity-fishing package is applied for entity disambiguation. Moreover, the entities are disambiguated against the Authorities made available by the AN, by leveraging RDF graph manipulation, string-matching algorithms, and an application of CrossEncoders. The idea is to merge this approach to a structure-based approach relying on GNNs, which was partially implemented.
PRAIRIE (3IA, 2019-2024): The PRAIRIE Institute (PaRis AI Research InstitutE) is one of the four French Institutes of Artificial Intelligence, which were created as part of the national French initiative on AI announced by President Emmanuel Macron on May 29, 2018. PRAIRIE’s objective is to become within five years a world leader in AI research and higher education, with an undeniable impact on economy and technology at the French, European and global levels. It brings together academic members (“PRAIRIE chairs”) who excel at research and education in both the core methodological areas and the interdisciplinary aspects of AI, and industrial members that are major actors in AI at the global level and a very strong group of international partners.
DAHN (Convention (MIC, Archives Nationales), 2019-2022): Digitalisation and computational exploitation of archives of historical interest.
Nénufar (DGLFLF & Huma-Num (CORLI, CAHIER), 2019-2019): The project is intended to digitise and exploit the early editions (beginning of the 20th century) of the Petit Larousse dictionary. ALMAnaCH is involved in the automatic extraction of the dictionary content by means of the GROBID-dictionary and in defining a TEI-compliant interchange format for all results.
LECTAUREP (Convention (MIC, Archives Nationales), 2018-2021): Development of a platform for the transcription, reading and automatic analysis of notarial deeds present in the National Archives.
OPALINe (PIA, 2017-2020): Development of tools for the accessibility of digital books for visually impaired people.
Matériaux Anciens et Patrimoniaux (DIM, 2017-2021): The DIM « Matériaux anciens et patrimoniaux » (MAP) is a region-wide research network. Its singularity relies on a close collaboration between human sciences, experimental sciences such as physics and chemistry, scientific ecology and information sciences, while integrating socio-economical partners from the cultural heritage environment. Based on its research, development and valorization potential, we expect such an interdisciplinary network to raise the Ile-de-France region up to a world-top position as far as heritage sciences and research on ancient materials are concerned.
EFL (LabEx, 2010-2024): Empirical foundations of linguistics, including computational linguistics and natural language processing. ALMAnaCH’s predecessor team ALPAGE was one of the partner teams of this LabEx, which gathers a dozen of teams within and around Paris whose research interests include one aspects of linguistics or more. Several ALMAnaCH members are now “individual members” of the LabEx EFL. B. Sagot serves as deputy head (and former head) of one of the scientific strands of the LabEx, namely strand 6 dedicated to language resources. Benoît Sagot and D. Seddah are (co-)heads of a number of scientific “operations” within strands 6, 5 (“computational semantic analysis”) and 2 (“experimental grammar”). Main collaborations are related to language resource development (strands 5 and 6), syntactic and semantic parsing (strand 5, especially with LIPN [CNRS and U. Paris 13]) and computational morphology (strands 2 and 6, especially with CRLAO [CNRS and Inalco] and LLF [CNRS and Paris-Diderot]).

International projects

BigScience (Informal initiative, 2021-2022): This collaboration aims at fostering discussions and reflections around the research questions surrounding large language models (capabilities, limitations, potential improvements, bias, ethics, environmental impact, role in the general AI/cognitive research landscape) as well as the challenges around creating and sharing such models and datasets for research purposes and among the research community. The collaborative tasks involves creating, sharing and evaluating a large multilingual dataset and a large multilingual generative language model. An uncommonly large compute budget was allocated for these collaborative tasks (several millions GPU hours on several thousands GPUs, in particular on the French public cluster Jean Zay).
NLP Resources for Analyzing Reactions to Major Events in Hebrew and French Social Media (PHC Maïmonide, 2018-2019): Building NLP resources for analyzing reactions to major events in Hebrew and French social media.
MCM-NL (ANR-NSF, 2016-2020): Exploring correlations between data from neuro-imagery (fMRI, EEG) and data from NLP tools (mostly parsers). The data comes from “Le Petit Prince” read in French and English, and parsed with different parsers.