Software and Resources

Navigate using the side menu

× Language models Raw corpora Speech corpora HTR and OCR Machine translation Text simplification Lexicons Standardisation Treebanks Parsing Shallow processing and tagging Industrial software Other annotated corpora

Language models

CamemBERT

Neural BERT-like language model for French

PAGnol

Neural GPT-based language model for French

GAPeron

A fully open suite of French-English-coding language models designed to advance transparency and reproducibility in large-scale model training.

ModernCamemBERT

ModernCamemBERT is a French language model pretrained on a large corpus of 1T tokens of High-Quality French text. It is the French version of the ModernBERT model.

FrELMo

ELMo language model for French

MRELMo

ELMo language models for 5 mid-resource languages (Bulgarian, Catalan, Danish, Finnish, Indonesian)

CamemBERTa

A DeBERTa v3-based French language model

MANTa-LM

A differentiable tokenizer trained end-to-end with the language model.

CamemBERT-bio-gliner

Neural GLiNER-like language model for the French biomedical domain

D'AlemBERT

Neural BERT-like language model for Early Modern French

CamemBERT-bio

Neural BERT-like language model for the French biomedical domain

CharacterBERT-UGC

A CharacterBERT language model for North-African Arabizi and French user-generated content

Bloom

Open large multilingual language model

CamemBERTav2

A SOTA French pretrained language model, based on the DeBERTaV3 architecture.

CamemBERTv2

A new updated version of the CamemBERT pretrained language model for french

Raw corpora

OSCAR

Huge multilingual web-based corpus

goclassy

Asynchronous concurrent pipeline for classifying Common Crawl

Ungoliant

High-performance pipeline that provides tools to build corpus generation pipelines from CommonCrawl.

mOSCAR

Large-scale multilingual, multimodal (text-image) web-crawled corpus

Speech corpora

SpeechMatrix

Speech parallel corpus mined from VoxPopuli

Expresso

A Benchmark and Analysis of Discrete Expressive Speech Resynthesis

HTR and OCR

HTR-United

HTR-United is an open Github ecosystem designed to share training data for HTR and OCR tasks

CATMuS Medieval (Dataset)

Large-scale diverse dataset for handwritten text recognition of medieval manuscripts

WikiCremma

Dataset for HTR training on Contemporary French

LADaS

LADaS (Layout Analysis Dataset with SegmOnto) is a diachronic diageneric layout analysis dataset (16th-21st c.)

KaMI-Lib

KaMI-lib is an HTR and OCR engine agnostic Python package for evaluating transcription models

eScriptorium Documentation

Open and collaborative documentation for eScriptorium

CATMuS Medieval (Model)

Handwritten Text Recognition model for medieval manuscripts in Latin scripts

HTRomance

Ground-truth for training HTR models

eScriptorium

Web application for manual, semi-automatic, and automatic segmentation and transcription of printed or handwritten text documents, with the possibility of training or reusing transcription models.

Kraken

Kraken is software that can be used to train and utilize models for transcribing, segmenting, and annotating printed or handwritten documents, regardless of language.

Machine translation

DiscEvalMT

Contrastive test sets for the evaluation of discourse phenomena in English-to-French machine translation

PFSMB

FR-EN parallel corpus of noisy user-generated content

PMUMT

FR-EN Annotated parallel corpus of noisy user-generated content

DiaBLa

Parallel dataset of English-French bilingual dialogues

SONAR

SONAR (Sentence-level multimOdal and laNguage-Agnostic Representations) is a multilingual and multimodal fixed-size sentence embedding space, with a full suite of speech and text encoders and decoders

T-modules

Approach to cross-modal transfer between speech and text for translation tasks

SWELLS

SWELLS makes it possible to assess, in a controlled way, the ability of language models to assimilate specific aspects of an unknown language on the basis of grammar book excerpts added to their prompts.

ACReFOSC

Generates fine-tuning datasets for preference optimization in machine translation.

VGAMT

A multimodal machine translation model

CoMMuTE

A contrastive evaluation dataset for multimodal (text-image) machine translation.

RoCS-MT

Robust Challenge Set for Machine Translation

Text simplification

ACCESS

Controllable Text Simplification Model

ASSET

Text Simplification Evaluation Dataset

EASSE

Text Simplification Evaluation Library

tseval

Text Simplification Evaluation Library

Lexicons

WOLF

Free Wordnet for French

Alexina

Morphological (and sometimes syntactic) lexicons (including the Lefff)

OFrLex-modifier

Online user interface to collaboratively modify and check the OFrLex lexicon

EtymDB

Etymological database extracted from wiktionary

UDLexicons

Multilingual collection of morphological lexicons

Standardisation

SSK

Collection of research use case scenarios illustrating best practices in Digital Humanities and Heritage research

Treebanks

Sequoia corpus

French corpus with surface and deep syntactic annotations

FQB

Multi-layered treebank made of questions for French

FSMB

French social media bank

Narabizi Treebank

A multi-layered treebank for the Arabic dialect spoken in North Africa and written in Latin Script

Parsing

FRMG

A large-coverage meta-grammar for French

SYNTAX

Lexical and syntactic parser generator

DyALog

Environment for building tabular parsers and programs

Mgwiki

Linguistic Wiki for FRMG

dyalog-sr

Transition-based parser built on top of DyALog

ELMoLex

Neural parsing system developed for ALMAnaCH's submission to the CoNLL-18 multilingual parsing shared task

Shallow processing and tagging

SxPipe

Shallow language pipeline

GROBID-Dictionaries

GROBID module for structuring digitised lexical resources and entry-based documents

GROBID

Library for extracting, parsing and re-structuring raw documents

MElt

Statistical part-of-speech tagger

entity-fishing

Entity recognition and disambiguation

grobid-medical-report

GROBID module for extracting and restructuring medical reports from PDF documents into encoded XML/TEI documents

ModFr-norm

Normalisation of Modern (17th c.) French

DESIR-CodeSprint-TrackA-TextMining

A tool for extracting scholarly documents and visualizing the results on PDF files using GROBID.

nerdKid

NerdKid is a tool for grouping Wikidata entities into 27 classes (e.g., ANIMAL, LOCATION, MEDIA, PERSON).

CCASS-sim

Similarity detection tool for legal texts from the Cour de Cassation

D'AlemBERT NER

NER model for Early Modern French

D'AlemBERT POS

POS tagger for Early Modern French

ocDI

Occitan dialect identification models

Industrial software

vera

Automatic analysis of answers to open-ended questions in employee surveys

Enqi

feats2notes

Generation of notes from structured data

Other annotated corpora

VerDI project release

Omission detection tool for journalistic content.

FreEM-corpora

Corpora and NLP tools for Early Modern French (16th-18th c.)

3MT_French Dataset

3 Minutes Thesis Corpus

Counter dataset

An open-source pseudonymized dataset aimed at facilitating research on radicalization detection with NER annotations. It is the first publicly available multilingual dataset for radicalization detection, gathered from diverse sources.

HaSCoSVa

A collection of Spanish tweets annotated for hate speech towards immigrants across two different Spanish-speaking regions.

NeWMe

A corpus of annotated instances of Word Meaning Negotiation (sequences in conversation where speakers discuss word meaning) from existing oral and written conversational corpora.

SPOT

SPOT (Stopping Points in Online Threads) is a French corpus of 43k Facebook comments annotated for the presence of stopping points (critical interventions)

CUBANSPVARIETY

A Cuban Spanish variety identification dataset with common example annotations developed to facilitate more accurate detection of Cuban and Caribbean Spanish varieties. This is the first dataset dedicated to identifying the Cuban (or any other Caribbean) Spanish variety.

OcWikiDialects

OcWikiDialects is a corpus derived from Occitan Wikipedia, featuring diverse metadata, including dialect annotations.