Software and Resources
Navigate using the side menu
Language models
CamemBERT
Neural BERT-like language model for French
PAGnol
Neural GPT-based language model for French
GAPeron
A fully open suite of French-English-coding language models designed to advance transparency and reproducibility in large-scale model training.
ModernCamemBERT
ModernCamemBERT is a French language model pretrained on a large corpus of 1T tokens of High-Quality French text. It is the French version of the ModernBERT model.
FrELMo
ELMo language model for French
MRELMo
ELMo language models for 5 mid-resource languages (Bulgarian, Catalan, Danish, Finnish, Indonesian)
CamemBERTa
A DeBERTa v3-based French language model
MANTa-LM
A differentiable tokenizer trained end-to-end with the language model.
CamemBERT-bio-gliner
Neural GLiNER-like language model for the French biomedical domain
D'AlemBERT
Neural BERT-like language model for Early Modern French
CamemBERT-bio
Neural BERT-like language model for the French biomedical domain
CharacterBERT-UGC
A CharacterBERT language model for North-African Arabizi and French user-generated content
Bloom
Open large multilingual language model
CamemBERTav2
A SOTA French pretrained language model, based on the DeBERTaV3 architecture.
CamemBERTv2
A new updated version of the CamemBERT pretrained language model for french
Raw corpora
OSCAR
Huge multilingual web-based corpus
goclassy
Asynchronous concurrent pipeline for classifying Common Crawl
Ungoliant
High-performance pipeline that provides tools to build corpus generation pipelines from CommonCrawl.
mOSCAR
Large-scale multilingual, multimodal (text-image) web-crawled corpus
Speech corpora
SpeechMatrix
Speech parallel corpus mined from VoxPopuli
Expresso
A Benchmark and Analysis of Discrete Expressive Speech Resynthesis
HTR and OCR
HTR-United
HTR-United is an open Github ecosystem designed to share training data for HTR and OCR tasks
CATMuS Medieval (Dataset)
Large-scale diverse dataset for handwritten text recognition of medieval manuscripts
WikiCremma
Dataset for HTR training on Contemporary French
LADaS
LADaS (Layout Analysis Dataset with SegmOnto) is a diachronic diageneric layout analysis dataset (16th-21st c.)
KaMI-Lib
KaMI-lib is an HTR and OCR engine agnostic Python package for evaluating transcription models
eScriptorium Documentation
Open and collaborative documentation for eScriptorium
CATMuS Medieval (Model)
Handwritten Text Recognition model for medieval manuscripts in Latin scripts
HTRomance
Ground-truth for training HTR models
eScriptorium
Web application for manual, semi-automatic, and automatic segmentation and transcription of printed or handwritten text documents, with the possibility of training or reusing transcription models.
Kraken
Kraken is software that can be used to train and utilize models for transcribing, segmenting, and annotating printed or handwritten documents, regardless of language.
Machine translation
DiscEvalMT
Contrastive test sets for the evaluation of discourse phenomena in English-to-French machine translation
PFSMB
FR-EN parallel corpus of noisy user-generated content
PMUMT
FR-EN Annotated parallel corpus of noisy user-generated content
DiaBLa
Parallel dataset of English-French bilingual dialogues
SONAR
SONAR (Sentence-level multimOdal and laNguage-Agnostic Representations) is a multilingual and multimodal fixed-size sentence embedding space, with a full suite of speech and text encoders and decoders
T-modules
Approach to cross-modal transfer between speech and text for translation tasks
SWELLS
SWELLS makes it possible to assess, in a controlled way, the ability of language models to assimilate specific aspects of an unknown language on the basis of grammar book excerpts added to their prompts.
ACReFOSC
Generates fine-tuning datasets for preference optimization in machine translation.
VGAMT
A multimodal machine translation model
CoMMuTE
A contrastive evaluation dataset for multimodal (text-image) machine translation.
RoCS-MT
Robust Challenge Set for Machine Translation
Text simplification
ACCESS
Controllable Text Simplification Model
ASSET
Text Simplification Evaluation Dataset
EASSE
Text Simplification Evaluation Library
tseval
Text Simplification Evaluation Library
Lexicons
WOLF
Free Wordnet for French
Alexina
Morphological (and sometimes syntactic) lexicons (including the Lefff)
OFrLex-modifier
Online user interface to collaboratively modify and check the OFrLex lexicon
EtymDB
Etymological database extracted from wiktionary
UDLexicons
Multilingual collection of morphological lexicons
Standardisation
SSK
Collection of research use case scenarios illustrating best practices in Digital Humanities and Heritage research
Treebanks
Sequoia corpus
French corpus with surface and deep syntactic annotations
FQB
Multi-layered treebank made of questions for French
FSMB
French social media bank
Narabizi Treebank
A multi-layered treebank for the Arabic dialect spoken in North Africa and written in Latin Script
Parsing
FRMG
A large-coverage meta-grammar for French
SYNTAX
Lexical and syntactic parser generator
DyALog
Environment for building tabular parsers and programs
Mgwiki
Linguistic Wiki for FRMG
dyalog-sr
Transition-based parser built on top of DyALog
ELMoLex
Neural parsing system developed for ALMAnaCH's submission to the CoNLL-18 multilingual parsing shared task
Shallow processing and tagging
SxPipe
Shallow language pipeline
GROBID-Dictionaries
GROBID module for structuring digitised lexical resources and entry-based documents
GROBID
Library for extracting, parsing and re-structuring raw documents
MElt
Statistical part-of-speech tagger
entity-fishing
Entity recognition and disambiguation
grobid-medical-report
GROBID module for extracting and restructuring medical reports from PDF documents into encoded XML/TEI documents
ModFr-norm
Normalisation of Modern (17th c.) French
DESIR-CodeSprint-TrackA-TextMining
A tool for extracting scholarly documents and visualizing the results on PDF files using GROBID.
nerdKid
NerdKid is a tool for grouping Wikidata entities into 27 classes (e.g., ANIMAL, LOCATION, MEDIA, PERSON).
CCASS-sim
Similarity detection tool for legal texts from the Cour de Cassation
D'AlemBERT NER
NER model for Early Modern French
D'AlemBERT POS
POS tagger for Early Modern French
ocDI
Occitan dialect identification models
Industrial software
vera
Automatic analysis of answers to open-ended questions in employee surveys
feats2notes
Generation of notes from structured data
Other annotated corpora
VerDI project release
Omission detection tool for journalistic content.
FreEM-corpora
Corpora and NLP tools for Early Modern French (16th-18th c.)
3MT_French Dataset
3 Minutes Thesis Corpus
Counter dataset
An open-source pseudonymized dataset aimed at facilitating research on radicalization detection with NER annotations. It is the first publicly available multilingual dataset for radicalization detection, gathered from diverse sources.
HaSCoSVa
A collection of Spanish tweets annotated for hate speech towards immigrants across two different Spanish-speaking regions.
NeWMe
A corpus of annotated instances of Word Meaning Negotiation (sequences in conversation where speakers discuss word meaning) from existing oral and written conversational corpora.
SPOT
SPOT (Stopping Points in Online Threads) is a French corpus of 43k Facebook comments annotated for the presence of stopping points (critical interventions)
CUBANSPVARIETY
A Cuban Spanish variety identification dataset with common example annotations developed to facilitate more accurate detection of Cuban and Caribbean Spanish varieties. This is the first dataset dedicated to identifying the Cuban (or any other Caribbean) Spanish variety.
OcWikiDialects
OcWikiDialects is a corpus derived from Occitan Wikipedia, featuring diverse metadata, including dialect annotations.