ELMo language models for 5 mid-resource languages (Bulgarian, Catalan, Danish, Finnish, Indonesian)
D'AlemBERT
Neural BERT-like language model for Early Modern French
CamemBERTa
A DeBERTa v3-based French language model
CamemBERT-bio
Neural BERT-like language model for the French biomedical domain
MANTa-LM
A differentiable tokenizer trained end-to-end with the language model.
CharacterBERT-UGC
A CharacterBERT language model for North-African Arabizi and French user-generated content
Bloom
Open large multilingual language model
Raw corpora
OSCAR
Huge multilingual web-based corpus
goclassy
Asynchronous concurrent pipeline for classifying Common Crawl
Ungoliant
High-performance pipeline that provides tools to build corpus generation pipelines from CommonCrawl.
mOSCAR
Large-scale multilingual, multimodal (text-image) web-crawled corpus
Speech corpora
SpeechMatrix
Speech parallel corpus mined from VoxPopuli
Expresso
A Benchmark and Analysis of Discrete Expressive Speech Resynthesis
HTR and OCR
KaMI-Lib
KaMI-lib is an HTR and OCR engine agnostic Python package for evaluating transcription models
HTR-United
HTR-United is an open Github ecosystem designed to share training data for HTR and OCR tasks
WikiCremma
Dataset for HTR training on Contemporary French
eScriptorium Documentation
Open and collaborative documentation for eScriptorium
HTRomance
Ground-truth for training HTR models
CATMuS Medieval (Dataset)
Large-scale diverse dataset for handwritten text recognition of medieval manuscripts
CATMuS Medieval (Model)
Handwritten Text Recognition model for medieval manuscripts in Latin scripts
Machine translation
DiscEvalMT
Contrastive test sets for the evaluation of discourse phenomena in English-to-French machine translation
PFSMB
FR-EN parallel corpus of noisy user-generated content
PMUMT
FR-EN Annotated parallel corpus of noisy user-generated content
DiaBLa
Parallel dataset of English-French bilingual dialogues
VGAMT
A multimodal machine translation model
CoMMuTE
A contrastive evaluation dataset for multimodal (text-image) machine translation.
RoCS-MT
Robust Challenge Set for Machine Translation
SONAR
SONAR (Sentence-level multimOdal and laNguage-Agnostic Representations) is a multilingual and multimodal fixed-size sentence embedding space, with a full suite of speech and text encoders and decoders
T-modules
Approach to cross-modal transfer between speech and text for translation tasks
Text simplification
ACCESS
Controllable Text Simplification Model
ASSET
Text Simplification Evaluation Dataset
EASSE
Text Simplification Evaluation Library
tseval
Text Simplification Evaluation Library
Lexicons
WOLF
Free Wordnet for French
Alexina
Morphological (and sometimes syntactic) lexicons (including the Lefff)
OFrLex-modifier
Online user interface to collaboratively modify and check the OFrLex lexicon
EtymDB
Etymological database extracted from wiktionary
UDLexicons
Multilingual collection of morphological lexicons
Standardisation
SSK
Collection of research use case scenarios illustrating best practices in Digital Humanities and Heritage research
Standardization Survival Kit
Collection of research use case scenarios illustrating best practices in Digital Humanities and Heritage research
Treebanks
Sequoia corpus
French corpus with surface and deep syntactic annotations
FQB
Multi-layered treebank made of questions for French
FSMB
French social media bank
Narabizi Treebank
A multi-layered treebank for the Arabic dialect spoken in North Africa and written in Latin Script
Parsing
FRMG
A large-coverage meta-grammar for French
SYNTAX
Lexical and syntactic parser generator
DyALog
Environment for building tabular parsers and programs
Mgwiki
Linguistic Wiki for FRMG
dyalog-sr
Transition-based parser built on top of DyALog
ELMoLex
Neural parsing system developed for ALMAnaCH's submission to the CoNLL-18 multilingual parsing shared task
Shallow processing and tagging
SxPipe
Shallow language pipeline
GROBID-Dictionaries
GROBID module for structuring digitised lexical resources and entry-based documents
GROBID
Library for extracting, parsing and re-structuring raw documents
entity-fishing
Entity recognition and disambiguation
MElt
Statistical part-of-speech tagger
grobid-medical-report
GROBID module for extracting and restructuring medical reports from PDF documents into encoded XML/TEI documents
DESIR-CodeSprint-TrackA-TextMining
A tool for extracting scholarly documents and visualizing the results on PDF files using GROBID.
ModFr-norm
Normalisation of Modern (17th c.) French
nerdKid
NerdKid is a tool for grouping Wikidata entities into 27 classes (e.g., ANIMAL, LOCATION, MEDIA, PERSON).
CCASS-sim
Similarity detection tool for legal texts from the Cour de Cassation
D'AlemBERT NER
NER model for Early Modern French
D'AlemBERT POS
POS tagger for Early Modern French
Industrial software
vera
Automatic analysis of answers to open-ended questions in employee surveys
Enqi
feats2notes
Generation of notes from structured data
Other annotated corpora
VerDI project release
Omission detection tool for journalistic content.
FreEM-corpora
Corpora and NLP tools for Early Modern French (16th-18th c.)