ALMAnaCH lab
Inria project-team
ALMAnaCH
People
Seminars
Software and Resources
(current)
Publications
Projects
Contact
Software and Resources
Navigate using the side menu
☰
×
Language models
Lexicons
Raw corpora
Text simplification
HTR and OCR
Machine translation
Treebanks
Parsing
Shallow processing and tagging
Standardisation
Industrial software
Other annotated corpora
Language models
CamemBERT
Neural BERT-like language model for French
PAGnol
Neural GPT-based language model for French
FrELMo
ELMo language model for French
MRELMo
ELMo language models for 5 mid-resource languages (Bulgarian, Catalan, Danish, Finnish, Indonesian)
CamemBERTa
A DeBERTa v3-based French language model
D'AlemBERT
Neural BERT-like language model for Early Modern French
CamemBERT-bio
Neural BERT-like language model for the French biomedical domain
MANTa-LM
A robust T5-like model based on a neural tokenizer
Lexicons
WOLF
Free Wordnet for French
Alexina
Morphological (and sometimes syntactic) lexicons (including the Lefff)
EtymDB
Etymological database extracted from wiktionary
OFrLex-modifier
UDLexicons
Multilingual collection of morphological lexicons
Raw corpora
OSCAR
Huge multilingual web-based corpus
goclassy
Asynchronous concurrent pipeline for classifying Common Crawl
Ungoliant
Asynchronous concurrent pipeline for classifying Common Crawl
Text simplification
ACCESS
Controllable Text Simplification Model
ASSET
Text Simplification Evaluation Dataset
EASSE
Text Simplification Evaluation Library
tseval
Text Simplification Evaluation Library
HTR and OCR
KaMI-Lib
KaMI-lib is an HTR and OCR engine agnostic Python package for evaluating transcription models
HTR-United
HTR-United is an open Github ecosystem designed to share training data for HTR and OCR tasks
WikiCremma
Dataset for HTR training on Contemporary French
Machine translation
DiscEvalMT
Contrastive test sets for the evaluation of discourse phenomena in English-to-French machine translation
PFSMB
FR-EN parallel corpus of noisy user-generated content
PMUMT
FR-EN Annotated parallel corpus of noisy user-generated content
DiaBLa
Parallel dataset of English-French bilingual dialogues
VGAMT
A multimodal machine translation model.
Treebanks
FSMB
French social media bank
FQB
Multi-layered treebank made of questions for French
Sequoia corpus
French corpus with surface and deep syntactic annotations
Parsing
FRMG
A large-coverage meta-grammar for French
dyalog-sr
Transition-based parser built on top of DyALog
DyALog
Environment for building tabular parsers and programs
ELMoLex
Neural parsing system developed for ALMAnaCH's submission to the CoNLL-18 multilingual parsing shared task
Mgwiki
Linguistic Wiki for FRMG
SYNTAX
Lexical and syntactic parser generator
Shallow processing and tagging
GROBID
Library for extracting, parsing and re-structuring raw documents
GROBID-Dictionaries
GROBID module for structuring digitised lexical resources and entry-based documents
SxPipe
Shallow language pipeline
entity-fishing
Entity recognition and disambiguation
MElt
Statistical part-of-speech tagger
CCASS-sim
Similarity detection tool for legal texts from the Cour de Cassation
D'AlemBERT POS
POS tagger for Early Modern French
D'AlemBERT NER
NER model for Early Modern French
DESIR-CodeSprint-TrackA-TextMining
A tool for extracting scholarly documents and visualizing the results on PDF files using GROBID.
grobid-medical-report
GROBID module for extracting and restructuring medical reports from PDF documents into encoded XML/TEI documents
ModFr-norm
Normalisation of Modern (17th c.) French
nerdKid
NerdKid is a tool for grouping Wikidata entities into 27 classes (e.g., ANIMAL, LOCATION, MEDIA, PERSON).
Standardisation
SSK
Collection of research use case scenarios illustrating best practices in Digital Humanities and Heritage research
SSK (fr) / Standardization Survival Kit (en)
Industrial software
Enqi
vera
Automatic analysis of answers to open-ended questions in employee surveys
Other annotated corpora
VerDI project release
FreEM-corpora
Corpora and NLP tools for Early Modern French (16th-18th c.)