Publications

Explore our publications on the HAL archive

2022

PhD theses and Habiliations

Clémentine Fourrier. 2022. Neural Approaches to Historical Word Reconstruction. PhD thesis. Université PSL (Paris Sciences & Lettres).

In historical linguistics, cognates are words that descend in direct line from a common ancestor, called their proto-form, andtherefore are representative of their respective languages evolutions through time, as well as of the relations between theselanguages synchronically. As they reflect the phonetic history of the languages they belong to, they allow linguists to betterdetermine all manners of synchronic and diachronic linguistic relations (etymology, phylogeny, sound correspondences).Cognates of related languages tend to be linked through systematic phonetic correspondence patterns, which neuralnetworks could well learn to model, being especially good at learning latent patterns. In this dissertation, we seek tomethodically study the applicability of machine translation inspired neural networks to historical word prediction, relyingon the surface similarity of both tasks. We first create an artificial dataset inspired by the phonetic and phonotactic rules ofRomance languages, which allow us to vary task complexity and data size in a controlled environment, therefore identifyingif and under which conditions neural networks were applicable. We then extend our work to real datasets (after havingupdated an etymological database to gather a correct amount of data), study the transferability of our conclusions toreal data, then the applicability of a number of data augmentation techniques to the task, to try to mitigate low-resourcesituations. We finally investigat in more detail our best models, multilingual neural networks. We first confirm that, onthe surface, they seem to capture language relatedness information and phonetic similarity, confirming prior work. Wethen discover, by probing them, that the information they store is actually more complex: our multilingual models actuallyencode a phonetic language model, and learn enough latent historical information to allow decoders to reconstruct the(unseen) proto-form of the studied languages as well or better than bilingual models trained specifically on the task. Thislatent information is likely the explanation for the success of multilingual methods in the previous works
Pedro Ortiz Suarez. 2022. A Data-driven Approach to Natural Language Processing for Contemporary and Historical French. PhD thesis. Sorbonne Université.

In recent years, neural methods for Natural Language Processing (NLP) have consistently and repeatedly improved the state of the art in a wide variety of NLP tasks. One of the main contributing reasons for this steady improvement is the increased use of transfer learning techniques. These methods consist in taking a pre-trained model and reusing it, with little to no further training, to solve other tasks. Even though these models have clear advantages, their main drawback is the amount of data that is needed to pre-train them. The lack of availability of large-scale data previously hindered the development of such models for contemporary French, and even more so for its historical states.In this thesis, we focus on developing corpora for the pre-training of these transfer learning architectures. This approach proves to be extremely effective, as we are able to establish a new state of the art for a wide range of tasks in NLP for contemporary, medieval and early modern French as well as for six other contemporary languages. Furthermore, we are able to determine, not only that these models are extremely sensitive to pre-training data quality, heterogeneity and balance, but we also show that these three features are better predictors of the pre-trained models' performance in downstream tasks than the pre-training data size itself. In fact, we determine that the importance of the pre-training dataset size was largely overestimated, as we are able to repeatedly show that such models can be pre-trained with corpora of a modest size.

Journal articles

Robin Algayres, Tristan Ricoul, Julien Karadayi, Hugo Laurençon, Salah Zaiem, Abdelrahman Mohamed, Benoît Sagot and Emmanuel Dupoux. 2022. DP-Parse: Finding Word Boundaries from Raw Speech with an Instance Lexicon. Transactions of the Association for Computational Linguistics 10, pages 1051–1065. The MIT Press.

Finding word boundaries in continuous speech is challenging as there is little or no equivalent of a 'space' delimiter between words. Popular Bayesian non-parametric models for text segmentation (Goldwater et al., 2006, 2009) use a Dirichlet process to jointly segment sentences and build a lexicon of word types. We introduce DP-Parse, which uses similar principles but only relies on an instance lexicon of word tokens, avoiding the clustering errors that arise with a lexicon of word types. On the Zero Resource Speech Benchmark 2017, our model sets a new speech segmentation state-of-theart in 5 languages. The algorithm monotonically improves with better input representations, achieving yet higher scores when fed with weakly supervised inputs. Despite lacking a type lexicon, DP-Parse can be pipelined to a language model and learn semantic and syntactic representations as assessed by a new spoken word embedding benchmark. 1
Tu Anh Nguyen, Benoit Sagot and Emmanuel Dupoux. 2022. Are Discrete Units Necessary for Spoken Language Modeling? IEEE Journal of Selected Topics in Signal Processing 16, pages 1415–1423.

Recent work in spoken language modeling shows the possibility of learning a language unsupervisedly from raw audio without any text labels. The approach relies first on transforming the audio into a sequence of discrete units (or pseudo-text) and then training a language model directly on such pseudo-text. Is such a discrete bottleneck necessary, potentially introducing irreversible errors in the encoding of the speech signal, or could we learn a language model without discrete units at all? In this work, we study the role of discrete versus continuous representations in spoken language modeling. We show that discretization is indeed essential for good results in spoken language modeling. We show that discretization removes linguistically irrelevant information from the continuous features, helping to improve language modeling performances. On the basis of this study, we train a language model on the discrete units of the HuBERT features, reaching new state-of-the-art results in the lexical, syntactic and semantic metrics of the Zero Resource Speech Challenge 2021 (Track 1-Speech Only).
Barry Haddow, Rachel Bawden, Antonio Valerio Miceli Barone, Jindřich Helcl and Alexandra Birch. 2022. Survey of Low-Resource Machine Translation. Computational Linguistics 48, pages 673–732. The MIT Press.

We present a survey covering the state of the art in low-resource machine translation (MT) research. There are currently around 7,000 languages spoken in the world and almost all language pairs lack significant resources for training machine translation models. There has been increasing interest in research addressing the challenge of producing useful translation models when very little translated training data is available. We present a summary of this topical research field and provide a description of the techniques evaluated by researchers in several recent shared tasks in low-resource MT.
Julia Kreutzer, Isaac Caswell, Lisa Wang, Ahsan Wahab, Daan van Esch, Nasanbayar Ulzii-Orshikh, Allahsera Tapo, Nishant Subramani, Artem Sokolov, Claytone Sikasote, Monang Setyawan, Supheakmungkol Sarin, Sokhar Samb, Benoît Sagot, Clara Rivera, Annette Rios, Isabel Papadimitriou, Salomey Osei, Pedro Ortiz Suarez, Iroro Orife, Kelechi Ogueji, Rubungo Andre Niyongabo, Toan Q. Nguyen, Mathias Müller, André Müller, Shamsuddeen Hassan Muhammad, Nanda Muhammad, Ayanda Mnyakeni, Jamshidbek Mirzakhalov, Tapiwanashe Matangira, Colin Leong, Nze Lawson, Sneha Kudugunta, Yacine Jernite, Mathias Jenny, Orhan Firat, Bonaventure F. P. Dossou, Sakhile Dlamini, Nisansa de Silva, Sakine Çabuk Balli, Stella Biderman, Alessia Battisti, Ahmed Baruwa, Ankur Bapna, Pallavi Baljekar, Israel Abebe Azime, Ayodele Awokoya, Duygu Ataman, Orevaoghene Ahia, Oghenefego Ahia, Sweta Agrawal and Mofetoluwa Adeyemi. 2022. Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets. Transactions of the Association for Computational Linguistics 10, pages 50–72. The MIT Press.

With the success of large-scale pre-training and multilingual modeling in Natural Language Processing (NLP), recent years have seen a proliferation of large, web-mined text datasets covering hundreds of languages. However, to date there has been no systematic analysis of the quality of these publicly available datasets, or whether the datasets actually contain content in the languages they claim to represent. In this work, we manually audit the quality of 205 language-specific corpora released with five major public datasets (CCAligned, ParaCrawl, WikiMatrix, OSCAR, mC4), and audit the correctness of language codes in a sixth (JW300). We find that lower-resource corpora have systematic issues: at least 15 corpora are completely erroneous, and a significant fraction contains less than 50% sentences of acceptable quality. Similarly, we find 82 corpora that are mislabeled or use nonstandard/ambiguous language codes. We demonstrate that these issues are easy to detect even for non-speakers of the languages in question, and supplement the human judgements with automatic analyses. Inspired by our analysis, we recommend techniques to evaluate and improve multilingual corpora and discuss the risks that come with low-quality data releases.

Conference proceedings

Syrielle Montariol, Arij Riabi and Djamé Seddah. 2022. Multilingual Auxiliary Tasks Training: Bridging the Gap between Languages for Zero-Shot Transfer of Hate Speech Detection Models. In Findings of the Association for Computational Linguistics: AACL-IJCNLP 2022, pages 347–363. Association for Computational Linguistics. Online.

Zero-shot cross-lingual transfer learning has been shown to be highly challenging for tasks involving a lot of linguistic specificities or when a cultural gap is present between languages, such as in hate speech detection. In this paper, we highlight this limitation for hate speech detection in several domains and languages using strict experimental settings. Then, we propose to train on multilingual auxiliary tasks -- sentiment analysis, named entity recognition, and tasks relying on syntactic information -- to improve zero-shot transfer of hate speech detection models across languages. We show how hate speech detection models benefit from a cross-lingual knowledge proxy brought by auxiliary tasks fine-tuning and highlight these tasks' positive impact on bridging the hate speech linguistic and cultural gap between languages.
Syrielle Montariol, Étienne Simon, Arij Riabi and Djamé Seddah. 2022. Fine-tuning and Sampling Strategies for Multimodal Role Labeling of Entities under Class Imbalance. In Proceedings of the Workshop on Combating Online Hostile Posts in Regional Languages during Emergency Situations, pages 55–65. Association for Computational Linguistics. Dublin, Ireland.

We propose our solution to the multimodal semantic role labeling task from the CON-STRAINT’22 workshop. The task aims at clas-sifying entities in memes into classes such as “hero” and “villain”. We use several pre-trained multi-modal models to jointly encode the text and image of the memes, and implement three systems to classify the role of the entities. We propose dynamic sampling strategies to tackle the issue of class imbalance. Finally, we per-form qualitative analysis on the representations of the entities.
Jesujoba O Alabi, Lydia Nishimwe, Benjamin Muller, Camille Rey, Benoît Sagot and Rachel Bawden. 2022. Inria-ALMAnaCH at the WMT 2022 shared task: Does Transcription Help Cross-Script Machine Translation? In Proceedings of the Seventh Conference on Machine Translation. Association for Computational Linguistics. Abu Dhabi, United Arab Emirates.

This paper describes the Inria ALMAnaCH team submission to the WMT 2022 general translation shared task. Participating in the language directions {cs,ru,uk}→en and cs↔uk, we experiment with the use of a dedicated Latin-script transcription convention aimed at representing all Slavic languages involved in a way that maximises character-and word-level correspondences between them as well as with the English language. Our hypothesis was that bringing the source and target language closer could have a positive impact on machine translation results. We provide multiple comparisons, including bilingual and multilingual baselines, with and without transcription. Initial results indicate that the transcription strategy was not successful, resulting in lower results than baselines. We nevertheless submitted our multilingual, transcribed models as our primary systems, and in this paper provide some indications as to why we got these negative results.
Paul-Ambroise Duquenne, Hongyu Gong, Benoît Sagot and Holger Schwenk. 2022. T-Modules: Translation Modules for Zero-Shot Cross-Modal Machine Translation. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics. Abu Dhabi, United Arab Emirates.

We present a new approach to perform zeroshot cross-modal transfer between speech and text for translation tasks. Multilingual speech and text are encoded in a joint fixed-size representation space. Then, we compare different approaches to decode these multimodal and multilingual fixed-size representations, enabling zero-shot translation between languages and modalities. All our models are trained without the need of cross-modal labeled translation data. Despite a fixed-size representation, we achieve very competitive results on several text and speech translation tasks. In particular, we outperform the state of the art for zero-shot speech translation on Must-C. We also introduce the first results for zero-shot direct speechto-speech and text-to-speech translation.
Louis Martin, Angela Fan, Éric Villemonte de la Clergerie, Antoine Bordes and Benoît Sagot. 2022. MUSS: Multilingual Unsupervised Sentence Simplification by Mining Paraphrases. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 1651–1664. European Language Resources Association. Marseille, France.

Progress in sentence simplification has been hindered by a lack of labeled parallel simplification data, particularly in languages other than English. We introduce MUSS, a Multilingual Unsupervised Sentence Simplification system that does not require labeled simplification data. MUSS uses a novel approach to sentence simplification that trains strong models using sentencelevel paraphrase data instead of proper simplification data. These models leverage unsupervised pretraining and controllable generation mechanisms to flexibly adjust attributes such as length and lexical complexity at inference time. We show that this paraphrase data can be mined in any language from Common Crawl using semantic sentence embeddings, thus removing the need for labeled data. We evaluate our approach on English, French, and Spanish simplification benchmarks and closely match or outperform the previous best supervised results, despite not using any labeled simplification data. We push the state of the art further by incorporating labeled simplification data.
Robin Algayres, Adel Nabli, Benoît Sagot and Emmanuel Dupoux. 2022. Speech Sequence Embeddings using Nearest Neighbors Contrastive Learning. In Proceedings of the 23rd Annual Conference of the International Speech Communication Association, pages 2123–2127. Incheon, South Korea.

We introduce a simple neural encoder architecture that can be trained using an unsupervised contrastive learning objective which gets its positive samples from data-augmented k-Nearest Neighbors search. We show that when built on top of recent self-supervised audio representations [1, 2, 3], this method can be applied iteratively and yield competitive SSE as evaluated on two tasks: query-by-example of random sequences of speech, and spoken term discovery. On both tasks our method pushes the state-of-the-art by a significant margin across 5 different languages. Finally, we establish a benchmark on a query-byexample task on the LibriSpeech dataset to monitor future improvements in the field.
Simon Gabay, Pedro Ortiz Suarez, Rachel Bawden, Alexandre Bartz, Philippe Gambette and Benoît Sagot. 2022. Le projet FREEM : ressources, outils et enjeux pour l'étude du français d'Ancien Régime (The FREEM project: Resources, tools and challenges for the study of Ancien Régime French). In Actes de la 29e Conférence sur le Traitement Automatique des Langues Naturelles. Volume 1 : conférence principale, pages 154–165. ATALA. Avignon, France.

Despite their undoubted quality, the resources and tools available for the analysis of Ancien Régime French are no longer able to meet the challenges of research in linguistics and literature for this period. After having precisely defined the chronological framework, we present the corpora made available and the results obtained with them for several NLP tasks, fundamental to the study of language and literature.
Arij Riabi, Syrielle Montariol and Djamé Seddah. 2022. Tâches Auxiliaires Multilingues pour le Transfert de Modèles de Détection de Discours Haineux (Multilingual Auxiliary Tasks for Zero-Shot Cross-Lingual Transfer of Hate Speech Detection). In Actes de la 29e Conférence sur le Traitement Automatique des Langues Naturelles. Volume 1 : conférence principale, pages 413–423. ATALA. Avignon, France.

La tâche de détection de contenus haineux est ardue, car elle nécessite des connaissances culturelles et contextuelles approfondies ; les connaissances nécessaires varient, entre autres, selon la langue du locateur ou la cible du contenu. Or, des données annotées pour des domaines et des langues spécifiques sont souvent absentes ou limitées. C’est là que les données dans d’autres langues peuvent être exploitées ; mais du fait de ces variations, le transfert cross-lingue est souvent difficile. Dans cet article, nous mettons en évidence cette limitation pour plusieurs domaines et langues et montrons l’impact positif de l’apprentissage de tâches auxiliaires multilingues - analyse de sentiments, reconnaissance des entités nommées et tâches reposant sur des informations morpho-syntaxiques - sur le transfert cross-lingue zéro-shot des modèles de détection de discours haineux, afin de combler ce fossé culturel.
Benjamin Muller, Antonios Anastasopoulos, Benoît Sagot and Djamé Seddah. 2022. Quand être absent de mBERT n'est que le commencement : Gérer de nouvelles langues à l'aide de modèles de langues multilingues (When Being Unseen from mBERT is just the Beginning : Handling New Languages With Multilingual Language Models). In Actes de la 29e Conférence sur le Traitement Automatique des Langues Naturelles. Volume 1 : conférence principale, pages 450–451. ATALA. Avignon, France.

Transfer learning based on pretraining language models on a large amount of raw data has become a new norm to reach state-of-the-art performance in NLP. Still, it remains unclear how this approach should be applied for unseen languages that are not covered by any available large-scale multilingual language model and for which only a small amount of raw data is generally available. In this work, by comparing multilingual and monolingual models, we show that such models behave in multiple ways on unseen languages. Some languages greatly benefit from transfer learning and behave similarly to closely related high resource languages whereas others apparently do not. Focusing on the latter, we show that this failure to transfer is largely related to the impact of the script used to write such languages. We show that transliterating those languages significantly improves the potential of large-scale multilingual language models on downstream tasks. This result provides a promising direction towards making these massively multilingual models useful for a new set of unseen languages.
Thibault Charmet, Inès Cherichi, Matthieu Allain, Urszula Czerwinska, Amaury Fouret, Benoît Sagot and Rachel Bawden. 2022. Complex Labelling and Similarity Prediction in Legal Texts: Automatic Analysis of France's Court of Cassation Rulings. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 4754–4766. European Language Resources Association. Marseille, France.

Detecting divergences in the applications of the law (where the same legal text is applied differently by two rulings) is an important task. It is the mission of the French Cour de Cassation. The first step in the detection of divergences is to detect similar cases, which is currently done manually by experts. They rely on summarised versions of the rulings (syntheses and keyword sequences), which are currently produced manually and are not available for all rulings. There is also a high degree of variability in the keyword choices and the level of granularity used. In this article, we therefore aim to provide automatic tools to facilitate the search for similar rulings. We do this by (i) providing automatic keyword sequence generation models, which can be used to improve the coverage of the analysis, and (ii) providing measures of similarity based on the available texts and augmented with predicted keyword sequences. Our experiments show that the predictions improve correlations of automatically obtained similarities against our specially colelcted human judgments of similarity.
Francesco De Toni, Christopher Akiki, Javier De La Rosa, Clémentine Fourrier, Enrique Manjavacas, Stefan Schweter and Daniel Van Strien. 2022. Entities, Dates, and Languages: Zero-Shot on Historical Texts with T0. In Proceedings of BigScience Episode #5–Workshop on Challenges & Perspectives in Creating Large Language Models, pages 75–83. Association for Computational Linguistics. virtual+Dublin.

In this work, we explore whether the recently demonstrated zero-shot abilities of the T0 model extend to Named Entity Recognition for out-of-distribution languages and time periods. Using a historical newspaper corpus in 3 languages as test-bed, we use prompts to extract possible named entities. Our results show that a naive approach for prompt-based zero-shot multilingual Named Entity Recognition is error-prone, but highlights the potential of such an approach for historical languages lacking labeled datasets. Moreover, we also find that T0-like models can be probed to predict the publication date and language of a document, which could be very relevant for the study of historical texts.
Clémentine Fourrier and Syrielle Montariol. 2022. Caveats of Measuring Semantic Change of Cognates and Borrowings using Multilingual Word Embeddings. In Proceedings of the 3rd Workshop on Computational Approaches to Historical Language Change, pages 97–112. Association for Computational Linguistics. Dublin, Ireland.

Cognates and borrowings carry different aspects of etymological evolution. In this work, we study semantic change of such items using multilingual word embeddings, both static and contextualised. We underline caveats identified while building and evaluating these embeddings. We release both said embeddings and a newly-built historical words lexicon, containing typed relations between words of varied Romance languages.
Clémentine Fourrier and Benoît Sagot. 2022. Probing Multilingual Cognate Prediction Models. In Findings of the Association for Computational Linguistics: ACL 2022, pages 3786–3801. Association for Computational Linguistics. Dublin, Ireland.

Character-based neural machine translation models have become the reference models for cognate prediction, a historical linguistics task. So far, all linguistic interpretations about latent information captured by such models have been based on external analysis (accuracy, raw results, errors). In this paper, we investigate what probing can tell us about both models and previous interpretations, and learn that though our models store linguistic and diachronic information, they do not achieve it in previously assumed ways.
Simon Gabay, Pedro Ortiz Suarez, Alexandre Bartz, Alix Chagué, Rachel Bawden, Philippe Gambette and Benoît Sagot. 2022. From FreEM to D'AlemBERT: a Large Corpus and a Language Model for Early Modern French. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 3367–3374. European Language Resources Association. Marseille, France.

Language models for historical states of language are becoming increasingly important to allow the optimal digitisation and analysis of old textual sources. Because these historical states are at the same time more complex to process and more scarce in the corpora available, specific efforts are necessary to train natural language processing (NLP) tools adapted to the data. In this paper, we present our efforts to develop NLP tools for Early Modern French (historical French from the 16th to the 18th centuries). We present the FreEMmax corpus of Early Modern French and D'AlemBERT, a RoBERTa-based language model trained on FreEMmax. We evaluate the usefulness of D'AlemBERT by fine-tuning it on a part-of-speech tagging task, outperforming previous work on the test set. Importantly, we find evidence for the transfer learning capacity of the language model, since its performance on lesser-resourced time periods appears to have been boosted by the more resourced ones. We release D'AlemBERT and the open-sourced subpart of the FreEMmax corpus.
Rachel Bawden, Jonathan Poinhos, Eleni Kogkitsidou, Philippe Gambette, Benoît Sagot and Simon Gabay. 2022. Automatic Normalisation of Early Modern French. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 3354–3366. European Language Resources Association. Marseille, France.

Spelling normalisation is a useful step in the study and analysis of historical language texts, whether it is manual analysis by experts or automatic analysis using downstream natural language processing (NLP) tools. Not only does it help to homogenise the variable spelling that often exists in historical texts, but it also facilitates the use of off-the-shelf contemporary NLP tools, if contemporary spelling conventions are used for normalisation. We present FREEMnorm, a new benchmark for the normalisation of Early Modern French (from the 17th century) into contemporary French and provide a thorough comparison of three different normalisation methods: ABA, an alignment-based approach and MT-approaches, (both statistical and neural), including extensive parameter searching, which is often missing in the normalisation literature.
Victor Sanh, Albert Webson, Colin Raffel, Stephen H. Bach, Lintang Sutawika, Zaid Alyafeai, Antoine Chaffin, Arnaud Stiegler, Teven Le Scao, Arun Raja, Manan Dey, M Saiful Bari, Canwen Xu, Urmish Thakker, Shanya Sharma, Eliza Szczechla, Taewoon Kim, Gunjan Chhablani, Nihal V. Nayak, Debajyoti Datta, Jonathan Chang, Mike Tian-Jian Jiang, Han Wang, Matteo Manica, Sheng Shen, Zheng-Xin Yong, Harshit Pandey, Michael Mckenna, Rachel Bawden, Thomas Wang, Trishala Neeraj, Jos Rozen, Abheesht Sharma, Andrea Santilli, Thibault Fevry, Jason Alan Fries, Ryan Teehan, Tali Bers, Stella Biderman, Leo Gao, Thomas Wolf and Alexander M. Rush. 2022. Multitask Prompted Training Enables Zero-Shot Task Generalization. In Proceedings of the The Tenth International Conference on Learning Representations. Online.

Large language models have recently been shown to attain reasonable zero-shot generalization on a diverse set of tasks (Brown et al., 2020). It has been hypothesized that this is a consequence of implicit multitask learning in language model training (Radford et al., 2019). Can zero-shot generalization instead be directly induced by explicit multitask learning? To test this question at scale, we develop a system for easily mapping general natural language tasks into a human-readable prompted form. We convert a large set of supervised datasets, each with multiple prompts using varying natural language. These prompted datasets allow for benchmarking the ability of a model to perform completely unseen tasks specified in natural language. We fine-tune a pretrained encoder-decoder model (Raffel et al., 2020; Lester et al., 2021) on this multitask mixture covering a wide variety of tasks. The model attains strong zero-shot performance onseveral datasets, often outperforming models 16× its size. Further, our model attains strong performance on a subset of tasks from the BIG-Bench benchmark, out-performing models 6× its size. All prompts and trained models are available at https://github.com/bigscience-workshop/promptsource/ and https://huggingface.co/bigscience/T0pp.
Julien Abadji, Pedro Ortiz Suarez, Laurent Romary and Benoît Sagot. 2022. Towards a Cleaner Document-Oriented Multilingual Crawled Corpus. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 4344–4355. European Language Resources Association. Marseille, France.

The need for raw large raw corpora has dramatically increased in recent years with the introduction of transfer learning and semi-supervised learning methods to Natural Language Processing. And while there have been some recent attempts to manually curate the amount of data necessary to train large language models, the main way to obtain this data is still through automatic web crawling. In this paper we take the existing multilingual web corpus OSCAR and its pipeline Ungoliant that extracts and classifies data from Common Crawl at the line level, and propose a set of improvements and automatic annotations in order to produce a new document-oriented version of OSCAR that could prove more suitable to pre-train large generative language models as well as hopefully other applications in Natural Language Processing and Digital Humanities.

Communications

Simon Gabay, Rachel Bawden, Benoît Sagot and Philippe Gambette. 2022. Vers l'étude linguistique sur données artificielles. In Variation(s) en français. Nancy, France.

Depuis désormais des décennies, plusieurs disciplines ont pris l'habitude de travailler sur des données dites « synthétiques » plutôt que « réelles », c’est-à-dire sur des données générées par une simulation computationnelle reflétant le monde réel. Notre présentation se propose d'expérimenter cette méthode en linguistique diachronique par la génération de corpus pseudo-anciens. Nous reviendrons donc sur cette approche, tant du point de vue méthodologique que technique, en prenant comme cas d'étude celui de la variation graphique du français et de son évolution pendant l'Ancien Régime.
Aurélia Rostaing and Hugo Scheithauer. 2022. LectAuRep (2018-2021) :Projet de lecture automatique de répertoires de notaires. In Segmenter et annoter les images : déconstruire pour reconstruire. Paris, France.

You Zuo, Houda Mouzoun, Samir Ghamri Doudane, Kim Gerdes and Benoît Sagot. 2022. Patent Classification using Extreme Multi-label Learning: A Case Study of French Patents. In SIGIR 2022 - PatentSemTech workshop - 3rd Workshop on Patent Text Mining and Semantic Technologies. Madrid, Spain.

Most previous patent classification methods have treated the task as a general text classification task, and others have tried to implement XML (extreme multi-label learning) methods designed to handle vast numbers of classes. However, they focus only on the IPC subclass level, which has fewer than 700 labels and is far from "extreme." This paper presents a French Patents corpus INPI-CLS extracted from the INPI internal database. It contains all parts of patent texts (title, abstract, claims, description) published from 2002 to 2021, with IPC labels at all levels. We test different XML methods and other classification models at the subclass and group levels of the INPI-CLS dataset with about 600 and 7k labels, respectively, demonstrating the XML approach's validity to patent classification.
You Zuo, Yixuan Li, Alma Parias García and Kim Gerdes. 2022. Technological taxonomies for hypernym and hyponym retrieval in patent texts. In ToTh 2022 - Terminology & Ontology: Theories and applications. Chambéry, France.

This paper presents an automatic approach to creating taxonomies of technical terms based on the Cooperative Patent Classification (CPC). The resulting taxonomy contains about 170k nodes in 9 separate technological branches and is freely available. We also show that a Text-to-Text Transfer Transformer (T5) model can be fine-tuned to generate hypernyms and hyponyms with relatively high precision, confirming the manually assessed quality of the resource. The T5 model opens the taxonomy to any new technological terms for which a hypernym can be generated, thus making the resource updateable with new terms, an essential feature for the constantly evolving field of technological terminology.
Nathan Godey, Roman Castagné, Eric Villemonte de La Clergerie and Benoît Sagot. 2022. MANTa: Efficient Gradient-Based Tokenization for Robust End-to-End Language Modeling. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Abu Dhabi, United Arab Emirates.

Static subword tokenization algorithms have been an essential component of recent works on language modeling. However, their static nature results in important flaws that degrade the models' downstream performance and robustness. In this work, we propose MANTa, a Module for Adaptive Neural TokenizAtion. MANTa is a differentiable tokenizer trained end-to-end with the language model. The resulting system offers a trade-off between the expressiveness of byte-level models and the speed of models trained using subword tokenization. In addition, our tokenizer is highly explainable since it produces an explicit segmentation of sequences into blocks. We evaluate our pretrained model on several English datasets from different domains as well as on synthetic noise. We find that MANTa improves robustness to character perturbations and out-of-domain data. We then show that MANTa performs comparably to other models on the general-domain GLUE benchmark. Finally, we show that it is considerably faster than strictly byte-level models.
Laurent Romary and Hugo Scheithauer. 2022. DataCatalogue : enjeux et réalisations. In Un outil numérique pour interroger les catalogues de vente : le projet DataCatalogue. Paris, France.

Ghazi Felhi, Joseph Le Roux and Djamé Seddah. 2022. Exploiting Inductive Bias in Transformers for Unsupervised Disentanglement of Syntax and Semantics with VAEs. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 5763–5776. Association for Computational Linguistics. Seattle, United States.

Aurélia Rostaing and Hugo Scheithauer. 2022. Enrichir le patrimoine écrit archivistique grâce aux technologies numériques : Ingénierie du projet LectAuRep (Lecture automatique de répertoires). In DHNord 2022 - Travailler en Humanités Numériques : collaborations, complémentarités et tensions. Online, France.

Floriane Chiffoleau and Hugo Scheithauer. 2022. From a collection of documents to a published edition : how to use an end-to-end publication pipeline. In TEI 2022 - Text Encoding Initiative 2022 Conference. Newcastle, United Kingdom.

The goal of the workshop is to demonstrate how a corpus could be processed for publication with TEI Publisher. The workshop participants will learn to experiment with a ready-to-use solution that provides an easy and quick publication of a corpus. They will also get tips and shortcuts to help speed up the creation of a digital edition. Moreover, by the end of the session, this workshop will provide the participants with a visualization of their respective corpus, with side by side transformed text and original image; all of which then showing what can be achieved while working with TEI in the context of an end-to-end publication pipeline.
Ariane Pinche, Kelly Christensen and Simon Gabay. 2022. Between automatic and manual encoding. In TEI 2022 conference : Text as data. Newcastle, United Kingdom.

Cultural heritage institutions today aim to digitise their collections of prints andmanuscripts (Bermès 2020) and are generating more and more digital images (Gray2009). To enrich these images, many institutions work with standardised formats such asIIIF, preserving as much of the source’s information as possible. To take full advantage oftextual documents, an image alone is not enough. Thanks to automatic text recognitiontechnology, it is now possible to extract images’ content on a large scale. The TEI seemsto provide the perfect format to capture both an image’s formal and textual data (Janèset al. 2021). However, this poses a problem. To ensure compatibility with a range ofuse cases, TEI XML files must guarantee IIIF or RDF exports and therefore must bebased on strict data structures that can be automated. But a rigid structure contradictsthe basic principles of philology, which require maximum flexibility to cope with varioussituations. The solution proposed by the Gallic(orpor)a project1 attempted to deal with such acontradiction, focusing on French historical documents produced between the 15th andthe 18th c. It aims to enrich the digital facsimiles distributed by the French NationalLibrary (BnF).
Alix Chagué, Hugo Scheithauer, Lucas Terriel, Floriane Chiffoleau and Yves Tadjo-Takianpi. 2022. Take a sip of TEI and relax: a proposition for an end-to-end workflow to enrich and publish data created with automatic text recognition. In Digital Humanities 2022 : Responding to Asian Diversity. Tokyo, Japan.

Loïc Grobol, Mathilde Regnault, Pedro Ortiz Suarez, Benoît Sagot, Laurent Romary and Benoit Crabbé. 2022. BERTrade: Using Contextual Embeddings to Parse Old French. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 1104–1113. European Language Resources Association. Marseille, France.

The successes of contextual word embeddings learned by training large-scale language models, while remarkable, have mostly occurred for languages where significant amounts of raw texts are available and where annotated data in downstream tasks have a relatively regular spelling. Conversely, it is not yet completely clear if these models are also well suited for lesser-resourced and more irregular languages. We study the case of Old French, which is in the interesting position of having relatively limited amount of available raw text, but enough annotated resources to assess the relevance of contextual word embedding models for downstream NLP tasks. In particular, we use POS-tagging and dependency parsing to evaluate the quality of such models in a large array of configurations, including models trained from scratch from small amounts of raw text and models pre-trained on other languages but fine-tuned on Medieval French data.
Alix Chagué and Thibault Clérice. 2022. Sharing HTR datasets with standardized metadata: the HTR-United initiative. In Documents anciens et reconnaissance automatique des écritures manuscrites. Paris, France.

Simon Gabay, Rachel Bawden, Philippe Gambette, Jonathan Poinhos, Eleni Kogkitsidou and Benoît Sagot. 2022. Le changement linguistique au XVIIe s. : nouvelles approches scriptométriques. In CMLF 2022 - 8e Congrès Mondial de Linguistique Française 138, pages 02006.1–14. EDP Sciences. Orléans, France.

Linguistic change in 17<sup><small>th</small></sup> c. France: new scriptometric approaches The end of the 17<sup><small>th</small></sup> c. remains a blind spot of the research on the spelling system, despite its importance for French at this period, during which a strict norm, still (more or less) in place, was created and imposed. Focusing on a practical rather than a theoretical approach, we propose to lay the foundation for a computational scriptometric study of early modern French and analyse the evolution of the spelling system over the 17<sup><small>th</small></sup> c. To do so, we measure and evaluate the distance between the early modern and the contemporary versions of the language, thanks to two automatic normalisers: one rulebased and another one neural-based.
Hugo Scheithauer. 2022. LectAuRep : Données d'archives en français des XIXe et XXe siècles. In Transkribus / eScriptorium : Transcrire, annoter et éditer numériquement des documents d'archives. Paris, France.

Alix Chagué. 2022. Corpus, méthodes et ressources pour la transcription automatique des documents manuscrits patrimoniaux francophones contemporains. In 89e Congrès de l'Acfas, Section 310 - Le numérique dans les sciences humaines : édition et visualisation. Montréal, Canada.

Résumé en 5 minutes du projet de recherche doctorale intitulé "Corpus, méthodes et ressources pour la transcription automatique des documents manuscrits patrimoniaux francophones contemporains" débuté en novembre 2021 et récompensé par le Bourse d'Excellence 2022 du GREN. La communication replaçait le projet dans le contexte de la disponibilité actuelle des logiciels grand public pour l'application de la transcription automatique de documents manuscrits et le manque de ressources conceptuelles et méthodologiques permettant d'en tirer pleinement parti. L'une des principales difficultés évoquées était celle de la convergence des pratiques vers les modèles et des données interopérables.
Florence Clavaud, Laurent Romary, Pauline Charbonnier, Lucas Terriel, Gaetano Piraino and Vincent Verdese. 2022. NER4Archives (named entity recognition for archives) : Conception et réalisation d'un outil de détection, de classification et de résolution des entités nommées dans les instruments de recherche archivistiques encodés en XML/EAD. In Atelier Culture-INRIA. Pierrefitte sur Seine, France.

Hugo Scheithauer, Laurent Romary, Frédérique Duyrat and Federico Nurra. 2022. DataCatalogue : présentation du projet. In Atelier Culture-Inria. Pierrefitte-sur-Seine, France.

Presentation on the DataCatalogue project, jointly led by Inria, the National Library of France (BnF) and the National Institute for Art History (INHA), at the "journée Atelier culture-Inria," held at the Archives nationales on 03/22/2022.
Ghazi Felhi, Joseph Le Roux and Djamé Seddah. 2022. Towards Unsupervised Content Disentanglement in Sentence Representations via Syntactic Roles. In CtrlGen: Controllable Generative Modeling in Language and Vision. virtual, France.

Linking neural representations to linguistic factors is crucial in order to build and analyze NLP models interpretable by humans. Among these factors, syntactic roles (e.g. subjects, direct objects,.. .) and their realizations are essential markers since they can be understood as a decomposition of predicative structures and thus the meaning of sentences. Starting from a deep probabilistic generative model with attention, we measure the interaction between latent variables and realizations of syntactic roles, and show that it is possible to obtain, without supervision, representations of sentences where different syntactic roles correspond to clearly identified different latent variables. The probabilistic model we propose is an Attention-Driven Variational Autoencoder (ADVAE). Drawing inspiration from Transformer-based machine translation models, ADVAEs enable the analysis of the interactions between latent variables and input tokens through attention. We also develop an evaluation protocol to measure disentanglement with regard to the realizations of syntactic roles. This protocol is based on attention maxima for the encoder and on disturbing individual latent variables for the decoder. Our experiments on raw English text from the SNLI dataset show that i) disentanglement of syntactic roles can be induced without supervision, ii) ADVAE separates more syntactic roles than classical sequence VAEs, iii) realizations of syntactic roles can be separately modified in sentences by mere intervention on the associated latent variables. Our work constitutes a first step towards unsupervised controllable content generation. The code for our work is publicly available 1 .

Book chapters

Jack Bowers. 2022. Pathways and patterns of metaphor and metonymy in Mixtepec-Mixtec body-part terms. In The Grammar of Body-Part Expressions: A view from the Americas, pages 91–135. Roberto Zariquiey.

Other

Alix Chagué. 2022. Intelligence Artificielle et intelligence collective : des nouveaux eldorados pour rendre les textes patrimoniaux plus accessibles ? .

Alix Chagué. 2022. Conditions de la mutualisation : les principes FAIR et HTR-United. .

Preprints

Teven Le Scao, Angela Fan, Christopher Akiki, Ellie Pavlick, Suzana Ilić, Daniel Hesslow, Roman Castagné, Alexandra Sasha Luccioni, François Yvon, Matthias Gallé, Jonathan Tow, Alexander M. Rush, Stella Biderman, Albert Webson, Pawan Sasanka Ammanamanchi, Thomas Wang, Benoît Sagot, Niklas Muennighoff, Albert Villanova del Moral, Olatunji Ruwase, Rachel Bawden, Stas Bekman, Angelina Mcmillan-Major, Iz Beltagy, Huu Nguyen, Lucile Saulnier, Samson Tan, Pedro Ortiz Suarez, Victor Sanh, Hugo Laurençon, Yacine Jernite, Julien Launay, Margaret Mitchell, Colin Raffel, Aaron Gokaslan, Adi Simhi, Aitor Soroa, Alham Fikri Aji, Amit Alfassy, Anna Rogers, Ariel Kreisberg Nitzav, Canwen Xu, Chenghao Mou, Chris Emezue, Christopher Klamm, Colin Leong, Daniel van Strien, David Ifeoluwa Adelani, Dragomir Radev, Eduardo González Ponferrada, Efrat Levkovizh, Ethan Kim, Eyal Bar Natan, Francesco de Toni, Gérard Dupont, Germán Kruszewski, Giada Pistilli, Hady Elsahar, Hamza Benyamina, Hieu Tran, Ian Yu, Idris Abdulmumin, Isaac Johnson, Itziar Gonzalez-Dios, Javier de la Rosa, Jenny Chim, Jesse Dodge, Jian Zhu, Jonathan Chang, Jörg Frohberg, Joseph Tobing, Joydeep Bhattacharjee, Khalid Almubarak, Kimbo Chen, Kyle Lo, Leandro von Werra, Leon Weber, Long Phan, Loubna Ben Allal, Ludovic Tanguy, Manan Dey, Manuel Romero Muñoz, Maraim Masoud, María Grandury, Mario Šaško, Max Huang, Maximin Coavoux, Mayank Singh, Mike Tian-Jian Jiang, Minh Chien Vu, Mohammad A. Jauhar, Mustafa Ghaleb, Nishant Subramani, Nora Kassner, Nurulaqilla Khamis, Olivier Nguyen, Omar Espejel, Ona de Gibert, Paulo Villegas, Peter Henderson, Pierre Colombo, Priscilla Amuok, Quentin Lhoest, Rheza Harliman, Rishi Bommasani, Roberto Luis López, Rui Ribeiro, Salomey Osei, Sampo Pyysalo, Sebastian Nagel, Shamik Bose, Shamsuddeen Hassan Muhammad, Shanya Sharma, Shayne Longpre, Somaieh Nikpoor, Stanislav Silberberg, Suhas Pai, Sydney Zink, Tiago Timponi Torrent, Timo Schick, Tristan Thrush, Valentin Danchev, Vassilina Nikoulina, Veronika Laippala, Violette Lepercq, Vrinda Prabhu, Zaid Alyafeai, Zeerak Talat, Arun Raja, Benjamin Heinzerling, Chenglei Si, Elizabeth Salesky, Sabrina J. Mielke, Wilson Y. Lee, Abheesht Sharma, Andrea Santilli, Antoine Chaffin, Arnaud Stiegler, Debajyoti Datta, Eliza Szczechla, Gunjan Chhablani, Han Wang, Harshit Pandey, Hendrik Strobelt, Jason Alan Fries, Jos Rozen, Leo Gao, Lintang Sutawika, M Saiful Bari, Maged S. Al-Shaibani, Matteo Manica, Nihal Nayak, Ryan Teehan, Samuel Albanie, Sheng Shen, Srulik Ben-David, Stephen H. Bach, Taewoon Kim, Tali Bers, Thibault Fevry, Trishala Neeraj, Urmish Thakker, Vikas Raunak, Xiangru Tang, Zheng-Xin Yong, Zhiqing Sun, Shaked Brody, Yallow Uri, Hadar Tojarieh, Adam Roberts, Hyung Won Chung, Jaesung Tae, Jason Phang, Ofir Press, Conglong Li, Deepak Narayanan, Hatim Bourfoune, Jared Casper, Jeff Rasley, Max Ryabinin, Mayank Mishra, Minjia Zhang, Mohammad Shoeybi, Myriam Peyrounette, Nicolas Patry, Nouamane Tazi, Omar Sanseviero, Patrick von Platen, Pierre Cornette, Pierre François Lavallée, Rémi Lacroix, Samyam Rajbhandari, Sanchit Gandhi, Shaden Smith, Stéphane Requena, Suraj Patil, Tim Dettmers, Ahmed Baruwa, Amanpreet Singh, Anastasia Cheveleva, Anne-Laure Ligozat, Arjun Subramonian, Aurélie Névéol, Charles Lovering, Dan Garrette, Deepak Tunuguntla, Ehud Reiter, Ekaterina Taktasheva, Ekaterina Voloshina, Eli Bogdanov, Genta Indra Winata, Hailey Schoelkopf, Jan-Christoph Kalo, Jekaterina Novikova, Jessica Zosa Forde, Jordan Clive, Jungo Kasai, Ken Kawamura, Liam Hazan, Marine Carpuat, Miruna Clinciu, Najoung Kim, Newton Cheng, Oleg Serikov, Omer Antverg, Oskar van der Wal, Rui Zhang, Ruochen Zhang, Sebastian Gehrmann, Shani Pais, Tatiana Shavrina, Thomas Scialom, Tian Yun, Tomasz Limisiewicz, Verena Rieser, Vitaly Protasov, Vladislav Mikhailov, Yada Pruksachatkun, Yonatan Belinkov, Zachary Bamberger, Zdeněk Kasner, Alice Rueda, Amanda Pestana, Amir Feizpour, Ammar Khan, Amy Faranak, Ana Santos, Anthony Hevia, Antigona Unldreaj, Arash Aghagol, Arezoo Abdollahi, Aycha Tammour, Azadeh Hajihosseini, Bahareh Behroozi, Benjamin Ajibade, Bharat Saxena, Carlos Muñoz Ferrandis, Danish Contractor, David Lansky, Davis David, Douwe Kiela, Duong A. Nguyen, Edward Tan, Emi Baylor, Ezinwanne Ozoani, Fatima Mirza, Frankline Ononiwu, Habib Rezanejad, Hessie Jones, Indrani Bhattacharya, Irene Solaiman, Irina Sedenko, Isar Nejadgholi, Jesse Passmore, Josh Seltzer, Julio Bonis Sanz, Karen Fort, Livia Dutra, Mairon Samagaio, Maraim Elbadri, Margot Mieskes, Marissa Gerchick, Martha Akinlolu, Michael Mckenna, Mike Qiu, Muhammed Ghauri, Mykola Burynok, Nafis Abrar, Nazneen Rajani, Nour Elkott, Nour Fahmy, Olanrewaju Samuel, Ran An, Rasmus Kromann, Ryan Hao, Samira Alizadeh, Sarmad Shubber, Silas Wang, Sourav Roy, Sylvain Viguier, Thanh Le, Tobi Oyebade, Trieu Le, Yoyo Yang, Zach Nguyen, Abhinav Ramesh Kashyap, Alfredo Palasciano, Alison Callahan, Anima Shukla, Antonio Miranda-Escalada, Ayush Singh, Benjamin Beilharz, Bo Wang, Caio Brito, Chenxi Zhou, Chirag Jain, Chuxin Xu, Clémentine Fourrier, Daniel León Periñán, Daniel Molano, Dian Yu, Enrique Manjavacas, Fabio Barth, Florian Fuhrimann, Gabriel Altay, Giyaseddin Bayrak, Gully Burns, Helena U. Vrabec, Imane Bello, Ishani Dash, Jihyun Kang, John Giorgi, Jonas Golde, Jose David Posada, Karthik Rangasai Sivaraman, Lokesh Bulchandani, Lu Liu, Luisa Shinzato, Madeleine Hahn de Bykhovetz, Maiko Takeuchi, Marc Pàmies, Maria A Castillo, Marianna Nezhurina, Mario Sänger, Matthias Samwald, Michael Cullan, Michael Weinberg, Michiel de Wolf, Mina Mihaljcic, Minna Liu, Moritz Freidank, Myungsun Kang, Natasha Seelam, Nathan Dahlberg, Nicholas Michio Broad, Nikolaus Muellner, Pascale Fung, Patrick Haller, Ramya Chandrasekhar, Renata Eisenberg, Robert Martin, Rodrigo Canalli, Rosaline Su, Ruisi Su, Samuel Cahyawijaya, Samuele Garda, Shlok S Deshmukh, Shubhanshu Mishra, Sid Kiblawi, Simon Ott, Sinee Sang-Aroonsiri, Srishti Kumar, Stefan Schweter, Sushil Bharati, Tanmay Laud, Théo Gigant, Tomoya Kainuma, Wojciech Kusa, Yanis Labrak, Yash Shailesh Bajaj, Yash Venkatraman, Yifan Xu, Yingxin Xu, Yu Xu, Zhe Tan, Zhongli Xie, Zifan Ye, Mathilde Bras, Younes Belkada and Thomas Wolf. 2022. BLOOM: A 176B-Parameter Open-Access Multilingual Language Model. Preprint.

Large language models (LLMs) have been shown to be able to perform new tasks based on a few demonstrations or natural language instructions. While these capabilities have led to widespread adoption, most LLMs are developed by resource-rich organizations and are frequently kept from the public. As a step towards democratizing this powerful technology, we present BLOOM, a 176B-parameter open-access language model designed and built thanks to a collaboration of hundreds of researchers. BLOOM is a decoder-only Transformer language model that was trained on the ROOTS corpus, a dataset comprising hundreds of sources in 46 natural and 13 programming languages (59 in total). We find that BLOOM achieves competitive performance on a wide variety of benchmarks, with stronger results after undergoing multitask prompted finetuning. To facilitate future research and applications using LLMs, we publicly release our models and code under the Responsible AI License.
Yu Lu Liu, Rachel Bawden, Thomas Scialom, Benoît Sagot and Jackie Chi Kit Cheung. 2022. MaskEval: Weighted MLM-Based Evaluation for Text Summarization and Simplification. Preprint.

In text summarization and simplification, system outputs must be evaluated along multiple dimensions such as relevance, factual consistency, fluency, and grammaticality, and a wide range of possible outputs could be of high quality. These properties make the development of an adaptable, reference-less evaluation metric both necessary and challenging. We introduce MaskEval, a reference-less metric for text summarization and simplification that operates by performing masked language modeling (MLM) on the concatenation of the candidate and the source texts. It features an attention-like weighting mechanism to modulate the relative importance of each MLM step, which crucially allows it to be adapted to evaluate different quality dimensions. We demonstrate its effectiveness on English summarization and simplification in terms of correlations with human judgments, and explore transfer scenarios between the two tasks.
Tu Anh Nguyen, Eugene Kharitonov, Jade Copet, Yossi Adi, Wei-Ning Hsu, Ali Elkahky, Paden Tomasello, Robin Algayres, Benoît Sagot, Abdelrahman Mohamed and Emmanuel Dupoux. 2022. Generative Spoken Dialogue Language Modeling. Preprint.

We introduce dGSLM, the first "textless" model able to generate audio samples of naturalistic spoken dialogues. It uses recent work on unsupervised spoken unit discovery coupled with a dual-tower transformer architecture with cross-attention trained on 2000 hours of two-channel raw conversational audio (Fisher dataset) without any text or labels. It is able to generate speech, laughter and other paralinguistic signals in the two channels simultaneously and reproduces naturalistic turn taking. Generation samples can be found at: https://speechbot.github.io/dgslm.
Thibault Clérice, Malamatenia Vlachou-Efstathiou and Alix Chagué. 2022. CREMMA Medii Aevi: Literary manuscript text recognition in Latin. Preprint.

This paper present a novel segmentation and handwritten text recognition dataset for Medieval Latin, from the 11 th to the 16 th century. It connects with Medieval French dataset as well as ealier Latin dataset, by enforcing common guidelines. We provide our own addition to Ariane Pinche's Old French guidelines to deal with specific Latin case. We also offer an overview of how we addressed this dataset compilation through the use of pre-existing resources. With a higher abbreviation ratio and a better representation of abbreviating marks, we offer new models that outperform the base Old French model on Latin dataset, reaching readability levels on unknown manuscripts.
Floriane Chiffoleau and Anne Baillot. 2022. Le projet DAHN : une pipeline pour l'édition numérique de documents d'archives. Preprint.

Angelina Mcmillan-Major, Zaid Alyafeai, Stella Biderman, Kimbo Chen, Francesco de Toni, Gérard Dupont, Hady Elsahar, Chris Emezue, Alham Fikri Aji, Suzana Ilić, Nurulaqilla Khamis, Colin Leong, Maraim Masoud, Aitor Soroa, Pedro Ortiz Suarez, Zeerak Talat, Daniel van Strien and Yacine Jernite. 2022. Documenting Geographically and Contextually Diverse Data Sources: The BigScience Catalogue of Language Data and Resources. Preprint.

In recent years, large-scale data collection efforts have prioritized the amount of data collected in order to improve the modeling capabilities of large language models. This prioritization, however, has resulted in concerns with respect to the rights of data subjects represented in data collections, particularly when considering the difficulty in interrogating these collections due to insufficient documentation and tools for analysis. Mindful of these pitfalls, we present our methodology for a documentation-first, human-centered data collection project as part of the BigScience initiative. We identified a geographically diverse set of target language groups (Arabic, Basque, Chinese, Catalan, English, French, Indic languages, Indonesian, Niger-Congo languages, Portuguese, Spanish, and Vietnamese, as well as programming languages) for which to collect metadata on potential data sources. To structure this effort, we developed our online catalogue as a supporting tool for gathering metadata through organized public hackathons. We present our development process; analyses of the resulting resource metadata, including distributions over languages, regions, and resource types; and our lessons learned in this endeavor.
Sabrina J. Mielke, Zaid Alyafeai, Elizabeth Salesky, Colin Raffel, Manan Dey, Matthias Gallé, Arun Raja, Chenglei Si, Wilson Y. Lee, Benoît Sagot and Samson Tan. 2022. Between words and characters: A Brief History of Open-Vocabulary Modeling and Tokenization in NLP. Preprint.

What are the units of text that we want to model? From bytes to multi-word expressions, text can be analyzed and generated at many granularities. Until recently, most natural language processing (NLP) models operated over words, treating those as discrete and atomic tokens, but starting with byte-pair encoding (BPE), subword-based approaches have become dominant in many areas, enabling small vocabularies while still allowing for fast inference. Is the end of the road character-level model or byte-level processing? In this survey, we connect several lines of work from the pre-neural and neural era, by showing how hybrid approaches of words and characters as well as subword-based approaches based on learned segmentation have been proposed and evaluated. We conclude that there is and likely will never be a silver bullet singular solution for all applications and that thinking seriously about tokenization remains important for many applications.

2021

PhD theses and Habiliations

Louis Martin. 2021. Automatic sentence simplification using controllable and unsupervised methods. PhD thesis. Sorbonne Université.

In this thesis we study the task of automatic sentence simplification. We first study the different methods used to evaluate simplification models, highlight several shortcomings of current approaches, and propose new contributions. We then propose to train sentence simplification models that can be adapted to the target user, allowing for greater simplification flexibility. Finally, we extend the scope of sentence simplification to several languages, by proposing methods that do not require annotated training data, but that nevertheless achieve very strong performance.

Journal articles

Frank Uiterwaal, Franco Niccolucci, Sheena Bassett, Steven Krauwer, Hella Hollander, Femmy Admiraal, Laurent Romary, George Bruseker, Carlo Meghini, Jennifer Edmond and Mark Hedges. 2021. From disparate disciplines to unity in diversity How the PARTHENOS project has brought European humanities Research Infrastructures together. International Journal of Humanities and Arts Computing 15, pages 101–116. Edinburgh University Press.

Since the first ESFRI roadmap in 2006, multiple humanities Research Infrastructures (RIs) have been set up all over the European continent, supporting archaeologists (ARIADNE), linguists (CLARIN-ERIC), Holocaust researchers (EHRI), cultural heritage specialists (IPERION-CH) and others. These examples only scratch the surface of the breadth of research communities that have benefited from close cooperation in the European Research Area.While each field developed discipline-specific services over the years, common themes can also be distinguished. All humanities RIs address, in varying degrees, questions around research data management, the use of standards and the desired interoperability of data across disciplinary boundaries.This article sheds light on how cluster project PARTHENOS developed pooled services and shared solutions for its audience of humanities researchers, RI managers and policymakers. In a time where the convergence of existing infrastructure is becoming ever more important – with the construction of a European Open Science Cloud as an audacious, ultimate goal – we hope that our experiences inform future work and provide inspiration on how to exploit synergies in interdisciplinary, transnational, scientific cooperation.
Rachel Bawden. 2021. [Book Review] Understanding Dialogue: Language Use and Social Interaction. Computational Linguistics. Massachusetts Institute of Technology Press (MIT Press).

Luca Foppiano, Sae Dieb, Akira Suzuki, Pedro Baptista de Castro, Suguru Iwasaki, Azusa Uzuki, Miren Garbine Esparza Echevarria, Yan Meng, Kensei Terashima, Laurent Romary, Yoshihiko Takano and Masashi Ishii. 2021. SuperMat: Construction of a linked annotated dataset from superconductors-related publications. Science and Technology of Advanced Materials: Methods 1. Taylor & Francis.

A growing number of papers are published in the area of superconducting materials science. However, novel text and data mining (TDM) processes are still needed to efficiently access and exploit this accumulated knowledge, paving the way towards data-driven materials design. Herein, we present SuperMat (Superconductor Materials), an annotated corpus of linked data derived from scientific publications on superconductors, which comprises 142 articles, 16052 entities, and 1398 links that are characterised into six categories: the names, classes, and properties of materials; links to their respective superconducting critical temperature (Tc); and parametric conditions such as applied pressure or measurement methods. The construction of SuperMat resulted from a fruitful collaboration between computer scientists and material scientists, and its high quality is ensured through validation by domain experts. The quality of the annotation guidelines was ensured by satisfactory Inter Annotator Agreement (IAA) between the annotators and the domain experts. SuperMat includes the dataset, annotation guidelines, and annotation support tools that use automatic suggestions to help minimise human errors.
Naomi Truan and Laurent Romary. 2021. Building, Encoding, and Annotating a Corpus of Parliamentary Debates in XML-TEI: A Cross-Linguistic Account. Journal of the Text Encoding Initiative. TEI Consortium.

This data paper introduces an integrative and comprehensive method for the linguistic annotation of parliamentary discourse. Initially conceived as a documentation for a specific and rather small-scale research project, the annotation scheme takes into account national specificities and is geared to proposing an annotation scheme that is both highly standardised and adaptable to other research contexts. The paper reads as a specific application of the Text Encoding Initiative (TEI) framework applied to a subset of parliamentary debates. This strategy has two main applications: first, to develop a model for the encoding of parliamentary corpora by providing a systematic way of annotating both elements within the text (e.g. turns, incidents, interruptions) and the metadata associated with it (e.g. variables pertaining to the speaker or the speech event); second, to provide a cross-linguistic empirical basis for further annotation projects.

Conference proceedings

José Carlos Rosales Núñez, Djamé Seddah and Guillaume Wisniewski. 2021. Understanding the Impact of UGC Specificities on Translation Quality. In Proceedings of the Seventh Workshop on Noisy User-generated Text (W-NUT 2021), pages 189–198. Association for Computational Linguistics. Online.

This work takes a critical look at the evaluation of user-generated content automatic translation, the well-known specificities of which raise many challenges for MT. Our analyses show that measuring the average-case performance using a standard metric on a UGC test set falls far short of giving a reliable image of the UGC translation quality. That is why we introduce a new data set for the evaluation of UGC translation in which UGC specificities have been manually annotated using a fine-grained typology. Using this data set, we conduct several experiments to measure the impact of different kinds of UGC specificities on translation quality, more precisely than previously possible.
José Carlos Rosales Núñez, Guillaume Wisniewski and Djamé Seddah. 2021. Noisy UGC Translation at the Character Level: Revisiting Open-Vocabulary Capabilities and Robustness of Char-Based Models. In Proceedings of the Seventh Workshop on Noisy User-generated Text (W-NUT 2021), pages 199–211. Association for Computational Linguistics. Online.

This work explores the capacities of character-based Neural Machine Translation to translate noisy User-Generated Content (UGC) with a strong focus on exploring the limits of such approaches to handle productive UGC phenomena, which almost by definition, cannot be seen at training time. Within a strict zero-shot scenario, we first study the detrimental impact on translation performance of various user-generated content phenomena on a small annotated dataset we developed, and then show that such models are indeed incapable of handling unknown letters, which leads to catastrophic translation failure once such characters are encountered. We further confirm this behavior with a simple, yet insightful, copy task experiment and highlight the importance of reducing the vocabulary size hyper-parameter to increase the robustness of character-based models for machine translation.
Ghazi Felhi, Joseph Le Roux and Djamé Seddah. 2021. Challenging the Semi-Supervised VAE Framework for Text Classification. In Second Workshop on Insights from Negative Results in NLP (colocated with EMNLP). Association for Computational Linguistics. Punta Cana, Dominican Republic.

Semi-Supervised Variational Autoencoders (SSVAEs) are widely used models for data efficient learning. In this paper, we question the adequacy of the standard design of sequence SSVAEs for the task of text classification as we exhibit two sources of overcomplexity for which we provide simplifications. These simplifications to SSVAEs preserve their theoretical soundness while providing a number of practical advantages in the semi-supervised setup where the result of training is a text classifier. These simplifications are the removal of (i) the Kullback-Liebler divergence from its objective and (ii) the fully unobserved latent variable from its probabilistic model. These changes relieve users from choosing a prior for their latent variables, make the model smaller and faster, and allow for a better flow of information into the latent variables. We compare the simplified versions to standard SSVAEs on 4 text classification tasks. On top of the above-mentioned simplification, experiments show a speed-up of 26%, while keeping equivalent classification scores. The code to reproduce our experiments is public.
Arij Riabi, Benoît Sagot and Djamé Seddah. 2021. Can Character-based Language Models Improve Downstream Task Performances In Low-Resource And Noisy Language Scenarios? In Proceedings of the Seventh Workshop on Noisy User-generated Text (W-NUT 2021), pages 423–436. Association for Computational Linguistics. Online.

Recent impressive improvements in NLP, largely based on the success of contextual neural language models, have been mostly demonstrated on at most a couple dozen high-resource languages. Building language models and, more generally, NLP systems for non-standardized and low-resource languages remains a challenging task. In this work, we focus on North-African colloquial dialectal Arabic written using an extension of the Latin script, called NArabizi, found mostly on social media and messaging communication. In this low-resource scenario with data displaying a high level of variability, we compare the downstream performance of a character-based language model on part-of-speech tagging and dependency parsing to that of monolingual and multilingual models. We show that a character-based model trained on only 99k sentences of NArabizi and fined-tuned on a small treebank of this language leads to performance close to those obtained with the same architecture pre-trained on large multilingual and monolingual models. Confirming these results a on much larger data set of noisy French user-generated content, we argue that such character-based language models can be an asset for NLP in low-resource and high language variability set-tings.
Lana Yeganova, Dina Wiemann, Mariana Neves, Federica Vezzani, Amy Siu, Inigo Jauregi Unanue, Maite Oronoz, Nancy Mah, Aurélie Névéol, David Martinez, Rachel Bawden, Giorgio Maria Di Nunzio, Roland Roller, Philippe Thomas, Cristian Grozea, Olatz Perez-de-Viñaspre, Maika Vicente Navarro and Antonio Jimeno Yepes. 2021. Findings of the WMT 2021 Biomedical Translation Shared Task: Summaries of Animal Experiments as New Test Set. In Proceedings of the Sixth Conference on Machine Translation, pages 664–683. Association for Computational Linguistics. Online.

In the sixth edition of the WMT Biomedical Task, we addressed a total of eight language pairs, namely English/German, English/French, English/Spanish, English/Portuguese, English/Chinese, English/Russian, English/Italian, and English/Basque. Further, our tests were composed of three types of textual test sets. New to this year, we released a test set of summaries of animal experiments, in addition to the test sets of scientific abstracts and terminologies. We received a total of 107 submissions from 15 teams from 6 countries.
Lionel Tadonfouet Tadjou, Fabrice Bourge, Tiphaine Marie, Laurent Romary and Eric Villemonte de La Clergerie. 2021. Building A Corporate Corpus For Threads Constitution. In Student Research Workshop associated with the International Conference on Recent Advances in Natural Language Processing (RANLP'2021). Online, Bulgaria.

In this paper we describe the process of building a corporate corpus that will be used as a reference for modelling and computing threads from conversations generated using communication and collaboration tools. The overall goal of the reconstruction of threads is to be able to provide value to the collorator in various use cases, such as higlighting the important parts of a running discussion, reviewing the upcoming commitments or deadlines, etc. Since, to our knowledge, there is no available corporate corpus for the French language which could allow us to address this problem of thread constitution, we present here a method for building such corpora including different aspects and steps which allowed the creation of a pipeline to pseudo-anonymise data. Such a pipeline is a response to the constraints induced by the General Data Protection Regulation GDPR in Europe and the compliance to the secrecy of correspondence.
Simon Gabay, Barbara Topalov, Caroline Corbières, Lucie Rondeau Du Noyer, Béatrice Joyeux-Prunel and Laurent Romary. 2021. Automating Artl@s–extracting data from exhibition catalogues. In EADH 2021 - Second International Conference of the European Association for Digital Humanities. Krasnoyarsk, Russia.

Catalogues, which have been published for centuries, are an extremely precious resource for scholars. Using the Artl@s database as an example, where exhibition catalogues are transformed into a georeferenced database, we question the possibility of an (almost) automatic transformation of pdfs into semantically annotated data. To do so, we present and analyse the graphic organisation of exhibition catalogues, before exploring a possible modeling into TEI (involving possible enhancement of the guidelines).
Julien Abadji, Pedro Javier Ortiz Suárez, Laurent Romary and Benoît Sagot. 2021. Ungoliant: An Optimized Pipeline for the Generation of a Very Large-Scale Multilingual Web Corpus. In CMLC 2021 - 9th Workshop on Challenges in the Management of Large Corpora. Limerick / Virtual, Ireland.

Since the introduction of large language models in Natural Language Processing, large raw corpora have played a crucial role in Computational Linguistics. However, most of these large raw corpora are either available only for English or not available to the general public due to copyright issues. Nevertheless, there are some examples of freely available multilingual corpora for training Deep Learning NLP models, such as the OSCAR and Paracrawl corpora. However, they have quality issues, especially for low-resource languages. Moreover, recreating or updating these corpora is very complex. In this work, we try to reproduce and improve the goclassy pipeline used to create the OSCAR corpus. We propose a new pipeline that is faster, modular, parameterizable, and well documented. We use it to create a corpus similar to OSCAR but larger and based on recent data.Also, unlike OSCAR, the metadata information is at the document level. We release our pipeline under an open source license and publish the corpus under a research-only license.
Syrielle Montariol and Alexandre Allauzen. 2021. Transport Optimal pour le Changement Sémantique à partir de Plongements Contextualisés. In TALN 2021 - Traitement Automatique des Langues Naturelles, pages 235–244. ATALA. Lille / Virtuel, France.

Plusieurs méthodes de détection des changements sémantiques utilisant des plongements lexicaux contextualisés sont apparues récemment. Elles permettent une analyse fine du changement d’usage des mots, en agrégeant les plongements contextualisés en clusters qui reflètent les différents usages d’un mot. Nous proposons une nouvelle méthode basée sur le transport optimal. Nous l’évaluons sur plusieurs corpus annotés, montrant un gain de précision par rapport aux autres méthodes utilisant des plongements contextualisés, et l’illustrons sur un corpus d’articles de journaux.
Benjamin Muller, Antonios Anastasopoulos, Benoît Sagot and Djamé Seddah. 2021. When Being Unseen from mBERT is just the Beginning: Handling New Languages With Multilingual Language Models. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 448–462. Association for Computational Linguistics. Online.

Transfer learning based on pretraining language models on a large amount of raw data has become a new norm to reach state-of-theart performance in NLP. Still, it remains unclear how this approach should be applied for unseen languages that are not covered by any available large-scale multilingual language model and for which only a small amount of raw data is generally available. In this work, by comparing multilingual and monolingual models, we show that such models behave in multiple ways on unseen languages. Some languages greatly benefit from transfer learning and behave similarly to closely related high resource languages whereas others apparently do not. Focusing on the latter, we show that this failure to transfer is largely related to the impact of the script used to write such languages. We show that transliterating those languages significantly improves the potential of large-scale multilingual language models on downstream tasks. This result provides a promising direction towards making these massively multilingual models useful for a new set of unseen languages.
Clémentine Fourrier, Rachel Bawden and Benoît Sagot. 2021. Can Cognate Prediction Be Modelled as a Low-Resource Machine Translation Task? In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 847–861. Association for Computational Linguistics. Online.

Cognate prediction is the task of generating, in a given language, the likely cognates of words in a related language, where cognates are words in related languages that have evolved from a common ancestor word. It is a task for which little data exists and which can aid linguists in the discovery of previously undiscovered relations. Previous work has applied machine translation (MT) techniques to this task, based on the tasks' similarities, without, however, studying their numerous differences or optimising architectural choices and hyper-parameters. In this paper, we investigate whether cognate prediction can benefit from insights from low-resource MT. We first compare statistical MT (SMT) and neural MT (NMT) architectures in a bilingual setup. We then study the impact of employing data augmentation techniques commonly seen to give gains in low-resource MT: monolingual pretraining, backtranslation and multilinguality. Our experiments on several Romance languages show that cognate prediction behaves only to a certain extent like a standard lowresource MT task. In particular, MT architectures, both statistical and neural, can be successfully used for the task, but using supplementary monolingual data is not always as beneficial as using additional language data, contrarily to what is observed for MT.
Benjamin Muller, Yanai Elazar, Benoît Sagot and Djamé Seddah. 2021. First Align, then Predict: Understanding the Cross-Lingual Ability of Multilingual BERT. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 2214–2231. Association for Computational Linguistics. Online.

Multilingual pretrained language models have demonstrated remarkable zero-shot crosslingual transfer capabilities. Such transfer emerges by fine-tuning on a task of interest in one language and evaluating on a distinct language, not seen during the fine-tuning. Despite promising results, we still lack a proper understanding of the source of this transfer. Using a novel layer ablation technique and analyses of the model's internal representations, we show that multilingual BERT, a popular multilingual language model, can be viewed as the stacking of two sub-networks: a multilingual encoder followed by a taskspecific language-agnostic predictor. While the encoder is crucial for cross-lingual transfer and remains mostly unchanged during finetuning, the task predictor has little importance on the transfer and can be reinitialized during fine-tuning. We present extensive experiments with three distinct tasks, seventeen typologically diverse languages and multiple domains to support our hypothesis.
Rute Costa, Ana Salgado, Anas Fahad Khan, Sara Carvalho, Laurent Romary, Bruno Almeida, Margarida Ramos, Mohamed Khemakhem, Raquel Silva and Toma Tasovac. 2021. MORDigital: The Advent of a New Lexicographical Portuguese Project. In eLex 2021 - Seventh biennial conference on electronic lexicography. Brno, Czech Republic.

MORDigital is a newly funded Portuguese lexicographical project that aims to produce highquality and searchable digital versions of the first three editions (1789; 1813; 1823) of the Diccionario da Lingua Portugueza by António de Morais Silva, preserving and making accessible this important work of European heritage. This paper will describe the current state of the art, the project, its objectives and the methodology proposed, the latter of which is based on a rigorous linguistic analysis and will also include steps necessary for the ontologisation of knowledge contained in and relating to the text. A section will be dedicated to the various investigation domains of the project description. The output of the project will be made available via a dedicated platform.
Antoine Gérard, Benoît Sagot and Emilie Pons. 2021. Le Traitement Automatique des Langues au service du vin. In Dataquitaine 2021 - IA, Recherche Opérationnelle & Data Science. Bordeaux / Virtual, France.

Dans cette présentation, nous proposons de détailler une collaboration fructueuse entre l'institut de recherche Inria et une startup bordelaise : Winespace. Nous nous intéresserons alors à l'analyse sémantique de commentaires de dégustation dans le but de recommander des vins présentant des caractéristiques similaires.
Farid Arthaud, Rachel Bawden and Alexandra Birch. 2021. Few-shot learning through contextual data augmentation. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 1049–1062. Association for Computational Linguistics. Online.

Machine translation (MT) models used in industries with constantly changing topics, such as translation or news agencies, need to adapt to new data to maintain their performance over time. Our aim is to teach a pre-trained MT model to translate previously unseen words accurately, based on very few examples. We propose (i) an experimental setup allowing us to simulate novel vocabulary appearing in human-submitted translations, and (ii) corresponding evaluation metrics to compare our approaches. We extend a data augmentation approach using a pre-trained language model to create training examples with similar contexts for novel words. We compare different fine-tuning and data augmentation approaches and show that adaptation on the scale of one to five examples is possible. Combining data augmentation with randomly selected training sentences leads to the highest BLEU score and accuracy improvements. Impressively, with only 1 to 5 examples, our model reports better accuracy scores than a reference system trained with on average 313 parallel examples.

Communications

Alix Chagué. 2021. CREMMA : Une infrastructure mutualisée pour la reconnaissance d'écritures manuscrites et la patrimonialisation numérique. In Sciences du patrimoine - sciences du texte. Confrontation des méthodes. Paris, France.

Hugo Scheithauer, Alix Chagué, Aurélia Rostaing, Lucas Terriel, Laurent Romary, Marie-Françoise Limon-Bonnet, Benjamin Davy, Gaetano Piraino, Franck Beltrami, Danis Habib, Nathalie Denis and Marc Durand. 2021. Production d'un modèle affiné de reconnaissance d'écriture manuscrite avec eScriptorium et évaluation de ses performances. In Les Futurs Fantastiques - 3e Conférence Internationale sur l'Intelligence Artificielle appliquée aux Bibliothèques, Archives et Musées, AI4LAM. Paris, France.

For this workshop, participants will take part in the fine-tuning of a handwritten text recognition (HTR) model with eScriptorium. Fine-tuning a model means retraining an initial generic model with a new dataset in order to specialize it in a particular domain.
Hugo Scheithauer, Alix Chagué and Laurent Romary. 2021. From eScriptorium to TEI Publisher. In Brace your digital scholarly edition!. Berlin, France.

Lucas Terriel. 2021. Atelier : Production d'un modèle affiné de reconnaissance d'écriture manuscrite avec eScriptorium et évaluation de ses performances. Évaluer son modèle HTR/OCR avec KaMI (Kraken as Model Inspector). In Les Futurs Fantastiques - 3e Conférence Internationale sur l'Intelligence Artificielle appliquée aux Bibliothèques, Archives et Musées. Paris, France.

Pauline Charbonnier, Lucas Terriel, Florence Clavaud, Laurent Romary, Gaetano Piraino and Vincent Verdese. 2021. NER4Archives (named entity recognition for archives) : méthodes et outils semi-automatiques pour reconnaître les entités nommées dans les instruments de recherche archivistiques encodés en XML/EAD. In Les Futurs Fantastiques - 3e Conférence Internationale sur l'Intelligence Artificielle appliquée aux Bibliothèques, Archives et Musées. Paris, France.

Alix Chagué and Rostaing Aurélia. 2021. LECTAUREP : Lecture Automatique des Répertoires de Notaires Parisiens. In Fantastic Futures 2021 / Futures Fantastiques 2021. Paris, France.

Alix Chagué and Aurélia Rostaing. 2021. LECTAUREP: Paris Notary Record Books Automated Reading. In Fantastic Futures 2021 / Futures Fantastiques 2021. Paris, France.

Floriane Chiffoleau, Anne Baillot and Manon Ovide. 2021. A TEI-based publication pipeline for historical egodocuments - the DAHN project. In Next Gen TEI, 2021 - TEI Conference and Members' Meeting. Virtual, United States.

Alix Chagué, Thibault Clérice and Laurent Romary. 2021. HTR-United : Mutualisons la vérité de terrain ! In DHNord2021 - Publier, partager, réutiliser les données de la recherche : les data papers et leurs enjeux. Lille, France.

Hugo Scheithauer, Alix Chagué, Simon Gabay, Laurent Romary, Juliette Janes and Claire Jahan. 2021. From page to content–which TEI representation for HTR output? In Next Gen TEI, 2021 - TEI Conference and Members' Meeting. Weaton (virtual), United States.

Alexandre Bartz, Juliette Janes, Laurent Romary, Philippe Gambette, Rachel Bawden, Pedro Ortiz Suarez, Benoît Sagot and Simon Gabay. 2021. Expanding the content model of annotationBlock. In Next Gen TEI, 2021 - TEI Conference and Members' Meeting. Virtual, United States.

Simon Gabay, Philippe Gambette, Rachel Bawden, Jonathan Poinhos, Eleni Kogkitsidou and Benoît Sagot. 2021. Variation graphique dans les documents d'Ancien Régime : Nouvelles approches scriptométriques. In Journée d'étude : « Pour une histoire de la langue ‘par en bas': textes privés et variation des langues dans le passé »;. Paris, France.

Jean-Damien Généro, Alix Chagué, Victoria Le Fourner and Marie Puren. 2021. Transcribing and editing digitized sources on work in the textile industry. In Rémunérations et usages du temps des hommes et des femmes dans le textile en France de la fin du XVIIe au début du XXe siècle. Lyon, France.

Historians have been using digital tools for several decades. Time-Us project has been part ofthis long tradition by developing experimental methods of automatic transcription (ORC) andstructuring (XML) of handwritten archival documents and book collections. The sets chosen toillustrate this work are the minutes of the Conseil des prud'hommes de Paris (1847-1848, 1858,1878) and the monographs of the Ouvriers des deux mondes (1857-1913, 1930). Two stageswill be exposed. The first is the process of analysis and reproduction of logical structures(minutes of the labor court hearings and sections of the monographs), conducted on a ridgebetween the machine (automation of tasks) and the human hand (manual verifications andcorrections). The second is the extraction of textile-related information from the monographsand its availability to researchers. Finally, proposals will be made regarding the possible usesof digital technology in research programs.
Simon Gabay and Pedro Javier Ortiz Suárez. 2021. A dataset for automatic detection of places in (early) modern French texts. In Proceedings of the 50th Annual North American Society for Seventeenth-Century French Literature Conference. Online.

Alix Chagué and Floriane Chiffoleau. 2021. An accessible and transparent pipeline for publishing historical egodocuments. In WPIP21 - What's Past is Prologue: The NewsEye International Conference. Virtual, Austria.

The automatization of the processing of documents oriented towards online publication and exploration by the humanities increases the rapidity of treatments like the transcription, but they should also be an opportunity to make the experimentation and the resulting corpora sustainable and reusable. The DAHN project (Dispositif de soutien à l’Archivistique et aux Humanités Numériques) relies on a joint interdisciplinary collaboration between Inria, the EHESS and the University of Le Mans. By taking the example of egodocuments, the project aims to create a ready-to-use digital and scientific publishing pipeline going from the material archive to an online publication. In this presentation, we introduce our method and guidelines for the processing of non-digital-native textual documents using open-source and easily hackable tools that guarantee visibility across an accessible pipeline, thus challenging the notions of a black box or scattered tools which tend to be hard to maintain in the long run.
Alix Chagué and Aurélia Rostaing. 2021. Présentation du projet Lectaurep (Lecture automatique de répertoires). In Atelier sur la transcription des écritures manuscrites - BnF DataLab. Paris, France.

Arij Riabi, Thomas Scialom, Rachel Keraron, Benoît Sagot, Djamé Seddah and Jacopo Staiano. 2021. Synthetic Data Augmentation for Zero-Shot Cross-Lingual Question Answering. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Punta cana, Dominican Republic.

Coupled with the availability of large scale datasets, deep learning architectures have enabled rapid progress on the Question Answering task. However, most of those datasets are in English, and the performances of state-of-the-art multilingual models are significantly lower when evaluated on non-English data. Due to high data collection costs, it is not realistic to obtain annotated data for each language one desires to support. We propose a method to improve the Cross-lingual Question Answering performance without requiring additional annotated data, leveraging Question Generation models to produce synthetic samples in a cross-lingual fashion. We show that the proposed method allows to significantly outperform the baselines trained on English data only. We report a new state-of-the-art on four multilingual datasets: MLQA, XQuAD, SQuAD-it and PIAF (fr).

Tech reports

Julien Launay, Elena Tommasone, Baptiste Pannier, François Boniface, Amélie Chatelain, Alessandro Cappelli, Iacopo Poli and Djamé Seddah. 2021. PAGnol: An Extra-Large French Generative Model. Technical report.

Access to large pre-trained models of varied architectures, in many different languages, is central to the democratization of NLP. We introduce PAGnol, a collection of French GPT models. Using scaling laws, we efficiently train PAGnol-XL (1.5B parameters) with the same computational budget as CamemBERT, a model 13 times smaller. PAGnol-XL is the largest model trained to date for the French language. We plan to train increasingly large and performing versions of PAGnol, exploring the capabilities of French extreme-scale models. For this first release, we focus on the pre-training and scaling calculations underlining PAGnol. We fit a scaling law for compute for the French language, and compare it with its English counterpart. We find the pre-training dataset significantly conditions the quality of the outputs, with common datasets such as OSCAR leading to low-quality offensive text. We evaluate our models on discriminative and generative tasks in French, comparing to other state-of-the-art French and multilingual models, and reaching the state of the art in the abstract summarization task. Our research was conducted on the public GENCI Jean Zay supercomputer, and our models up to the Large are made publicly available.
Toma Tasovac, Laurent Romary, Erzsébet Tóth-Czifra and Irena Marinski. 2021. Lexicographic Data Seal of Compliance. Technical report.

Other

Alix Chagué. 2021. Comment faire lire des gribouillis à mon ordinateur ? .

Preprints

Laurent Romary. 2021. Normes et patrimoine numérique. Preprint.

Thomas Scialom, Louis Martin, Jacopo Staiano, Eric Villemonte de La Clergerie and Benoît Sagot. 2021. Rethinking Automatic Evaluation in Sentence Simplification. Preprint.

Automatic evaluation remains an open research question in Natural Language Generation. In the context of Sentence Simplification, this is particularly challenging: the task requires by nature to replace complex words with simpler ones that shares the same meaning. This limits the effectiveness of n-gram based metrics like BLEU. Going hand in hand with the recent advances in NLG, new metrics have been proposed, such as BERTScore for Machine Translation. In summarization, the QuestEval metric proposes to automatically compare two texts by questioning them. In this paper, we first propose a simple modification of QuestEval allowing it to tackle Sentence Simplification. We then extensively evaluate the correlations w.r.t. human judgement for several metrics including the recent BERTScore and QuestEval, and show that the latter obtain state-of-the-art correlations, outperforming standard metrics like BLEU and SARI. More importantly, we also show that a large part of the correlations are actually spurious for all the metrics. To investigate this phenomenon further, we release a new corpus of evaluated simplifications, this time not generated by systems but instead, written by humans. This allows us to remove the spurious correlations and draw very different conclusions from the original ones, resulting in a better understanding of these metrics. In particular, we raise concerns about very low correlations for most of traditional metrics. Our results show that the only significant measure of the Meaning Preservation is our adaptation of QuestEval.
Alix Chagué and Floriane Chiffoleau. 2021. An accessible and transparent pipeline for publishing historical egodocuments. Preprint.

The automatization of the processing of documents oriented towards online publication and exploration by the humanities increases the rapidity of treatments like the transcription, but they should also be an opportunity to make the experimentation and the resulting corpora sustainable and reusable. The DAHN project (Dispositif de soutien à l’Archivistique et aux Humanités Numériques) relies on a joint interdisciplinary collaboration between Inria, the EHESS and the University of Le Mans. By taking theexample of egodocuments, the project aims to create a ready-to-use digital and scientific publishing pipeline going from the material archive to an online publication.In this presentation, we introduce our method and guidelines for the processing of non-digital-native textual documents using open-source and easily hackable tools that guarantee visibility across an accessible pipeline, thus challenging the notions of a black box or scattered tools which tend to be hard to maintain in the long run.
Benjamin Muller, Yanai Elazar, Benoît Sagot and Djamé Seddah. 2021. First Align, then Predict: Understanding the Cross-Lingual Ability of Multilingual BERT. Preprint.

Multilingual pretrained language models have demonstrated remarkable zero-shot cross-lingual transfer capabilities. Such transfer emerges by fine-tuning on a task of interest in one language and evaluating on a distinct language, not seen during the fine-tuning. Despite promising results, we still lack a proper understanding of the source of this transfer. Using a novel layer ablation technique and analyses of the model's internal representations, we show that multilingual BERT, a popular multilingual language model, can be viewed as the stacking of two sub-networks: a multilingual encoder followed by a task-specific language-agnostic predictor. While the encoder is crucial for cross-lingual transfer and remains mostly unchanged during fine-tuning, the task predictor has little importance on the transfer and can be reinitialized during fine-tuning. We present extensive experiments with three distinct tasks, seventeen typologically diverse languages and multiple domains to support our hypothesis.
Benjamin Muller, Benoît Sagot and Djamé Seddah. 2021. Can Multilingual Language Models Transfer to an Unseen Dialect? A Case Study on North African Arabizi. Preprint.

Building natural language processing systems for non standardized and low resource languages is a difficult challenge. The recent success of large-scale multilingual pretrained language models provides new modeling tools to tackle this. In this work, we study the ability of multilingual language models to process an unseen dialect. We take user generated North-African Arabic as our case study, a resource-poor dialectal variety of Arabic with frequent code-mixing with French and written in Arabizi, a non-standardized transliteration of Arabic to Latin script. Focusing on two tasks, part-of-speech tagging and dependency parsing, we show in zero-shot and unsupervised adaptation scenarios that multilingual language models are able to transfer to such an unseen dialect, specifically in two extreme cases: (i) across scripts, using Modern Standard Arabic as a source language, and (ii) from a distantly related language, unseen during pretraining, namely Maltese. Our results constitute the first successful transfer experiments on this dialect, paving thus the way for the development of an NLP ecosystem for resource-scarce, non-standardized and highly variable vernacular languages.
Louis Martin, Angela Fan, Eric Villemonte de La Clergerie, Antoine Bordes and Benoît Sagot. 2021. Multilingual Unsupervised Sentence Simplification. Preprint.

Progress in Sentence Simplification has been hindered by the lack of supervised data, particularly in languages other than English. Previous work has aligned sentences from original and simplified corpora such as English Wikipedia and Simple English Wikipedia, but this limits corpus size, domain, and language. In this work, we propose using unsupervised mining techniques to automatically create training corpora for simplification in multiple languages from raw Common Crawl web data. When coupled with a controllable generation mechanism that can flexibly adjust attributes such as length and lexical complexity, these mined paraphrase corpora can be used to train simplification systems in any language. We further incorporate multilingual unsupervised pretraining methods to create even stronger models and show that by training on mined data rather than supervised corpora, we outperform the previous best results. We evaluate our approach on English, French, and Spanish simplification benchmarks and reach state-of-the-art performance with a totally unsupervised approach. We will release our models and code to mine the data in any language included in Common Crawl.
Jack Bowers, Axel Herold, Laurent Romary and Toma Tasovac. 2021. TEI Lex-0 Etym–towards terse recommendations for the encoding of etymological information. Preprint.

The present paper describes the etymological component of the TEI Lex-0 initiative which aims at defining a terser subset of the TEI guidelines for the representation of etymological features in dictionary entries. Going beyond the basic provision of etymological mechanisms in the TEI guidelines, TEI Lex-0 Etym proposes a systematic representation of etymological and cognate descriptions by means of embedded constructs based on the <etym> (for etymologies) and <cit> (for etymons and cognates) elements. In particular, given that all the potential contents of etymons are highly analogous to those of dictionary entries in general, the contents presented herein heavily re-use many of the corresponding features and constraints introduced in other components of the TEI Lex-0 to the encoding of etymologies and etymons. The TEI Lex-0 Etym model is also closely aligned to ISO 24613-3 on modelling etymological data and the corresponding TEI serialisation available in ISO 24613-4.

2020

PhD theses and Habiliations

Mohamed Khemakhem. 2020. Standard-based lexical models for automatically structured dictionnaries. PhD thesis. Université Paris Cité.

Dictionaries could be considered as the most comprehensive reservoir of human knowledge, which carry not only the lexical description of words in one or more languages, but also the commun awareness of a certain community about every known piece of knowledge in a time frame. Print dictionaries are the principle resources which enable the documentation and transfer of such knowledge. They already exist in abundant numbers, while new ones are continuously compiled, even with the recent strong move to digital resources. However, a majority of these dictionaries, even when available digitally, is still not fully structured due to the absence of scalable methods and techniques that can cover the variety of corresponding material. Moreover, the relatively few existing structured resources present limited exchange and query alternatives, given the discrepancy of their data models and formats. In this thesis we address the task of parsing lexical information in print dictionaries through the design of computer models that enable their automatic structuring. Solving this task goes hand in hand with finding a standardised output for these models to guarantee a maximum interoperability among resources and usability for downstream tasks. First, we present different classifications of the dictionaric resources to delimit the category of print dictionaries we aim to process. Second, we introduce the parsing task by providing an overview of the processing challenges and a study of the state of the art. Then, we present a novel approach based on a top-down parsing of the lexical information. We also outline the architecture of the resulting system, called GROBID-Dictionaries, and the methodology we followed to close the gap between the conception of the system and its applicability to real-world scenarios. After that, we draw the landscape of the leading standards for structured lexical resources. In addition, we provide an analysis of two ongoing initiatives, TEI-Lex-0 and LMF, that aim at the unification of modelling the lexical information in print and electronic dictionaries. Based on that, we present a serialisation format that is inline with the schemes of the two standardisation initiatives and fits the approach implemented in our parsing system. After presenting the parsing and standardised serialisation facets of our lexical models, we provide an empirical study of their performance and behaviour. The investigation is based on a specific machine learning setup and series of experiments carried out with a selected pool of varied dictionaries. We try in this study to present different ways for feature engineering and exhibit the strength and the limits of the best resulting models. We also dedicate two series of experiments for exploring the scalability of our models with regard to the processed documents and the employed machine learning technique. Finally, we sum up this thesis by presenting the major conclusions and opening new perspectives for extending our investigations in a number of research directions for parsing entry-based documents.
Mohamed Khemakhem. 2020. Standard-based Lexical Models for Automatically Structured Dictionaries. PhD thesis. Université de Paris.

Dictionaries could be considered as the most comprehensive reservoir of human knowledge, which carry not only the lexical description of words in one or more languages, but also the commun awareness of a certain community about every known piece of knowledge in a time frame. Print dictionaries are the principle resources which enable the documentation and transfer of such knowledge. They already exist in abundant numbers, while new ones are continuously compiled, even with the recent strong move to digital resources. However, a majority of these dictionaries, even when available digitally, is still not fully structured due to the absence of scalable methods and techniques that can cover the variety of corresponding material. Moreover, the relatively few existing structured resources present limited exchange and query alternatives, given the discrepancy of their data models and formats. In this thesis we address the task of parsing lexical information in print dictionaries through the design of computer models that enable their automatic structuring. Solving this task goes hand in hand with finding a standardised output for these models to guarantee a maximum interoperability among resources and usability for downstream tasks. First, we present different classifications of the dictionaric resources to delimit the category of print dictionaries we aim to process. Second, we introduce the parsing task by providing an overview of the processing challenges and a study of the state of the art. Then, we present a novel approach based on a top-down parsing of the lexical information. We also outline the architecture of the resulting system, called GROBID-Dictionaries, and the methodology we followed to close the gap between the conception of the system and its applicability to real-world scenarios. After that, we draw the landscape of the leading standards for structured lexical resources. In addition, we provide an analysis of two ongoing initiatives, TEI-Lex-0 and LMF, that aim at the unification of modelling the lexical information in print and electronic dictionaries. Based on that, we present a serialisation format that is inline with the schemes of the two standardisation initiatives and fits the approach implemented in our parsing system. After presenting the parsing and standardised serialisation facets of our lexical models, we provide an empirical study of their performance and behaviour. The investigation is based on a specific machine learning setup and series of experiments carried out with a selected pool of varied dictionaries. We try in this study to present different ways for feature engineering and exhibit the strength and the limits of the best resulting models. We also dedicate two series of experiments for exploring the scalability of our models with regard to the processed documents and the employed machine learning technique. Finally, we sum up this thesis by presenting the major conclusions and opening new perspectives for extending our investigations in a number of research directions for parsing entry-based documents.
Jack Bowers. 2020. Language Documentation and Standards in Digital Humanities: TEI and the documentation of Mixtepec-Mixtec. PhD thesis. École Pratique des Hauts Études.

This dissertation concerns a language documentation project covering the Mixtepec-Mixtec variety of Mixtec (ISO 639-3: mix). Mixtepec-Mixtec is an Oto-Manguean spoken by roughly 9000- 10000 people in San Juan Mixtepec Municipality in the Juxtlahuaca district of Oaxaca, Mexico and by several thousand speakers living in Baja California, Tlaxiaco, Santiago Juxtlahuaca. There are also significant populations in the United States, most notably in California, around Santa Maria and Oxnard, as well as in Oregon, Florida, and Arkansas.The core facets of the work are: the creation a body of linguistic resources for the MIX language and community; the evaluation the current tools, standards and practices used in language documentation; an account of how the TEI and related XML technologies can be used as the primary encoding, metadata, and annotation format for multi-dimensional linguistic projects, including under-resourced languages. The concrete resources produced are: a multilingual TEI dictionary; a collection of audio recordings published and archived on Harvard Dataverse; a corpus of texts derived from a combination of spoken language transcriptions and texts encoded and annotated in TEI, as well as linguistic and lexicographic descriptions and analyses of the Mixtepec-Mixtec language.Due to the array of different data and resources produced, this project has components that equally fall within the fields of: digital humanities, language documentation, language description and corpus linguistics. Because of this overlapping relevance, over the processes of attempting to carry out this work in line with best practices in each sub-field, this work addresses the need to further bring together the intersecting interests, technologies, practices and standards relevant to, and used in each of these related fields.
Loïc Grobol. 2020. Coreference resolution for spoken French. PhD thesis. Université Sorbonne Nouvelle - Paris 3.

A coreference chain is the set of linguistic expressions — or mentions — that refer to the same entity or discourse object in a given document. Coreference resolution consists in detecting all the mentions in a document and partitioning their set into coreference chains. Coreference chainsplay a central role in the consistency of documents and interactions, and their identification has applications to many other fields in natural language processing that rely on an understanding of language, such as information extraction, question answering or machine translation. Natural language processing systems that perform this task exist for many languages, but none for French — which suffered until recently from a lack of suitable annotated resources — and none for spoken language.In this thesis, we aim to fill this gap by designing a coreference resolution system for spoken French. To this end, we propose a knowledge-poor system based on an end-to-end neural network architecture, which obviates the need for the preprocessing pipelines common in existing systems, while maintaining performances comparable to the state-of-the art. We then propose extensions on that baseline, by augmenting our system with external knowledge obtained from resources and preprocessing tools designed for written French. Finally, we propose a new standard representation for coreference annotation in corpora of written and spoken languages, and demonstrate its use in a new version of ANCOR, the first coreference corpus of spoken French.

Journal articles

Xinying Chen and Kim Gerdes. 2020. Dependency Distances and Their Frequencies in Indo-European Language. Journal of Quantitative Linguistics, pages 1–20. Taylor & Francis (Routledge).

The present study investigates the relationship between two features of dependencies, namely, dependency distances and dependency frequencies. The study is based on the analysis of a parallel dependency treebank that includes 10 Indo-European languages. Two corresponding random dependency treebanks are generated as baselines for comparison. After computing the values of dependency distances and their frequencies in these treebanks, for each lan-guage, we fit four functions, namely quadratic, exponent, logarithm, and power-law func-tions, to its original and random datasets. The preliminary result shows that there is a rela-tion between the two dependency features for all 10 Indo-European languages. The relation can be further formalized as a power-law function which can distinguish the observed data from randomly generated datasets.
Laurent Romary. 2020. Découpler gestion des manuscrits de publication et évaluation par les pairs : la plateforme de gestion de revues Épisciences. I2D -- Information, données & documents. A.D.B.S.

Fondée sur un modèle original, la plateforme Épisciences, qui contient actuellement 15 revues, propose un outil complet pour la gestion d’une revue, son hébergement et la diffusion de ses contenus. Elle assure l’hébergement de revues en open access (épi-revues) et le processus de soumission des articles à ces revues, via un dépôt dans une archive ouverte telle que HAL. Les personnels documentaires jouent ici un rôle d’accompagnement décisif.
Andrea Bertino, Luca Foppiano, Laurent Romary and Pierre Mounier. 2020. Leveraging Concepts in Open Access Publications. Journal of Data Mining and Digital Humanities 2019. Episciences.org.

This paper addresses the integration of a Named Entity Recognition and Disambiguation (NERD) service within a group of open access (OA) publishing digital platforms and considers its potential impact on both research and scholarly publishing. The software powering this service, called entity-fishing, was initially developed by Inria in the context of the EU FP7 project CENDARI and provides automatic entity recognition and disambiguation using the Wikipedia and Wikidata data sets. The application is distributed with an open-source licence, and it has been deployed as a web service in DARIAH's infrastructure hosted by the French HumaNum. In the paper, we focus on the specific issues related to its integration on five OA platforms specialized in the publication of scholarly monographs in the social sciences and humanities (SSH), as part of the work carried out within the EU H2020 project HIRMEOS (High Integration of Research Monographs in the European Open Science infrastructure). In the first section, we give a brief overview of the current status and evolution of OA publications, considering specifically the challenges that OA monographs are encountering. In the second part, we show how the HIRMEOS project aims to face these challenges by optimizing five OA digital platforms for the publication of monographs from the SSH and ensuring their interoperability. In sections three and four we give a comprehensive description of the entity-fishing service, focusing on its concrete applications in real use cases together with some further possible ideas on how to exploit the annotations generated. We show that entity-fishing annotations can improve both research and publishing process. In the last chapter, we briefly present further possible application scenarios that could be made available through infrastructural projects.
Luca Foppiano and Laurent Romary. 2020. Entity-fishing: a DARIAH entity recognition and disambiguation service. Journal of the Japanese Association for Digital Humanities 5, pages 22–60. Japanese Association for Digital Humanities.

This paper presents an attempt to provide a generic named-entity recognition and disambiguation module (NERD) called entity-fishing as a stable online service that demonstrates the possible delivery of sustainable technical services within DARIAH, the European digital research infrastructure for the arts and humanities. Deployed as part of the national infrastructure Huma-Num in France, this service provides an efficient state-of-the-art implementation coupled with standardised interfaces allowing an easy deployment on a variety of potential digital humanities contexts. Initially developed in the context of the FP9 EU project CENDARI, the software was well received by the user community and continued to be further developed within the H2020 HIRMEOS project where several open access publishers have integrated the service to their collections of published monographs as a means to enhance retrieval and access. entity-fishing implements entity extraction as well as disambiguation against Wikipedia and Wikidata entries. The service is accessible through a REST API which allows easier and seamless integration, language independent and stable convention and a widely used service-oriented architecture (SOA) design. Input and output data are carried out over a query data model with a defined structure providing flexibility to support the processing of partially annotated text or the repartition of text over several queries. The interface implements a variety of functionalities, like language recognition, sentence segmentation and modules for accessing and looking up concepts in the knowledge base. The API itself integrates more advanced contextual parametrisation or ranked outputs, allowing for the resilient integration in various possible use cases. The entity-fishing API has been used as a concrete use case to draft the experimental stand-off proposal, which has been submitted for integration into the TEI guidelines. The representation is also compliant with the Web Annotation Data Model (WADM). In this paper we aim at describing the functionalities of the service as a reference contribution to the subject of web-based NERD services. In this paper, we detail the workflow from input to output and unpack each building box in the processing flow. Besides, with a more academic approach, we provide a transversal schema of the different components taking into account non-functional requirements in order to facilitate the discovery of bottlenecks, hotspots and weaknesses. We also describe the underlying knowledge base, which is set up on the basis of Wikipedia and Wikidata content. We conclude the paper by presenting our solution for the service deployment: how and which the resources where allocated. The service has been in production since Q3 of 2017, and extensively used by the H2020 HIRMEOS partners during the integration with the publishing platforms.

Conference proceedings

Hila Gonen, Ganesh Jawahar, Djamé Seddah and Yoav Goldberg. 2020. Simple, Interpretable and Stable Method for Detecting Words with Usage Change across Corpora. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 538–555. Association for Computational Linguistics. Online.

The problem of comparing two bodies of text and searching for words that differ in their usage between them arises often in digital humanities and computational social science. This is commonly approached by training word embeddings on each corpus, aligning the vector spaces, and looking for words whose cosine distance in the aligned space is large. However, these methods often require extensive filtering of the vocabulary to perform well, and-as we show in this work-result in unstable, and hence less reliable, results. We propose an alternative approach that does not use vector space alignment, and instead considers the neighbors of each word. The method is simple, interpretable and stable. We demonstrate its effectiveness in 9 different setups, considering different corpus splitting criteria (age, gender and profession of tweet authors, time of tweet) and different languages (English, French and Hebrew).
Gaël Guibon, Marine Courtin, Kim Gerdes and Bruno Guillaume. 2020. When Collaborative Treebank Curation Meets Graph Grammars. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 5291–5300. European Language Resources Association. Marseille, France.

In this paper we present Arborator-Grew, a collaborative annotation tool for treebank development. Arborator-Grew combines the features of two preexisting tools: Arborator and Grew. Arborator is a widely used collaborative graphical online dependency treebank annotation tool. Grew is a tool for graph querying and rewriting specialized in structures needed in NLP, i.e. syntactic and semantic dependency trees and graphs. Grew also has an online version, Grew-match, where all Universal Dependencies treebanks in their classical, deep and surface-syntactic flavors can be queried. Arborator-Grew is a complete redevelopment and modernization of Arborator, replacing its own internal database storage by a new Grew API, which adds a powerful query tool to Arborator's existing treebank creation and correction features. This includes complex access control for parallel expert and crowd-sourced annotation, tree comparison visualization, and various exercise modes for teaching and training of annotators. Arborator-Grew opens up new paths of collectively creating, updating, maintaining, and curating syntactic treebanks and semantic graph banks.
Pedro Javier Ortiz Suárez, Yoann Dupont, Gaël Lejeune and Tian Tian. 2020. SinNer@Clef-Hipe2020 : Sinful adaptation of SotA models for Named Entity Recognition in French and German. In CLEF 2020 Working Notes. Working Notes of CLEF 2020 - Conference and Labs of the Evaluation Forum. Thessaloniki / Virtual, Greece.

In this article we present the approaches developed by the Sorbonne-INRIA for NER (SinNer) team for the CLEF-HIPE 2020 challenge on Named Entity Processing on old newspapers. The challenge proposed various tasks for three languages, among them we focused on Named Entity Recognition in French and German texts. The best system we proposed ranked third for these two languages, it uses FastText em-beddings and Elmo language models (FrELMo and German ELMo). We show that combining several word representations enhances the quality of the results for all NE types and that the segmentation in sentences has an important impact on the results.
Robin Algayres, Mohamed Salah Zaiem, Benoît Sagot and Emmanuel Dupoux. 2020. Evaluating the Reliability of Acoustic Speech Embeddings. In Proceedings of the 21st Annual Conference of the International Speech Communication Association (INTERSPEECH 2020), pages 4621–4625.

Speech embeddings are fixed-size acoustic representations of variable-length speech sequences. They are increasingly used for a variety of tasks ranging from information retrieval to un-supervised term discovery and speech segmentation. However, there is currently no clear methodology to compare or optimize the quality of these embeddings in a task-neutral way. Here, we systematically compare two popular metrics, ABX discrimination and Mean Average Precision (MAP), on 5 languages across 17 embedding methods, ranging from supervised to fully unsu-pervised, and using different loss functions (autoencoders, cor-respondance autoencoders, siamese). Then we use the ABX and MAP to predict performances on a new downstream task: the unsupervised estimation of the frequencies of speech segments in a given corpus. We find that overall, ABX and MAP correlate with one another and with frequency estimation. However, substantial discrepancies appear in the fine-grained distinctions across languages and/or embedding methods. This makes it un-realistic at present to propose a task-independent silver bullet method for computing the intrinsic quality of speech embed-dings. There is a need for more detailed analysis of the metrics currently used to evaluate such embeddings.
Tanti Kristanti and Laurent Romary. 2020. DeLFT and entity-fishing : Tools for CLEF HIPE 2020 Shared Task. In CLEF 2020 - Conference and Labs of the Evaluation Forum 2696. CEUR. Thessaloniki / Virtual, Greece.

This article presents an overview of approaches and results during our participation in the CLEF HIPE 2020 NERC-COARSE-LIT and EL-ONLY tasks for English and French. For these two tasks, we use two systems: 1) DeLFT, a Deep Learning framework for text processing; 2) entity-fishing, generic named entity recognition and disambiguation service deployed in the technical framework of INRIA.
Fernando Alva-Manchego, Louis Martin, Antoine Bordes, Carolina Scarton, Benoît Sagot and Lucia Specia. 2020. ASSET: A Dataset for Tuning and Evaluation of Sentence Simplification Models with Multiple Rewriting Transformations. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4668–4679. Association for Computational Linguistics. Online.

In order to simplify a sentence, human editors perform multiple rewriting transformations: they split it into several shorter sentences , paraphrase words (i.e. replacing complex words or phrases by simpler synonyms), reorder components, and/or delete information deemed unnecessary. Despite these varied range of possible text alterations, current models for automatic sentence simplification are evaluated using datasets that are focused on a single transformation, such as lexical paraphrasing or splitting. This makes it impossible to understand the ability of simplification models in more realistic settings. To alleviate this limitation, this paper introduces ASSET, a new dataset for assessing sentence simplification in English. ASSET is a crowdsourced multi-reference corpus where each simplification was produced by executing several rewriting transformations. Through quantitative and qualitative experiments, we show that simplifications in ASSET are better at capturing characteristics of simplicity when compared to other standard evaluation datasets for the task. Furthermore, we motivate the need for developing better methods for automatic evaluation using ASSET, since we show that current popular metrics may not be suitable when multiple simplification transformations are performed.
Louis Martin, Benjamin Muller, Pedro Javier Ortiz Suárez, Yoann Dupont, Laurent Romary, Éric Villemonte de La Clergerie, Djamé Seddah and Benoît Sagot. 2020. CamemBERT: a Tasty French Language Model. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7203–7219. Online.

Pretrained language models are now ubiquitous in Natural Language Processing. Despite their success, most available models have either been trained on English data or on the con-catenation of data in multiple languages. This makes practical use of such models-in all languages except English-very limited. In this paper, we investigate the feasibility of training monolingual Transformer-based language models for other languages, taking French as an example and evaluating our language models on part-of-speech tagging, dependency parsing, named entity recognition and natural language inference tasks. We show that the use of web crawled data is preferable to the use of Wikipedia data. More surprisingly, we show that a relatively small web crawled dataset (4GB) leads to results that are as good as those obtained using larger datasets (130+GB). Our best performing model CamemBERT reaches or improves the state of the art in all four downstream tasks.
Djamé Seddah, Farah Essaidi, Amal Fethi, Matthieu Futeral, Benjamin Muller, Pedro Javier Ortiz Suárez, Benoît Sagot and Abhishek Srivastava. 2020. Building a User-Generated Content North-African Arabizi Treebank: Tackling Hell. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1139–1150. Association for Computational Linguistics. Online.

We introduce the first treebank for a romanized user-generated content variety of Algerian, a North-African Arabic dialect known for its frequent usage of code-switching. Made of 1500 sentences, fully annotated in morpho-syntax and Universal Dependency syntax, with full translation at both the word and the sentence levels, this treebank is made freely available. It is supplemented with 50k unlabeled sentences collected from Common Crawl and web-crawled data using intensive data-mining techniques. Preliminary experiments demonstrate its usefulness for POS tagging and dependency parsing. We believe that what we present in this paper is useful beyond the low-resource language community. This is the first time that enough unlabeled and annotated data is provided for an emerging user-generated content dialectal language with rich morphology and code switching, making it an challenging test-bed for most recent NLP approaches.
Pedro Javier Ortiz Suárez, Laurent Romary and Benoît Sagot. 2020. A Monolingual Approach to Contextualized Word Embeddings for Mid-Resource Languages. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1703–1714. Association for Computational Linguistics. Online.

We use the multilingual OSCAR corpus, extracted from Common Crawl via language classification, filtering and cleaning, to train monolingual contextualized word embeddings (ELMo) for five mid-resource languages. We then compare the performance of OSCAR-based and Wikipedia-based ELMo embeddings for these languages on the part-of-speech tagging and parsing tasks. We show that, despite the noise in the Common-Crawl-based OSCAR data, embeddings trained on OSCAR perform much better than monolingual embeddings trained on Wikipedia. They actually equal or improve the current state of the art in tagging and parsing for all five languages. In particular, they also improve over multilingual Wikipedia-based contextual embeddings (multilingual BERT), which almost always constitutes the previous state of the art, thereby showing that the benefit of a larger, more diverse corpus surpasses the cross-lingual benefit of multilingual embedding architectures.
Clémentine Fourrier. 2020. Évolution phonologique des langues et réseaux de neurones : travaux préliminaires (Sound change and neural networks: preliminary experiments ). In Actes de la 6e conférence conjointe Journées d'Études sur la Parole (JEP, 33e édition), Traitement Automatique des Langues Naturelles (TALN, 27e édition), Rencontre des Étudiants Chercheurs en Informatique pour le Traitement Automatique des Langues (RÉCITAL, 22e édition). Volume 3 : Rencontre des Étudiants Chercheurs en Informatique pour le TAL, pages 110–122. ATALA et AFCP. Nancy, France.

Cognate prediction is a key task in historical linguistics that presents a number of similarities withmachine translation. However, although neural methods are now widespread in machine translation,they are still largely unused in historical linguistics. In this paper, we study the performance ofneural methods (more specifically encoder-decoder networks) for the task of cognate prediction. Wefocus in particular on the types of data that can be used for this task, and compare the performanceof statistical and neural methods. We show that sound correspondances can only be learned usingcognate datasets, and that statistical and neural methods seem to have complementary strengths andweaknesses regarding what they learn about the data.
Louis Martin, Benjamin Muller, Pedro Javier Ortiz Suárez, Yoann Dupont, Laurent Romary, Éric Villemonte de la Clergerie, Benoît Sagot and Djamé Seddah. 2020. Les modèles de langue contextuels Camembert pour le français : impact de la taille et de l'hétérogénéité des données d'entrainement (C AMEM BERT Contextual Language Models for French: Impact of Training Data Size and Heterogeneity ). In Actes de la 6e conférence conjointe Journées d'Études sur la Parole (JEP, 33e édition), Traitement Automatique des Langues Naturelles (TALN, 27e édition), Rencontre des Étudiants Chercheurs en Informatique pour le Traitement Automatique des Langues (RÉCITAL, 22e édition). Volume 2 : Traitement Automatique des Langues Naturelles, pages 54–65. ATALA et AFCP. Nancy, France.

Contextual word embeddings have become ubiquitous in Natural Language Processing. Until recently,most available models were trained on English data or on the concatenation of corpora in multiplelanguages. This made the practical use of models in all languages except English very limited.The recent release of monolingual versions of BERT (Devlinet al., 2019) for French establisheda new state-of-the-art for all evaluated tasks. In this paper, based on experiments on CamemBERT(Martinet al., 2019), we show that pretraining such models on highly variable datasets leads to betterdownstream performance compared to models trained on more uniform data. Moreover, we show thata relatively small amount of web crawled data (4GB) leads to downstream performances as good as amodel pretrained on a corpus two orders of magnitude larger (138GB)
Murielle Fabre, Pedro Javier Ortiz Suárez, Benoît Sagot and Éric Villemonte de La Clergerie. 2020. French Contextualized Word-Embeddings with a sip of CaBeRnet: a New French Balanced Reference Corpus. In CMLC-8 - 8th Workshop on the Challenges in the Management of Large Corpora. Marseille, France.

This paper describes and compares the impact of different types and size of training corpora on language models like ELMO. By asking the fundamental question of quality versus quantity we evaluate four French corpora for training on parsing scores, POS-tagging and named-entities recognition downstream tasks. The paper studies the relevance of a new corpus, CaBeRnet, featuring a representative range of language usage, including a balanced variety of genres (oral transcriptions, newspapers, popular magazines, technical reports, fiction, academic texts), in oral and written styles. We hypothesize that a linguistically representative and balanced corpora will allow the language model to be more efficient and representative of a given language and therefore yield better evaluation scores on different evaluation sets and tasks.
Louis Martin, Éric Villemonte de La Clergerie, Benoît Sagot and Antoine Bordes. 2020. Controllable Sentence Simplification. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 4689–4698. European Language Resources Association. Marseille, France.

Text simplification aims at making a text easier to read and understand by simplifying grammar and structure while keeping the underlying information identical. It is often considered an all-purpose generic task where the same simplification is suitable for all; however multiple audiences can benefit from simplified text in different ways. We adapt a discrete parametrization mechanism that provides explicit control on simplification systems based on Sequence-to-Sequence models. As a result, users can condition the simplifications returned by a model on attributes such as length, amount of paraphrasing, lexical complexity and syntactic complexity. We also show that carefully chosen values of these attributes allow out-of-the-box Sequence-to-Sequence models to outperform their standard counterparts on simplification benchmarks. Our model, which we call ACCESS (as shorthand for AudienCe-CEntric Sentence Simplification), establishes the state of the art at 41.87 SARI on the WikiLarge test set, a +1.42 improvement over the best previously reported score.
Clémentine Fourrier and Benoît Sagot. 2020. Methodological Aspects of Developing and Managing an Etymological Lexical Resource: Introducing EtymDB-2.0. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 3207–3216. European Language Resources Association. Marseille, France.

Diachronic lexical information was mostly used in its natural field, historical linguistics, until recently, when promising but not yet conclusive applications to low resource languages machine translation started extending its usage to NLP. There is therefore a new need for fine-grained, large-coverage and accurate etymological lexical resources. In this paper, we propose a set of guidelines to generate such resources, for each step of the life-cycle of an etymological lexicon: creation, update, evaluation, dissemination, and exploitation. To illustrate the guidelines, we introduce EtymDB 2.0, an etymological database automatically generated from the Wiktionary, which contains 1.8 million lexemes, linked by more than 700,000 fine-grained etymological relations, across 2,536 living and dead languages. We also introduce use cases for which EtymDB 2.0 could represent a key resource, such as phylogenetic tree generation, low resource machine translation and medieval languages study.
Gaël Guibon and Benoît Sagot. 2020. OFrLex: A Computational Morphological and Syntactic Lexicon for Old French. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 3217–3225. European Language Resources Association. Marseille, France.

In this paper we describe our work on the development and enrichment of OFrLex, a freely available, large-coverage morphological and syntactic Old French lexicon. We rely on several heterogeneous language resources to extract structured and exploitable information. The extraction follows a semi-automatic procedure with substantial manual steps to respond to difficulties encountered while aligning lexical entries from distinct language resources. OFrLex aims at improving natural language processing tasks on Old French such as part-of-speech tagging and dependency parsing. We provide quantitative information on OFrLex and discuss its reliability. We also describe and evaluate a semi-automatic, word-embedding-based lexical enrichment process aimed at increasing the accuracy of the resource. Results of this extension technique will be manually validated in the near future, a step that will take advantage of OFrLex's viewing, searching and editing interface, which is already accessible online.
Fahad Khan, Laurent Romary, Ana Salgado, Jack Bowers, Mohamed Khemakhem and Toma Tasovac. 2020. Modelling Etymology in LMF/TEI: The Grande Dicionário Houaiss da Língua Portuguesa Dictionary as a Use Case. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 3172–3180. European Language Resources Association. Marseille, France.

In this article, we will introduce two of the new parts of the new multi-part version of the Lexical Markup Framework (LMF) ISO standard, namely Part 3 of the standard (ISO 24613-3), which deals with etymological and diachronic data, and Part 4 (ISO 24613-4), which consists of a TEI serialisation of all of the prior parts of the model. We will demonstrate the use of both standards by describing the LMF encoding of a small number of examples taken from a sample conversion of the reference Portuguese dictionary Grande Dicionário Houaiss da Língua Portuguesa, part of a broader experiment comprising the analysis of different, heterogeneously encoded, Portuguese lexical resources. We present the examples in the Unified Modelling Language (UML) and also in a couple of cases in TEI.
Pedro Javier Ortiz Suárez, Yoann Dupont, Benjamin Muller, Laurent Romary and Benoît Sagot. 2020. Establishing a New State-of-the-Art for French Named Entity Recognition. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 4631–4638. European Language Resources Association. Marseille, France.

The French TreeBank developed at the University Paris 7 is the main source of morphosyntactic and syntactic annotations for French. However, it does not include explicit information related to named entities, which are among the most useful information for several natural language processing tasks and applications. Moreover, no large-scale French corpus with named entity annotations contain referential information, which complement the type and the span of each mention with an indication of the entity it refers to. We have manually annotated the French TreeBank with such information, after an automatic pre-annotation step. We sketch the underlying annotation guidelines and we provide a few figures about the resulting annotations.
Clémentine Fourrier and Benoît Sagot. 2020. Comparing Statistical and Neural Models for Learning Sound Correspondences. In LT4HALA 2020 - First Workshop on Language Technologies for Historical and Ancient Languages. Marseille, France.

Cognate prediction and proto-form reconstruction are key tasks in computational historical linguistics that rely on the study of sound change regularity. Solving these tasks appears to be very similar to machine translation, though methods from that field have barely been applied to historical linguistics. Therefore, in this paper, we investigate the learnability of sound correspondences between a proto-language and daughter languages for two machine-translation-inspired models, one statistical, the other neural. We first carry out our experiments on plausible artificial languages, without noise, in order to study the role of each parameter on the algorithms respective performance under almost perfect conditions. We then study real languages, namely Latin, Italian and Spanish, to see if those performances generalise well. We show that both model types manage to learn sound changes despite data scarcity, although the best performing model type depends on several parameters such as the size of the training data, the ambiguity, and the prediction direction.
Simon Gabay, Lucie Rondeau Du Noyer and Mohamed Khemakhem. 2020. Selling autograph manuscripts in 19th c. Paris: digitising the Revue des Autographes. In IX Convegno AIUCD. Milan, Italy.

In Paris, the manuscript market appears in the early 20's of the 19th c. Fixed-price catalogues and auction catalogues are regularly published, describing each document in detail. Such descriptions being highly formalised, it is possible to extract and structure them (almost) automatically, and thus create a database of sold manuscripts in 19th c. Paris.

Communications

Mohamed Khemakhem, Simon Gabay, Béatrice Joyeux-Prunel, Laurent Romary, Léa Saint-Raymond and Lucie Rondeau Du Noyer. 2020. Information Extraction Workflow for Digitised Entry-based Documents. In DARIAH Annual event 2020. Zagreb / Virtual, Croatia.

Book chapters

Benoît Sagot. 2020. A new PIE root *h1er ‘(to be/become) dark red'. In Loanwords and Substrata 164.

Romain Garnier and Benoît Sagot. 2020. New results on a centrum substratum in Greek: the Lydian connection. In Loanwords and Substrata 164.

Jennifer Edmond, Frank Fischer, Laurent Romary and Toma Tasovac. 2020. 9. Springing the Floor for a Different Kind of Dance. In Digital Technology and the Practices of Humanities Research, pages 207–234. Open Book Publishers.

Jennifer Edmond and Laurent Romary. 2020. 3. Academic Publishing. In Digital Technology and the Practices of Humanities Research, pages 49–80. Open Book Publishers.

Anne Baillot. 2020. Zahlenwahn oder Textliebe? Digitale Philologie als Disziplin und als Weltanschauung. In Machines/Maschinen. Les machines dans l'espace germanique: de l'automate de Kempelen à Kraftwerk. Presses Universitaires de Rennes.

Tech reports

Floriane Chiffoleau. 2020. Rapport d'avancement sur le projet DAHN (avec le soutien du MESRI). Technical report.

Other

Laurent Romary. 2020. Eléments de sciences ouvertes. .

Lucas Terriel. 2020. Le saviez-vous ? Les répertoires de notaires ne sont pas seulement des images numérisées ! .

This post provides an overview of the data associated with the documents of the project coordinated by INRIA (team project ALMAnaCH) and the National Archives LectAuRep - Automatic reading of directories - which consists in applying the handwritten text recognition techniques on notaries directories. This post is part of a larger reflection on the creation of a TEI pivot format to centralize metadata associated with documents and those generated during image processing with the eScriptorium transcription platform.
Jean-Damien Généro. 2020. Le corpus des Ouvriers des deux mondes : des images et des URLs. .

Si les documents d’archives ont une part prépondérante dans le projet Time us, ils ne représentent pas pour autant l’intégralité de sa documentation. Les imprimés sont également présents, sous la forme de trois importants dossiers : la collection de la presse ancienne lyonnaise, divers imprimés portant sur le textile en France au XIXe siècle, et le corpus des Ouvriers des deux mondes. Les Ouvriers des deux mondes sont des enquêtes sociologiques réparties en 3 séries et 126 monographies. Initiée par le sociologue Frédéric Le Play (1806-1882), la publication est assurée par la Société internationale des études pratiques d’économie sociale de 1857 à 1928 et représente un total de 13 volumes. Ceux-ci sont aujourd’hui intégralement consultables sur le site Internet Archive. Nous allons nous intéresser dans ce billet aux fichiers de transcription de ces volumes et au lien entre ceux-ci et les images numérisées d’origine. Le script lse od2m, écrit par Alix Chagué, avait automatiquement segmenté et transcrit les images, puis encodé et structuré en xml-tei les textes bruts ainsi obtenus; la sortie avait résulté en 13 fichiers xml. Ces fichiers « sources » avaient ensuite été scindés en 222 fichiers xml correspondant à autant de divisions logiques des volumes : les monographies bien sûr, mais également les introductions, tables des matières et autres éléments de paratexte. Des opérations de vérification ont permis de réduire le nombre de fichiers à 192.
Alix Chagué, Lucas Terriel and Laurent Romary. 2020. Des images au texte : LECTAUREP, un projet de reconnaissance automatique d'écriture. .

Laurent Romary. 2020. Les données de la recherche. .

Dans le cadre de l'Open Access Week, présentation de l’actualité dans le domaine de la gestion des données, notamment dans le cadre du plan science ouverte du ministère.
Laurent Romary. 2020. Multilingual content management and standards with a view on AI developments. .

Laurent Romary. 2020. An editorial and technical journey into Post Publication Peer Review (PPPR). .

Laurent Romary. 2020. TEI guidelines: born to be open. .

Open science has never been so high on the research agendas, and this is true in all fields, ranging from so-called hard sciences to the humanities. In this respect, those who have been dealing with the TEI guidelines for years, whether as users or designers of the standard, have experienced an environment which has always been open by construction and fostering openness for projects based upon its principles.We outline the main issues related to open science in the current scholarly landscape, whether political or technical, and show the various aspects where the TEI environment has been seminal in setting up an open agenda that may enlightened the humanities at large in terms of good practices, for, e.g., managing, documenting or disseminating scholarly sources and methods.

Preprints

Benjamin Muller, Antonis Anastasopoulos, Benoît Sagot and Djamé Seddah. 2020. When Being Unseen from mBERT is just the Beginning: Handling New Languages With Multilingual Language Models. Preprint.

Transfer learning based on pretraining language models on a large amount of raw data has become a new norm to reach state-of-the-art performance in NLP. Still, it remains unclear how this approach should be applied for unseen languages that are not covered by any available large-scale multilingual language model and for which only a small amount of raw data is generally available. In this work, by comparing multilingual and monolingual models, we show that such models behave in multiple ways on unseen languages. Some languages greatly benefit from transfer learning and behave similarly to closely related high resource languages whereas others apparently do not. Focusing on the latter, we show that this failure to transfer is largely related to the impact of the script used to write such languages. Transliterating those languages improves very significantly the ability of large-scale multilingual language models on downstream tasks.
Erzsébet Tóth-Czifra and Laurent Romary. 2020. The Heritage Data Reuse Charter: from principles to research workflows. Preprint.

There is a growing need to establish domain-or discipline-specific approaches to research data sharing workflows. A defining feature of data and data workflows in the arts and humanities domain is their dependence on cultural heritage sources hosted and curated in museums, libraries, galleries and archives. A major difficulty when scholars interact with heritage data is that the nature of the cooperation between researchers and Cultural Heritage Institutions (henceforth CHIs) is often constrained by structural and legal challenges but even more by uncertainties as to the expectations of both parties. The Heritage Data Reuse Charter aims to address these by designing a common environment that will enable all the relevant actors to work together to connect and improve access to heritage data and make transactions related to the scholarly use of cultural heritage data more visible and transparent. As a first step, a wide range of stakeholders on the Cultural Heritage and research sector agreed upon a set of generic principles, summarized in the Mission Statement of the Charter, that can serve as a baseline governing the interactions between CHIs, researchers and data centres. This was followed by a long and thorough validation process related to these principles through surveys 1 and workshops 2. As a second step, we now put forward a questionnaire template tool that helps researchers and CHIs to translate the 6 core principles into specific research project settings. It contains questions about access to data, provenance information, preferred citation standards, hosting responsibilities etc. on the basis of which the parties can arrive at mutual reuse agreements that could serve as a starting point for a FAIR-by-construction data management, right from the project planning/application phase. The questionnaire template and the resulting mutual agreements can be flexibly applied to projects of different scale and in platform-independent ways. Institutions can embed them into their own exchange protocols while researchers can add them to their Data Management Plans. As such, they can show evidence for responsible and fair conduct of cultural heritage data, and fair (but also FAIR) research data management practices that are based on partnership with the holding institution.

2019

Journal articles

Romain Garnier and Benoît Sagot. 2019. Metathesis of Proto-Indo-European Sonorants. Münchener Studien zur Sprachwissenschaft 73, pages 29–53. Verlag J.H. Röll GmbH.

Detlef Reineke and Laurent Romary. 2019. Bridging the gap between SKOS and TBX. edition - Die Fachzeitschrift für Terminologie 19. Deutscher Terminologie-Tag e.V. (DTT).

This article provides an in-depth comparison and proposal for mapping between Simple KnowledgeOrganization System (SKOS) and TermBase eXchange (TBX), two important exchangestandards within the knowledge and terminology landscape. The attempt to develop an interfaceor conversion routine between SKOS and TBX is rooted in a strong demand in the language andknowledge industries for resource leverage and based on the premise that the two formalisms aregoverned by similar data models, namely the description of concepts (rather than words).
Laurent Romary and Charles Riondet. 2019. Towards multiscale archival digital data. Umanistica digitale. AIUCD - Associazione per l'Informatica Umanistica e la Cultura Digitale.

In this paper, we would like to present some ideas on the use of the archival standards in various contexts that exemplify the complexity of such standards and provide users with innovative ways to handle EAD content. Our main idea is that researchers, Cultural heritage institutions, archival portals and standards maintenance bodies could greatly benefit from a multiscale modelling of archival data, but also from multiscale representations and documentations. A first step is on the way to being cleared in the domain of the management of heterogeneous archival sources in one single environment, namely a federated portal, like in EHRI. We built a methodology based on a specification and customisation method inspired from the long lasting experience of the Text Encoding Initiative (TEI) community. In the TEI framework, one has the possibility of defining project-specific subsets or extensions of the TEI guidelines while maintaining both the technical (XML schemas) and editorial (documentation) specification within a single framework. Using the same framework for EAD data allows us to express precise content-oriented rules combined with some interesting possibilities of integrating the human readable documentation in the validation process.

Conference proceedings

Laurent Romary. 2019. The place of lexicography in (computer) science. In The Future of Academic Lexicography: Linguistic Knowledge Codification in the Era of Big Data and AI. Leiden, Netherlands.

Luca Foppiano, Laurent Romary, Masashi Ishii and Mikiko Tanifuji. 2019. Automatic Identification and Normalisation of Physical Measurements in Scientific Literature. In DocEng '19 - ACM Symposium on Document Engineering 2019, pages 1–4. ACM Press. Berlin, Germany.

We present Grobid-quantities, an open-source application for extracting and normalising measurements from scientific and patent literature. Tools of this kind, aiming to understand and make unstructured information accessible, represent the building blocks for large-scale Text and Data Mining (TDM) systems. Grobid-quantities is a module built on top of Grobid [6] [13], a machine learning framework for parsing and structuring PDF documents. Designed to process large quantities of data, it provides a robust implementation accessible in batch mode or via a REST API. The machine learning engine architecture follows the cascade approach, where each model is specialised in the resolution of a specific task. The models are trained using CRF (Conditional Random Field) algorithm [12] for extracting quantities (atomic values, intervals and lists), units (such as length, weight) and different value representations (numeric, alphabetic or scientific notation). Identified measurementsare normalised according to the International System of Units (SI). Thanks to its stable recall and reliable precision, Grobid-quantities has been integrated as the measurement-extraction engine in various TDM projects, such as Marve (Measurement Context Extraction from Text), for extracting semantic measurements and meaning in Earth Science [10]. At the National Institute for Materials Science in Japan (NIMS), it is used in an ongoing project to discover new superconducting materials. Normalised materials characteristics (such as critical temperature, pressure) extracted from scientific literature are a key resource for materials informatics (MI) [9].
Benjamin Muller, Benoit Sagot and Djamé Seddah. 2019. Enhancing BERT for Lexical Normalization. In Proceedings of the 5th Workshop on Noisy User-generated Text (W-NUT 2019), pages 297–306. Association for Computational Linguistics. Hong Kong, China.

Language model-based pre-trained representations have become ubiquitous in natural language processing. They have been shown to significantly improve the performance of neu-ral models on a great variety of tasks. However , it remains unclear how useful those general models can be in handling non-canonical text. In this article, focusing on User Generated Content (UGC) in a resource-scarce scenario , we study the ability of BERT (Devlin et al., 2018) to perform lexical normalisation. Our contribution is simple: by framing lexical normalisation as a token prediction task, by enhancing its architecture and by carefully fine-tuning it, we show that BERT can be a competitive lexical normalisation model without the need of any UGC resources aside from 3,000 training sentences. To the best of our knowledge , it is the first work done in adapting and analysing the ability of this model to handle noisy UGC data.
Hervé Bohbot, Francesca Frontini, Fahad Khan, Mohamed Khemakhem and Laurent Romary. 2019. Nénufar: Modelling a Diachronic Collection of Dictionary Editions as a Computational Lexical Resource. In ELEX 2019: smart lexicography. Sintra, Portugal.

The Petit Larousse Illustré (PLI) is a monolingual French dictionary which has been published every year since the 1906 edition and which is therefore a fundamental testimony of the evolution of the French language. As a consequence of the pre-1948 editions of the PLI entering the public domain in 2018 the Nénufar (“Nouvelle édition numérique de fac-similés de reference”) project was launched at the Praxiling laboratory in Montpellier with the aim of digitising and make these editions available electronically. The project is still ongoing; various selected editions per decade are going to be fully digitised (so far the 1906, 1924 and 1925 editions have been completed), and changes backtracked and dated per specific year.
Lucie Rondeau Du Noyer, Simon Gabay, Mohamed Khemakhem and Laurent Romary. 2019. Scaling up Automatic Structuring of Manuscript Sales Catalogues. In TEI 2019: What is text, really? TEI and beyond. Graz, Austria.

Manuscript Sales Catalogues (MSC) are highly important for authenticating documents and studying the reception of authors. Their regular publication throughout Europe since the beginning of the 19th c. has consequently raised the interest around scaling up the means for automatically structuring their contents. Following successful first encoding tests with GROBID-Dictionaries [1,2] on a single MSC collection [3], we aim in this paper to present the results of more advanced tests of the system’s capacity to handle a larger corpus with MSC ofdifferent dealers, and therefore multiple layouts.
Fernando Alva-Manchego, Louis Martin, Carolina Scarton and Lucia Specia. 2019. EASSE: Easier Automatic Sentence Simplification Evaluation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP): System Demonstrations, pages 49–54. Association for Computational Linguistics. Hong Kong, China.

We introduce EASSE, a Python package aiming to facilitate and standardise automatic evaluation and comparison of Sentence Simplification (SS) systems. EASSE provides a single access point to a broad range of evaluation resources: standard automatic metrics for assessing SS outputs (e.g. SARI), word-level accuracy scores for certain simplification transformations, reference-independent quality estimation features (e.g. compression ratio), and standard test data for SS evaluation (e.g. TurkCorpus). Finally, EASSE generates easy-to-visualise reports on the various met-rics and features above and on how a particular SS output fares against reference simplifications. Through experiments, we show that these functionalities allow for better comparison and understanding of the performance of SS systems.
Mathilde Regnault, Sophie Prévost and Éric Villemonte de La Clergerie. 2019. Challenges of language change and variation: towards an extended treebank of Medieval French. In TLT 2019 - 18th International Workshop on Treebanks and Linguistic Theories. Paris, France.

In order to automatically extend a treebank of Old French (9 th-13 th c.) with new texts in Old and Middle French (14 th-15 th c.), we need to adapt tools for syntactic annotation. However, these stages of French are subjected to great variation, and parsing historical texts remains an issue. We chose to adapt a symbolic system, the French Metagrammar (FRMG), and develop a lexicon comparable to the Lefff lexicon for Old and Middle French. The final goal of our project is to model the evolution of language through the whole period of Medieval French (9 th-15 th c.).
Benoit Crabbé, Murielle Fabre and Christophe Pallier. 2019. Variable beam search for generative neural parsing and its relevance for the analysis of neuro-imaging signal. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 1150–1160. Association for Computational Linguistics. Hong Kong, China.

This paper describes a method of variable beam size inference for Recurrent Neural Network Grammar (rnng) by drawing inspiration from sequential Monte-Carlo methods such as particle filtering. The paper studies the relevance of such methods for speeding up the computations of direct generative parsing for rnng. But it also studies the potential cognitive interpretation of the underlying representations built by the search method (beam activity) through analysis of neuro-imaging signal.
Géraldine Walther and Benoît Sagot. 2019. Morphological complexities. In 16th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology. Florence, Italy.

Jack Bowers, Mohamed Khemakhem and Laurent Romary. 2019. TEI Encoding of a Classical Mixtec Dictionary Using GROBID- Dictionaries. In ELEX 2019: Smart Lexicography. Sintra, Portugal.

This paper presents the application of GROBID-Dictionaries (Khemakhem et al. 2017, Khemakhem et al. 2018a, Khemakhem et al. 2018b, Khemakhem et al. 2018c), an open source machine learning system for automatically structuring print dictionaries in digital format into TEI (Text Encoding Initiative) to a historical lexical resource of Colonial Mixtec 'Voces del Dzaha Dzahui' published by the Dominican fray Francisco Alvarado in the year 1593. The GROBID-Dictionaries application was applied to a reorganized and modernized version of the historical resource published by Jansen and Perez Jiménez (2009). The TEI dictionary produced will be integrated into a language documentation project dealing with Mixtepec-Mixtec (ISO 639-3: mix) (Bowers & Romary, 2017, 2018a, 2018b) an under-resourced indigenous language native to the Juxtlahuaca district of Oaxaca Mexico.
Marco Dinarelli and Loïc Grobol. 2019. Modèles neuronaux hybrides pour la modélisation de séquences : le meilleur de trois mondes. In TALN-RECITAL 2019 - 26ème Conférence sur le Traitement Automatique des Langues Naturelles. Toulouse, France.

We propose a neural architecture with the main characteristics of the most successful neural models of the last years : bidirectional RNNs, encoder-decoder, and the Transformer model. Evaluation on three sequence labelling tasks yields results that are close to the state-of-the-art for all tasks and better than it for some of them, showing the pertinence of this hybrid architecture for this kind of tasks.
Loïc Grobol. 2019. Neural Coreference Resolution with Limited Lexical Context and Explicit Mention Detection for Oral French. In Second Workshop on Computational Models of Reference, Anaphora and Coreference (CRAC19). Minneapolis, United States.

We propose an end-to-end coreference resolution system obtained by adapting neural models that have recently improved the state-of-the-art on the OntoNotes benchmark to make them applicable to other paradigms for this task. We report the performances of our system on ANCOR, a corpus of transcribed oral French-for which it constitutes a new baseline with proper evaluation.
Benoît Sagot. 2019. Développement d'un lexique morphologique et syntaxique de l'ancien français. In 26ème Conférence sur le Traitement Automatique des Langues Naturelles (TALN). Toulouse, France.

In this paper we describe our work on the development of a large-scale morphological and syntactic lexicon of Old French for natural language processing. We rely on dictionary and lexical resources, from which the extraction of structured and exploitable information required specific developments. In addition, matching information from these different sources posed difficulties. We provide quantitative information on the resulting lexicon, and discuss its reliability in its current version and the prospects for improvement allowed by the existence of a first version, in particular through the automatic analysis of textual data.
Pedro Javier Ortiz Suárez, Benoît Sagot and Laurent Romary. 2019. Asynchronous Pipeline for Processing Huge Corpora on Medium to Low Resource Infrastructures. In 7th Workshop on the Challenges in the Management of Large Corpora (CMLC-7). Leibniz-Institut für Deutsche Sprache. Cardiff, United Kingdom.

Common Crawl is a considerably large, heterogeneous multilingual corpus comprised of crawled documents from the internet, surpassing 20TB of data and distributed as a set of more than 50 thousand plain text files where each contains many documents written in a wide variety of languages. Even though each document has a metadata block associated to it, this data lacks any information about the language in which each document is written, making it extremely difficult to use Common Crawl for monolingual applications. We propose a general, highly parallel, multithreaded pipeline to clean and classify Common Crawl by language; we specifically design it so that it runs efficiently on medium to low resource infrastructures where I/O speeds are the main constraint. We develop the pipeline so that it can be easily reapplied to any kind of heterogeneous corpus and so that it can be parameterised to a wide range of infrastructures. We also distribute a 6.3TB version of Common Crawl, filtered, classified by language, shuffled at line level in order to avoid copyright issues, and ready to be used for NLP applications.
Mathilde Regnault. 2019. Adaptation d'une métagrammaire du français contemporain au français médiéval. In TALN-RECITAL 2019 - 26e édition de la conférence TALN (Traitement Automatique des Langues Naturelles) et 21e édition de la conférence jeunes chercheur$\times$euse$\times$s RECITAL. Toulouse, France.

Adapting an existing metagrammar for Contemporary French to Medieval French Medieval French is characterized by strong language variation. Our purpose is to extend a corpus of Old French annotated with dependency syntax with new texts of this period and add texts of Middle French. In order to achieve this, we want to adapt existing tools instead of training a parser with annotated data. In this article, we present a state of the art for this project and our solution : adapting the French Metagrammar (FRMG) to former states of language.
Ganesh Jawahar, Benoît Sagot and Djamé Seddah. 2019. What does BERT learn about the structure of language? In ACL 2019 - 57th Annual Meeting of the Association for Computational Linguistics. Florence, Italy.

BERT is a recent language representation model that has surprisingly performed well in diverse language understanding benchmarks. This result indicates the possibility that BERT networks capture structural information about language. In this work, we provide novel support for this claim by performing a series of experiments to unpack the elements of English language structure learned by BERT. We first show that BERT's phrasal representation captures phrase-level information in the lower layers. We also show that BERT's intermediate layers encode a rich hierarchy of linguistic information, with surface features at the bottom, syntactic features in the middle and semantic features at the top. BERT turns out to require deeper layers when long-distance dependency information is required, e.g.~to track subject-verb agreement. Finally, we show that BERT representations capture linguistic information in a compositional way that mimics classical, tree-like structures.
Laurent Romary, Mohamed Khemakhem, Fahad Khan, Jack Bowers, Nicoletta Calzolari, Monte George, Mandy Pet and Piotr Bański. 2019. LMF Reloaded. In AsiaLex 2019: Past, Present and Future. Istanbul, Turkey.

Lexical Markup Framework (LMF) or ISO 24613 [1] is a de jure standard that provides a framework for modelling and encoding lexical information in retrodigitised print dictionaries and NLP lexical databases. An in-depth review is currently underway within the standardisation subcommittee , ISO-TC37/SC4/WG4, to find a more modular, flexible and durable follow up to the original LMF standard published in 2008. In this paper we will present some of the major improvements which have so far been implemented in the new version of LMF.
Anas Fahad Khan, Hervé Bohbot, Francesca Frontini, Mohamed Khemakhem and Laurent Romary. 2019. Historical Dictionaries as Digital Editions and Connected Graphs: the Example of Le Petit Larousse Illustré. In Digital Humanities 2019. Utrech, Netherlands.

Marco Dinarelli and Loïc Grobol. 2019. Seq2Biseq: Bidirectional Output-wise Recurrent Neural Networks for Sequence Modelling. In CICLing 2019 - 20th International Conference on Computational Linguistics and Intelligent Text Processing. La Rochelle, France.

During the last couple of years, Recurrent Neural Networks (RNN) have reached state-of-the-art performances on most of the sequence modelling problems. In particular, the sequence to sequence model and the neural CRF have proved to be very effective in this domain. In this article, we propose a new RNN architecture for sequence labelling, leveraging gated recurrent layers to take arbitrarily long contexts into account, and using two decoders operating forward and backward. We compare several variants of the proposed solution and their performances to the state-of-the-art. Most of our results are better than the state-of-the-art or very close to it and thanks to the use of recent technologies, our architecture can scale on corpora larger than those used in this work.
Jack Bowers and Laurent Romary. 2019. TEI and the Mixtepec-Mixtec corpus: data integration, annotation and normalization of heterogeneous data for an under-resourced language. In 6th International Conference on Language Documentation and Conservation (ICLDC). Honolulu, United States.

Communications

Alix Chagué, Victoria Le Fourner, Manuela Martini and Éric Villemonte de La Clergerie. 2019. Deux siècles de sources disparates sur l'industrie textile en France : comment automatiser les traitements d'un corpus non-uniforme ? In Colloque DHNord 2019 »;Corpus et archives numériques»;. Lille, France.

Murielle Fabre, Yoann Dupont and Éric Villemonte de La Clergerie. 2019. Syntactic Parsing versus MWEs: What can fMRI signal tell us. In PARSEME-FR 2019 consortium meeting. Blois, France.

Yixuan Li, Gerdes Kim and Dong Chuanming. 2019. Character-level Annotation for Chinese Surface-Syntactic Universal Dependencies. In Proceedings of the Fifth International Conference on Dependency Linguistics (Depling, SyntaxFest 2019), pages 216–226. Association for Computational Linguistics. Paris, France.

This paper presents a new schema to annotate Chinese Treebanks on the character level. The original Universal Dependencies (UD) and Surface-Syntactic Universal Dependencies (SUD) projects provide token-level resources with rich morphosyntactic language details. However, without any commonly accepted word definition for Chinese, the dependency parsing always faces the dilemma of word segmentation. Therefore we present a character-level annotation schema integrated into the existing Universal Dependencies schema as an extension.
Kim Gerdes, Sylvain Kahane and Xinying Chen. 2019. Rediscovering Greenberg's Word Order Universals in UD. In UDW, Universal Dependencies Workshop 2019, Syntaxfest. Paris, France.

This paper discusses an empirical refoundation of selected Greenbergian word order univer-sals based on a data analysis of the Universal Dependencies project. The nature of the data we work on allows us to extract rich details for testing well-known typological universals and constitutes therefore a valuable basis for validating Greenberg's universals. Our results show that we can refine some Greenbergian universals in a more empirical and accurate way by means of a data-driven typological analysis.
Bernard Caron, Marine Courtin, Kim Gerdes and Sylvain Kahane. 2019. A Surface-Syntactic UD Treebank for Naija. In TLT 2019, Treebanks and Linguistic Theories, Syntaxfest. Paris, France.

This paper presents a syntactic treebank for spoken Naija, an English pidgincreole, which is rapidly spreading across Nigeria. The syntactic annotation is developed in the Surface-Syntactic Universal Dependency annotation scheme (SUD) (Gerdes et al., 2018) and automatically converted into UD. We present the workflow of the treebank development for this under-resourced language. A crucial step in the syntactic analysis of a spoken language consists in manually adding a markup onto the transcription, indicating the segmentation into major syntactic units and their internal structure. We show that this so-called "macrosyntactic" markup improves parsing results. We also study some iconic syntactic phenomena that clearly distinguish Naija from English.
Xinying Chen and Kim Gerdes. 2019. The relation between dependency distance and frequency. In Quasy 2019, Quantitative Syntax 2019, Syntaxfest. Paris, France.

This present pilot study investigates the relationship between dependency distance and frequency based on the analysis of an English dependency treebank. The preliminary result shows that there is a non-linear relation between dependency distance and frequency. This relation between them can be further formalized as a power law function which can be used to predict the distribution of dependency distance in a treebank.
José Carlos Rosales Nunez, Djamé Seddah and Guillaume Wisniewski. 2019. A Comparison between NMT and PBSMT Performance for Translating Noisy User-Generated Content. In The 22nd Nordic Conference on Computational Linguistics (NoDaLiDa'19). Turku, Finland.

This work compares the performances achieved by Phrase-Based Statistical Ma- chine Translation systems (PBSMT) and attention-based Neural Machine Transla- tion systems (NMT) when translating User Generated Content (UGC), as encountered in social medias, from French to English. We show that, contrary to what could be ex- pected, PBSMT outperforms NMT when translating non-canonical inputs. Our error analysis uncovers the specificities of UGC that are problematic for sequential NMT architectures and suggests new avenue for improving NMT models.
Mohamed Khemakhem, Ioana Galleron, Geoffrey Williams, Laurent Romary and Pedro Javier Ortiz Suárez. 2019. How OCR Performance can Impact on the Automatic Extraction of Dictionary Content Structures. In 19th annual Conference and Members' Meeting of the Text Encoding Initiative Consortium (TEI) -What is text, really? TEI and beyond. Graz, Austria.

Ganesh Jawahar and Djamé Seddah. 2019. Contextualized Diachronic Word Representations. In 1st International Workshop on Computational Approaches to Historical Language Change 2019 (colocated with ACL 2019). Florence, Italy.

Diachronic word embeddings play a key role in capturing interesting patterns about how language evolves over time. Most of the existing work focuses on studying corpora spanning across several decades, which is understandably still not a possibility when working on social media-based user-generated content. In this work, we address the problem of studying semantic changes in a large Twitter corpus collected over five years, a much shorter period than what is usually the norm in di-achronic studies. We devise a novel attentional model, based on Bernoulli word embeddings, that are conditioned on contextual extra-linguistic (social) features such as network, spatial and socioeconomic variables, which are associated with Twitter users, as well as topic-based features. We posit that these social features provide an inductive bias that helps our model to overcome the narrow time-span regime problem. Our extensive experiments reveal that our proposed model is able to capture subtle semantic shifts without being biased towards frequency cues and also works well when certain con-textual features are absent. Our model fits the data better than current state-of-the-art dynamic word embedding models and therefore is a promising tool to study diachronic semantic changes over small time periods.
Pedro Javier Ortiz Suárez, Laurent Romary and Benoît Sagot. 2019. Preparing the Dictionnaire Universel for Automatic Enrichment. In 10th International Conference on Historical Lexicography and Lexicology (ICHLL). Leeuwarden, Netherlands.

The Dictionnaire Universel (DU) is an encyclopaedic dictionary originally written by Antoine Furetière around 1676-78, later revised and improved by the Protestant jurist Henri Basnage de Beauval who expanded, corrected and included terms of arts, crafts and sciences, into the Dictionnaire.The aim of the BASNUM project is to digitize the DU in its second edition rewritten by Basnage de Beauval, to analyse it with computational methods in order to better assess the importance of this work for the evolution of sciences and mentalities in the 18th century, and to contribute to the contemporary movement for creating innovative and data-driven computational methods for text digitization, encoding and analysis.Based on the experience acquired within the research group, an enrichment workflow based upon a series of Natural Language Processing processes is being set up to be applied to Basnage's work. This includes, among others, automatic identification of the dictionary structure (macro-, meso- and microstructure), named-entity recognition (in particular persons and locations), classification of dictionary entries, detection and study of polysemy markers, tracking and classification of quotation use (bibliographic references), scoring semantic similarity between the DU and other dictionaries. The main challenges being the lack of available annotated data in order to train machine learning models, decreased accuracy when using modern pre-trained models due to the differences between present-day and 18th century French, and even unreliable or low quality OCRisation. The paper describes methods that are useful to tackle these issues in order to prepare the the DU for automatic enrichment going beyond what current available tools like Grobid-dictionaries can do, thanks to the advent of deep learning NLP models. The paper also describes how these methods could be applied to other dictionaries or even other types of ancient texts.
Sheena Bassett, Leon Wessels, Steven Krauwer, Bente Maegaard, Hella Hollander, Femmy Admiraal, Laurent Romary and Frank Uiterwaal. 2019. Connecting the Humanities through Research Infrastructures. In 4th Digital Humanities in the Nordic Countries (DHN 2019). Copenhagen, Denmark.

Several Research Infrastructures(RIs)exist in the Humanities and Social Sciences, some –such as CLARIN, DARIAH and CESSDA –which address specific areas of interest, i.e. linguistic studies, digital humanities and social science data archives. RIs are also unique in their scope and application, largely tailored to their specific community needs. However, commonalities do exist and it is recognised that benefits are to be gained from these such as efficient use of resources, enabling multi-disciplinary research and sharing good practices. As such,a bridging project PARTHENOS has worked closely with CLARIN and DARIAH as well as ARIADNE (archaeology), CENDARI (history), EHRI (holocaust studies) and E-RIHS (heritage science) to iden-tify, develop and promote these commonalities. In this paper, we present some specif-ic examples of cross-discipline and trans-border applications arising from joint RI collaboration, allowing for entirely new avenues of research

Books

Kim Gerdes and Sylvain Kahane. 2019. Proceedings of the Fifth International Conference on Dependency Linguistics (Depling, SyntaxFest 2019). .

Book chapters

Kim Gerdes, Sylvain Kahane, Rachel Bawden, Julie Beliao, Éric Villemonte de La Clergerie and Ilaine Wang. 2019. Annotation tools for syntax. In Rhapsodie: A Prosodic and Syntactic Treebank for Spoken French. John Benjamins.

Sylvain Kahane, Paola Pietrandrea and Kim Gerdes. 2019. The annotation of list structures. In Rhapsodie: A Prosodic and Syntactic Treebank for Spoken French. John Benjamins.

Sylvain Kahane, Kim Gerdes and Rachel Bawden. 2019. The microsyntactic annotation. In Rhapsodie: A Prosodic and Syntactic Treebank for Spoken French. John Benjamins.

Laurent Romary and Jennifer Edmond. 2019. A Tangential View on Impact for the Arts and Humanities through the Lens of the DARIAH-ERIC. In Stay Tuned To The Future - Impact of the Research Infrastructures for Social Sciences and Humanities. Leo S. Olschki Editore.

The reflections in this chapter stem from the perspective of the DARIAH-ERIC,a distributed infrastructure for the arts and humanities. They explore how impactcan take a variety of forms not always considered when the term is applied in astrictly technocratic sense, and the idea that focussing on the user of a research infrastructuremay not describe an optimal relationship from an impact perspective.The chapter concludes by presenting three frames of reference in which an infrastructurelike DARIAH can have impact: to foster excellence through impact on researchers,promote fluidity through impact on policymakers, and support efficiencythrough impact on our partner organisations.

Other

Laurent Romary. 2019. Traçabilité des données d'expérience pour les matériaux anciens et patrimoniaux. .

Kim Gerdes, Bruno Guillaume, Sylvain Kahane and Guy Perrier. 2019. Pourquoi se tourner vers le SUD : L'importance de choisir un schéma d'annotation en dépendance surface-syntaxique. .

Why you should turn SUD-The importance of choosing a Surface-Syntactic dependency annotation scheme. The article attempts to promote the Surface-Syntactic Universal Dependencies (SUD) annotation scheme to syntactic annotation projects, as an alternative to the standard Universal Dependencies (UD) scheme, particularly on oral or non-standard texts, conducted for comparative and typological studies.
Yoann Dupont. 2019. Un corpus libre, évolutif et versionné en entités nommées du français. .

A free, evolving and versioned french named entity recognition corpus. Annotated corpora are very hard resources to make because of the high human cost they imply. Once released, they are hardly modifiable and tend to not evolve through time. In this article we present a free and evolving corpus annotated in named entity recognition based on French Wikinews articles from 2016 to 2018, for a total of 1191 articles. We will briefly describe the annotation guidelines before comparing our corpus to various corpora of comparable nature. We will also give an intra-annotator-agreement to provide an estimation of the stability of the annotation as well as the overall process to develop the corpus.
Murielle Fabre, Benoit Crabbe and Christophe Pallier. 2019. Variable beam search for generative neural parsing and its fit with neuro-imaging signal. .

Murielle Fabre, Shohini Bhattasali, Christophe Pallier and John Hale. 2019. Modeling Conventionalization and Predictability in Multi-Word Expressions at Brain-level. .

Linguistic expressions have been binarized as compositional and non-compositional given the lack of composionallinguistic analysis, Multi-word Expressions (MWEs) demonstrate finer-grained degrees of conventionalization and predictability in psycholinguisitcs, which canbe quantified through computational Association Measures, like Point-wise Mutual Information and Dice's Coefficient.In this study, fMRI recordings of naturalistic narrative comprehension is used to investigate to what extent these computational measures and the underlying cognitive processes they could reflect are observable during on-line naturalistic sentence processing.
Abhishek Srivastava, Benjamin Muller and Djamé Seddah. 2019. Unsupervised Learning for Handling Code-Mixed Data: A Case Study on POS Tagging of North-African Arabizi Dialect. .

Language model pretrained representation are now ubiquitous in Natural Language Processing. In this work, we present some first results in adapting those models to Out-of-Domain textual data. Using Part-of-Speech tagging as our case study, we analyze the ability of BERT to model a complex North-African Dialect (Arabizi). What is Arabizi ? BERT and Arabizi We do our experiments on the released base multilingual version of BERT (Delvin et al. 2018) which was trained on a concatenation of Wikipedia of 104 languages. BERT has never seen any Arabizi. It is visible that Arabizi is related to French in BERT's embedding space Summary • Multilingual-BERT can be used to build a decent Part-of-Speech Tagger with a reasonable amount of annotated data • Unsupervised adaptation improves (+1) performance in downstream POS tagging Research questions • Is BERT able to model Out-of-Domain languages such as Arabizi ? • Can we adapt BERT in an unsupervised way to Arabizi ? Definitions • Dialectal Arabic is a variation of Classic Arabic that varies from one region to another that is spoken orally only. Darija is the one spoken in Maghreb (Algeria, Tunisia, Morocco). • Arabizi is the name given to the transliterated language of dialectal Arabic in Latin script mostly found online. Key Property : High Variability • No spelling, morphological or syntactic fixed norms • Strong influence from foreign languages • Code-Switching French / Darija Unsupervised Fine Tuning of BERT on Arabizi We fine-tune BERT (MLM objective) on the 200k Arabizi sentences Results Collecting and filtering raw Arabizi Data We bootstrap a data set for Arabizi starting from 9000 sentences collected by Cotterell et al. (2014). Using keywords scraping, we collect 1 million UGC sentences comprising French, English and Arabizi. We filter 200k Arabizi sentences out of the raw corpus (94% F1 score) using our language identifier (cf. Figure below). Lexical Normalization We train a clustering lexical normalizer using edit and word2vec distances. This degrades downstream performances in POS tagging. A new Treebank The first bottleneck in analyzing such a dialect is the lack of annotated resources. We developed a CoNLL-U Treebank** that includes Part-of-Speech, dependencies, and the translations of 1500 sentences (originally posted in Facebook, Echorouk newspaper…). Model Accuracy Baseline (udpipe) 73.7 Baseline + Normalization (udpipe) 72.4 BERT + POS tuning 77.3 BERT + POS tuning + Normalization (udpipe) 69.9 BERT + Unsupervised Domain fine tuning+ POS tuning 78.3 Final performance. Accuracy reported on the test set averaged over 5 runs Figure 2 : Validation accuracy while fine tuning BERT on Arabizi data (200k sentence) X1000 iteration Accuracy Masked Language Model French Wikipedia Arabizi vive mca w nchalah had l'3am championi Arabizi long live MCA and I hope that this year we will be champions English
Laurent Romary. 2019. The TEI as a modeling infrastructure: TEI beyond the TEI realms. .

Whereas the Text Encoding Initiative (TEI) has become the reference standard for encoding textual material of all kinds in the humanities, the power of the underlying TEI modelling infrastructure to deal with professional document management scenarios or even non-TEI based vocabularies could deserve more attention. The aim of my presentation will be to show concrete projects where I have contributed to use the ODD (One document does it all) specification language of the TEI to deal with such applications as the management of patent documents, the modelling of lexical resources or the integration of heterogeneous archival descriptions in the EAD (Encoded Archival Description) standard. Starting from an introduction of the TEI as a standard, I will try to conclude on its potential bright future as a real infrastructure for the humanities.
Anas Alaoui M'Darhri, Vincent Baillet, Bastien Bourineau, Alessio Calantropio, Gabriella Carpentiero, Medhi Chayani, Livio de Luca, Iwona Dudek, Bruno Dutailly, Hélène Gautier, Eleonora Grilli, Valentin Grimaud, Christoph Hoffmann, Adeline Joffres, Nenad Jončić, Michel Jordan, Justin J.L. Kimball, Adeline Manuel, Patrick Mcinerney, Imanol Muñoz Pandiella, Ariane Néroulidis, Erica Nocerino, Anthony Pamart, Costas Papadopoulos, Marco Potenziani, Emilie Saubestre, Roberto Scopigno, Dorian Seillier, Sarah Tournon-Valiente, Martina Trognitz, Jean-Marc Vallet and Chiara Zuanni. 2019. Share - Publish - Store - Preserve. Methodologies, Tools and Challenges for 3D Use in Social Sciences and Humanities. .

Through this White Paper, which gathers contributions from experts of 3D data as well as professionals concerned with the interoperability and sustainability of 3D research data, the PARTHENOS project aims at highlighting some of the current issues they have to face, with possible specific points according to the discipline, and potential practices and methodologies to deal with these issues.During the workshop, several tools to deal with these issues have been introduced and confronted with the participants experiences, this White Paper now intends to go further by also integrating participants feedbacks and suggestions of potential improvements.Therefore, even if the focus is put on specific tools, the main goal is to contribute to the development of standardized good practices related to the sharing, publication, storage and long-term preservation of 3D data.
Laurent Romary, Damien Biabiany, Klaus Illmayer, Marie Puren, Charles Riondet, Dorian Seillier and Lionel Tadjou. 2019. SSK by example - Make your Arts and Humanities research go standard. .

Preprints

Laurent Romary, Dorian Seillier and Erzsébet Tóth-Czifra. 2019. Reuse agreement template between Cultural Heritage Institutions and researchers. Preprint.

A defining feature of data and data workflows in the arts and humanities domain is their dependence on cultural heritage sources hosted and curated in museums, libraries, galleries and archives. A major difficulty when scholars interact with heritage data is that the nature of the cooperation between researchers and Cultural Heritage Institutions and the researchers working in CHIs (henceforth CHIs) is often constrained by structural and legal challenges but even more by uncertainties as to the expectations of both parties.This recognition led several European organizations such as APEF, CLARIN, Europeana, E-RIHS to come together and join forces under the governance of DARIAH to set up principles and mechanisms for improving the conditions for the use and re-use of cultural heritage data issued by cultural heritage institutions and studied and enriched by researchers. As a first step of this joint effort is the Heritage Data Reuse Charter (https://datacharter.hypotheses.org/) establishes 6 basic principles for improving the use and re-use of cultural heritage resources by researchers and , to help all the relevant actors to work together to connect and improve access to heritage data. These are: Reciprocity, Interoperability, Citability, Openness, Stewardship and Trustworthiness.As a further step in translating these principles to actual data workflows the survey below serves as a template to frame exchanges around cultural heritage data by enabling both Cultural Heritage Institutions, infrastructure providers and researchers and to clarify their goals at the beginning and the project, to specify access to data, provenance information, preferred citation standards, hosting responsibilities etc. on the basis of which the parties can arrive at mutual reuse agreements that could serve as a starting point for a FAIR-by-construction data management, right from the project planning/application phase. In practice, the survey below can be flexibly applied in platform-independent ways in exchange protocols between Cultural Heritage Institutions and researchers, Institutions who sign the Charter could use it (and expect to use such surveys) in their own exchange protocols. Another direction of future developments is to set up a platform dedicated to such exchanges. On the other hand, researchers are encouraged to contact the CHIs during the initial stages of their project in order to explain their plans and figure details of transaction together. This mutual declaration can later be a powerful component in their Data Management Plans as it shows evidence for responsible and fair conduct of cultural heritage data, and fair (but also FAIR) research data management practices that are based on partnership with the holding institution. As enclosing a Research Data Management Plan to grant applications is becoming a more and more common requirement among research funders, we need to raise the funders’ awareness to the fact that such bi- or trilateral agreements and data reuse declarations among researchers, CHIs and infrastructure providers are crucial domain-specific components of FAIR data management.
Laurent Romary. 2019. Archives de hier et de demain. Preprint.

Alix Chagué, Victoria Le Fourner, Manuela Martini and Eric Villemonte de La Clergerie. 2019. Deux siècles de sources disparates sur l'industrie textile en France : comment automatiser les traitements d'un corpus non-uniforme ? Preprint.

Louis Martin, Benjamin Muller, Pedro Javier Ortiz Suárez, Yoann Dupont, Laurent Romary, Éric Villemonte de La Clergerie, Djamé Seddah and Benoît Sagot. 2019. CamemBERT: a Tasty French Language Model. Preprint.

Pretrained language models are now ubiquitous in Natural Language Processing. Despite their success, most available models have either been trained on English data or on the concatenation of data in multiple languages. This makes practical use of such models—in all languages except English—very limited. Aiming to address this issue for French, we release CamemBERT, a French version of the Bi-directional Encoders for Transformers (BERT). We measure the performance of CamemBERT compared to multilingual models in multiple downstream tasks, namely part-of-speech tagging, dependency parsing, named-entity recognition, and natural language inference. CamemBERT improves the state of the art for most of the tasks considered. We release the pretrained model for CamemBERT hoping to foster research and downstream applications for French NLP.
Louis Martin, Benoît Sagot, Éric Villemonte de La Clergerie and Antoine Bordes. 2019. Controllable Sentence Simplification. Preprint.

Text simplification aims at making a text easier to read and understand by simplifying grammar and structure while keeping the underlying information identical. It is often considered an all-purpose generic task where the same simplification is suitable for all; however multiple audiences can benefit from simplified text in different ways. We adapt a discrete parametrization mechanism that provides explicit control on simplification systems based on Sequence-to-Sequence models. As a result, users can condition the simplifications returned by a model on parameters such as length, amount of paraphrasing, lexical complexity and syntactic complexity. We also show that carefully chosen values of these parameters allow out-of-the-box Sequence-to-Sequence models to outperform their standard counterparts on simplification benchmarks. Our model, which we call ACCESS (as shorthand for AudienCe-CEntric Sentence Simplification), increases the state of the art to 41.87 SARI on the WikiLarge test set, a +1.42 gain over previously reported scores.
Charlotte Rochereau, Benoît Sagot and Emmanuel Dupoux. 2019. Modeling German Verb Argument Structures: LSTMs vs. Humans. Preprint.

LSTMs have proven very successful at language modeling. However, it remains unclear to what extent they are able to capture complex morphosyntactic structures. In this paper, we examine whether LSTMs are sensitive to verb argument structures. We introduce a German grammaticality dataset in which ungrammatical sentences are constructed by manipulating case assignments (eg substituting nominative by accusative or dative). We find that LSTMs are better than chance in detecting incorrect argument structures and slightly worse than humans tested on the same dataset. Surprisingly, LSTMs are contaminated by heuristics not found in humans like a preference toward nominative noun phrases. In other respects they show human-similar results like biases for particular orders of case assignments.
Alba Málaga Sabogal and Serge Troubetzkoy. 2019. Unique ergodicity for infinite area Translation Surfaces. Preprint.

We consider infinite staircase translation surfaces with varying step sizes. For typical step sizes we show that the translation flow is uniquely ergodic in almost every direction. Our result also hold for typical configurations of the Ehrenfest wind-tree model endowed with the Hausdorff topology. In contrast, we show that the translation flow on a periodic translation surface can not be uniquely ergodic in any direction.
Jack Bowers. 2019. Language Documentation and Standards in Digital Humanities: TEI and the documentation of Mixtepec-Mixtec. Preprint.

This project concerns an ongoing language documentation project covering the Mixtepec-Mixtec variety of Mixtec (iso 639-3: mix). Mixtepec-Mixtec is an Otomonguean spoken by roughly 9000-10000 people in the Juxtlahuaca district of Oaxaca, and parts of the Guerrerro and Puebla states of Mexico as well as by communities living in California, Oregon, Washington and Arkansas. Among the primary facets of the work are to: create an open source body of reusable and extensible multimedia language resources encoded in TEI XML; create multi-lingual translations (English and Spanish), annotate the content according to sound theoretical linguistic principles; use the above in order to further the knowledge of all aspects of the language itself within the fields of linguistics and lexicography by producing empirical corpus-based descriptions and analyses of various aspects of the language’s features; demonstrate and evaluate the application of encoding and description standards on a collection of lexical and knowledge resources for an under-resourced non-indo-european language. In addition to providing a lasting and reusable set of resources for the MIX language, this work also aims to make strides towards bridging the gap between lexicography, language documentation, theoretical linguistics, computational linguistics and digital humanities.

2018

PhD theses and Habiliations

Benoît Sagot. 2018. Informatiser le lexique. Habilitation à diriger des recherches. Sorbonne Université.

Journal articles

Charles Riondet and Laurent Romary. 2018. The Standardization Survival Kit: for a Wider Use of Metadata Standards within Arts and Humanities. Archives et Bibliothèques de Belgique - Archief- en Bibliotheekwezen in België 106, pages 55–62. Archief.

Jack Bowers and Laurent Romary. 2018. Bridging the Gaps between Digital Humanities, Lexicography, and Linguistics: A TEI Dictionary for the Documentation of Mixtepec-Mixtec. Dictionaries: Journal of the Dictionary Society of North America 39, pages 79–106. Dictionary Society of North America.

This paper discusses the digital dictionary component in an ongoing language documentation project for the Mixtepec-Mixtec language (iso 639-3: mix). Mixtepec-Mixtec (Sa'an Savi 'rain language') is an Oto-monguean language spoken by roughly 9,000-10,000 people in the Juxtlahuaca district of Oaxaca Mexico. Creating a digital dictionary for an under-resourced language entails a number of challenges that require unique and nuanced encoding solutions in which a delicate balance between the linguistic content, data structure, potential linked resources, and editorial metadata must be found. Herein we demonstrate how we use TEI to create a reusable, extensible, and machine readable language resource with an emphasis on how our solutions using a combination of novel and established TEI dictionary structures enable us to address our specific needs for Mixtepec-Mixtec and also provide a relevant roadmap for similar under-resourced language projects.
Laurent Romary and Charles Riondet. 2018. EAD-ODD: A solution for project-specific EAD schemes. Archival Science. Springer Verlag.

This article tackles the issue of integrating heterogeneous archival sources in one single data repository, namely the EHRI portal, whose aim is to support Holocaust research by providing online access to information about dispersed sources relating to the Holocaust (http://portal.ehri-project.eu). In this case, the problem at hand is to combine data coming from a network of archives in order to create an interoperable data space which can be used to search for, retrieve and disseminate content in the context of archival-based research. The central aspect of the work described in this paper is the assessment of the role of the Encoded Archival Description (EAD) standard as the basis for achieving the tasks described above. We have worked out how we could develop a real strategy of defining specific customization of EAD that could be used at various stages of the process of integrating heterogeneous sources. We have developed a methodology based on a specification and customization method inspired from the extensive experience of the Text Encoding Initiative (TEI) community. In the TEI framework, one has the possibility to model specific subsets or extensions of the TEI guidelines while maintaining both the technical (XML schemas) and editorial (documentation) content within a single framework. This work has led us quite far in anticipating that the method we have developed may be of a wider interest within similar environments, but also, as we believe, for the future maintenance of the EAD standard.
Sacha Beniamine, Olivier Bonami and Benoît Sagot. 2018. Inferring inflection classes with description length. Journal of Language Modelling 5, pages 465–525. Institute of Computer Science, Polish Academy of Sciences, Poland.

We discuss the notion of an inflection class system, a traditional ingredient of the description of inflection systems of nontrivial complexity. We distinguish systems of microclasses, which partition a set of lexemes in classes with identical behavior, and systems of macroclasses, which group lexemes that are similar enough in a few larger classes. On the basis of the intuition that macroclasses should contribute to a concise description of the system, we propose one algorithmic method for inferring macroclasses from raw inflectional paradigms, based on minimisation of the description length of the system under a given strategy of identifying morphological alternations in paradigms. We then exhibit classifications produced by our implementation on French and European Portuguese conjugation data and argue that they constitute an appropriate systematisation of traditional classifications. To arrive at such a convincing systematisation, it was crucial for us to use a local approach to inflection class similarity (based on pairwise comparisons of paradigm cells) rather than a global approach (based on the simultaneous comparison of all cells). We conclude that it is indeed possible to infer inflectional macroclasses objectively.
Alba Málaga Sabogal and Serge Troubetzkoy. 2018. Infinite ergodic index of the ehrenfest wind-tree model. Communications in Mathematical Physics 358, pages 995–1006. Springer Verlag.

The set of all possible configurations of the Ehrenfest wind-tree model endowed with the Hausdorff topology is a compact metric space. For a typical configuration we show that the wind-tree dynamics has infinite ergodic index in almost every direction. In particular some ergodic theorems can be applied to show that if we start with a large number of initially parallel particles their directions decorrelate as the dynamics evolve answering the question posed by the Ehrenfests.

Conference proceedings

Jack Bowers and Philip Stöckle. 2018. TEI and Bavarian dialect resources in Austria: updates from the DBÖ and WBÖ. In Second workshop on Corpus-Based Research in the Humanities (CRH-2) 1. Gerastree proceedings. Vienna, Austria.

In our paper, we present a large historical database of Bavarian dialects (from the Dictionary of Bavarian Dialects in Austria) and its conversion from handwritten paper slips via TUSTEP into TEI-XML while elaborating on the topics discussed by Bowers [2] with regards to enhancement of its contents. While the original purpose of the digitalization was to facilitate the writing of dictionary articles, our current TEI database will be used as a corpus from which the materials are being gathered to both write print dictionary articles as well as serving as a basis for a web-based lexicographic information system. Herein we trace the different steps that have already been taken to create our current digital database from a legacy data collection, discuss the challenges we are still facing, and describe the approaches we are taking and considering to address such challenges.
Jack Bowers, Axel Herold and Laurent Romary. 2018. TEI-Lex0 Etym -towards terse(r) recommendations for the encoding of etymological information. In TEI Conference and Members' Meeting. Tokyo, Japan.

Jack Bowers and Laurent Romary. 2018. Encoding Mixtepec-Mixtec Etymology in TEI. In TEI Conference and Members' Meeting. Tokyo, Japan.

Marco Dinarelli and Loïc Grobol. 2018. Modélisation d'un contexte global d'étiquettes pour l'étiquetage de séquences dans les réseaux neuronaux récurrents. In Journée commune AFIA-ATALA sur le Traitement Automatique des Langues et l'Intelligence Artificielle pendant la onzième édition de la plate-forme Intelligence Artificielle (PFIA 2018). Nancy, France.

During the last few years Recurrent Neural Networks (RNN) have reached state-of-the-art performances on most sequence modeling problems. In particular the sequence to sequence model and the neural CRF have proved very effective on this class of problems. In this paper we propose an alternative RNN for sequence labelling, based on label embeddings and memory networks, which makes possible to take arbitrary long contexts into account. Our results are better than those of state-of-the-art models in most cases, and close to them in all cases. Moreover, our solution is simpler than the best models in the literature. MOTS-CLÉS : Réseaux neuronaux récurrents, contexte global, Étiquetage de séquences.
Louis Martin, Samuel Humeau, Pierre-Emmanuel Mazaré, Antoine Bordes, Éric Villemonte de La Clergerie and Benoît Sagot. 2018. Reference-less Quality Estimation of Text Simplification Systems. In 1st Workshop on Automatic Text Adaptation (ATA). Tilburg, Netherlands.

The evaluation of text simplification (TS) systems remains an open challenge. As the task has common points with machine translation (MT), TS is often evaluated using MT metrics such as BLEU. However, such metrics require high quality reference data, which is rarely available for TS. TS has the advantage over MT of being a monolingual task, which allows for direct comparisons to be made between the simplified text and its original version. In this paper, we compare multiple approaches to reference-less quality estimation of sentence-level text simplification systems, based on the dataset used for the QATS 2016 shared task. We distinguish three different dimensions: gram-maticality, meaning preservation and simplicity. We show that n-gram-based MT metrics such as BLEU and METEOR correlate the most with human judgment of grammaticality and meaning preservation, whereas simplicity is best evaluated by basic length-based metrics.
Ganesh Jawahar, Benjamin Muller, Amal Fethi, Louis Martin, Éric Villemonte de La Clergerie, Benoît Sagot and Djamé Seddah. 2018. ELMoLex: Connecting ELMo and Lexicon features for Dependency Parsing. In CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies. Brussels, Belgium.

In this paper, we present the details of the neural dependency parser and the neu-ral tagger submitted by our team 'ParisNLP' to the CoNLL 2018 Shared Task on parsing from raw text to Universal Dependencies. We augment the deep Biaffine (BiAF) parser (Dozat and Manning, 2016) with novel features to perform competitively: we utilize an indomain version of ELMo features (Peters et al., 2018) which provide context-dependent word representations; we utilize disambiguated, embedded, morphosyntactic features from lexicons (Sagot, 2018), which complements the existing feature set. Henceforth , we call our system 'ELMoLex'. In addition to incorporating character embed-dings, ELMoLex leverage pre-trained word vectors, ELMo and morphosyntactic features (whenever available) to correctly handle rare or unknown words which are prevalent in languages with complex morphology. ELMoLex 1 ranked 11th by Labeled Attachment Score metric (70.64%), Morphology-aware LAS metric (55.74%) and ranked 9th by Bilexical dependency metric (60.70%). In an extrinsic evaluation setup, ELMoLex ranked 7 th for Event Extraction, Negation Resolution tasks and 11th for Opinion Analysis task by F1 score.
Andrea Bertino, Luca Foppiano, Laurent Romary and Pierre Mounier. 2018. Leveraging Concepts in Open Access Publications. In PUBMET 2018 - 5th Conference on Scholarly Publishing in the Context of Open Science. Zadar, Croatia.

Aim: This paper addresses the integration of a Named Entity Recognition and Disambiguation (NERD) service within a group of open access (OA) publishing digital platforms and considers its potential impact on both research and scholarly publishing. This application, called entity-fishing, was initially developed by Inria in the context of the EU FP7 project CENDARI (Lopez et al., 2014) and provides automatic entity recognition and disambiguation against Wikipedia and Wikidata. Distributed with an open-source licence, it was deployed as a web service in the DARIAH infrastructure hosted at the French HumaNum. Methods: In this paper, we focus on the specific issues related to its integration on five OA platforms specialized in the publication of scholarly monographs in social sciences and humanities as part of the work carried out within the EU H2020 project HIRMEOS (High Integration of Research Monographs in the European Open Science infrastructure). Results and Discussion: In the following sections, we give a brief overview of the current status and evolution of OA publications and how HIRMEOS aims to contribute to this. We then give a comprehensive description of the entity-fishing service, focusing on its concrete applications in real use cases together with some further possible ideas on how to exploit the generated annotations. Conclusions: We show that entity-fishing annotations can improve both research and publishing process. Entity-fishing annotations can be used to achieve a better and quicker understanding of the specific and disciplinary language of certain monographs and so encourage non-specialists to use them. In addition, a systematic implementation of the entity-fishing service can be used by publishers to generate thematic indexes within book collections to allow better cross-linking and query functions.
Loïc Grobol, Frédéric Landragin and Serge Heiden. 2018. XML-TEI-URS: using a TEI format for annotated linguistic resources. In CLARIN Annual Conference 2018. Pisa, Italy.

This paper discusses XML-TEI-URS, a recently introduced TEI-compliant XML format for theannotation of referential phenomenons in arbitrary corpora. We describe our experiments on usingthis format in different contexts, assess its perceived strengths and weaknesses, compare it withother similar efforts and suggest improvements to ease its use as standard for thedistribution of interoperable annotated linguistic resources.
Maëlle Brassier, Alexis Puret, Augustin Voisin-Marras and Loïc Grobol. 2018. Classification par paires de mention pour la résolution des coréférences en français parlé interactif. In Conférence jointe CORIA-TALN-RJC 2018. Rennes, France.

Mention-pair classification for corefence resolution on spontaneous spoken French. This paper presents the first experiments conducted by our laboratory (LIFAT) on the question of the resolution of coreference on spontaneous spoken French. We have developed a mention-pair classifier, trained on the ANCOR French coreference corpus, which is based on various classification techniques among which support vector machines (SVM). The paper details several experimental studies that investigate several factors (classification model, interactivity degree, nature of the coreference…) that should affect the performances of the system.
Mohamed Khemakhem, Laurent Romary, Simon Gabay, Hervé Bohbot, Francesca Frontini and Giancarlo Luxardo. 2018. Automatically Encoding Encyclopedic-like Resources in TEI. In The annual TEI Conference and Members Meeting. Tokyo, Japan.

Mohamed Khemakhem, Carmen Brando, Laurent Romary, Frédérique Mélanie-Becquet and Jean-Luc Pinol. 2018. Fueling Time Machine: Information Extraction from Retro-Digitised Address Directories. In JADH2018 »;Leveraging Open Data»;. Tokyo, Japan.

Djamé Seddah, Éric Villemonte de La Clergerie, Benoît Sagot, Hector Martinez Alonso and Marie Candito. 2018. Cheating a Parser to Death: Data-driven Cross-Treebank Annotation Transfer. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation, pages 4535–4539. European Language Resources Association (ELRA). Miyazaki, Japan.

We present an efficient and accurate method for transferring annotations between two different treebanks of the same language. This method led to the creation of a new instance of the French Treebank (Abeillé et al., 2003), which follows the Universal Dependency annotation scheme and which was proposed to the participants of the CoNLL 2017 Universal Dependency parsing shared task (Zeman et al., 2017). Strong results from an evaluation on our gold standard (94.75% of LAS, 99.40% UAS on the test set) demonstrate the quality of this new annotated data set and validate our approach.
Benoît Sagot. 2018. A multilingual collection of CoNLL-U-compatible morphological lexicons. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation, pages 1861–1867. Miyazaki, Japan.

We introduce UDLexicons, a multilingual collection of morphological lexicons that follow the guidelines and format of the Universal Dependencies initiative. We describe the three approaches we use to create 53 morphological lexicons covering 38 languages, based on existing resources. These lexicons, which are freely available, have already proven useful for improving part-of-speech tagging accuracy in state-of-the-art architectures.
Amir More, Özlem Çetinoğlu, Çağri Çöltekin, Nizar Habash, Benoît Sagot, Djamé Seddah, Dima Taji and Reut Tsarfaty. 2018. CoNLL-UL: Universal Morphological Lattices for Universal Dependency Parsing. In 11th Language Resources and Evaluation Conference. Miyazaki, Japan.

Following the development of the universal dependencies (UD) framework and the CoNLL 2017 Shared Task on end-to-end UD parsing, we address the need for a universal representation of morphological analysis which on the one hand can capture a range of different alternative morphological analyses of surface tokens, and on the other hand is compatible with the segmentation and morphological annotation guidelines prescribed for UD treebanks. We propose the CoNLL universal lattices (CoNLL-UL) format, a new annotation format for word lattices that represent morphological analyses, and provide resources that obey this format for a range of typologically different languages. The resources we provide are harmonized with the two-level representation and morphological annotation in their respective UD v2 treebanks, thus enabling research on universal models for morphological and syntactic parsing , in both pipeline and joint settings, and presenting new opportunities in the development of UD resources for low-resource languages.
Loïc Grobol, Isabelle Tellier, Éric Villemonte de La Clergerie, Marco Dinarelli and Frédéric Landragin. 2018. ANCOR-AS: Enriching the ANCOR Corpus with Syntactic Annotations. In LREC 2018 - 11th edition of the Language Resources and Evaluation Conference. Miyazaki, Japan.

This paper presents ANCOR-AS, an enriched version of the ANCOR corpus. This version adds syntactic annotations in addition to the existing coreference and speech transcription ones. This corpus is also released in a new TEI-compliant XML format.
Mohamed Khemakhem, Axel Herold and Laurent Romary. 2018. Enhancing Usability for Automatically Structuring Digitised Dictionaries. In GLOBALEX workshop at LREC 2018. Miyazaki, Japan.

The last decade has seen a rapid development of the number of NLP tools which have been made available to the community. The usability of several e-lexicography tools represents a serious obstacle for researchers with little or no background in computer science. We present in this paper our efforts to overcome this issue in the case of a machine learning system for the automatic segmentation and semantic annotation of digitised dictionaries. Our approach is based on limiting the burdens of managing the tool's setup in different execution environments and lightening the complexity of the training process. We illustrate the possibility to reach this goal through the adaptation of existing functionalities and through using out of the box software deployment technology. We also report on the community's feedback after exposing the new setup to real users of different professional backgrounds.

Communications

Laurent Romary and Toma Tasovac. 2018. TEI Lex-0: A Target Format for TEI-Encoded Dictionaries and Lexical Resources. In TEI Conference and Members' Meeting. Tokyo, Japan.

Achieving consistent encoding within a given community of practice has been a recurrent issue for the TEI Guidelines. The topic is of particular importance for lexical data if we think of the potential wealth of content we could gain from pooling together the information available in the variety of highly structured, historical and contemporary lexical resources. Still, the encoding possibilities offered by the Dictionaries Chapter in the Guidelines are too numerous and too flexible to guarantee sufficient interoperability and a coherent model for searching, visualising or enriching multiple lexical resources.Following the spirit of TEI Analytics [Zillig, 2009], developed in the context of the MONK project, TEI Lex-0 aims at establishing a target format to facilitate the interoperability of heterogeneously encoded lexical resources. This is important both in the context of building lexical infrastructures as such [Ermolaev and Tasovac, 2012] and in the context of developing generic TEI-aware tools such as dictionary viewers and profilers. The format itself should not necessarily be one which is used for editing or managing individual resources, but one to which they can be univocally transformed to be queried, visualised, or mined in a uniform way. We are also aiming to stay as aligned as possible with the TEI subset developed in conjunction with the revision of the ISO LMF (Lexical Markup Framework) standard so that coherent design guidelines can be provided to the community (cf. [Romary, 2015]).The paper will provide an overview of the various domains covered by TEI Lex- 0 and the main decisions that were taken over the last 18 months: constraining the general structure of a lexical entry; offering mechanisms to overcome the limits of <entry> when used in retro-digitized dictionaries (by allowing, for instance, <pc> and <lbl> as children of <entry>); systematizing the representation of morpho-syntactic information [Bański et al., 2017]; providing a strict <sense>-based encoding of sense-related information; deprecating <hom>; dealing with internal and external references in dictionary entries, providing more advanced encodings of etymology (see submission by Bowers, Herold and Romary); as well as defining technical constraints on the systematic use of @xml:id at different levels of the dictionary microstructure. The activity of the group has already lead to changes in the Guidelines in response to specific GitHub tickets.
David Lindemann, Mohamed Khemakhem and Laurent Romary. 2018. Retro-digitizing and Automatically Structuring a Large Bibliography Collection. In European Association for Digital Humanities (EADH) Conference. Galway, Ireland.

Marie Puren, Alix Chagué, Manuela Martini, Éric Villemonte de La Clergerie and Charles Riondet. 2018. Creating gold data to understand the gender gap in the French textile trades (17th–20th century). Time-Us project. In Digital Humanities 2018: »;Puentes/ Bridges»;. Mexico, Mexico.

Marie Puren, Dorian Seillier, Charles Riondet and Lionel Tadjou. 2018. Le Standardization Survival Kit (SSK). In Rencontres de la TGIR Huma-Num. Ecully, France.

Marie Puren, Charles Riondet, Laurent Romary, Dorian Seillier and Lionel Tadjou. 2018. The Standardization Survival Kit (SSK). In Digital Humanities Benelux 2018. Amsterdam, Netherlands.

Romain Garnier and Benoît Sagot. 2018. New results on a centum substratum in Greek: the Lydian connection. In International Colloquium on Loanwords and Substrata in Indo-European languages. Limoges, France.

Benoît Sagot. 2018. A new PIE root *h1er ‘(to be) dark red, dusk red': drawing the line between inherited and borrowed words for ‘red(ish)', ‘pea', ‘ore', ‘dusk' and ‘love' in daughter languages. In International Colloquium on Loanwords and Substrata in Indo-European languages. Limoges, France.

Hajer Maraoui, Kais Haddar and Laurent Romary. 2018. Segmentation tool for hadith corpus to generate TEI encoding. In 4th International Conference on Advanced Intelligent Systems and Informatics (AISI'18). Cairo, Egypt.

A segmentation tool for a hadith corpus is necessary to prepare the TEI hadith encoding process. In this context, we aim to develop a tool allowing the segmentation of hadith text from Sahih al-Bukhari corpus. To achieve this objective, we start by identifying different hadith structures. Then, we elaborate an automatic processing tool for hadith segmentation. This tool will be integrated in a prototype allowing the TEI encoding process. The experimentation and the evaluation of this tool is based on Sahih al-Bukhari corpus. The obtained results were encouraging despite some flaws related to exceptional cases of hadith structure.
Hervé Bohbot, Francesca Frontini, Giancarlo Luxardo, Mohamed Khemakhem and Laurent Romary. 2018. Presenting the Nénufar Project: a Diachronic Digital Edition of the Petit Larousse Illustré. In GLOBALEX 2018 - Globalex workshop at LREC2018, pages 1–6. Miyazaki, Japan.

This paper presents the Nénufar project, which aims to make several successive (free of copyright up to 1948) editions of the French Petit Larousse Illustré dictionary available in a digitised format. The corpus of digital editions will be made publicly available via a web-based querying interface, as well as distributed in a machine readable format, TEI-LEX0.

Book chapters

Tobias Blanke, Conny Kristel and Laurent Romary. 2018. Crowds for Clouds: Recent Trends in Humanities Research Infrastructures. In Cultural Heritage Digital Tools and Infrastructures. Routledge.