ALMAnaCH organises regular seminars in NLP and digital humanities. Everyone is welcome!
Sign up to email@example.com to receive seminar announcements.
Seminars were organised by Djamé Seddah until July 2021 and since then are organised by Rachel Bawden.
The (Undesired) Attenuation of Human Biases by Multilinguality.
NLP Beyond the Top-100 Languages
Abstract: The availability of large multilingual pre-trained language models has opened up exciting pathways for developing NLP technologies for languages with scarce resources. In this talk I will summarize some of my group's recent work on the challenges of handling new, unseen languages through finetuning, proposing a phylogeny-based adapter solution. Last, as data is paramount for extending into new languages, I will discuss issues relating to data requirements and data representativeness.
Bio: Antonios Anastasopoulos is an Assistant Professor in Computer Science at George Mason University. He received his PhD in Computer Science from the University of Notre Dame with a dissertation on "NLP for Endangered Languages Documentation" and then did a postdoc at Languages Technologies Institute at Carnegie Mellon University. His research is on natural language processing with a focus on low-resource settings, endangered languages, and cross-lingual learning, and is currently funded by the National Science Foundation, the National Endowment for the Humanities, the DoD, Google, Amazon, Meta, and the Virginia Research Investment Fund.Download slides here:
DP-Parse: Finding Word Boundaries from Raw Speech with an Instance Lexicon
Abstract: Finding word boundaries in continuous speech is challenging as there is little or no equivalent of a 'space' delimiter between words. Popular Bayesian non-parametric models for text segmentation use a Dirichlet process to jointly segment sentences and build a lexicon of word types. We introduce DP-Parse, which uses similar principles but only relies on an instance lexicon of word tokens, avoiding the clustering errors that arise with a lexicon of word types. On the Zero Resource Speech Benchmark 2017, our model sets a new speech segmentation state-of-the-art in 5 languages. The algorithm monotonically improves with better input representations, achieving yet higher scores when fed with weakly supervised inputs. Despite lacking a type lexicon, DP-Parse can be pipelined to a language model and learn semantic and syntactic representations as assessed by a new spoken word embedding benchmark.
Roadmap to universal hate speech detection
Abstract: An increasing propagation of hate speech has been detected on social media platforms (e.g., Twitter) where (pseudo-)anonymity enables people to target others without being recognized or easily traced. While this societal issue has attracted many studies in the NLP community, it comes with three important challenges. Hate speech detection models should be fair, work on every language, and consider the whole context (e.g., imagery). Solving these challenges will revolutionize the field of hate speech detection and help on create a "universal" model. In this talk, I will present my contributions in this area along with my takes for future directions.
Bio: Debora Nozza is an Assistant Professor in Computing Sciences at Bocconi University. She was recently awarded a €120,000 grant from Fondazione Cariplo for her project MONICA, which will focus on monitoring coverage, attitudes, and accessibility of Italian measures in response to COVID-19. Her research interests mainly focus on Natural Language Processing, specifically on the detection and counter-acting of hate speech and algorithmic bias on Social Media data in multilingual context. She was one of the organizers of the task on Automatic Misogyny Identification (AMI) at Evalita 2018 and Evalita 2020, and one of the organizers of the HatEval Task 5 at SemEval 2019 on multilingual detection of hate speech against immigrants and women in Twitter.
Modeling Decentralized Group Coordination at Large Scale
Abstract: Understanding collective decision making at a large-scale, and elucidating how community organization and community dynamics shape collective behavior are at the heart of social science research. Communities are multi-faceted, complex and dynamic. In this talk I will present two approaches for learning community representations: a generic representation that could be used as an exploratory tool to find nuanced similarities between communities, and a task oriented representation. Both representations combine multiple types of signals - textual and contextual, e.g., the (social) network structure and community dynamics. I will show how this multifaceted model can accurately predict large-scale collective decision-making in a distributed environment. We demonstrate the applicability of our model through Reddit's r/place - a large-scale online experiment in which millions of users, self-organized in thousands of communities, clashed and collaborated in an effort to realize their agenda.
Bio: Dr. Oren Tsur is an Assistant Professor (Senior Lecturer) at the Department of Software and Information Systems Engineering at Ben Gurion University in Israel where he heads the NLP and Social Dynamics Lab (NASLAB) and the newly founded interdisciplinary Research Center for Cyber Policy and Politics (a web page and a logo are coming soon :). His work combines Machine Learning, Natural Language Processing (NLP), Social Dynamics, and Complex Networks. Specifically, Oren’s work varies from sentiment analysis to modeling speakers’ language preferences, hates-speech detection, community dynamics, and adversarial influence campaigns. Oren serves as an (S)Area Chair, editor and Senior Program Committee member in venues like ACL, EMNLP, WSDM and ICWSM and as a reviewer for journals ranging from TACL to PNAS and Nature. Oren’s work was published in top NLP and Web Science venues, most recently AAAI-22 and WWW-22.
A Few Thousand Translations Go a Long Way! Leveraging Pre-trained Models for African News Translation
Abstract: Recent advances in the pre-training of language models leverage large-scale datasets to create multilingual models. However, low-resource languages are mostly left out in these datasets. This is primarily because many widely spoken languages are not well represented on the web and therefore excluded from the large-scale crawls used to create datasets. Furthermore, downstream users of these models are restricted to the selection of languages originally chosen for pre-training. This work investigates how to optimally leverage existing pre-trained models to create low-resource translation systems for 16 African languages. We focus on two questions: 1) How can pre-trained models be used for languages not included in the initial pre-training? and 2) How can the resulting translation models effectively transfer to new domains? To answer these questions, we create a new African news corpus covering 16 languages, of which eight languages are not part of any existing evaluation dataset. We demonstrate that the most effective strategy for transferring both to additional languages and to additional domains is to fine-tune large pre-trained models on small quantities of high-quality translation data.
Cross-lingual RRG Parsing
Abstract: The work presented in this talk is joint work with Kilian Evang, Jakub Waszczuk, Kilu von Prince, Tatiana Bladier and Simon Petitjean. We consider the task of parsing low-resource languages in a scenario where parallel English data and also a limited seed of annotated sentences in the target language are available, as for example in bootstrapping parallel treebanks. We focus on constituency parsing using Role and Reference Grammar (RRG), a theory that has so far been understudied in computational linguistics but that is widely used in typological research, i.e., in particular in the context of low-resource languages. Starting from an existing RRG parser, we propose two strategies for low-resource parsing: first, we extend the parsing model into a cross-lingual parser, exploiting the parallel data in the high-resource language and unsupervised word alignments by providing internal states of the source-language parser to the target-language parser. Second, we adopt self-training, thereby iteratively expanding the training data, starting from the seed, by including the most confident new parses in each round. Both in simulated scenarios and with a real low-resource language (Daakaka), we find substantial and complementary improvements from both self-training and cross-lingual parsing. Moreover, we also experimented with using gloss embeddings in addition to token embeddings in the target language, and this also improves results. Finally, starting from what we have for Daakaka, we also consider parsing a related language (Dalkalaen) where glosses and English translations are available but no annotated trees at all, i.e., a no-resource scenario wrt. syntactic annotations. We start with a cross-lingual parser trained on Daakaka with glosses and use self-training to adapt it to Dalkalaen. The results are surprisingly good.
Knowledge Acquisition for Natural Language Processing
Abstract: Building AI systems involves acquiring background knowledge about the world. Historically, knowledge has first been encoded by experts directly into AI systems, then later acquired from massive amounts of data and statistical cues. Ongoing research, such as AI2's Aristo project, is aiming to acquire knowledge from declarative, textbook-style language. I present three distinct approaches to knowledge acquisition for natural language processing systems. (1) Under certain conditions, natural language processing models can understand and reason with rules and facts stated in natural language. (2) SchemaBlocks is an interface to elicit common sense knowledge from humans, based on event chains mined from text, or on free-form textual descriptions of scenarios of interest. (3) For the task of template extraction, finding the right formulation for natural language prompts is key to fully exploiting the knowledge contained in large-scale, pretrained language models. These three approaches differ in how knowledge is represented, how humans are involved, and if so, what their expertise is.
Revisiting Populations in Multi-agent Communication
Abstract: Despite evidence from cognitive sciences that larger groups of speakers tend to develop more structured languages in human communication, vanilla scaling up of populations has failed to yield significant benefits in emergent multi-agent communication. In this talk, I will reassess the validity of the standard protocol used to train these populations. Informed by an analysis of the population-level communication objective at the equilibrium, we advocate for an alternate population-level training paradigm for referential games based on the idea of "partitioning" the agents into sender-receiver pairs and limiting co-adaptation across pairs. We show that this results in optimizing a different objective at the population level, where agents maximize (1) their respective "internal" communication accuracy and (2) some measure of alignment between agents. In experiments, we find that agents trained in partitioned populations are able to communicate successfully with new agents which they have never interacted with and tend to develop a shared language. Moreover, we observe that larger populations tend to develop languages that are more compositional, which aligns better with existing work in sociolinguistics.
Bioinformatics-inspired methods for text corpora analysis
Abstract: In this talk, I will show how computer-assisted textual analysis can benefit from approaches developed in bioinformatics, more precisely comparative genomics and phylogenetics. I will quickly introduce a few problems in this field and show how algorithms developed to solve them can be adapted to textual data, highlighting similarities but also differences. Text comparison, as well as other text processing tasks, may benefit from ideas coming from the alignment of biological sequences at the nucleotide or gene level. More specifically, the idea of having a reference genome was useful in order to quickly build a database of poems by Marceline Desbordes-Valmore, allowing to explore, for example, the musical adaptations of her poetic works. Methods developed to reconstruct the tree of life, or to compare phylogenetic trees, can also be used to visualise texts, or to evaluate whether a chronological signal can be observed in the result of a hierarchical clustering of texts.
Contributors of these works include J.-C. Bontemps, L. Bulteau, A. Chaschina, E. Kogkitsidou, N. Lechevrel, D. Legallois, C. Martineau, T. Poibeau, J. Poinhos, O. Seminck, C. Trotot and J. Véronis.
No prerequisite in biology is required to attend this seminar.
Processing Natural Language to Extract, Analyze and Generate Knowledge and Arguments from Texts
Abstract: The long-term goal of the Natural Language Processing research area is to make computers/machines as intelligent as human beings in understanding and generating language, being thus able: to speak, to make deduction, to ground on common knowledge, to answer, to debate, to support humans in decision making, to explain, to persuade. Natural language understanding can come in many forms. In my research career so far I put efforts in investigating some of these forms, strongly connected to the actions I would like intelligent artificial systems to be able to perform. In my presentation, I will focus on some of these research challenges, that I believe stand in the way of reaching this ambitious goal: (1) the detection of argumentative structures and the prediction the relations among them in different textual resources as political debates, medical texts, and social media content; (2) the detection of abusive language, taking advantage of both the network analysis and the content of the short-text messages on online platforms to detect cyberbullying phenomena.
Controllable Text Generation - Controlling Style and Content
Abstract: The 21st century is witnessing a major shift in the way people interact with technology and Natural Language Generation (NLG) is playing a central role. Users of smartphones and smart home devices now expect their gadgets to be aware of their social context, and to produce natural language responses in interactions. The talk provides deep learning solutions to control style and content in NLG. To control style, the talk presents two novel solutions: Back-Translation and Tag and Generate approach. To control content, the talk dives deep into understanding the task of document grounded generation as well as proposing novel solutions for the task. The talk further presents multi-stage prompting approach to use pre-trained large language models for knowledge grounded dialogue response generation task.Download slides here:
Translatorship attribution with strong confounders and also how to make friends between TEI and NLP
Abstract: The first part of this talk is about translatorship attribution in the context of 19th century literary translations. The main challenge in translatorship attribution is the presence of confounding variables such as the genre or the style of the original author. I will discuss different regularization strategies and informed use of features. Additionally, I will present a novel approach that takes into account both the original and the translation.
The second part of the talk is about the technical prerequisites for conducting the aforementioned research on translatorship attribution. I will show how we created and published training data for OCR in an unclear copyright setting and how to conveniently use NLP methods on TEI-encoded documents with the help of the Standoffconverter Python package.
- Postponed -
Scaling NMT to Hundreds of Languages
Abstract: There are more than 7000 languages in the world, but only about 100 are currently handled by MT and other multilingual NLP tasks. While there is a lot of success in unsupervised MT, parallel data remains a very useful resource to train NMT systems.
A popular approach to mine for parallel data is to compare sentences in a multilingual embedding space and to decide whether they are parallel or not based on a threshold. In this talk, we present new techniques, based on a teacher-student framework, to train multilingual sentence encoders which were successfully applied to several low resource languages.
Analyzing Transformers Representations. A Linguistic Perspective
Abstract: Transformers have become a key component in many NLP models, arguably because of their capacity to uncover contextualized distributed representation of tokens from raw texts. Many works have striven to analyze these representations to find out whether they are consistent with models derived from linguistic theories and how they could explain their ability to solve an impressive number of NLP tasks.
In this talk, I will present two series of experiments falling within this line of research and aiming at highlighting the information flows within a Transformer network. The first series of experiments focuses on the long distance agreement task (e.g. between a verb and its subject), one of the most popular methods to assess neural networks’ ability to encode syntactic information. I will present several experimental results showing that transformers are able to build an abstract, high-level sentence representation rather than solely capturing surface statistical regularities. In a second series of experiments, I will use a controlled set of examples to investigate how gender information circulates in an encoder-decoder architecture considering both probing techniques as well as interventions on the internal representations used in the MT system.
Joint work with Bingzhi Li, Benoit Crabbé, Lichao Zhu, Nicolas Bailler and François Yvon
Multimodal and Multilingual Embeddings for Large-Scale Speech Mining
Abstract: We present an approach to encode a speech signal into a fixed-size representation which minimizes the cosine loss with the existing massively multilingual LASER text embedding space. Sentences are close in this embedding space, independently of their language and modality, either text or audio. Using a similarity metric in that multimodal embedding space, we perform mining of audio in German, French, Spanish and English from Librivox against billions of sentences from Common Crawl. This yielded more than twenty thousand hours of aligned speech translations. To evaluate the automatically mined speech/text corpora, we train neural speech translation systems for several languages pairs. Adding the mined data, achieves significant improvements in the BLEU score on the CoVoST2 and the MUST-C test sets with respect to a very competitive baseline. Our approach can also be used to directly perform speech-to-speech mining, without the need to first transcribe or translate the data. We obtain more than one thousand three hundred hours of aligned speech in French, German, Spanish and English. This speech corpus has the potential to boost research in speech-to-speech translation which suffers from scarcity of natural end-to-end training data. All the mined multimodal corpora will be made freely available.
EMBEDDIA project and selected applications
Abstract: Newsrooms increasingly use and rely on AI tools for automatic text processing. However, these are mostly developed for major languages and that limitation continues to be a challenge. New tools allowing high quality transformations between languages and tools specifically adapted to low-resource environments are urgently needed. EMBEDDIA is a Horizon 2020 funded project consisting of a large European consortium of partners from academia, media and technology, which seeks to address this challenge. During the talk, we will overview the main achievements of the project and present some of the newly developed tools for comment filtering, keyword extraction and viewpoint detection.Download slides here:
Harnessing text generation
Abstract: Text generation is an active area of Natural Language Processing (NLP) research, covering tasks such as dialogue generation, machine translation (MT), summarisation, and story generation, etc. Despite the progress in the current NLP methods (for example, such powerful language generation models as GPT-3), this task remains a challenge when the validity of outputs is crucial. This talk covers my work on the generation of synthetic medical text to address the data availability bottleneck for Biomedical NLP. I will also talk about my work on the exploration of supervised and unsupervised rewards for text generation with Reinforcement Learning and my work in simultaneous MT, which applies to incomplete source text and where the optimal integration of visual information is crucial to generate adequate outputs.Download slides here:
T7: Tech-Taxonomy with a Text To Text Transfer Transformer
Abstract: In this seminar, we will first explain why we need a terminological taxonomy for drafting and editing technological texts. Then we will explain how such a taxonomy can be compiled from existing ontologies and how different models such as TransE, LSTM, Transformers can be trained on a taxonomy to predict hypernyms and hyponyms. We will also demonstrate how this can eventually help to curate and extend the database, and thus be used in applications of paraphrase generation and text drafting.
This project has been carried out in cooperation between LISN (CNRS) and qatent.com at Inria’s Startup Studio.
Data quality for low-resource MT
Abstract: In this talk I will present the findings of a collaborative audit of multilingual corpora, with special attention for low-resourced languages. We will discuss the challenges that come with building such corpora, and the risks of using them without inspection. With a case study on a subset of African languages I will illustrate the implications of building machine translation on low-quality parallel data.
Propositions pratiques pour l’édition numérique des textes français modernes
Abstract: La littérature du Grand siècle a manqué il y a près d’un siècle sa rencontre avec la philologie romane, ce qui n’a pas été sans conséquence sur la qualité des éditions de textes pourtant qualifiés de « classiques » : il est crucial que cette erreur ne se répète pas avec la philologie computationnelle. Prolongeant la célèbre tradition des Instructions pour la publication et autres Règles pour l’édition, nous souhaitons partager quelques propositions pour l’édition numérique des textes français modernes. En présentant la chaîne de traitement au développement de laquelle nous travaillons, nous nous attacherons à donner une dimension pratique à nos réflexions théoriques quant au renouveau ecdotique que nous appelons de nos vœux.
Modélisation, synthèse et représentation éditable des langues des signes
Abstract: Les langues des signes sont des langues à part entière, gestuelles et non phonatoires. Le travail présenté s'intéresse à leur traitement en informatique, un domaine de recherche encore à ses débuts. Trois volets seront présentés, à commencer par la représentation formelle des langues de signes. Nous y présentons la construction d'une approche et d'un modèle (AZee), qui permet entre autres leur synthèse par un signeur virtuel (avatar 3D), ce qui fera l'objet d'un deuxième volet. En guise d'ouverture, une dernière partie s'intéresse à la question d'une forme éditable pour la langue, celle-ci ne possédant pas de forme écrite. En observant des productions graphiques spontanées de signeurs mettant sur papier les discours de leur langue, nous avons pu les rapprocher de résultats issus d'AZee. Nous pensons qu'une piste existe là pour la définition d'un système de représentation graphique intuitif de la langue des signes, voire d'une piste pour en élaborer une écriture.
L'étude des langues des signes étant bien plus récente que celle de leurs homologues vocales ou écrites, les connaissances linguistiques sur elles sont plus limitées et leur traitement automatique n'est possible que de manière interdisciplinaire en avançant conjointement sur les fronts linguistique et informatique. Ainsi, les avancées en représentation formelle ont des implications ou des contrastes en linguistique et nous mettrons en lumière certains d'entre eux.
Self-Supervised Representation Learning for Pre-training Speech Systems
Abstract: Self-supervised learning using huge unlabeled data has been successfully explored for image processing and natural language processing. Since 2019, recent works also investigated self-supervised representation learning from speech. They were notably successful to improve performance on downstream tasks such as speech recognition. These recent works suggest that it is possible to reduce dependence on labeled data for building speech systems through acoustic representation learning. In this talk I will present an overview of these recent approaches to self-supervised learning from speech and show my own investigations to use them in a end-to-end automatic speech translation (AST) task for which the size of training data is generally limited.
LECTAUREP : Lecture Automatique des Répertoires
Abstract: Il s'agit de faire un point d'avancement sur le projet LECTAUREP, au sein duquel collaborent depuis 2018, l'équipe ALMAnaCH et les Archives Nationales. L'objectif de ce projet est de faciliter l'accès au très grand corpus des répertoires d'actes de notaires parisiens en ayant recours à la transcription automatique d'écritures manuscrites et à la fouille de texte. Au-delà de la collecte des données, cette collaboration est l'occasion d'explorer les implications méthodologiques et infrastructurelles de tels projets. (joint work with Laurent Romary)
Natural Language Generation: Training, Inference & Evaluation
Abstract: Recent advances in the field of natural language generation are undoubtedly impressive. Yet, little has changed from training and inference to evaluation. Models are learned with Teacher Forcing, inferred via Beam Search, and evaluated with BLEU or ROUGE. However, these algorithms suffer from many well-known limitations. How can these limitations be overcome? In this talk, we will present recently proposed methods that could be part of the solution, paving the way for a better NLG.
Joint work with Jacopo Staiano
CharacterBERT: Reconciling ELMo and BERT for Word-Level Open-Vocabulary Representations From Characters
Abstract: Due to the compelling improvements brought by BERT, many recent representation models adopted the Transformer architecture as their main building block, consequently inheriting the wordpiece tokenization system despite it not being intrinsically linked to the notion of Transformers. While this system is thought to achieve a good balance between the flexibility of characters and the efficiency of full words, using predefined wordpiece vocabularies from the general domain is not always suitable, especially when building models for specialized domains (e.g., the medical domain). Moreover, adopting a wordpiece tokenization shifts the focus from the word level to the subword level, making the models conceptually more complex and arguably less convenient in practice. For these reasons, we propose CharacterBERT, a new variant of BERT that drops the wordpiece system altogether and uses a Character-CNN module instead to represent entire words by consulting their characters. We show that this new model improves the performance of BERT on a variety of medical domain tasks while at the same time producing robust, word-level and open-vocabulary representations.
Joint work with Olivier Ferret, Thomas Lavergne, Hiroshi Noji, Pierre Zweigenbaum and Junichi Tsujii
Learning Sound Correspondences: What about Neural Networks?
Abstract: Cognate and proto-form prediction are key tasks in computational historical linguistics, which rely heavily on sound correspondences identification, and could help low resource translation. In the last two decades, a combination of sequence alignement, statistical models and clustering methods have emerged to try and solve these. But where are the neural networks? In this talk, I will present my ongoing research in investigating the learnability of sound correspondences between a proto-language and daughter languages by neural models.
I will introduce: (i) MEDeA, a Multiway Encoder Decoder Architecture inspired by NMT, (ii) EtymDB2.0, the etymological database that we updated to generate much needed data, (iii) our experiments on plausible artificial languages as well as on real languages.
Limits, open questions and current trends in Transfer Learning for NLP
Abstract: This talk is a subjective walk through my favorite papers and research directions in late-2019/early-2020. I’ll roughly cover the topics of model size and computational efficiency, model evaluation, fine tuning, out of domain generalization, sample efficiency, common sens and inductive biases. The talk is adapted from the sessions I gave in early 2020 at the NLPL Winter School.
Can multilingual BERT transfer to an Out-of-Distribution dialect? A case study on North African Arabizi
Abstract: Building natural language processing systems for highly variable and low resource languages is a hard challenge. The recent success of large-scale multilingual pretrained language models provides us with new modeling tools to tackle it. In this talk, I will present my ongoing research in testing the ability of the multilingual version of BERT to model an unseen dialect. We take user-generated North African Arabic text as our case study. We show in different scenarios that multilingual language models are able to transfer to an unseen dialect, specifically in two extreme cases: across script (Arabic to Latin) and from Maltese, a distantly related language, unseen during pretraining.
Joint work with Benoît Sagot and Djamé Seddah.
Neural Semantic Role Labeling for French FrameNet With Deep Syntactic Information
Abstract: A recent graph-based neural architecture for semantic role labeling (SRL) developed by He et al. (2018)  jointly predicts argument spans, predicates and the relations between them without using gold predicates as input features. Although working well on Propbank-style data, this architecture makes some systematic mistakes when being used on a more semantically-oriented resource such as French FrameNet .
We adapt He's (2018)  system for the semantic roles prediction for French FrameNet. Contrasting to , we do not predict the full spans of the arguments directly, but implement a two-step pipeline of predicting syntactic heads of the argument spans first and reconstructing the full spans using surface and deep syntax in the second step. While the idea of reconstructing the argument spans using syntactic information is not new , the novelty of our work lies in using deep syntactic dependency relations for the full span recovery. We obtain deep syntactic information using symbolic conversion rules described in Michalon et al. (2016) . We present the results of the ongoing semantic role labeling experiments for French FrameNet and discuss the advantages and challenges of our approach.
Neural and Symbolic Representations of Speech and Language
Abstract: As end-to-end architectures based on neural networks became the tool of choice for processing speech and language, there has been increased interest in techniques for analyzing and interpreting the representations emerging in these models. A large array of analytical techniques have been proposed and applied to diverse architectures. Given that the developments in this field have been so fast, it is perhaps inevitable that some of them also turn out to be loose.
In this talk I firstly focus on one pitfall not always successfully avoided in work on neural representation analysis: the role of learning. In many cases non-trivial representations can be found in the activation patterns of randomly initialized, untrained neural networks. In past studies this phenomenon has not always been properly accounted for, which means that the results reported in them need to be reconsidered. Here I revisit the issue of the representations of phonology in neural models of spoken language.
Secondly I present two methods based on Representational Similarity Analysis (RSA) and Tree Kernels (TK) which allow us to directly quantify how strongly the information encoded in neural activation patterns corresponds to information represented by symbolic structures such as syntax trees. I first validate the methods on the case of a simple synthetic language for arithmetic expressions with clearly defined syntax and semantics, and show that they exhibit the expected pattern of results. I then apply these methods to correlate neural representations of English sentences with their constituency parse trees.
How to choose the test set size? Some observations on the evaluation of PoS taggers on the Universal Dependencies project
Abstract: This presentation questions the usual framework of statistical learning in which test set and train sets are fixed arbitrarily and independently of the model considered. Taking the evaluation of PoS taggers on the UD project as an example, we show that, in many cases, it is possible to consider smaller test sets than those generally available without hurting evaluation quality and that the examples that have been `saved' can be added to the train set to improve system performance, especially in the context of domain adaptation.
La production participative (crowdsourcing ) : miroir grossissant sur l’annotation manuelle
Abstract: L'annotation manuelle de corpus est au coeur du Traitement automatique des langues actuel : elle fournit non seulement les exemples utilisés pour entraîner les outils par apprentissage, mais elle fait également référence lors des campagnes d'évaluation. Elle est, de fait, l'endroit où s'est réfugiée la linguistique dans le domaine. Pour autant, elle reste encore largement sous-étudiée. Aborder le sujet par le prisme de la production participative (crowdsourcing) ludifiée, c'est en regarder les points les plus durs dans un miroir grossissant. Les questions essentielles de la qualité de la production, des biais liés à l'outillage et de l'expertise des annotateurs sont en effet magnifiées par le nombre et la distance. Cet effet de loupe complexifie les expériences, mais nous pousse également à imaginer des solutions originales, qui enrichissent la réflexion sur l'annotation manuelle traditionnelle et remettent l'annotateur au coeur du processus.
Where’s My Head? Definition, Dataset and Models for Numeric Fused-Heads Identification and Resolution
Abstract: In this talk, I will describe our on-going work on fused-heads. We provide the first computational treatment of fused-heads constructions (FH), focusing on the numeric fused-heads (NFH). FHs constructions are noun phrases (NPs) in which the head noun is missing and is said to be “fused” with its dependent modifier. This missing information is implicit and is important for sentence understanding. The missing references are easily filled in by humans but pose a challenge for computational models. We pose the handling of FH as a two stages process: identification of the FH construction and resolution of the missing head. We explore the NFH phenomena in large corpora of English text and create (1) a dataset and a highly accurate method for NFH identification; (2) a 10k examples (1M tokens) crowd-sourced dataset of NFH resolution; and (3) a neural baseline for the NFH resolution task.
Adapting an Existing French Metagrammar for Old and Middle French
Abstract: Although many texts in Old French (9th-13th c.) and Middle French (14th-15th c.) are now available, only a few of them are annotated with dependency syntax. Our goal is to extend the already existing data, the Old French treebank SRCMF “Syntactic Reference Corpus of Medieval French” (Prévost and Stein 2013) to obtain an annotated corpus of one million words also covering Middle French.
These stages of French are submitted to strong variation (language evolution, dialects, forms and domains) and are characterised by a free word-order, as well as null subjects. To deal with these difficulties, we have opted for the formalism of metagrammars (Candito 1999), for a modular constraint-based representation of syntactic phenomena through classes. More precisely, we are adapting the French Metagrammar (FRMG) (Villemonte de la Clergerie 2005) for Old and Middle French because there are enough similarities between these stages of French. In this talk, we will present the processing chain developed by the Almanach team and our choices to adapt the metagrammar to former stages of a language.
Lipstick on a Pig: Debiasing Methods Cover up Systematic Gender Biases in Word Embeddings But do not Remove Them
Abstract: Word embeddings are widely used in the NLP community for a vast range of tasks. It was shown that word embeddings derived from text corpora reflect gender biases in society. This phenomenon is pervasive and consistent across different word embedding models, causing serious concern. Several recent works tackle this problem, and propose methods for significantly reducing this gender bias in word embeddings, demonstrating convincing results. However, we argue that this removal is superficial. While the bias is indeed substantially reduced according to the provided bias definition, the actual effect is only hiding the bias, not removing it. The gender bias information is still reflected in the distances between "gender-neutralized" words in the debiased embeddings, and can be recovered from them. We present a series of experiments to support this claim, for two debiasing methods. We conclude that existing bias removal techniques are insufficient, and should not be trusted for providing gender-neutral modeling.
French MultiWord Expressions representation and parsing
Abstract: Many NLP tasks, such as natural language understanding, require a representation of syntax and semantics in texts. MultiWord Expressions (MWEs), which can be described as a set of (not necessarily contiguous) tokens that exhibit some idiosyncratic properties (Baldwin and Kim, 2010), to quote Sag et al. 2001 are "a pain in the neck for NLP" . MWEs are difficult to predict as their syntactic behavior tends to be unpredictable: they can have an irregular internal syntax and a non-compositional meaning. MWEs-aware NLP systems are also hard to evaluate, because until recently and the PARSEME COST initiative (Savary et al, 2017) there were only few annotated corpora annotated with MWEs (Laporte et al. 2008). I will first present my previous works on named entity recognition (Dupont et al, 2017), showing how they are related to MWEs, before delving deeper into MWEs.
I will present in more details how they are a challenge, and how we can represent them using metagrammars (Savary et al., 2018), more precisely within the FRMG framework of de la Clergerie, (2010).
Joint work with Eric Villemonte de la Clergerie and Yannick Parmentier
Tackling Machine Translation of Noisy Text
Abstract: Despite their recent success, neural machine translation systems have proven to be brittle in the face of non-standard inputs that are far from their training domain. This is particularly salient for the kind of noisy, user-generated content ubiquitous on social media and the internet in general.
In this talk I will present MTNT, our first step to remedy this situation by proposing a testbed for Machine Translation of Noisy Text. MTNT consists of parallel Reddit comments in three languages (English, French, Japanese) exhibiting a large amount of typos, grammar errors, code switching and more. I will discuss the challenges of the collection process, preliminary MT experiments and outlook for future work (and a sneak peek of ongoing follow-up research).
A Multi-Source Trainable Parser with Deep Contextualized Lexical Representations with Case Studies
Abstract: Despite their recent success, neural machine translation In this talk, we describe a multi-source trainable parser developed at Lattice for the CoNLL 2018 Shared Task (Multilingual Parsing from Raw Text to Universal Dependencies). The main characteristic of our work is the encoding of three different modes of contextual information for parsing: (i) Treebank feature representations, (ii) Multilingual word representations, (iii) ELMo representations obtained via unsupervised learning from external resources. In the talk, we investigated more about parsing low-resource languages with very small training corpora using multilingual word embeddings and annotated corpora of larger languages. The study demonstrates that specific language combinations enable improved dependency parsing when compared to previous work, allowing for wider reuse of pre-existing resources when parsing low resource languages. The study also explores the question of whether contemporary contact languages or genetically related languages would be the most fruitful starting point for multilingual parsing scenarios.
Predictive processing in lexical and syntactic acquisition
Abstract: There is a general consensus in the field of language acquisition that infants use syntactic context to bootstrap their learning of the meaning of words. This is known as the syntactic bootstrapping hypothesis. For example, toddlers use the distributional information that articles tend to be followed by nouns (e.g., "la balle"), and pronouns tend to be followed by verbs (e.g., "elle saute"), to infer whether a novel word is likely to refer to an object or an action (e.g., "la dase" is likely to refer to an object and not an action). Previous modeling studies show that the distribution of syntactic contexts in the input is indeed a reliable cue to class membership. Thus, models that rely on frequent contexts show good categorization of unfamiliar words into nouns and verbs. My talk will focus on the question of whether children and infants can keep track of changes in the distribution of structures in their input, and update their predictions accordingly. I will present experimental results from my own studies with children, and suggest how we could model such effects.
Cooking entities at low heat: a receipt for entity disambiguation in scientific publications
Abstract: Entity ambiguity is a frequently encountered problem in digital publication libraries. Author/organisation is one of the most known use case, but there are others. We present “entity-cooking”: a generic, Machine Learning-based framework for entity matching/disambiguation. Developed with the help of Patrice Lopez, it is a tool offering a reusable entity disambiguation engine with “minimal" adaptations, independent by any specific domain. Lightly designed, it provides a standardised REST API and it supports XML-TEI or PDF (via Grobid) as input data. This project started in 2016; as of today we have implemented an author/organisation disambiguation solution, we have produced a manually annotated corpus (including affiliations references) and we are investigating the application to geographical location and toponym resolution (Semeval 2019, task 12).
Le Web et ses publics
Abstract: In the seminar, we will discuss the social and political consequences of the organization of digital media. We will consider the limits of a simplistic reading of the power-law distribution of online visibility and the hopes raised by the thematic clustering and the dynamism of the Web. We will also study the risks that these dynamics entail exploring the causes of the recent proliferation of 'junk news'. Dans ce séminaire, nous discuterons des conséquences sociales et politiques de l'organisation des médias numériques. Nous considérerons les limites d'une lecture simpliste de la distribution en loi de puissance de la visibilité en ligne et les espoirs soulevés par la clusterisation thématique et le dynamisme du Web. Nous nous pencherons aussi sur les risques que ces dynamiques comportent, en explorant les causes de la récente prolifération des 'junk news’.
Building a Treebank for Naija, the English-based Creole of Nigeria.
Abstract: As an example of treebank development without pre-existing language specific NLP tools, we will present the ongoing work of constructing a 750 000 word treebank for Naija. The annotation project, part of the NaijaSynCor ANR project, has a social dimension because the language, NaijaSynCor ANR project, has a social dimension because the language, that is not fully recognized as such by the speakers themselves, is not yet institutionalized in any way. Yet, Naija, spoken by close to 100 million speakers, could play an important role in the nation-building process of Nigeria. We will briefly present a few particularities of Naija such as serial verbs, reduplications, and emphatic adverbial particles. We used a bootstrapping process of manual annotation and parser training to enhance and speed up the annotation process. The annotation is done in the Syntactic Universal Dependencies scheme (SUD) which allows seamless transformation into Universal Dependencies (UD) by means of Grew http://grew.fr/, a rule based graph rewriting system. We will present the different tools involved in this process, and we will show a few preliminary quantitative measures on the annotated sentences.
New Resources and Ideas for Semantic Parsing
Abstract: In this talk, I will give an overview of research being done at the University of Stuttgart on semantic parser induction and natural language understanding. The main topic, semantic parser induction, relates to the problem of learning to map input text to full meaning representations from parallel datasets. Such resulting “semantic parsers” are often a core component in various downstream natural language understanding applications, including automated question-answering and generation systems. We look at learning within several novel domains and datasets being developed in Stuttgart (e.g., software documentation for text-to-code translation) and under various types of data supervision (e.g., learning from entailment, « polyglot » modeling, or learning from multiple datasets).
Bio: Kyle Richardson is a finishing PhD student at the University of Stuttgart (IMS), working on semantic parsing and various applications thereof. Prior to this, he was a researcher in the Intelligent Systems Lab at the Palo Alto Research Center (PARC), and holds a B.A. from the University of Rochester, USA. He’ll be joining the Allen Institute for AI in November.Download slides here:
Historical text normalization with neural networks
Abstract: With the increasing availability of digitized historical documents, interest in effective NLP tools for these documents is on the rise. The abundance of variant spellings, however, makes them challenging to work with for both humans and machines. For my PhD thesis, I worked on automatic normalization—mapping historical spellings to modern ones—as a possible approach to this problem. I looked at datasets of historical texts in eight different languages and evaluated normalization using rule-based, statistical, and neural approaches, with a particular focus on tuning a neural encoder–decoder model. In this talk, I will highlight what I learned from different perspectives: Why, what, and how to normalize? How do the different approaches compare and which one should I use? And what can we learn from this about neural networks that might be useful for other NLP tasks?Download slides here:
Text readability assessment for second language learners
Abstract: In this talk, I will present our work on readability assessment for the texts aimed at second language (L2) learners. I will discuss the approaches to this task and the features that we use in the machine learning framework. One of the major challenges in this task is the lack of significantly sized level-annotated data for L2 learners, as most models are aimed at and trained on the large amounts of texts for native English speakers. I will overview the methods of adapting models trained on larger native corpora to estimate text readability for L2 learners. Once the readability level of the text is assessed, the text can be adapted (e.g., simplified) to the level of the reader. The first step in this process is identification of words and phrases in need of simplification or adaptation. This task is called Complex Word Identification (CWI), and it has recently attracted much attention. In the second part of the talk, I will discuss the approaches to CWI and present our winning submission to the CWI Shared Task 2018.
Dynamiques circadiennes du langage : comment les données massives permettent de sonder de nouvelles échelles
Abstract: La linguistique s'est intéressée aux dynamiques langagières dans des gammes d'échelles allant de quelques décennies à quelques millénaires. Depuis quelques années, certaines études basées sur des média en ligne, notamment des forums, se sont penché sur des échelles de l'ordre de l'année, voire du mois. Qu'en est-il des échelles encore plus petites ? Peut-on observer des phénomène de l'ordre du jour ? de l'heure ? Si la chronobiologie a montré que nos capacités cognitives variait selon des rythmes circadiens, peu a été dit à propos du langage. En utilisant des données issues de Twitter, nous montrerons qu'il est possible d'observer des dynamiques linguistiques à des échelles nouvelles et mettrons en évidence l'existence des rythmes circadiens dans l'utilisation du lexique.
Quality Evaluation of Machine Translation into Arabic
Abstract: In machine translation, automatically obtaining a reliable assessment of translation quality is a challenging problem. Several techniques for automatically assessing translation quality for different purposes have been proposed, but these are mostly limited to strict string comparisons between the generated translation and translations produced by humans. This approach is too simplistic and ineffective for languages with flexible word order and rich morphology such as Arabic, a language for which machine translation evaluation is still an under-studied problem, despite posing many challenges. In this talk, I will first introduce AL-BLEU, a metric for Arabic machine translation evaluation that uses a rich set of morphological, syntactic and lexical features to extend the evaluation beyond the exact matching. We showed that AL-BLEU has a stronger correlation with human judgments than the state-of-the-art classical metrics. Then, I will present a more advanced study in which we explore the use of embeddings obtained from different levels of lexical and morpho-syntactic linguistic analysis and show that they improve MT evaluation into an Arabic. Our results show that using a neural-network model with different input representations produces results that clearly outperform the state-of-the-art for MT evaluation into Arabic, by almost over 75% increase in correlation with human judgments on pairwise MT evaluation quality task.
Socioeconomic dependencies of linguistic patterns in Twitter: Correlation and learning
Abstract: Our usage of language is not solely reliant on cognition but is arguably determined by myriad external factors leading to a global variability of linguistic patterns. This issue, which lies at the core of sociolinguistics and is backed by many small-scale studies on face-to-face communication, is addressed here by constructing a dataset combining the largest French Twitter corpus to date with detailed socioeconomic maps obtained from national census in France. We show how key linguistic variables measured in individual Twitter streams depend on factors like socioeconomic status, location, time, and the social network of individuals. We found that (1) people of higher socioeconomic status, active to a greater degree during the daytime, use a more standard language; (ii) the southern part of the country is more prone to using more standard language than the northern one, while locally the used variety or dialect is determined by the spatial distribution of socioeconomic status; and (iii) individuals connected in the social network are closer linguistically than disconnected ones, even after the effects of status homophily have been removed. In the second part of the talk we will discuss how linguistic information and the detected correlations can be used for the inference of socioeconomic status.