Séminaires ALMAnaCH

ALMAnaCH organise régulièrement des seminaires en TAL et en humanités numériques.
Inscrivez-vous à la liste almanach-seminar@inria.fr pour recevoir les annonces de séminaires et le lien de connexion.

Les séminaires ont été organisés par Djamé Seddah jusqu'à juillet 2021 et le sont depuis par Rachel Bawden.

À venir...

20/12/24 à 11:00 - Big Blue Button
20 déc. 2024 à 11:00
Big Blue Button
Présentation par Caio Corro, INSA Rennes

Named-Entity Recognition: Resurrecting Old School Machine Learning in the Era of Deep Learning

N.B. Ce séminaire aura lieu en français 🇫🇷.
Résumé : In this talk, I will show that we can bridge old-school methods (finite-state automaton and k-means) with neural networks to achieve SOTA results.

First, I will present my EMNLP 2024 paper [1] on discontinuous named-entity recognition, an overlooked setting in the literature. SOTA methods are based on complex pipelines with intricate neural architectures. I will show that using finite-state automaton, we can build a word tagging method that achieves competitive experimental results while being 40x-50x faster than SOTA. Unlike previous attempts to use work tagging in this setting, the proposed approach guarantees well-formedness of predictions.

Second, I will present our COLING 2025 paper [2] on few-shot learning for named-entity recognition. Many approaches in this setting are based on variants of nearest neighbor classification. Unfortunately, they cannot leverage unlabeled data. We propose a novel approach for semi-supervised few-shot learning based on joint k-means and subspace selection. For named-entity recognition, a difficulty arises from the fact that most words are tagged with O (outside a mention): when we include a large amount of unlabeled data, the model can easily collapse to assigning tag O for all words. To prevent this issue, we include a ratio-constraint in the fine-tuning step.

[1] A fast and sound tagging method for discontinuous named-entity recognition (Caio Corro) https://arxiv.org/abs/2409.16243
[2] Few-shot domain adaptation for named-entity recognition via joint constrained k-means and subspace selection (Ayoub Hammal, Benno Uthayasooriyar, Caio Corro) https://arxiv.org/abs/2412.00426

Séminaires passés

15/11/24 à 11:00 - Big Blue Button
15 nov. 2024 à 11:00
Big Blue Button
Présentation par Raphaël Baena, Imagine Group, École des Ponts ParisTech

A General Framework for Text Line Detection and Recognition

Résumé : In this seminar, I will present a quick overview of my Ph.D research on transfer learning and generalization, followed by a detailed discussion of our recent NeurIPS paper on General Detection-based Text Line Recognition (DTLR). DTLR is a novel approach for recognizing text lines, whether printed or handwritten, across diverse scripts, including Latin, Chinese, and ciphered characters.
Most HTR methods have focused on autoregressive decoding, predicting characters one after the other. Our method shows strong results across various scripts, even those typically addressed by specialized techniques.  In particular, we achieve state-of-the-art performance for Chinese script recognition on the CASIA v2 dataset and for cipher recognition on the Borg and Copiale datasets. Finally, I will highlight several collaborative applications and extensions of this work with historians.

11/10/24 à 11:00 - Big Blue Button
11 oct. 2024 à 11:00
Big Blue Button
Présentation par Marine Carpuat, University of Maryland, USA

Beyond Translation: Human-Centered NLP for Cross-Lingual Communication

Résumé : How can we develop NLP technology to effectively support cross-lingual communication, especially given recent progress in machine translation and multilingual language models? In this talk, I will present two main threads of work that aim to broaden the scope of machine translation to more directly support people's needs.
In the first thread, I'll consider the difficulty people face when weighing the potential benefits of machine translation against the risks it may pose. This difficulty arises because users—who typically do not speak either the input or output language—often cannot assess translation quality. I will present results from a human study in medical settings, which highlights the strengths and weaknesses of state-of-the-art quality estimation techniques.
Next, I'll discuss how even accurate translations can fail when users lack background knowledge that is implied in the source language. I will introduce techniques for automatically generating explicitations that explain missing context by considering cultural differences between source and target audiences.
Throughout, I will discuss ongoing research directions aimed at developing human-centered NLP approaches for cross-lingual communication.

Télécharger les diapositives ici :
4/07/24 à 11:00 - Big Blue Button
4 juil. 2024 à 11:00
Big Blue Button
Présentation par Paolo Rosso, Universitat Politècnica de València, ValgrAI

Beyond the detection of fake news and explicit hate speech: conspiracy theories and implicit hate speech with stereotypes, jokes and sarcasm

Résumé : The rise of social media has offered a fast and easy way for the propagation of disinformation and conspiracy theories. Despite the research attention that has received, disinformation detection remains an open problem and users keep sharing texts that contain false statements. In this talk I will comment on some studies on the detection of conspiracy theories. In the framework of the PAN Lab, recently we organised a challenge to discriminate between conspiracy narratives and critical thinking. Finally, I will address the other side of harmful information: hate speech. I will present the work done to analyse misogyny and sexism, also in memes, and the work done in collaboration with the Spanish observatory against racism and xenophobia. Moreover, I will briefly present a study of the usage of stereotypes against immigrants by the members of the Spanish Congress of Deputies. Hate speech is often conveyed covertly employing stereotypes and figurative language devices such as irony or sarcasm. I will finally show how hurtful humour is often employed to spread prejudice in social media towards women and feminists, the LGBTIQ community, immigrants and racially discriminated people, and overweight people.

21/06/24 à 11:00 - Big Blue Button
21 juin 2024 à 11:00
Big Blue Button
Présentation par Nicolas Rollet, Télécom Paris

Interaction Humain-Machine, IA parlante et ethnométhodologie : le diable est dans les détails

N.B. Ce séminaire aura lieu en français 🇫🇷.
Résumé : Depuis 2015, Nicolas Rollet s’emploie à étudier les interactions humain-machine, que celles-ci soient équipées d’une IA ou non. L’approche ethnométhodologique, et son pendant interactionnel, l’Analyse de Conversation, offrent des outils d’analyse permettant d’interroger de manière détaillée et comme des accomplissements pratiques :

- qu’est-ce qui est « social » dans une interaction sociale
- en quoi les humains s’orientent-ils vers un agent artificiel en tant que partenaire social
- comment le corps, la vision constituent des ressources pour organiser des activités sociales
- qu’est-ce qui « manque » aux IA parlantes pour parler naturellement.

Pour discuter ces points, plusieurs terrains de recherche seront mobilisés : interactions humain-robot, interactions vidéo à distance dans un service d’urgence, interactions dans un cabinet d’échographie prénatale.

Bio: Nicolas Rollet, diplômé d’un doctorat en sciences du langage (ILPGA, Sorbonne Nouvelle Paris 3, 2012), s’est spécialisé dans les études d’interactions dans des contextes ordinaires et professionnels tels que les réunions familiales, les répétitions musicales, l’utilisation de bibliothèque numérique, l’interaction médicale d’urgence, l’interaction homme-robot ou encore les séances d’échographie prénatales (Télécom Paris, SAMU-Centre15, CNRS, BNF, CNR114). Ses travaux s’inscrivent dans le cadre de l’Ethnométhodologie, analyse de la conversation accompagnée d’une sensibilité ethnographique. Il s’intéresse, entre autres, à la manière dont le langage s’intègre au corps et à l’intégration de dispositifs techniques dans des activités sociales complexes. Il est également membre du collectif Encyclopédie de la parole, depuis sa création en 2007. A ce titre il est associé à la production de nombreuses oeuvres : performances, spectacles, installations sonores, conférences (Kunsten Festival, Festival d’Automne, Palais de Tokyo, Théâtre de Montreuil, MAMCO Genève, Festival des Arts de la parole Bordeaux, KAAT Yokohama..). Il est également auteur de plusieurs textes de prose (Leo Scheer, Les Petits Matins, Argol).

31/05/24 à 11:00 - Big Blue Button
31 mai 2024 à 11:00
Big Blue Button
Présentation par Marco Bronzini, University of Trento

Unveiling LLMs: The Evolution of Latent Representations in a Temporal Knowledge Graph

Résumé : Large Language Models (LLMs) demonstrate an impressive capacity to recall a vast range of common factual knowledge information. However, unravelling the underlying reasoning of LLMs and explaining their internal mechanisms of exploiting this factual knowledge remain active areas of investigation.
Our work analyzes the factual knowledge encoded in the latent representation of LLMs when prompted to assess the truthfulness of factual claims.
We propose an end-to-end framework that jointly decodes the factual knowledge embedded in the latent space of LLMs from a vector space to a set of ground predicates and represents its evolution across the layers using a temporal knowledge graph. Our framework relies on the technique of activation patching which intervenes in the inference computation of a model by dynamically altering its latent representations.
Consequently, we neither rely on external models nor training processes.
We showcase our framework with local and global interpretability analyses using two claim verification datasets: FEVER and CLIMATE-FEVER. The local interpretability analysis exposes different latent errors from representation to multi-hop reasoning errors. On the other hand, the global analysis uncovered patterns in the underlying evolution of the model's factual knowledge (e.g., store-and-seek factual information).
By enabling graph-based analyses of the latent representations, this work represents a step towards the mechanistic interpretability of LLMs. https://arxiv.org/abs/2404.03623

Télécharger les diapositives ici :
5/04/24 à 11:00 - Big Blue Button
5 avr. 2024 à 11:00
Big Blue Button
Présentation par Philippe Blache, CNRS

Etudier le signal cérébral associé au traitement du langage en conversation. Limites, état de nos connaissances, perspectives

N.B. Ce séminaire aura lieu en français 🇫🇷.
Résumé : Comprendre comment fonctionne le langage nécessite de l'étudier dans sa globalité. Lorsqu'on pose cette question dans une perspective cognitive, il est de plus nécessaire de l'étudier dans son contexte naturel, typiquement celui de la conversation. Quels sont les mécanismes permettant à deux individus d'encoder, transmettre et décoder l'information pendant ce type d'interaction ? Je propose dans cette présentation d'aborder plus particulièrement l'étude des bases cérébrales de l'interaction : le signal cérébral que nous observons nous apprend-il quelque chose sur le traitement du langage ? Les méthodes reposant sur l'électro-encéphalographie (technique d'imagerie la plus facile à mettre en œuvre) consistent essentiellement soit à étudier les potentiels évoqués par phénomène localisé temporellement, soit dynamiquement à utiliser des fonctions de corrélation entre les signaux linguistiques et cérébraux. Il est par exemple possible d'observer une potentiel négatif de grande amplitude en cas d'incongruité sémantique, une baisse de la bande de fréquence 8-12Hz du cerveau associée à la préparation de la réponse à une question ou encore une corrélation entre l'enveloppe acoustique de la parole et la dynamique oscillatoire du cerveau. Ces observations sont cependant très focalisées, et la question qui est désormais posée est celle de la possibilité de rechercher de telles corrélations pendant une conversation naturelle. Le problème ici est double : 1/ limite des modèles de prédiction linguistique adaptés à la langue parlée et 2/ limites de méthodes de traitement du signal cérébral en condition naturelle, extrêmement bruité. Pendant cette présentation, je décrirai plus précisément ces problèmes, l'état de nos connaissances y compris méthodologiques pour traiter ce type de signal et quelques pistes de recherche pour avancer dans cette direction.

15/03/24 à 11:00 - Big Blue Button
15 mars 2024 à 11:00
Big Blue Button
Présentation par Alexandra Birch, University of Edinburgh

Translation and Large Language Models

Résumé : What is the future of translation research in the era of large language models? Brown et al. in 2020 showed that prompting GPT3 with a few examples of translation could result in translations which were higher quality than SOTA supervised models at the time (into English and only for French, German). Until this point, research on machine translation had been central to the field of natural language processing, often attracting the most submissions in annual NLP conferences and leading to many breakthroughs in the field. Since then, there has been enormous interest in models which can perform a wide variety of tasks and interest in translation as a separate sub-field has somewhat diminished. However, translation remains a compelling and widely used technology. So what is the promise of LLMs for translation and how should we best use them? What opportunities do LLMs unlock and what challenges remain? How can the field of translation still contribute to NLP? I will touch on some of my own research but I focus on these broader questions.

Télécharger les diapositives ici :
9/02/24 à 11:00 - Big Blue Button
9 févr. 2024 à 11:00
Big Blue Button
[Collège de France] Présentation par Yann Le Cun, Meta, New York University

L'IA axée sur les objectifs : vers des machines capables d'apprendre, de raisonner et de planifier

Ceci est une diffusion en direct d'un séminaire donné dans le cadre de la chaire annuelle de Benoît Sagot au Collège de France. Plus d'informations ici

N.B. Ce séminaire aura lieu en français 🇫🇷.
Résumé : Comment les machines pourraient-elles apprendre aussi efficacement que les humains et les animaux ? Comment les machines pourraient-elles apprendre le fonctionnement du monde et acquérir le sens commun ? Comment les machines pourraient-elles apprendre à raisonner et à planifier ?
Les architectures d'IA actuelles, telles que les modèles de langage auto-régressifs à grande échelle, sont insuffisantes. Je proposerai une architecture cognitive modulaire qui pourrait constituer un chemin vers la réponse à ces questions. La pièce maîtresse de l'architecture est un modèle prédictif du monde qui permet au système de prédire les conséquences de ses actions et de planifier une séquence d'actions qui optimisent un ensemble d'objectifs. Les objectifs incluent des garde-fous qui garantissent la contrôlabilité et la sécurité du système. Le modèle du monde utilise une architecture hiérarchique jointe de prédiction d’embeddings (H-JEPA, pour Hierarchical Joint Embedding Predictive Architecture) entraîné par apprentissage auto-supervisé. L'architecture JEPA apprend des représentations abstraites des perceptions qui sont simultanément maximales en termes d'information et de prédictibilité.

2/02/24 à 11:00 - Big Blue Button
2 févr. 2024 à 11:00
Big Blue Button
[Collège de France] Présentation par Philippe Blache, CNRS

Prédire c'est comprendre : un modèle neuro-cognitif du langage fondé sur la prédiction

Ceci est une diffusion en direct d'un séminaire donné dans le cadre de la chaire annuelle de Benoît Sagot au Collège de France. Plus d'informations ici

N.B. Ce séminaire aura lieu en français 🇫🇷.
Résumé : La compréhension mutuelle pendant une conversation est un processus extrêmement rapide et efficace : nous pouvons traiter trois mots par seconde, souvent plus. Cette observation n’est cependant pas conforme aux expériences de laboratoire montrant que le traitement d’un seul mot peut prendre jusqu’à une seconde. La rapidité du traitement s’explique par notre capacité à prédire ce que va dire l’interlocuteur, d’une certaine façon à la manière des modèles de langage. Aujourd’hui, il n’existe pas de modèle global permettant d’intégrer à une architecture classique du traitement du langage (de la phonétique à la sémantique en passant par la syntaxe) ce phénomène de facilitation reposant sur la prédiction. Je présenterai les bases d’un tel modèle permettant d'expliquer comment cohabitent des processus superficiels (effets de facilitation) et profonds (en cas de difficulté). Cette architecture repose sur un mécanisme central, la prédiction, que je décrirai en l’abordant à la fois du point de vue computationnel et neurolinguistique. Cette approche repose sur les résultats obtenus dans le cadre de théories récentes en sciences cognitives (« prediction-by-production ») et en neurosciences (« predictive coding ») conduisant à penser que les participants à une conversation utilisent le même mécanisme pour produire et comprendre la parole.

26/01/24 à 11:00 - Big Blue Button
26 janv. 2024 à 11:00
Big Blue Button
[Collège de France] Présentation par Elena Cabrio , Université Côte-d’Azur, Inria, CNRS, I3S

Analyse automatique de l'argumentation dans les débats politiques

Ceci est une diffusion en direct d'un séminaire donné dans le cadre de la chaire annuelle de Benoît Sagot au Collège de France. Plus d'informations ici

N.B. Ce séminaire aura lieu en français 🇫🇷.
Résumé : Les débats politiques offrent aux citoyens une occasion unique d’apprécier la position des représentants politiques sur les sujets les plus controversés de l’actualité. Au regard des prises actives de paroles des différents acteurs de la vie politique, ces débats constituent une source d'information qu'il se doit d'être capitalisée afin de mieux appréhender les dynamiques sociétales. Compte tenu de leur qualité argumentative innée, ces échanges constituent un scénario d'application adéquat pour la mise en œuvre de méthodes computationnelles d'extraction d'arguments. La fouille d’arguments est un axe de recherche étudié dans le domaine du traitement du langage naturel et dont l’objectif consiste en l'extraction et l'identification automatique des structures argumentatives d'un texte en langage naturel à l'aide de programmes informatiques. L’analyse des structures argumentaires est une tâche complexe s’attachant à l’étude des composants et des schémas d’argumentation, aux relations entre les arguments ou encore aux stratégies de contre-argumentation. Au cours de cet exposé, je détaillerai les étapes nécessaires quant à l’automatisation de l’analyse du discours politique par le biais de méthodes de fouille d’arguments. En premier lieu, il s’agira de présenter les approches dédiées à l’identification des structures argumentatives et leurs relations. Ensuite, je décrirai les stratégies déployées dans le cadre de l'identification automatique des arguments fallacieux, notamment à travers l'analyse des différentes formes d'argumentation et la détection des manœuvres stratégiques dans le discours argumentatif.

19/01/24 à 11:00 - Big Blue Button
19 janv. 2024 à 11:00
Big Blue Button
[Collège de France] Présentation par Claire Gardent, CNRS

Génération de texte à partir de connaissances

Ceci est une diffusion en direct d'un séminaire donné dans le cadre de la chaire annuelle de Benoît Sagot au Collège de France. Plus d'informations ici

N.B. Ce séminaire aura lieu en français 🇫🇷.
Résumé : La génération de texte peut cibler différents types de langues et prendre en entrée différents types de connaissances. Dans cette présentation, je montrerai comment adapter les modèles de langue neuronaux pour générer du texte à partir de graphes de représentation sémantique, de graphes de connaissances et de documents multiples. Les architectures neuronales présentées permettront également d'illustrer comment générer à partir d'une même source des textes, soit dans vingt et une langues de l'Union européenne, soit dans des langues dites peu dotées comme le breton, le gallois et l'irlandais. Enfin les travaux sur la génération de biographies Wikipédia à partir de documents multiples permettront de mettre en lumière l'impact de biais de données sur la qualité des textes générés. Les travaux présentés ont été réalisés dans le cadre de la chaire IA xNLG (Génération de textes multilingues et multisources) cofinancée par l'ANR, Meta et la région Grand-Est.

12/01/24 à 11:00 - Big Blue Button
12 janv. 2024 à 11:00
Big Blue Button
[Collège de France] Présentation par François Yvon, CNRS

Traduction neuronale massivement multilingue

Ceci est une diffusion en direct d'un séminaire donné dans le cadre de la chaire annuelle de Benoît Sagot au Collège de France. Plus d'informations ici

N.B. Ce séminaire aura lieu en français 🇫🇷.
Résumé : Le développement d'architectures exploitant les méthodes d'apprentissage neuronal « profond » en traduction automatique a conduit à une augmentation considérable de l'acceptabilité et de l'utilisabilité des traductions calculées par la machine. Ces nouvelles architectures ont également permis de mettre en œuvre des dispositifs de traduction automatique dépassant le cadre habituel de la traduction d'un texte en langue source vers un texte en langue cible : traduction directe de parole, traduction conjointe de texte et d'image, etc. Dans cet exposé, je présenterai un de ces dispositifs, destiné à traduire depuis de multiples langues sources vers de multiples langues sources, en soulignant sur les bénéfices computationnels et linguistiques qu'apportent ces systèmes de traduction multilingues, en particulier pour traduire depuis et vers des langues minoritaires.

22/12/23 à 11:00 - Big Blue Button
22 déc. 2023 à 11:00
Big Blue Button
[Collège de France] Présentation par Emmanuel Dupoux, META, EHESS

Apprendre un modèle de langue à partir de l’audio

Ceci est une diffusion en direct d'un séminaire donné dans le cadre de la chaire annuelle de Benoît Sagot au Collège de France. Plus d'informations ici

N.B. Ce séminaire aura lieu en français 🇫🇷.
Résumé : La modalité orale est le canal le plus naturel pour les interactions linguistiques, mais les technologies langagières actuelles (TAL) se basent surtout sur l'écrit, nécessitant de grandes quantités de textes pour développer des modèles de langage. Même les assistants vocaux ou les systèmes de traduction de la parole utilisent le texte comme intermédiaire, ce qui est inefficace et limite la technologie aux langues dotées de ressources textuelles importantes. De plus, cela néglige les caractéristiques de la parole telles que le rythme et l'intonation. Pourtant, l’enfant arrive à apprendre sa ou ses langue(s) maternelle(s) bien avant d’apprendre à lire ou à écrire.
Dans cette présentation, nous aborderons les avancées récentes en apprentissage de représentations audio qui ouvrent la voie à des applications TAL directement à partir de la parole sans aucun texte. Ces modèles peuvent capturer les nuances de la langue orale, y compris dans les dialogues. Nous discuterons également des défis techniques qui restent à relever pour reproduire un apprentissage qui approcherait celui du bébé humain.

15/12/23 à 11:00 - Big Blue Button
15 déc. 2023 à 11:00
Big Blue Button
[Collège de France] Présentation par Guillaume Jacques, CNRS, EPHE

Deux exemples d'usage des transducteurs en linguistique

Ceci est une diffusion en direct d'un séminaire donné dans le cadre de la chaire annuelle de Benoît Sagot au Collège de France. Plus d'informations ici

N.B. Ce séminaire aura lieu en français 🇫🇷.
Résumé : Les transducteurs sont un outil précieux pour plusieurs domaines distincts de la linguistique. En morphologie, ils permettent de produire des descriptions explicites et cohérentes des paradigmes morphologiques, aussi bien pour les langues bien dotées que pour les langues à tradition orale. En linguistique historique, ils peuvent servir à modéliser les changements phonétiques, et à reconstruire automatiquement des protoformes à partir de langues attestées. Cette présentation illustrera ces deux types d'applications, et montrera les bénéfices qu'ils peuvent apporter à ces disciplines.

8/12/23 à 11:00 - Big Blue Button
8 déc. 2023 à 11:00
Big Blue Button
[Collège de France] Présentation par Daniel Stoekl Ben Ezra¹ & Jean-Baptiste Camps², ¹EPHE-PSL; ²École nationale des chartes, Université PSL

Quelques exemples d'application du TAL aux humanités numériques

Ceci est une diffusion en direct d'un séminaire donné dans le cadre de la chaire annuelle de Benoît Sagot au Collège de France. Plus d'informations ici

N.B. Ce séminaire aura lieu en français 🇫🇷.
Résumé : Traitement automatique des langues et sciences humaines computationnelles : l'intelligence artificielle au service du passé
Cette intervention présentera des cas d'usage de méthodes relevant du traitement automatique des langues en sciences humaines, et particulièrement dans les sciences des textes et la philologie des textes anciens et médiévaux en français et en hébreu. Nous commencerons par l'utilisation de techniques d'alignement texte/image qui facilitent la création supervisée de données de vérité de terrain pour la transcription automatique d'écritures manuscrites, aident à la résolution des abréviations et la reconstitution des copies d'un même texte. Nous continuerons avec les défis posés par la normalisation ou la lemmatisation d'états anciens de langue, présentant une variation graphique importante, tout en montrant comment cela peut servir ensuite pour la détection de l'intertextualité ou bien encore, à l'utilisation de méthodes de stylométrie pour l'identification des auteurs de textes anonymes ou disputés. Enfin, nous montrerons comment le traitement automatique des langues et l'intelligence artificielle peuvent être mis au service de la constitution et l'analyse de vastes corpus en diachronie longue, et comment ceux-ci peuvent être ensuite analysés en ayant recours à des méthodes telles que les plongements de mots et documents (embeddings) ou les grands modèles de langue pour ensuivre dans le temps les grandes évolutions thématiques.

17/11/23 à 11:00 - Big Blue Button
17 nov. 2023 à 11:00
Big Blue Button
Présentation par Sara Budts, University of Antwerp

Modelling the past: the use of digital text analysis techniques for historical research

Résumé : This seminar illustrates the benefits, caveats and shortcomings of the use of Natural Language Processing techniques to answer historical research questions by means of two recent projects that sit on the interface between the digital and the historical.
The first project explores discursive patterns in lottery rhymes produced in the late medieval and early modern Low Countries, with a focus on the rhymes used by women. The lottery was a popular fundraising event in the Low Countries. Lottery rhymes, personal messages attached to the lottery tickets, provide a valuable source for historians. We collected more than 11,000 digitized short texts from five lotteries held from 1446 to 1606 and used GysBERT, a Language Model of historical Dutch, to identify distinctly male and female discourses in the lottery rhymes corpus. Although the model pointed us to some interesting patterns, it also showed that male and female lottery rhymes do not radically differ from each other. This is consistent with insights from premodern women’s history which stresses that women worked within societal, and in this case literary, conventions, sometimes subverting them, sometimes adapting them, sometimes adopting them unchanged. This research results from a collaboration with Marly Terwisscha van Scheltinga and Jeroen Puttevils.
The second project is more practical in nature and addresses the design and implementation of a Named Entity Recognition (NER) system for the Johnson Letters, a correspondence of about 800 letters written by and to the English merchant John Johnson, all dated between 1542 and 1552. Due to the historical nature and relatively small size of the dataset, the letters required a tailored approach for NER-tagging. After manually annotating about 100 letters as ground truth, we set up experiments with Conditional Random Field (CRF) models as well as fine-tuned transformer-based models using bert-base-NER, hmBERT, and MacBERTh pre-trained language models. Results were compared across all model types. CRF models performed competitively, with combined sampling techniques proving effective for named entities with few training examples. bert-based-NER and hmBERT finetuned models performed better than MacBERTh models, despite the latter language model’s pre-raining with EModE data. This project was carried out in collaboration with MA-student Patrick Quick.
Drawing on insights from these two projects, the talk will conclude with a brief discussion of the usefulness of NLP-methodologies for historical research more generally.

Télécharger les diapositives ici :
20/10/23 à 11:00 - Big Blue Button
20 oct. 2023 à 11:00
Big Blue Button
Présentation par Biswesh Mohapatra, Inria

Conversational Grounding in Dialog Systems

Résumé : In linguistics, Clark and Brennan propose the concept of "common ground" - the mutual knowledge and mutual assumptions that are essential for successful communication. This common ground is accumulated over the course of a conversation and is built via words, of course, but also through the use of other modalities: pointing to objects in the environment, nodding to indicate that one has understood, and staring at the speaker to indicate that one needs more information. This interactive process of building a common ground during a conversation by making sure that the interlocutors have understood the information that is being exchanged is called conversational grounding. The utterances have an underlying uncertainty which is negotiated and removed by the participants before getting added to the shared information. Today’s dialog systems use language models extensively for processing and generating utterances. However, previous work has shown a lack of conversational grounding capabilities in the language models. In this talk, I'll provide an overview of conversational grounding, highlighting its significance and the challenges it presents. I will share our efforts in annotating datasets to study and model Grounding Acts and Grounding Units. I'll also unveil our methodology in crafting test cases utilizing these annotations and share our critical observations. I will then talk about our work of creating test cases using these annotated data and discuss our findings. The talk will conclude with a brief mention of our ongoing work on storing and representing grounded information.

16/06/23 à 11:00 - Big Blue Button
16 juin 2023 à 11:00
Big Blue Button
Présentation par David Bamman, University of California, Berkeley

Measuring Representation in Culture

Résumé : Much work in cultural analytics has examined questions of representation in narrative-whether through the deliberate process of watching movies or reading books and counting the people who appear on screen, or by developing algorithmic measuring devices to do so at scale. In this talk, I'll explore the use of NLP and computer vision to capture the diversity of representation in both contemporary literature and film, along with the challenges and opportunities that arise in this process. This includes not only the legal and policy challenges of working with copyrighted materials, but also in the opportunities that arise for aligning current methods in NLP with the diversity of representation we see in contemporary narrative; toward this end, I'll highlight models of referential gender that align characters in fiction with the pronouns used to describe them (he/she/they/xe/ze/etc.) rather than inferring an unknowable gender identity.

28/04/23 à 11:00 - Big Blue Button
28 avr. 2023 à 11:00
Big Blue Button
Présentation par Caio Corro, LISN, Université Paris-Saclay

Graph-based semantic parsing, compositional generalization and loss functions

N.B. Ce séminaire aura lieu en français 🇫🇷.
Résumé : Semantic parsing aims to transform a natural language utterance into a structured representation that can be easily manipulated by a software (for example to query a database). As such, it is a central task in human-computer interfaces. It has recently been observed that sequence-to-sequence models struggle on settings that require compositional generalization. On the contrary, previous work has shown that span-based parsers are more robust.
In this talk, I will present our work on reentrancy-free semantic parsing. We proposed a novel graph-based formulation of this problem which addresses the search space issue of span-based models. We proved that both MAP inference and latent tag anchoring (required for weakly-supervised learning) are NP-hard problems. Therefore, we developed approximation algorithms based on combinatorial optimization techniques. Our approach delivers novel state-of-art results on standard benchmarks that test for compositional generalization.
Finally, I will discuss the theoretical limitations of token-separable losses that are commonly used in the literature (and in this work!) to bypass expensive computation of the log-partition function, followed by ongoing work on weakly-supervised learning.
The research presented in this talk is a joint work with Alban Petit, Emile Chapuis and myself.

Télécharger les diapositives ici :
31/03/23 à 11:00 - Big Blue Button
31 mars 2023 à 11:00
Big Blue Button
Présentation par Nathan Godey & Roman Castagné, Inria (ALMAnaCH)

MANTa: Efficient Gradient-Based Tokenization for Robust End-to-End Language Modeling

Résumé : Static subword tokenization algorithms have been an essential component of recent works on language modeling. However, their static nature results in important flaws that degrade the models' downstream performance and robustness. In this work, we propose MANTa, a Module for Adaptive Neural TokenizAtion. MANTa is a differentiable tokenizer trained end-to-end with the language model. The resulting system offers a trade-off between the expressiveness of byte-level models and the speed of models trained using subword tokenization. In addition, our tokenizer is highly explainable since it produces an explicit segmentation of sequences into blocks. We evaluate our pre-trained model on several English datasets from different domains as well as on synthetic noise. We find that MANTa improves robustness to character perturbations and out-of-domain data. We then show that MANTa performs comparably to other models on the general-domain GLUE benchmark. Finally, we show that it is considerably faster than strictly byte-level models.

17/03/23 à 11:00 - Big Blue Button
17 mars 2023 à 11:00
Big Blue Button
Présentation par Jitao Xu, NetEase Youdao

Writing in two languages: Neural machine translation as an assistive bilingual writing tool

Résumé : In a globalized world, more situations appear where people need to express themselves in a foreign language. However, for many people, writing in a foreign language is not an easy task. Therefore, users may find assistance from computer-aided translation or writing technologies. Existing studies mainly focused on only generating texts in a foreign language. We suggest that showing corresponding texts in the user's mother tongue can also help users to verify the composed texts with synchronized bitexts. In this work, we study techniques to build bilingual writing assistant systems that allow free composition in both languages and display synchronized monolingual texts in the two languages. We introduce two types of simulated interactive systems. The first solution allows users to compose mixed-language texts, which are then translated into their monolingual counterparts. We propose a dual decoder Transformer model to simultaneously produce texts in two languages. The second design aims to extend commercial online translation systems by letting users freely alternate between the two languages at their will. We introduce a general bilingual synchronization task and experiment with autoregressive and non-autoregressive synchronization systems.
The demo can be found here: https://github.com/jmcrego/BiSync

Télécharger les diapositives ici :
24/02/23 à 11:00 - Big Blue Button
24 févr. 2023 à 11:00
Big Blue Button
Présentation par Thibault Clérice, PSL, ENS, Lattice

Lemmatisation et classification sémantique dans un corpus latin en diachronie longue

N.B. Ce séminaire aura lieu en français 🇫🇷.
Résumé : Dans le cadre de l'histoire des sociétés et des langues, la constitution de corpus thématique constitue l'une des tâches les plus chronophages: si la recherche d'occurrence des termes explicites prend peu de temps, celle des formes figurées devient rapidement difficile à mener à bout. Pour traiter ce problème, trois problématiques se posent à nous: (1) l'acquisition de corpus; (2) l'interprétation grammaticale de ce dernier et enfin (3) la classification de phrases. Si l'acquisition des corpus latins présente des problématiques propres aux corpus anciens (avant le XVIIIe siècle), celle de son analyse morphosyntaxique est un enjeu important tant la langue est à la fois complexe en synchronie (richesse morphologique) et en diachronie (variation graphique, influence des langues grecque, hébraïque puis germanique). Nous présenterons ensuite une expérience sur la détection sémantique de la sexualité dans un corpus latin du -IIIe siècle au IXe siècle utilisant des techniques de classification contemporaines (CNN, RNN, etc.). En variant les caractéristiques des données d'entraînement (taille, explicite vs implicite, etc.), nous montrons que certaines de ces architectures donnent des résultats prometteurs et pourraient soutenir la production de corpus thématiques.

Télécharger les diapositives ici :
17/02/23 à 11:00 - Big Blue Button
17 févr. 2023 à 11:00
Big Blue Button
Présentation par Perceval Wajsbürt & Romain Bey, APHP

Outils de traitement des comptes-rendus cliniques dans les entrepôts de données de santé

N.B. Ce séminaire aura lieu en français 🇫🇷.
Résumé : Les comptes-rendus médicaux textuels constituent une source d'information riche, mais peuvent être difficiles à exploiter en raison de la variété des besoins d'extraction et de la grande quantité de données présentes dans les entrepôts de santé (EDS). De plus, les algorithmes mis en place pour traiter ces données peuvent générer des résultats différents en fonction de leur implémentation, or le besoin de reproductibilité est critique dans le monde de la recherche et de la médecine. Nous présentons notre travail sur EDS-NLP, une librairie open-source pour le traitement automatique du langage (TAL) clinique français. Son objectif est de proposer un cadre simple pour traiter de grandes quantités de données textuelles, offrir des algorithmes performants et testés, et simplifier le partage des algorithmes de TAL via GitHub. Cette librairie offre plusieurs fonctionnalités personnalisables telles que le nettoyage de texte, l'extraction de diverses variables, dates et synonymes de terminologies et la détection d'attributs (négation, parenté, hypothèse, etc.). Nous présentons également notre projet de pseudonymisation des textes cliniques en démonstration de ce travail. Enfin, l'obtention de textes de qualité étant une étape critique pour l'exploitation des comptes-rendus des EDS, nous présentons notre travail de modélisation pour l'extraction de corps de texte et la librairie EDS-PDF, qui vise à faciliter l'extraction de textes à partir de documents cliniques PDF.

Télécharger les diapositives ici :
3/02/23 à 11:00 - Big Blue Button
3 févr. 2023 à 11:00
Big Blue Button
Présentation par Chloé Clavel, Institut Polytechnique de Paris, Telecom-Paris, Social Computing Team

Socio-conversational AI: integrating the socio-emotional component in neural models

Résumé : A single lack of social tact on the part of a conversational system (chatbot, voice assistant, social robot) can cause the user's trust and engagement with the interaction to drop. This lack of social intelligence affects the willingness of a large audience to view conversational systems as acceptable. To understand the state of the user, the current affective/social computing research community has drawn on research in artificial intelligence and the social sciences. However, in recent years, the trend has shifted towards a monopoly of deep learning methods, which are quite powerful but opaque and greedy for annotated data and less suitable for integrating social science knowledge. I will present here the research we are doing within the Social Computing team at Telecom-Paris to develop Machine/Deep Learning models for modelling the social component of interactions. In particular, I will focus on research aimed at improving the explainability of the models as well as their transferability to new data and new socio-emotional phenomena.

Télécharger les diapositives ici :
13/01/23 à 11:00 - Big Blue Button
13 janv. 2023 à 11:00
Big Blue Button
Présentation par Liane Guillou, University of Edinburgh

Temporality and Modality in Entailment Graph Learning

Résumé : The ability to recognise textual entailment and paraphrase is crucial in many downstream tasks, including Open-domain Question Answering from text. Entailment Graphs, constructed via unsupervised learning techniques over large text corpora, provide a solution to learning and encoding this information. Entailment Graphs comprise nodes representing linguistic predicates and edges representing the entailment relations between them.
Despite recent progress, methods for learning Entailment Graphs produce many spurious entailment relations. In this talk I propose the incorporation of temporal information and the linguistic modality of predicates as signals for refining the entailment graph learning process. I will show that these phenomena are useful in disentangling highly correlated but contradictory predicates, such as "winning" and "losing".

Bio: Liane Guillou is a postdoctoral researcher at the University of Edinburgh, and a member of the EdinburghNLP group. The central themes of her research are designing NLP systems with a strong awareness of linguistic context and developing novel evaluation datasets to challenge these systems. Liane's current research interests include Entailment Graphs, uncertainty detection, and Machine Translation evaluation. Liane was awarded her PhD from the University of Edinburgh in 2016, for her thesis on Incorporating Pronoun Function into Statistical Machine Translation. She also holds an MSc in Artificial Intelligence from the University of Edinburgh, and a BSc in Computer Science from the University of Warwick. Liane has previously worked as a data scientist on detecting hate speech and threats of violence on social media, as a postdoctoral researcher at LMU Munich, and as a visiting researcher at the University of Uppsala.

Télécharger les diapositives ici :
16/12/22 à 11:00 - Big Blue Button
16 déc. 2022 à 11:00
Big Blue Button
Présentation par Cristina España-Bonet, DFKI GmbH, Germany

The (Undesired) Attenuation of Human Biases by Multilinguality.

Résumé : Some human preferences are universal. The odor of vanilla is perceived as pleasant all around the world. We expect neural models trained on human texts to exhibit these kinds of preferences, i.e. human biases, but we show that this is not always the case. In this talk, I will explore 16 static and contextual embedding models in 9 languages and, when possible, compare them under similar training conditions. I will also introduce and motivate CA-WEAT, multilingual cultural aware tests to quantify biases, and compare them to previous English-centric tests. Our experiments confirm that monolingual static embeddings do exhibit human biases, but values differ across languages, being far from universal. Biases are less evident in contextual models, to the point that the original human associations might be reversed. Multilinguality proves to be another variable that attenuates and even reverses the effect of the bias, specially in contextual multilingual models. In order to explain this variance among models and languages, we examine the effect of asymmetries in the training corpus, departures from isomorphism in multilingual embedding spaces and discrepancies in the testing measures between languages.

Télécharger les diapositives ici :
18/11/22 à 11:00 - Big Blue Button
18 nov. 2022 à 11:00
Big Blue Button
Présentation par Antonis Anastasopoulos, George Mason University

NLP Beyond the Top-100 Languages

Résumé : The availability of large multilingual pre-trained language models has opened up exciting pathways for developing NLP technologies for languages with scarce resources. In this talk I will summarize some of my group's recent work on the challenges of handling new, unseen languages through finetuning, proposing a phylogeny-based adapter solution. Last, as data is paramount for extending into new languages, I will discuss issues relating to data requirements and data representativeness.

Bio: Antonios Anastasopoulos is an Assistant Professor in Computer Science at George Mason University. He received his PhD in Computer Science from the University of Notre Dame with a dissertation on "NLP for Endangered Languages Documentation" and then did a postdoc at Languages Technologies Institute at Carnegie Mellon University. His research is on natural language processing with a focus on low-resource settings, endangered languages, and cross-lingual learning, and is currently funded by the National Science Foundation, the National Endowment for the Humanities, the DoD, Google, Amazon, Meta, and the Virginia Research Investment Fund.

Télécharger les diapositives ici :
21/10/22 à 11:00 - Big Blue Button
21 oct. 2022 à 11:00
Big Blue Button
Présentation par Robin Algayres, ENS/PSL, Inria Paris and Meta AI Research

DP-Parse: Finding Word Boundaries from Raw Speech with an Instance Lexicon

Résumé : Finding word boundaries in continuous speech is challenging as there is little or no equivalent of a 'space' delimiter between words. Popular Bayesian non-parametric models for text segmentation use a Dirichlet process to jointly segment sentences and build a lexicon of word types. We introduce DP-Parse, which uses similar principles but only relies on an instance lexicon of word tokens, avoiding the clustering errors that arise with a lexicon of word types. On the Zero Resource Speech Benchmark 2017, our model sets a new speech segmentation state-of-the-art in 5 languages. The algorithm monotonically improves with better input representations, achieving yet higher scores when fed with weakly supervised inputs. Despite lacking a type lexicon, DP-Parse can be pipelined to a language model and learn semantic and syntactic representations as assessed by a new spoken word embedding benchmark.

7/10/22 à 11:00 - Big Blue Button
7 oct. 2022 à 11:00
Big Blue Button
Présentation par Debora Nozza, Bocconi University

Roadmap to universal hate speech detection

Résumé : An increasing propagation of hate speech has been detected on social media platforms (e.g., Twitter) where (pseudo-)anonymity enables people to target others without being recognized or easily traced. While this societal issue has attracted many studies in the NLP community, it comes with three important challenges. Hate speech detection models should be fair, work on every language, and consider the whole context (e.g., imagery). Solving these challenges will revolutionize the field of hate speech detection and help on create a "universal" model. In this talk, I will present my contributions in this area along with my takes for future directions.

Bio: Debora Nozza is an Assistant Professor in Computing Sciences at Bocconi University. She was recently awarded a €120,000 grant from Fondazione Cariplo for her project MONICA, which will focus on monitoring coverage, attitudes, and accessibility of Italian measures in response to COVID-19. Her research interests mainly focus on Natural Language Processing, specifically on the detection and counter-acting of hate speech and algorithmic bias on Social Media data in multilingual context. She was one of the organizers of the task on Automatic Misogyny Identification (AMI) at Evalita 2018 and Evalita 2020, and one of the organizers of the HatEval Task 5 at SemEval 2019 on multilingual detection of hate speech against immigrants and women in Twitter.

19/08/22 à 11:00 - Inria, Paris
19 août 2022 à 11:00
Inria, Paris
Présentation par Oren Tsur, NLP and Social Dynamics Lab, Ben Gurion University

Modeling Decentralized Group Coordination at Large Scale

Résumé : Understanding collective decision making at a large-scale, and elucidating how community organization and community dynamics shape collective behavior are at the heart of social science research. Communities are multi-faceted, complex and dynamic. In this talk I will present two approaches for learning community representations: a generic representation that could be used as an exploratory tool to find nuanced similarities between communities, and a task oriented representation. Both representations combine multiple types of signals - textual and contextual, e.g., the (social) network structure and community dynamics. I will show how this multifaceted model can accurately predict large-scale collective decision-making in a distributed environment. We demonstrate the applicability of our model through Reddit's r/place - a large-scale online experiment in which millions of users, self-organized in thousands of communities, clashed and collaborated in an effort to realize their agenda.

Bio: Dr. Oren Tsur is an Assistant Professor (Senior Lecturer) at the Department of Software and Information Systems Engineering at Ben Gurion University in Israel where he heads the NLP and Social Dynamics Lab (NASLAB) and the newly founded interdisciplinary Research Center for Cyber Policy and Politics (a web page and a logo are coming soon :). His work combines Machine Learning, Natural Language Processing (NLP), Social Dynamics, and Complex Networks. Specifically, Oren’s work varies from sentiment analysis to modeling speakers’ language preferences, hates-speech detection, community dynamics, and adversarial influence campaigns. Oren serves as an (S)Area Chair, editor and Senior Program Committee member in venues like ACL, EMNLP, WSDM and ICWSM and as a reviewer for journals ranging from TACL to PNAS and Nature. Oren’s work was published in top NLP and Web Science venues, most recently AAAI-22 and WWW-22.

8/07/22 à 11:00 - Big Blue Button
8 juil. 2022 à 11:00
Big Blue Button
Présentation par David Ifeoluwa Adelani, Saarland University & Masakhane NLP

A Few Thousand Translations Go a Long Way! Leveraging Pre-trained Models for African News Translation

Résumé : Recent advances in the pre-training of language models leverage large-scale datasets to create multilingual models. However, low-resource languages are mostly left out in these datasets. This is primarily because many widely spoken languages are not well represented on the web and therefore excluded from the large-scale crawls used to create datasets. Furthermore, downstream users of these models are restricted to the selection of languages originally chosen for pre-training. This work investigates how to optimally leverage existing pre-trained models to create low-resource translation systems for 16 African languages. We focus on two questions: 1) How can pre-trained models be used for languages not included in the initial pre-training? and 2) How can the resulting translation models effectively transfer to new domains? To answer these questions, we create a new African news corpus covering 16 languages, of which eight languages are not part of any existing evaluation dataset. We demonstrate that the most effective strategy for transferring both to additional languages and to additional domains is to fine-tune large pre-trained models on small quantities of high-quality translation data.

17/06/22 à 11:00 - Big Blue Button
17 juin 2022 à 11:00
Big Blue Button
Présentation par Laura Kallmeyer, Heinrich-Heine-Universität Düsseldorf

Cross-lingual RRG Parsing

Résumé : The work presented in this talk is joint work with Kilian Evang, Jakub Waszczuk, Kilu von Prince, Tatiana Bladier and Simon Petitjean. We consider the task of parsing low-resource languages in a scenario where parallel English data and also a limited seed of annotated sentences in the target language are available, as for example in bootstrapping parallel treebanks. We focus on constituency parsing using Role and Reference Grammar (RRG), a theory that has so far been understudied in computational linguistics but that is widely used in typological research, i.e., in particular in the context of low-resource languages. Starting from an existing RRG parser, we propose two strategies for low-resource parsing: first, we extend the parsing model into a cross-lingual parser, exploiting the parallel data in the high-resource language and unsupervised word alignments by providing internal states of the source-language parser to the target-language parser. Second, we adopt self-training, thereby iteratively expanding the training data, starting from the seed, by including the most confident new parses in each round. Both in simulated scenarios and with a real low-resource language (Daakaka), we find substantial and complementary improvements from both self-training and cross-lingual parsing. Moreover, we also experimented with using gloss embeddings in addition to token embeddings in the target language, and this also improves results. Finally, starting from what we have for Daakaka, we also consider parsing a related language (Dalkalaen) where glosses and English translations are available but no annotated trees at all, i.e., a no-resource scenario wrt. syntactic annotations. We start with a cross-lingual parser trained on Daakaka with glosses and use self-training to adapt it to Dalkalaen. The results are surprisingly good.

3/06/22 à 11:00 - Big Blue Button
3 juin 2022 à 11:00
Big Blue Button
Présentation par Nils Holzenberger, Center for Language and Speech Processing, Johns Hopkins University

Knowledge Acquisition for Natural Language Processing

Résumé : Building AI systems involves acquiring background knowledge about the world. Historically, knowledge has first been encoded by experts directly into AI systems, then later acquired from massive amounts of data and statistical cues. Ongoing research, such as AI2's Aristo project, is aiming to acquire knowledge from declarative, textbook-style language. I present three distinct approaches to knowledge acquisition for natural language processing systems. (1) Under certain conditions, natural language processing models can understand and reason with rules and facts stated in natural language. (2) SchemaBlocks is an interface to elicit common sense knowledge from humans, based on event chains mined from text, or on free-form textual descriptions of scenarios of interest. (3) For the task of template extraction, finding the right formulation for natural language prompts is key to fully exploiting the knowledge contained in large-scale, pretrained language models. These three approaches differ in how knowledge is represented, how humans are involved, and if so, what their expertise is.

20/05/22 à 11:00 - Big Blue Button
20 mai 2022 à 11:00
Big Blue Button
Présentation par Paul Michel, ENS

Revisiting Populations in Multi-agent Communication

Résumé : Despite evidence from cognitive sciences that larger groups of speakers tend to develop more structured languages in human communication, vanilla scaling up of populations has failed to yield significant benefits in emergent multi-agent communication. In this talk, I will reassess the validity of the standard protocol used to train these populations. Informed by an analysis of the population-level communication objective at the equilibrium, we advocate for an alternate population-level training paradigm for referential games based on the idea of "partitioning" the agents into sender-receiver pairs and limiting co-adaptation across pairs. We show that this results in optimizing a different objective at the population level, where agents maximize (1) their respective "internal" communication accuracy and (2) some measure of alignment between agents. In experiments, we find that agents trained in partitioned populations are able to communicate successfully with new agents which they have never interacted with and tend to develop a shared language. Moreover, we observe that larger populations tend to develop languages that are more compositional, which aligns better with existing work in sociolinguistics.

22/04/22 à 11:00 - Big Blue Button
22 avr. 2022 à 11:00
Big Blue Button
Présentation par Philippe Gambette, LIGM, Université Gustave Eiffel, CNRS

Bioinformatics-inspired methods for text corpora analysis

Résumé : In this talk, I will show how computer-assisted textual analysis can benefit from approaches developed in bioinformatics, more precisely comparative genomics and phylogenetics. I will quickly introduce a few problems in this field and show how algorithms developed to solve them can be adapted to textual data, highlighting similarities but also differences. Text comparison, as well as other text processing tasks, may benefit from ideas coming from the alignment of biological sequences at the nucleotide or gene level. More specifically, the idea of having a reference genome was useful in order to quickly build a database of poems by Marceline Desbordes-Valmore, allowing to explore, for example, the musical adaptations of her poetic works. Methods developed to reconstruct the tree of life, or to compare phylogenetic trees, can also be used to visualise texts, or to evaluate whether a chronological signal can be observed in the result of a hierarchical clustering of texts.

Contributors of these works include J.-C. Bontemps, L. Bulteau, A. Chaschina, E. Kogkitsidou, N. Lechevrel, D. Legallois, C. Martineau, T. Poibeau, J. Poinhos, O. Seminck, C. Trotot and J. Véronis.

No prerequisite in biology is required to attend this seminar.

Télécharger les diapositives ici :
18/03/22 à 11:00 - Google meet
18 mars 2022 à 11:00
Google meet
Présentation par Elena Cabrio , Université Côte d’Azur, CNRS & Inria (Wimmics project-team)

Processing Natural Language to Extract, Analyze and Generate Knowledge and Arguments from Texts

Résumé : The long-term goal of the Natural Language Processing research area is to make computers/machines as intelligent as human beings in understanding and generating language, being thus able: to speak, to make deduction, to ground on common knowledge, to answer, to debate, to support humans in decision making, to explain, to persuade. Natural language understanding can come in many forms. In my research career so far I put efforts in investigating some of these forms, strongly connected to the actions I would like intelligent artificial systems to be able to perform. In my presentation, I will focus on some of these research challenges, that I believe stand in the way of reaching this ambitious goal: (1) the detection of argumentative structures and the prediction the relations among them in different textual resources as political debates, medical texts, and social media content; (2) the detection of abusive language, taking advantage of both the network analysis and the content of the short-text messages on online platforms to detect cyberbullying phenomena.

4/03/22 à 17:00 - Google meet
4 mars 2022 à 17:00
Google meet
Présentation par Shrimai Prabhumoye, NVIDIA

Controllable Text Generation - Controlling Style and Content

Résumé : The 21st century is witnessing a major shift in the way people interact with technology and Natural Language Generation (NLG) is playing a central role. Users of smartphones and smart home devices now expect their gadgets to be aware of their social context, and to produce natural language responses in interactions. The talk provides deep learning solutions to control style and content in NLG. To control style, the talk presents two novel solutions: Back-Translation and Tag and Generate approach. To control content, the talk dives deep into understanding the task of document grounded generation as well as proposing novel solutions for the task. The talk further presents multi-stage prompting approach to use pre-trained large language models for knowledge grounded dialogue response generation task.

Télécharger les diapositives ici :
4/02/22 à 11:00 - Google meet
4 févr. 2022 à 11:00
Google meet
Présentation par David Lassner, Technische Universität Berlin & BIFOLD

Translatorship attribution with strong confounders and also how to make friends between TEI and NLP

Résumé : The first part of this talk is about translatorship attribution in the context of 19th century literary translations. The main challenge in translatorship attribution is the presence of confounding variables such as the genre or the style of the original author. I will discuss different regularization strategies and informed use of features. Additionally, I will present a novel approach that takes into account both the original and the translation.
The second part of the talk is about the technical prerequisites for conducting the aforementioned research on translatorship attribution. I will show how we created and published training data for OCR in an unclear copyright setting and how to conveniently use NLP methods on TEI-encoded documents with the help of the Standoffconverter Python package.

7/01/22 à 11:00 - Google meet
7 janv. 2022 à 11:00
Google meet
Présentation par Holger Schwenk, META AI (formerly Facebook AI Research)

Scaling NMT to Hundreds of Languages

Résumé : There are more than 7000 languages in the world, but only about 100 are currently handled by MT and other multilingual NLP tasks. While there is a lot of success in unsupervised MT, parallel data remains a very useful resource to train NMT systems.

A popular approach to mine for parallel data is to compare sentences in a multilingual embedding space and to decide whether they are parallel or not based on a threshold. In this talk, we present new techniques, based on a teacher-student framework, to train multilingual sentence encoders which were successfully applied to several low resource languages.

17/12/21 à 11:00 - Google meet
17 déc. 2021 à 11:00
Google meet
Présentation par Guillaume Wisniewski, Université de Paris, LLF, CNRS

Analyzing Transformers Representations. A Linguistic Perspective

Résumé : Transformers have become a key component in many NLP models, arguably because of their capacity to uncover contextualized distributed representation of tokens from raw texts. Many works have striven to analyze these representations to find out whether they are consistent with models derived from linguistic theories and how they could explain their ability to solve an impressive number of NLP tasks.

In this talk, I will present two series of experiments falling within this line of research and aiming at highlighting the information flows within a Transformer network. The first series of experiments focuses on the long distance agreement task (e.g. between a verb and its subject), one of the most popular methods to assess neural networks’ ability to encode syntactic information. I will present several experimental results showing that transformers are able to build an abstract, high-level sentence representation rather than solely capturing surface statistical regularities. In a second series of experiments, I will use a controlled set of examples to investigate how gender information circulates in an encoder-decoder architecture considering both probing techniques as well as interventions on the internal representations used in the MT system.

Joint work with Bingzhi Li, Benoit Crabbé, Lichao Zhu, Nicolas Bailler and François Yvon

19/11/21 à 11:00 - Google meet
19 nov. 2021 à 11:00
Google meet
Présentation par Paul-Ambroise Duquenne, Inria (ALMAnaCH)

Multimodal and Multilingual Embeddings for Large-Scale Speech Mining

Résumé : We present an approach to encode a speech signal into a fixed-size representation which minimizes the cosine loss with the existing massively multilingual LASER text embedding space. Sentences are close in this embedding space, independently of their language and modality, either text or audio. Using a similarity metric in that multimodal embedding space, we perform mining of audio in German, French, Spanish and English from Librivox against billions of sentences from Common Crawl. This yielded more than twenty thousand hours of aligned speech translations. To evaluate the automatically mined speech/text corpora, we train neural speech translation systems for several languages pairs. Adding the mined data, achieves significant improvements in the BLEU score on the CoVoST2 and the MUST-C test sets with respect to a very competitive baseline. Our approach can also be used to directly perform speech-to-speech mining, without the need to first transcribe or translate the data. We obtain more than one thousand three hundred hours of aligned speech in French, German, Spanish and English. This speech corpus has the potential to boost research in speech-to-speech translation which suffers from scarcity of natural end-to-end training data. All the mined multimodal corpora will be made freely available.

8/11/21 à 11:00 - Google meet
8 nov. 2021 à 11:00
Google meet
Présentation par Senja Pollak & Matej Martinc, Jožef Stefan Institute (Ljubljana, Slovenia)

EMBEDDIA project and selected applications

Résumé : Newsrooms increasingly use and rely on AI tools for automatic text processing. However, these are mostly developed for major languages and that limitation continues to be a challenge. New tools allowing high quality transformations between languages and tools specifically adapted to low-resource environments are urgently needed. EMBEDDIA is a Horizon 2020 funded project consisting of a large European consortium of partners from academia, media and technology, which seeks to address this challenge. During the talk, we will overview the main achievements of the project and present some of the newly developed tools for comment filtering, keyword extraction and viewpoint detection.

Télécharger les diapositives ici :
5/11/21 à 11:00 - Google meet
5 nov. 2021 à 11:00
Google meet
Présentation par Julia Ive, Queen Mary University of London

Harnessing text generation

Résumé : Text generation is an active area of Natural Language Processing (NLP) research, covering tasks such as dialogue generation, machine translation (MT), summarisation, and story generation, etc. Despite the progress in the current NLP methods (for example, such powerful language generation models as GPT-3), this task remains a challenge when the validity of outputs is crucial. This talk covers my work on the generation of synthetic medical text to address the data availability bottleneck for Biomedical NLP. I will also talk about my work on the exploration of supervised and unsupervised rewards for text generation with Reinforcement Learning and my work in simultaneous MT, which applies to incomplete source text and where the optimal integration of visual information is crucial to generate adequate outputs.

Télécharger les diapositives ici :
22/10/21 à 11:00 - Google meet
22 oct. 2021 à 11:00
Google meet
Présentation par You Zuo¹ & Kim Gerdes², ¹Inria; ²Lisn, U. Paris-Saclay & ISS, Inria

T7: Tech-Taxonomy with a Text To Text Transfer Transformer

Résumé : In this seminar, we will first explain why we need a terminological taxonomy for drafting and editing technological texts. Then we will explain how such a taxonomy can be compiled from existing ontologies and how different models such as TransE, LSTM, Transformers can be trained on a taxonomy to predict hypernyms and hyponyms. We will also demonstrate how this can eventually help to curate and extend the database, and thus be used in applications of paraphrase generation and text drafting.
This project has been carried out in cooperation between LISN (CNRS) and qatent.com at Inria’s Startup Studio.

Télécharger les diapositives ici :
2/07/21 à 16:00 - Jitsi
2 juil. 2021 à 16:00
Jitsi
Présentation par Julia Kreutzer, Google Research

Data quality for low-resource MT

Résumé : In this talk I will present the findings of a collaborative audit of multilingual corpora, with special attention for low-resourced languages. We will discuss the challenges that come with building such corpora, and the risks of using them without inspection. With a case study on a subset of African languages I will illustrate the implications of building machine translation on low-quality parallel data.

7/05/21 à 11:00 - Zoom
7 mai 2021 à 11:00
Zoom
Présentation par Simon Gabay, University of Geneva

Propositions pratiques pour l’édition numérique des textes français modernes

N.B. Ce séminaire aura lieu en français 🇫🇷.
Résumé : La littérature du Grand siècle a manqué il y a près d’un siècle sa rencontre avec la philologie romane, ce qui n’a pas été sans conséquence sur la qualité des éditions de textes pourtant qualifiés de « classiques » : il est crucial que cette erreur ne se répète pas avec la philologie computationnelle. Prolongeant la célèbre tradition des Instructions pour la publication et autres Règles pour l’édition, nous souhaitons partager quelques propositions pour l’édition numérique des textes français modernes. En présentant la chaîne de traitement au développement de laquelle nous travaillons, nous nous attacherons à donner une dimension pratique à nos réflexions théoriques quant au renouveau ecdotique que nous appelons de nos vœux.

16/04/21 à 11:00 - Zoom
16 avr. 2021 à 11:00
Zoom
Présentation par Michael Filhol, LISN

Modélisation, synthèse et représentation éditable des langues des signes

N.B. Ce séminaire aura lieu en français 🇫🇷.
Résumé : Les langues des signes sont des langues à part entière, gestuelles et non phonatoires. Le travail présenté s'intéresse à leur traitement en informatique, un domaine de recherche encore à ses débuts. Trois volets seront présentés, à commencer par la représentation formelle des langues de signes. Nous y présentons la construction d'une approche et d'un modèle (AZee), qui permet entre autres leur synthèse par un signeur virtuel (avatar 3D), ce qui fera l'objet d'un deuxième volet. En guise d'ouverture, une dernière partie s'intéresse à la question d'une forme éditable pour la langue, celle-ci ne possédant pas de forme écrite. En observant des productions graphiques spontanées de signeurs mettant sur papier les discours de leur langue, nous avons pu les rapprocher de résultats issus d'AZee. Nous pensons qu'une piste existe là pour la définition d'un système de représentation graphique intuitif de la langue des signes, voire d'une piste pour en élaborer une écriture.
L'étude des langues des signes étant bien plus récente que celle de leurs homologues vocales ou écrites, les connaissances linguistiques sur elles sont plus limitées et leur traitement automatique n'est possible que de manière interdisciplinaire en avançant conjointement sur les fronts linguistique et informatique. Ainsi, les avancées en représentation formelle ont des implications ou des contrastes en linguistique et nous mettrons en lumière certains d'entre eux.

12/03/21 à 11:00 - Zoom
12 mars 2021 à 11:00
Zoom
Présentation par Laurent Besacier, Naver Labs

Self-Supervised Representation Learning for Pre-training Speech Systems

Résumé : Self-supervised learning using huge unlabeled data has been successfully explored for image processing and natural language processing. Since 2019, recent works also investigated self-supervised representation learning from speech. They were notably successful to improve performance on downstream tasks such as speech recognition. These recent works suggest that it is possible to reduce dependence on labeled data for building speech systems through acoustic representation learning. In this talk I will present an overview of these recent approaches to self-supervised learning from speech and show my own investigations to use them in a end-to-end automatic speech translation (AST) task for which the size of training data is generally limited.

26/02/21 à 11:00 - Zoom
26 févr. 2021 à 11:00
Zoom
Présentation par Alix Chagué, Inria (ALMAnaCH)

LECTAUREP : Lecture Automatique des Répertoires

N.B. Ce séminaire aura lieu en français 🇫🇷.
Résumé : Il s'agit de faire un point d'avancement sur le projet LECTAUREP, au sein duquel collaborent depuis 2018, l'équipe ALMAnaCH et les Archives Nationales. L'objectif de ce projet est de faciliter l'accès au très grand corpus des répertoires d'actes de notaires parisiens en ayant recours à la transcription automatique d'écritures manuscrites et à la fouille de texte. Au-delà de la collecte des données, cette collaboration est l'occasion d'explorer les implications méthodologiques et infrastructurelles de tels projets. (joint work with Laurent Romary)

18/12/20 à 11:00 - Zoom
18 déc. 2020 à 11:00
Zoom
Présentation par Thomas Scialom, Recital.AI

Natural Language Generation: Training, Inference & Evaluation

Résumé : Recent advances in the field of natural language generation are undoubtedly impressive. Yet, little has changed from training and inference to evaluation. Models are learned with Teacher Forcing, inferred via Beam Search, and evaluated with BLEU or ROUGE. However, these algorithms suffer from many well-known limitations. How can these limitations be overcome? In this talk, we will present recently proposed methods that could be part of the solution, paving the way for a better NLG.
Joint work with Jacopo Staiano

20/11/20 à 11:00 - Zoom
20 nov. 2020 à 11:00
Zoom
Présentation par Hicham El Boukkouri, LIMSI

CharacterBERT: Reconciling ELMo and BERT for Word-Level Open-Vocabulary Representations From Characters

Résumé : Due to the compelling improvements brought by BERT, many recent representation models adopted the Transformer architecture as their main building block, consequently inheriting the wordpiece tokenization system despite it not being intrinsically linked to the notion of Transformers. While this system is thought to achieve a good balance between the flexibility of characters and the efficiency of full words, using predefined wordpiece vocabularies from the general domain is not always suitable, especially when building models for specialized domains (e.g., the medical domain). Moreover, adopting a wordpiece tokenization shifts the focus from the word level to the subword level, making the models conceptually more complex and arguably less convenient in practice. For these reasons, we propose CharacterBERT, a new variant of BERT that drops the wordpiece system altogether and uses a Character-CNN module instead to represent entire words by consulting their characters. We show that this new model improves the performance of BERT on a variety of medical domain tasks while at the same time producing robust, word-level and open-vocabulary representations.
Joint work with Olivier Ferret, Thomas Lavergne, Hiroshi Noji, Pierre Zweigenbaum and Junichi Tsujii

6/03/20 à 11:30 -
6 mars 2020 à 11:30
Présentation par Clémentine Fourrier, Inria (ALMAnaCH)

Learning Sound Correspondences: What about Neural Networks?

Résumé : Cognate and proto-form prediction are key tasks in computational historical linguistics, which rely heavily on sound correspondences identification, and could help low resource translation. In the last two decades, a combination of sequence alignement, statistical models and clustering methods have emerged to try and solve these. But where are the neural networks? In this talk, I will present my ongoing research in investigating the learnability of sound correspondences between a proto-language and daughter languages by neural models.
I will introduce: (i) MEDeA, a Multiway Encoder Decoder Architecture inspired by NMT, (ii) EtymDB2.0, the etymological database that we updated to generate much needed data, (iii) our experiments on plausible artificial languages as well as on real languages.

3/03/20 à 11:15 -
3 mars 2020 à 11:15
Présentation par Thomas Wolf, Hugging Face

Limits, open questions and current trends in Transfer Learning for NLP

Résumé : This talk is a subjective walk through my favorite papers and research directions in late-2019/early-2020. I’ll roughly cover the topics of model size and computational efficiency, model evaluation, fine tuning, out of domain generalization, sample efficiency, common sens and inductive biases. The talk is adapted from the sessions I gave in early 2020 at the NLPL Winter School.

10/01/20 à 11:30 -
10 janv. 2020 à 11:30
Présentation par Benjamin Muller, Inria (ALMAnaCH)

Can multilingual BERT transfer to an Out-of-Distribution dialect? A case study on North African Arabizi

Résumé : Building natural language processing systems for highly variable and low resource languages is a hard challenge. The recent success of large-scale multilingual pretrained language models provides us with new modeling tools to tackle it. In this talk, I will present my ongoing research in testing the ability of the multilingual version of BERT to model an unseen dialect. We take user-generated North African Arabic text as our case study. We show in different scenarios that multilingual language models are able to transfer to an unseen dialect, specifically in two extreme cases: across script (Arabic to Latin) and from Maltese, a distantly related language, unseen during pretraining.
Joint work with Benoît Sagot and Djamé Seddah.

13/12/19 à 11:30 -
13 déc. 2019 à 11:30
Présentation par Tatiana Bladier, University of Dusseldorf

Neural Semantic Role Labeling for French FrameNet With Deep Syntactic Information

Résumé : A recent graph-based neural architecture for semantic role labeling (SRL) developed by He et al. (2018) [3] jointly predicts argument spans, predicates and the relations between them without using gold predicates as input features. Although working well on Propbank-style data, this architecture makes some systematic mistakes when being used on a more semantically-oriented resource such as French FrameNet [1].
We adapt He's (2018) [3] system for the semantic roles prediction for French FrameNet. Contrasting to [3], we do not predict the full spans of the arguments directly, but implement a two-step pipeline of predicting syntactic heads of the argument spans first and reconstructing the full spans using surface and deep syntax in the second step. While the idea of reconstructing the argument spans using syntactic information is not new [2], the novelty of our work lies in using deep syntactic dependency relations for the full span recovery. We obtain deep syntactic information using symbolic conversion rules described in Michalon et al. (2016) [4]. We present the results of the ongoing semantic role labeling experiments for French FrameNet and discuss the advantages and challenges of our approach.

8/11/19 à -
8 nov. 2019
Présentation par Grzegorz Chrupała, Tilburg University

Neural and Symbolic Representations of Speech and Language

Résumé : As end-to-end architectures based on neural networks became the tool of choice for processing speech and language, there has been increased interest in techniques for analyzing and interpreting the representations emerging in these models. A large array of analytical techniques have been proposed and applied to diverse architectures. Given that the developments in this field have been so fast, it is perhaps inevitable that some of them also turn out to be loose.
In this talk I firstly focus on one pitfall not always successfully avoided in work on neural representation analysis: the role of learning. In many cases non-trivial representations can be found in the activation patterns of randomly initialized, untrained neural networks. In past studies this phenomenon has not always been properly accounted for, which means that the results reported in them need to be reconsidered. Here I revisit the issue of the representations of phonology in neural models of spoken language.
Secondly I present two methods based on Representational Similarity Analysis (RSA) and Tree Kernels (TK) which allow us to directly quantify how strongly the information encoded in neural activation patterns corresponds to information represented by symbolic structures such as syntax trees. I first validate the methods on the case of a simple synthetic language for arithmetic expressions with clearly defined syntax and semantics, and show that they exhibit the expected pattern of results. I then apply these methods to correlate neural representations of English sentences with their constituency parse trees.

18/10/19 à -
18 oct. 2019
Présentation par Guillaume Wisniewski, LLF & Univ. Paris VII

How to choose the test set size? Some observations on the evaluation of PoS taggers on the Universal Dependencies project

Résumé : This presentation questions the usual framework of statistical learning in which test set and train sets are fixed arbitrarily and independently of the model considered. Taking the evaluation of PoS taggers on the UD project as an example, we show that, in many cases, it is possible to consider smaller test sets than those generally available without hurting evaluation quality and that the examples that have been `saved' can be added to the train set to improve system performance, especially in the context of domain adaptation.

11/10/19 à -
11 oct. 2019
Présentation par Karen Fort, Loria & Sorbonne Université

La production participative (crowdsourcing ) : miroir grossissant sur l’annotation manuelle

Résumé : L'annotation manuelle de corpus est au coeur du Traitement automatique des langues actuel : elle fournit non seulement les exemples utilisés pour entraîner les outils par apprentissage, mais elle fait également référence lors des campagnes d'évaluation. Elle est, de fait, l'endroit où s'est réfugiée la linguistique dans le domaine. Pour autant, elle reste encore largement sous-étudiée. Aborder le sujet par le prisme de la production participative (crowdsourcing) ludifiée, c'est en regarder les points les plus durs dans un miroir grossissant. Les questions essentielles de la qualité de la production, des biais liés à l'outillage et de l'expertise des annotateurs sont en effet magnifiées par le nombre et la distance. Cet effet de loupe complexifie les expériences, mais nous pousse également à imaginer des solutions originales, qui enrichissent la réflexion sur l'annotation manuelle traditionnelle et remettent l'annotateur au coeur du processus.

11/04/19 à -
11 avr. 2019
Présentation par Yanai Elazar (joint work with Dr. Yoav Goldberg), Bar-Ilan University

Where’s My Head? Definition, Dataset and Models for Numeric Fused-Heads Identification and Resolution

Résumé : In this talk, I will describe our on-going work on fused-heads. We provide the first computational treatment of fused-heads constructions (FH), focusing on the numeric fused-heads (NFH). FHs constructions are noun phrases (NPs) in which the head noun is missing and is said to be “fused” with its dependent modifier. This missing information is implicit and is important for sentence understanding. The missing references are easily filled in by humans but pose a challenge for computational models. We pose the handling of FH as a two stages process: identification of the FH construction and resolution of the missing head. We explore the NFH phenomena in large corpora of English text and create (1) a dataset and a highly accurate method for NFH identification; (2) a 10k examples (1M tokens) crowd-sourced dataset of NFH resolution; and (3) a neural baseline for the NFH resolution task.

29/03/19 à -
29 mars 2019
Présentation par Mathilde Regnault, Lattice & Inria (ALMAnaCH)

Adapting an Existing French Metagrammar for Old and Middle French

Résumé : Although many texts in Old French (9th-13th c.) and Middle French (14th-15th c.) are now available, only a few of them are annotated with dependency syntax. Our goal is to extend the already existing data, the Old French treebank SRCMF “Syntactic Reference Corpus of Medieval French” (Prévost and Stein 2013) to obtain an annotated corpus of one million words also covering Middle French.
These stages of French are submitted to strong variation (language evolution, dialects, forms and domains) and are characterised by a free word-order, as well as null subjects. To deal with these difficulties, we have opted for the formalism of metagrammars (Candito 1999), for a modular constraint-based representation of syntactic phenomena through classes. More precisely, we are adapting the French Metagrammar (FRMG) (Villemonte de la Clergerie 2005) for Old and Middle French because there are enough similarities between these stages of French. In this talk, we will present the processing chain developed by the Almanach team and our choices to adapt the metagrammar to former stages of a language.

12/02/19 à -
12 févr. 2019
Présentation par Hila Gonen, Bar Ilan University

Lipstick on a Pig: Debiasing Methods Cover up Systematic Gender Biases in Word Embeddings But do not Remove Them

Résumé : Word embeddings are widely used in the NLP community for a vast range of tasks. It was shown that word embeddings derived from text corpora reflect gender biases in society. This phenomenon is pervasive and consistent across different word embedding models, causing serious concern. Several recent works tackle this problem, and propose methods for significantly reducing this gender bias in word embeddings, demonstrating convincing results. However, we argue that this removal is superficial. While the bias is indeed substantially reduced according to the provided bias definition, the actual effect is only hiding the bias, not removing it. The gender bias information is still reflected in the distances between "gender-neutralized" words in the debiased embeddings, and can be recovered from them. We present a series of experiments to support this claim, for two debiasing methods. We conclude that existing bias removal techniques are insufficient, and should not be trusted for providing gender-neutral modeling.

25/01/19 à -
25 janv. 2019
Présentation par Yoann Dupont, LIFO, Orléans & Inria (ALMAnaCH)

French MultiWord Expressions representation and parsing

Résumé : Many NLP tasks, such as natural language understanding, require a representation of syntax and semantics in texts. MultiWord Expressions (MWEs), which can be described as a set of (not necessarily contiguous) tokens that exhibit some idiosyncratic properties (Baldwin and Kim, 2010), to quote Sag et al. 2001 are "a pain in the neck for NLP" . MWEs are difficult to predict as their syntactic behavior tends to be unpredictable: they can have an irregular internal syntax and a non-compositional meaning. MWEs-aware NLP systems are also hard to evaluate, because until recently and the PARSEME COST initiative (Savary et al, 2017) there were only few annotated corpora annotated with MWEs (Laporte et al. 2008). I will first present my previous works on named entity recognition (Dupont et al, 2017), showing how they are related to MWEs, before delving deeper into MWEs.
I will present in more details how they are a challenge, and how we can represent them using metagrammars (Savary et al., 2018), more precisely within the FRMG framework of de la Clergerie, (2010).
Joint work with Eric Villemonte de la Clergerie and Yannick Parmentier

11/01/19 à -
11 janv. 2019
Présentation par Paul Michel, Neulab, Carnegie Mellon University

Tackling Machine Translation of Noisy Text

Résumé : Despite their recent success, neural machine translation systems have proven to be brittle in the face of non-standard inputs that are far from their training domain. This is particularly salient for the kind of noisy, user-generated content ubiquitous on social media and the internet in general.
In this talk I will present MTNT, our first step to remedy this situation by proposing a testbed for Machine Translation of Noisy Text. MTNT consists of parallel Reddit comments in three languages (English, French, Japanese) exhibiting a large amount of typos, grammar errors, code switching and more. I will discuss the challenges of the collection process, preliminary MT experiments and outlook for future work (and a sneak peek of ongoing follow-up research).

8/12/18 à -
8 déc. 2018
Présentation par KyungTae Lim, Lattice (ENS)

A Multi-Source Trainable Parser with Deep Contextualized Lexical Representations with Case Studies

Résumé : Despite their recent success, neural machine translation In this talk, we describe a multi-source trainable parser developed at Lattice for the CoNLL 2018 Shared Task (Multilingual Parsing from Raw Text to Universal Dependencies). The main characteristic of our work is the encoding of three different modes of contextual information for parsing: (i) Treebank feature representations, (ii) Multilingual word representations, (iii) ELMo representations obtained via unsupervised learning from external resources. In the talk, we investigated more about parsing low-resource languages with very small training corpora using multilingual word embeddings and annotated corpora of larger languages. The study demonstrates that specific language combinations enable improved dependency parsing when compared to previous work, allowing for wider reuse of pre-existing resources when parsing low resource languages. The study also explores the question of whether contemporary contact languages or genetically related languages would be the most fruitful starting point for multilingual parsing scenarios.

16/11/18 à -
16 nov. 2018
Présentation par Noami Havron, ENS Paris

Predictive processing in lexical and syntactic acquisition

Résumé : There is a general consensus in the field of language acquisition that infants use syntactic context to bootstrap their learning of the meaning of words. This is known as the syntactic bootstrapping hypothesis. For example, toddlers use the distributional information that articles tend to be followed by nouns (e.g., "la balle"), and pronouns tend to be followed by verbs (e.g., "elle saute"), to infer whether a novel word is likely to refer to an object or an action (e.g., "la dase" is likely to refer to an object and not an action). Previous modeling studies show that the distribution of syntactic contexts in the input is indeed a reliable cue to class membership. Thus, models that rely on frequent contexts show good categorization of unfamiliar words into nouns and verbs. My talk will focus on the question of whether children and infants can keep track of changes in the distribution of structures in their input, and update their predictions accordingly. I will present experimental results from my own studies with children, and suggest how we could model such effects.

26/10/18 à -
26 oct. 2018
Présentation par Luca Foppianno, Inria (ALMAnaCH)

Cooking entities at low heat: a receipt for entity disambiguation in scientific publications

Résumé : Entity ambiguity is a frequently encountered problem in digital publication libraries. Author/organisation is one of the most known use case, but there are others. We present “entity-cooking”: a generic, Machine Learning-based framework for entity matching/disambiguation. Developed with the help of Patrice Lopez, it is a tool offering a reusable entity disambiguation engine with “minimal" adaptations, independent by any specific domain. Lightly designed, it provides a standardised REST API and it supports XML-TEI or PDF (via Grobid) as input data. This project started in 2016; as of today we have implemented an author/organisation disambiguation solution, we have produced a manually annotated corpus (including affiliations references) and we are investigating the application to geographical location and toponym resolution (Semeval 2019, task 12).

19/10/18 à -
19 oct. 2018
Présentation par Tommaso Venturini, Inria (ALMAnaCH)

Le Web et ses publics

Résumé : In the seminar, we will discuss the social and political consequences of the organization of digital media. We will consider the limits of a simplistic reading of the power-law distribution of online visibility and the hopes raised by the thematic clustering and the dynamism of the Web. We will also study the risks that these dynamics entail exploring the causes of the recent proliferation of 'junk news'. Dans ce séminaire, nous discuterons des conséquences sociales et politiques de l'organisation des médias numériques. Nous considérerons les limites d'une lecture simpliste de la distribution en loi de puissance de la visibilité en ligne et les espoirs soulevés par la clusterisation thématique et le dynamisme du Web. Nous nous pencherons aussi sur les risques que ces dynamiques comportent, en explorant les causes de la récente prolifération des 'junk news’.

5/10/18 à -
5 oct. 2018
Présentation par Marine Courtin & Kim Gerdes, Univ. Paris 3 and CNRS

Building a Treebank for Naija, the English-based Creole of Nigeria.

Résumé : As an example of treebank development without pre-existing language specific NLP tools, we will present the ongoing work of constructing a 750 000 word treebank for Naija. The annotation project, part of the NaijaSynCor ANR project, has a social dimension because the language, NaijaSynCor ANR project, has a social dimension because the language, that is not fully recognized as such by the speakers themselves, is not yet institutionalized in any way. Yet, Naija, spoken by close to 100 million speakers, could play an important role in the nation-building process of Nigeria. We will briefly present a few particularities of Naija such as serial verbs, reduplications, and emphatic adverbial particles. We used a bootstrapping process of manual annotation and parser training to enhance and speed up the annotation process. The annotation is done in the Syntactic Universal Dependencies scheme (SUD) which allows seamless transformation into Universal Dependencies (UD) by means of Grew http://grew.fr/, a rule based graph rewriting system. We will present the different tools involved in this process, and we will show a few preliminary quantitative measures on the annotated sentences.

24/09/18 à -
24 sept. 2018
Présentation par Kyle Richardson, IMS

New Resources and Ideas for Semantic Parsing

Résumé : In this talk, I will give an overview of research being done at the University of Stuttgart on semantic parser induction and natural language understanding. The main topic, semantic parser induction, relates to the problem of learning to map input text to full meaning representations from parallel datasets. Such resulting “semantic parsers” are often a core component in various downstream natural language understanding applications, including automated question-answering and generation systems. We look at learning within several novel domains and datasets being developed in Stuttgart (e.g., software documentation for text-to-code translation) and under various types of data supervision (e.g., learning from entailment, « polyglot » modeling, or learning from multiple datasets).

Bio: Kyle Richardson is a finishing PhD student at the University of Stuttgart (IMS), working on semantic parsing and various applications thereof. Prior to this, he was a researcher in the Intelligent Systems Lab at the Palo Alto Research Center (PARC), and holds a B.A. from the University of Rochester, USA. He’ll be joining the Allen Institute for AI in November.

Télécharger les diapositives ici :
21/09/18 à -
21 sept. 2018
Présentation par Marcel Bollmann, University of Copehnhagen, Departement of Computer Science

Historical text normalization with neural networks

Résumé : With the increasing availability of digitized historical documents, interest in effective NLP tools for these documents is on the rise. The abundance of variant spellings, however, makes them challenging to work with for both humans and machines. For my PhD thesis, I worked on automatic normalization—mapping historical spellings to modern ones—as a possible approach to this problem. I looked at datasets of historical texts in eight different languages and evaluated normalization using rule-based, statistical, and neural approaches, with a particular focus on tuning a neural encoder–decoder model. In this talk, I will highlight what I learned from different perspectives: Why, what, and how to normalize? How do the different approaches compare and which one should I use? And what can we learn from this about neural networks that might be useful for other NLP tasks?

Télécharger les diapositives ici :
4/05/18 à -
4 mai 2018
Présentation par Catherine Koshmar , Cambridge University, UK

Text readability assessment for second language learners

Résumé : In this talk, I will present our work on readability assessment for the texts aimed at second language (L2) learners. I will discuss the approaches to this task and the features that we use in the machine learning framework. One of the major challenges in this task is the lack of significantly sized level-annotated data for L2 learners, as most models are aimed at and trained on the large amounts of texts for native English speakers. I will overview the methods of adapting models trained on larger native corpora to estimate text readability for L2 learners. Once the readability level of the text is assessed, the text can be adapted (e.g., simplified) to the level of the reader. The first step in this process is identification of words and phrases in need of simplification or adaptation. This task is called Complex Word Identification (CWI), and it has recently attracted much attention. In the second part of the talk, I will discuss the approaches to CWI and present our winning submission to the CWI Shared Task 2018.

12/01/18 à -
12 janv. 2018
Présentation par Jean-Phillipe Magué, ENS Lyon

Dynamiques circadiennes du langage : comment les données massives permettent de sonder de nouvelles échelles

Résumé : La linguistique s'est intéressée aux dynamiques langagières dans des gammes d'échelles allant de quelques décennies à quelques millénaires. Depuis quelques années, certaines études basées sur des média en ligne, notamment des forums, se sont penché sur des échelles de l'ordre de l'année, voire du mois. Qu'en est-il des échelles encore plus petites ? Peut-on observer des phénomène de l'ordre du jour ? de l'heure ? Si la chronobiologie a montré que nos capacités cognitives variait selon des rythmes circadiens, peu a été dit à propos du langage. En utilisant des données issues de Twitter, nous montrerons qu'il est possible d'observer des dynamiques linguistiques à des échelles nouvelles et mettrons en évidence l'existence des rythmes circadiens dans l'utilisation du lexique.

22/12/17 à -
22 déc. 2017
Présentation par Houda Bouamor, Carnegie Mellon University, Qatar

Quality Evaluation of Machine Translation into Arabic

Résumé : In machine translation, automatically obtaining a reliable assessment of translation quality is a challenging problem. Several techniques for automatically assessing translation quality for different purposes have been proposed, but these are mostly limited to strict string comparisons between the generated translation and translations produced by humans. This approach is too simplistic and ineffective for languages with flexible word order and rich morphology such as Arabic, a language for which machine translation evaluation is still an under-studied problem, despite posing many challenges. In this talk, I will first introduce AL-BLEU, a metric for Arabic machine translation evaluation that uses a rich set of morphological, syntactic and lexical features to extend the evaluation beyond the exact matching. We showed that AL-BLEU has a stronger correlation with human judgments than the state-of-the-art classical metrics. Then, I will present a more advanced study in which we explore the use of embeddings obtained from different levels of lexical and morpho-syntactic linguistic analysis and show that they improve MT evaluation into an Arabic. Our results show that using a neural-network model with different input representations produces results that clearly outperform the state-of-the-art for MT evaluation into Arabic, by almost over 75% increase in correlation with human judgments on pairwise MT evaluation quality task.

13/11/17 à -
13 nov. 2017
Présentation par Jacobo Levy Abitbol & Marton Karsaï, ENS Lyon, Inria Dante

Socioeconomic dependencies of linguistic patterns in Twitter: Correlation and learning

Résumé : Our usage of language is not solely reliant on cognition but is arguably determined by myriad external factors leading to a global variability of linguistic patterns. This issue, which lies at the core of sociolinguistics and is backed by many small-scale studies on face-to-face communication, is addressed here by constructing a dataset combining the largest French Twitter corpus to date with detailed socioeconomic maps obtained from national census in France. We show how key linguistic variables measured in individual Twitter streams depend on factors like socioeconomic status, location, time, and the social network of individuals. We found that (1) people of higher socioeconomic status, active to a greater degree during the daytime, use a more standard language; (ii) the southern part of the country is more prone to using more standard language than the northern one, while locally the used variety or dialect is determined by the spatial distribution of socioeconomic status; and (iii) individuals connected in the social network are closer linguistically than disconnected ones, even after the effects of status homophily have been removed. In the second part of the talk we will discuss how linguistic information and the detected correlations can be used for the inference of socioeconomic status.