2 July 2022,
by Lucy Havens, University of Edinburgh
At the end of June, I finally was able to attend a conference in person as a Ph.D. student: LREC! I’m grateful to SICSA, EFI, and the Informatics Graduate School for making it possible for me to attend LREC in person this year, because it only occurs every two years. This year’s LREC, which stands for Language Resources and Evaluation, took place in Marseille, at le Palais du Pharo, a short walk from le Vieux Port (the Old Port) with lovely views of the Mediterranean Sea. In this post I share highlights from my experience.
In addition to attending the conference, I presented at one of the conference’s workshops, Perspectivist Approaches to Natural Language Processing (NLPerspectives Workshop), which took place the day before the conference began. The term “perspectivist” refers to the Perspectivist Data Manifesto, which questions the assumption that aggregated, annotated datasets can provide a “ground truth” for language models to learn from.
For anyone less familiar with machine learning: typically annotated datasets are created by having people label a collection of documents according to a set of instructions. After many people finish labelling the documents, researchers calculate the agreement and disagreement between their labels, hoping for low levels of disagreement, and then merge everyone’s labelled documents to create one final version of the dataset. Researchers then use the dataset to teach a language model to automatically identify what the people labelled in the documents (this approach to machine learning is called “supervised” learning).
The thing is, when everyone’s labelled data gets merged, the different perspectives that different people brought to the labelling process is erased. When labelling complex topics (such as gender bias, my own area of research), there may be more than one label that makes sense, depending on the perspective of the person reading a document. Researchers interested in perspectivist approaches are proposing ways to incorporate multiple people’s perspectives in a labelled dataset, and advocate for publishing non-aggregated versions of labelled data so that people’s disagreeing labels can be analyzed.
In the paper I presented at the NLPerspectives Workshop, co-authored with my Ph.D. supervisors Ben Bach, Melissa Terras, and Bea Alex, we propose the use of text visualizations for exploring non-aggregated, annotated datasets. In order to train a language model, LOTS of data is needed, so it’s difficult to carefully read through every labeled document in a dataset to study patterns in people’s disagreeing and agreeing labels. Since data visualization relies on intuitive visual cues to facilitate data analysis at large scale, it’s well-suited to analyzing multiple versions of a collection of labeled documents!
During the workshop, Su Lin Blodgett gave a great talk reflecting on how perspectivism complements participatory approaches to research, with perspectivism focusing on the technical decisions rather than previous participatory approaches that focus more on the problem framing and evaluation! You can check out the full proceedings of the workshop here.
Julia Parish-Morris gave a keynote talk titled, “Language Resources for Charting Linguistic Diversity in Neuroexpansive Populations.” When I’d thought about diversity in the context of language resources in the past, my mind had gone to different languages and dialects (since I’m living in Scotland, some examples from there would be Lallans, Doric, Gaelic, and English). Parish-Morris’ talk broadened my thinking about language diversity, especially about how to represent diversity in recorded language versus written language. They pointed out how spoken language has a richness that written language does not, thanks to pauses, intonation, and the emotion behind spoken words. Parish-Morris emphasized that the goal in creating language resources for neuroexpansive populations isn’t to improve classification, it’s to improve our communication, something that we all could work on!
During the Q&A, someone in the audience asked Parish-Morris about the use of the word “neuroexpansive,” saying they were unable to find anything about the word when they googled it. Parish-Morris said she had adopted the word after a conversation with their brother, who is autistic and prefers “neuroexpansive” to “neurodiverse.” The ending “-expansive” focuses on broadening our conceptualization of the world’s population, whereas “-diverse” tends to focus on trying to include more people into predefined concepts.
It’s amazing how powerful one word can be to shift a person’s thinking! Moving forward, I’ll be trying to think more about expansion, rather than trying to fit people in to categories I’ve already defined…
During the second day of the conference, I attended presentations in the morning and in the afternoon that caught my attention, due to my own research interests right now. In the morning, I heard David Kletz present, “A Methodology for Building a Diachronic Dataset of Semantic Shifts and its Application to QC-FR-Diac-V1.0, a Free Reference for French.” I’m interested in the changing meanings of words over time (a.k.a. diachronic semantic change) in the context of gender bias, so I’m looking forward to reading Kletz’s paper.
In the afternoon, I heard two presentations about different aspects of work on the same project, “Automatic Normalisation of Early Modern French” and “From FREEM to D’AlemBERT: a Large Corpus and a Language Model for Early Modern French.” The
“normalisation” in this work referred to what’s almost a translation task, converting Early Modern French (French from the 17th century, when word spellings were not yet fixed) to present-day French. I’m particularly interested in reading more about the language model development process, because the researchers tried numerous experiments in an attempt to determine the best model set-up for their dataset of historical language (I anticipate I’ll need to do quite a bit of experimenting myself as I begin creating a model on my annotated dataset for my Ph.D.).
On the last day of the conference, Steven Bird gave the Antonio Zampolli Prize Talk (he’d been awarded the prize earlier in the week). There was a common theme in his talk and Blodgett’s at the NLPerspectives Workshop: participatory methods. He recommended a book I’ll have to check out at the library, Orality and Literacy, and an article I’ll have to read called “Guiding Principles for Participatory Design-inspired Natural Language Processing.” Most striking about the Prize talk, though, was that Bird told a series of stories about his mistakes. The humility this demonstrated to me was inspiring and humbling in and of itself – Steven Bird is one of the developers of NLTK, a programming library I use ALL the time for analyzing text! He must have had plenty of successes to talk about. Instead, though, the message he left the audience with was what he learned from the mistakes he’d recounted: linguistic diversity is an opportunity to sit down together and talk, to develop trust across communities of people.
In poster session that afternoon, this message came back to me in a conversation with Robert Pugh. I approached Pugh’s poster, titled “Universal Dependencies for Western Sierra Pueblo Nahautl,” while Pugh was in conversation with another person who was saying something about how they would have written a script to complete more of the work in Pugh’s project. The project had created a morpho-syntactically annotated corpus according the Universal Dependencies framework, which serves as a standard schema for labeling the morphology and syntax of languages, which in turn makes it easier to represent different languages computationally. Pugh said there was so much variation in the project’s annotation work, that writing the script probably would have taken longer than the work had manually. He went on to say (more importantly in my opinion!) that it was also valuable to spend time with the data up close, reading through the language attentively as a researcher trying to develop computational resources for the language.
Speaking as someone whose work often crosses disciplinary boundaries into the humanities, I appreciated Pugh stating the value of spending time with the language data. Sometimes it starts to feel like, though the value of this sort of work is taken as a given in the humanities, it needs to be extensively explained, and even fought for, in the computational sciences.
If you ask me, spending time with language in the form of people speaking it and in the form of data is something the language technology community should encourage more of!