Posts by Collection

news

portfolio

publications

Developing an Optimized UI for Traffic Incident Managers

Published in In the proceedings of Proceedings of the Human Factors and Ergonomics Society Annual Meeting, 2018

Recommended citation: Andrina Helgerson, Jamiahus Walton, Celia Loya, Christopher Kawell, Katherine Atwell, Quinn Monaghan, Lakshay Ahuja, Hesham Hassan, Stephen Gilbert, Anuj Sharma, "Developing an Optimized UI for Traffic Incident Managers." In the proceedings of Proceedings of the Human Factors and Ergonomics Society Annual Meeting, 2018.

Traffic Incident Managers (TIMs) coordinate first responders and help resolve traffic-related incidents. Currently, some use over fifteen different software applications with unique functionalities across three monitors to manage incidents, leading to redundant data entry, unnecessary task switching, and delayed responses. 40 hours of TIMs’ screens were recorded during their normal work hours at the Iowa Department of Transportation (DoT). The resulting task analysis from these videos greatly influenced the design of a simplified, web-based, user interface (UI) prototype. The new UI offers a 42.9% reduction in the steps required to manage an incident by combining the functionality of the fifteen different applications used in the existing system into a single, structured UI. This research approach offers a UI model to other DoTs that can lead to faster and more effective incident management.

Where are we in discourse relation recognition?

Published in In the proceedings of Proceedings of the 22nd Annual Meeting of the Special Interest Group on Discourse and Dialogue, 2021

Recommended citation: Katherine Atwell, Junyi Li, Malihe Alikhani, "Where are we in discourse relation recognition?." In the proceedings of Proceedings of the 22nd Annual Meeting of the Special Interest Group on Discourse and Dialogue, 2021.

Discourse parsers recognize the intentional and inferential relationships that organize extended texts. They have had a great influence on a variety of NLP tasks as well as theoretical studies in linguistics and cognitive science. However it is often difficult to achieve good results from current discourse models, largely due to the difficulty of the task, particularly recognizing implicit discourse relations. Recent developments in transformer-based models have shown great promise on these analyses, but challenges still remain. We present a position paper which provides a systematic analysis of the state of the art discourse parsers. We aim to examine the performance of current discourse parsing models via gradual domain shift: within the corpus, on in-domain texts, and on out-of-domain texts, and discuss the differences between the transformer-based models and the previous models in predicting different types of implicit relations both inter-and intra-sentential. We conclude by describing several shortcomings of the existing models and a discussion of how future work should approach this problem.

The change that matters in discourse parsing: Estimating the impact of domain shift on parser error

Published in In the proceedings of Findings of the Association for Computational Linguistics: ACL 2022, 2022

Recommended citation: Katherine Atwell, Anthony Sicilia, Seong Hwang, Malihe Alikhani, "The change that matters in discourse parsing: Estimating the impact of domain shift on parser error." In the proceedings of Findings of the Association for Computational Linguistics: ACL 2022, 2022.

Discourse analysis allows us to attain inferences of a text document that extend beyond the sentence-level. The current performance of discourse models is very low on texts outside of the training distribution’s coverage, diminishing the practical utility of existing models. There is need for a measure that can inform us to what extent our model generalizes from the training to the test sample when these samples may be drawn from distinct distributions. While this can be estimated via distribution shift, we argue that this does not directly correlate with change in the observed error of a classifier (i.e. error-gap). Thus, we propose to use a statistic from the theoretical domain adaptation literature which can be directly tied to error-gap. We study the bias of this statistic as an estimator of error-gap both theoretically and through a large-scale empirical study of over 2400 experiments on 6 discourse datasets from domains including, but not limited to: news, biomedical texts, TED talks, Reddit posts, and fiction. Our results not only motivate our proposal and help us to understand its limitations, but also provide insight on the properties of discourse models and datasets which improve performance in domain adaptation. For instance, we find that non-news datasets are slightly easier to transfer to than news datasets when the training and test sets are very different. Our code and an associated Python package are available to allow practitioners to make more informed model and dataset choices.

Political ideology and polarization: A multi-dimensional approach

Published in In the proceedings of Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2022

Recommended citation: Barea Sinno, Bernardo Oviedo, Katherine Atwell, Malihe Alikhani, Junyi Li, "Political ideology and polarization: A multi-dimensional approach." In the proceedings of Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2022.

Analyzing ideology and polarization is of critical importance in advancing our grasp of modern politics. Recent research has made great strides towards understanding the ideological bias (ie, stance) of news media along the left-right spectrum. In this work, we instead take a novel and more nuanced approach for the study of ideology based on its left or right positions on the issue being discussed. Aligned with the theoretical accounts in political science, we treat ideology as a multi-dimensional construct, and introduce the first diachronic dataset of news articles whose ideological positions are annotated by trained political scientists and linguists at the paragraph level. We showcase that, by controlling for the author’s stance, our method allows for the quantitative and temporal measurement and analysis of polarization as a multidimensional ideological distance. We further present baseline models for ideology prediction, outlining a challenging task distinct from stance detection.

Studying the effect of moderator biases on the diversity of online discussions: A computational cross-linguistic study

Published in In the proceedings of Proceedings of the Annual Meeting of the Cognitive Science Society, 2022

Recommended citation: Sabit Hassan, Katherine Atwell, Malihe Alikhani, "Studying the effect of moderator biases on the diversity of online discussions: A computational cross-linguistic study." In the proceedings of Proceedings of the Annual Meeting of the Cognitive Science Society, 2022.

The methods by which people harm others evolve with changes in, and in access to, technology. Several cognitive, linguistic, and behavioral theories have suggested that biased language use is correlated with dominance and can reduce the diversity and inclusivity of a community (e.g. Poteat et al, 2010). We present a cross-cultural and cross-linguistic study of moderators on Reddit in English, Arabic, and French. We collect and analyze a large Reddit moderation dataset and use machine learning models to study cognitive and behavioral differences of moderation across cultures. We then work with expert linguists who analyze and evaluate our results. Finally, we explore the implications of our models for studying how we might shut down voices from different communities by not moderating online content properly. Our preliminary results reveal biases towards women and minority groups, and more broadly affirm our hypothesis that culture and topic of discussions bias moderation decisions.

PAC-bayesian domain adaptation bounds for multiclass learners

Published in In the proceedings of Uncertainty in Artificial Intelligence, 2022

Recommended citation: Anthony Sicilia, Katherine Atwell, Malihe Alikhani, Seong Hwang, "PAC-bayesian domain adaptation bounds for multiclass learners." In the proceedings of Uncertainty in Artificial Intelligence, 2022.

Multiclass neural networks are a common tool in modern unsupervised domain adaptation, yet an appropriate theoretical description for their non-uniform sample complexity is lacking in the adaptation literature. To fill this gap, we propose the first PAC-Bayesian adaptation bounds for multiclass learners. We facilitate practical use of our bounds by also proposing the first approximation techniques for the multiclass distribution divergences we consider. For divergences dependent on a Gibbs predictor, we propose additional PAC-Bayesian adaptation bounds which remove the need for inefficient Monte-Carlo estimation. Empirically, we test the efficacy of our proposed approximation techniques as well as some novel design-concepts which we include in our bounds. Finally, we apply our bounds to analyze a common adaptation algorithm that uses neural networks.

Appdia: A discourse-aware transformer-based style transfer model for offensive social media conversations

Published in In the proceedings of Proceedings of the 29th International Conference on Computational Linguistics, 2022

Recommended citation: Katherine Atwell, Sabit Hassan, Malihe Alikhani, "Appdia: A discourse-aware transformer-based style transfer model for offensive social media conversations." In the proceedings of Proceedings of the 29th International Conference on Computational Linguistics, 2022.

Using style-transfer models to reduce offensiveness of social media comments can help foster a more inclusive environment. However, there are no sizable datasets that contain offensive texts and their inoffensive counterparts, and fine-tuning pretrained models with limited labeled data can lead to the loss of original meaning in the style-transferred text. To address this issue, we provide two major contributions. First, we release the first publicly-available, parallel corpus of offensive Reddit comments and their style-transferred counterparts annotated by expert sociolinguists. Then, we introduce the first discourse-aware style-transfer models that can effectively reduce offensiveness in Reddit text while preserving the meaning of the original text. These models are the first to examine inferential links between the comment and the text it is replying to when transferring the style of offensive Reddit text. We propose two different methods of integrating discourse relations with pretrained transformer models and evaluate them on our dataset of offensive comments from Reddit and their inoffensive counterparts. Improvements over the baseline with respect to both automatic metrics and human evaluation indicate that our discourse-aware models are better at preserving meaning in style-transferred text when compared to the state-of-the-art discourse-agnostic models.

The Role of Context and Uncertainty in Shallow Discourse Parsing

Published in In the proceedings of Proceedings of the 29th International Conference on Computational Linguistics, 2022

Recommended citation: Katherine Atwell, Remi Choi, Junyi Jessy Li, Malihe Alikhani, "The Role of Context and Uncertainty in Shallow Discourse Parsing." In the proceedings of Proceedings of the 29th International Conference on Computational Linguistics, 2022.

Discourse parsing has proven to be useful for a number of NLP tasks that require complex reasoning. However, over a decade since the advent of the Penn Discourse Treebank, predicting implicit discourse relations in text remains challenging. There are several possible reasons for this, and we hypothesize that models should be exposed to more context as it plays an important role in accurate human annotation; meanwhile adding uncertainty measures can improve model accuracy and calibration. To thoroughly investigate this phenomenon, we perform a series of experiments to determine 1) the effects of context on human judgments, and 2) the effect of quantifying uncertainty with annotator confidence ratings on model accuracy and calibration (which we measure using the Brier score (Brier et al, 1950)). We find that including annotator accuracy and confidence improves model accuracy, and incorporating confidence in the model’s temperature function can lead to models with significantly better-calibrated confidence measures. We also find some insightful qualitative results regarding human and model behavior on these datasets.

How people talk about each other: Modeling Generalized Intergroup Bias and Emotion

Published in In the proceedings of Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, 2023

Recommended citation: Venkata Govindarajan, Katherine Atwell, Barea Sinno, Malihe Alikhani, David Beaver, Junyi Li, "How people talk about each other: Modeling Generalized Intergroup Bias and Emotion." In the proceedings of Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, 2023.

Current studies of bias in NLP rely mainly on identifying (unwanted or negative) bias towards a specific demographic group. While this has led to progress recognizing and mitigating negative bias, and having a clear notion of the targeted group is necessary, it is not always practical. In this work we extrapolate to a broader notion of bias, rooted in social science and psychology literature. We move towards predicting interpersonal group relationship (IGR) - modeling the relationship between the speaker and the target in an utterance - using fine-grained interpersonal emotions as an anchor. We build and release a dataset of English tweets by US Congress members annotated for interpersonal emotion – the first of its kind, and ‘found supervision’ for IGR labels; our analyses show that subtle emotional signals are indicative of different biases. While humans can perform better than chance at identifying IGR given an utterance, we show that neural models perform much better; furthermore, a shared encoding between IGR and interpersonal perceived emotion enabled performance gains in both tasks. Data and code for this paper are available at https://github.com/venkatasg/interpersonal-bias

Multilingual Content Moderation: A Case Study on Reddit

Published in In the proceedings of Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, 2023

Recommended citation: Meng Ye, Karan Sikka, Katherine Atwell, Sabit Hassan, Ajay Divakaran, Malihe Alikhani, "Multilingual Content Moderation: A Case Study on Reddit." In the proceedings of Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, 2023.

Content moderation is the process of flagging content based on pre-defined platform rules. There has been a growing need for AI moderators to safeguard users as well as protect the mental health of human moderators from traumatic content. While prior works have focused on identifying hateful/offensive language, they are not adequate for meeting the challenges of content moderation since 1) moderation decisions are based on violation of rules, which subsumes detection of offensive speech, and 2) such rules often differ across communities which entails an adaptive solution. We propose to study the challenges of content moderation by introducing a multilingual dataset of 1.8 Million Reddit comments spanning 56 subreddits in English, German, Spanish and French. We perform extensive experimental analysis to highlight the underlying challenges and suggest related research problems such as cross-lingual transfer, learning under label noise (human biases), transfer of moderation models, and predicting the violated rule. Our dataset and analysis can help better prepare for the challenges and opportunities of auto moderation.

Combining Discourse Coherence with Large Language Models for More Inclusive, Equitable, and Robust Task-Oriented Dialogue

Published in In the Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation, 2024

Recommended citation: [Combining Discourse Coherence with Large Language Models for More Inclusive, Equitable, and Robust Task-Oriented Dialogue](https://aclanthology.org/2024.lrec-main.314) (Atwell et al., LREC-COLING 2024)

Large language models (LLMs) are capable of generating well-formed responses, but using LLMs to generate responses on the fly is not yet feasible for many task-oriented systems. Modular architectures are often still required for safety and privacy guarantees on the output. We hypothesize that an offline generation approach using discourse theories, formal grammar rules, and LLMs can allow us to generate human-like, coherent text in a more efficient, robust, and inclusive manner within a task-oriented setting. To this end, we present the first discourse-aware multimodal task-oriented dialogue system that combines discourse theories with offline LLM generation. We deploy our bot as an app to the general public and keep track of the user ratings for six months. Our user ratings show an improvement from 2.8 to 3.5 out of 5 with the introduction of discourse coherence theories. We also show that our model reduces misunderstandings in the dialect of African-American Vernacular English from 93% to 57%. While terms of use prevent us from releasing our entire codebase, we release our code in a format that can be integrated into most existing dialogue systems.

Generating Signed Language Instructions in Large-Scale Dialogue Systems

Published in In the Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 6: Industry Track), 2024

Recommended citation: [Generating Signed Language Instructions in Large-Scale Dialogue Systems](https://aclanthology.org/2024.naacl-industry.13) (Inan et al., NAACL 2024)

We introduce a goal-oriented conversational AI system enhanced with American Sign Language (ASL) instructions, presenting the first implementation of such a system on a worldwide multimodal conversational AI platform. Accessible through a touch-based interface, our system receives input from users and seamlessly generates ASL instructions by leveraging retrieval methods and cognitively based gloss translations. Central to our design is a sign translation module powered by Large Language Models, alongside a token-based video retrieval system for delivering instructional content from recipes and wikiHow guides. Our development process is deeply rooted in a commitment to community engagement, incorporating insights from the Deaf and Hard-of-Hearing community, as well as experts in cognitive and ASL learning sciences. The effectiveness of our signing instructions is validated by user feedback, achieving ratings on par with those of the system in its non-signing variant. Additionally, our system demonstrates exceptional performance in retrieval accuracy and text-generation quality, measured by metrics such as BERTScore. We have made our codebase and datasets publicly accessible at https://github.com/Merterm/signed-dialogue, and a demo of our signed instruction video retrieval system is available at https://huggingface.co/spaces/merterm/signed-instructions.

The Importance of Including Signed Languages in Natural Language Processing

Published in Sign Language Machine Translation, 2024

Recommended citation: Kayo Yin, Katherine Atwell, Julie A Hochgesang, Malihe Alikhani. 2024. The Importance of Including Signed Languages in Natural Language Processing. In Sign Language Machine Translation. Springer Nature Switzerland.

Identifying linguistic bias in text demands the identification not only of explicitly asserted content but also of implicit content including presuppositions. Large language models (LLMs) offer a promising automated approach to detecting presuppositions, yet the extent to which their judgments align with human intuitions remains unexplored. Moreover, LLMs may inadvertently reflect societal biases when identifying presupposed content. To empirically investigate this, we prompt multiple large language models to evaluate presuppositions across diverse textual domains, drawing from three distinct datasets annotated by human raters. We calculate the agreement between LLMs and human raters, and find several linguistic factors associated with fluctuations in human-model agreement. Our observations reveal discrepancies in human-model alignment, suggesting potential biases in LLMs, notably influenced by gender and political ideology.

Studying and Mitigating Biases in Sign Language Understanding Models

Published in In the Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 2024

Recommended citation: Katherine Atwell, Danielle Bragg, and Malihe Alikhani. 2024. Studying and Mitigating Biases in Sign Language Understanding Models. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Miami. Association for Computational Linguistics.

Ensuring that the benefits of sign language technologies are distributed equitably among all community members is crucial. Thus, it is important to address potential biases and inequities that may arise from the design or use of these resources. Crowd-sourced sign language datasets, such as the ASL Citizen dataset, are great resources for improving accessibility and preserving linguistic diversity, but they must be used thoughtfully to avoid reinforcing existing biases. In this work, we utilize the rich information about participant demographics and lexical features present in the ASL Citizen dataset to study and document the biases that may result from models trained on crowd-sourced sign datasets. Further, we apply several bias mitigation techniques during model training, and find that these techniques reduce performance disparities without decreasing accuracy. With the publication of this work, we release the demographic information about the participants in the ASL Citizen dataset to encourage future bias mitigation work in this space.

Contextual ASR Error Handling with LLMs Augmentation for Goal-Oriented Conversational AI

Published in In the Proceedings of the 31st International Conference on Computational Linguistics: Industry Track, 2025

Recommended citation: Yuya Asano, Sabit Hassan, Paras Sharma, Anthony B. Sicilia, Katherine Atwell, Diane Litman, and Malihe Alikhani. 2025. Proceedings of the 31st International Conference on Computational Linguistics: Industry Track, Abu Dhabi. Association for Computational Linguistics.

General-purpose automatic speech recognition (ASR) systems do not always perform well in goal-oriented dialogue. Existing ASR correction methods rely on prior user data or named entities. We extend correction to tasks that have no prior user data and exhibit linguistic flexibility such as lexical and syntactic variations. We propose a novel context augmentation with a large language model and a ranking strategy that incorporates contextual information from the dialogue states of a goal-oriented conversational AI and its tasks. Our method ranks (1) n-best ASR hypotheses by their lexical and semantic similarity with context and (2) context by phonetic correspondence with ASR hypotheses. Evaluated in home improvement and cooking domains with real-world users, our method improves recall and F1 of correction by 34% and 16%, respectively, while maintaining precision and false positive rate. Users rated .8-1 point (out of 5) higher when our correction method worked properly, with no decrease due to false positives.

Measuring Bias and Agreement in Large Language Model Presupposition Judgments

Published in In the Findings of the Association for Computational Linguistics: ACL 2025, 2025

Recommended citation: Katherine Atwell, Mandy Simons, and Malihe Alikhani. 2025. Measuring Bias and Agreement in Large Language Model Presupposition Judgments. In Findings of the Association for Computational Linguistics: ACL 2025, Vienna. Association for Computational Linguistics.

Identifying linguistic bias in text demands the identification not only of explicitly asserted content but also of implicit content including presuppositions. Large language models (LLMs) offer a promising automated approach to detecting presuppositions, yet the extent to which their judgments align with human intuitions remains unexplored. Moreover, LLMs may inadvertently reflect societal biases when identifying presupposed content. To empirically investigate this, we prompt multiple large language models to evaluate presuppositions across diverse textual domains, drawing from three distinct datasets annotated by human raters. We calculate the agreement between LLMs and human raters, and find several linguistic factors associated with fluctuations in human-model agreement. Our observations reveal discrepancies in human-model alignment, suggesting potential biases in LLMs, notably influenced by gender and political ideology.

BASIL: Bayesian Assessment of Sycophancy in LLMs

Published in ArXiv, 2025

Recommended citation: Katherine Atwell, Pedram Heydari, Anthony Sicilia, and Malihe Alikhani. 2025. BASIL: Bayesian Assessment of Sycophancy in LLMs. https://arxiv.org/abs/2508.16846

Sycophancy (overly agreeable or flattering behavior) is critical to understand in the context of human-AI collaboration, especially in decision-making settings like health, law, and education. Existing methods for studying sycophancy in LLMs are either descriptive (study behavior change when sycophancy is elicited) or normative (provide values-based judgment on behavior change). Together, these approaches help us understand the extent, and impacts, of sycophancy. However, existing normative approaches only apply for objective tasks where ground-truth data exists, ignoring the natural subjectivity in many NLP tasks. Drawing from behavioral economics and rational decision theory, we introduce an Bayesian framework to study the normative effects of sycophancy on rationality in LLMs, without requiring labeled ground-truth. Using this interdisciplinary framework, we study sycophantic behavior in multiple LLM baselines across three different tasks, experimenting with various methods for eliciting sycophancy and obtaining probability judgments from LLMs. We find significant evidence of sycophancy in our experiments (7 of 8 baselines for one of our probing techniques), and observe that sycophancy is more likely to reduce rationality than it is to increase rationality in LLMs’ decisions when they are directly probed for probabilities (2 out of 4 baselines show significant increases overall).

talks

Coreference Resolution

Published:

Gave a lecture to a graduate NLP class about coreference resolution

Responsible Artificial Intelligence and Marginalized Communities

Published:

With Mert Inan, we present our dialogue research on American Sign Language and African-American Vernacular English, and describe future directions to improve the user experience for these populations when interacting with dialogue systems.

teaching

Teaching experience 1

Undergraduate course, University 1, Department, 2014

This is a description of a teaching experience. You can use markdown like any other post.

Teaching experience 2

Workshop, University 1, Department, 2015

This is a description of a teaching experience. You can use markdown like any other post.