Siri-ously biased: Computer science major examines language models showing negativity toward LGBTQ+ terms
While studying at his Vietnamese high school, Bang Nguyen ’22 wanted to pursue a liberal arts education in the U.S. that empowered him to combine his interests in math and social sciences. At The College of Wooster, he bridged the gap between the two, applying his technical skills as a computer science major and infusing the social sciences with minors in statistical and data sciences and communication studies.
“My research looks at the deeper layout of how technology replicates social biases against the LGBTQ+ community. Sometimes we are hurting members of queer identities, and it’s replicated in these technologies that we use every day.”
—Bang Nguyen ’22, I.S. Title: Queering NLP: A non-heteronormative approach to quantifying and investigating sentiment bias against LGBTQ+ identities in word embeddings
“Wooster was really generous with its financial aid package, but what resonated with me most was what I heard about the Independent Study and mentored research,” said Nguyen. “The interdisciplinary perspective has been helpful, and it culminates with my I.S. interest in computer science.”
Nguyen’s I.S. project examines the social issues within natural language processing (NLP) technologies that attempt to generate and understand human language. “My research looks at the deeper layout of how technology replicates social biases against the LGBTQ+ community,” said Nguyen. “Sometimes we are hurting members of queer identities, and it’s replicated in these technologies that we use every day.”
NLP is used in machines—including virtual assistants like Apple’s Siri and Amazon’s Alexa—to understand and respond to text or voice data. Their responses are based on a language model that converts each word in human language to a numerical vector (a word embedding) for the computer/machine to understand.
Twitter and other platforms can prevent certain posts from going out if the technology identifies the word embeddings as toxic material. Nguyen said that society—and language—operates with a world view that promotes heterosexuality as the normal or preferred sexual orientation, so when this bias is captured in word embeddings, words like ‘gay’ and ‘lesbian’ often get flagged as toxic.
“I looked at five specific identities in my research including lesbian, gay, bisexual, transgender, and queer,” said Nguyen. “I found biases against all five, and the word embeddings typically associate these with more negative words. ‘Gay’ for example, is close to weird, ugly, dumb, stupid, and pervert, among other terms.”
He compared an existing language model trained on data pre-2015 with embeddings trained by him on data post-2015 to see whether time has normalized the bias against the LGBTQ+ community. Nguyen found the older model to contain more extreme bias while the newer model showed more inclusivity. He called that progress a reflection of social change and the LGBTQ+ experience. “It’s interesting that when machines learn these things, they replicate whatever the human does,” said Nguyen. “Do we want to use these biases, or can we find ways to remove them computationally?”
Kowshik Bhowmik, visiting instructor of computer science, mentored Nguyen throughout the study. He pointed out that formulating the research question was tricky, but his mentee came up with two hypotheses that captured the goal of the work very well.
“Existing research in the field mostly looks at bias from a heteronormative point of view, such as how are biases regarding genders (male/female) embedded in existing word embeddings,” said Bhowmik. “Bang, on the other hand, investigates whether words with negative sentiments are more closely related to LGBTQ+ terms than words with positive sentiments.”
This I.S. project wasn’t Nguyen’s first venture into research. He invested in experiential learning opportunities across two summers in the Applied Methods and Research Experience (AMRE) program. AMRE gives Wooster students the ability to become consultants for local companies. First, he worked as a consultant for a STEM success initiative at the College, analyzing data on students who took STEM courses. The second year, Nguyen worked on quality control processes for The Goodyear Tire & Rubber Company. His team proposed a new system that uses machine learning to show how Goodyear can better control tires.
Between the I.S. and AMRE experiences, Nguyen recognizes that he built transferable research and writing skills that he’ll use in grad school and his career. He also knows the importance of clearly communicating tech knowledge.
“I want to make science more accessible,” said Nguyen. “Science is already complicated enough, and when you add social issues on top of that it gets worse.” Nguyen plans to focus his studies on fairness and non-discrimination in machine learning this fall when he joins the University of Notre Dame as a Ph.D. student in computer science and engineering.
Posted in Independent Study on July 15, 2022.
Related Areas of Study
Statistical & Data Sciences
Use statistics, math, and computer science to gain insights into data and solve real-world problems.Major Minor
Data Exploration & Communication
Bridge the gap between data analysis and communication and learn the many career paths for these skillsPathway
Solve complex problems with creative solutions using computer programming and applicationsMajor Minor
Be an effective listener, writer, and speaker who can think critically and connect with audiencesMajor Minor