Su Lin Blodgett, a graduate student in computer science at the University of Massachusetts, recently presented a paper at the annual meeting of the Association of Computational Linguistics in Melbourne, Australia. Her research is focused on improving English language parsing tools relating to words, phrases, and alternate spellings used by millions of African Americans on social media.
Current Natural Language Processing (NLP) tools “learn” and are trained on mainstream American English, and as a result don’t perform well on Twitter, where text deviates from this standard in many ways, including non-standard spelling, punctuation, capitalization, syntax and hashtags, the authors point out. They add that dialects such as African-American English, spoken by millions of individuals, contain language features not present in standard English.
By expanding the linguistic coverage of NLP tools to include minority and colloquial dialects, the thoughts and ideas of more individuals and groups can be included in areas such as opinion and sentiment analysis, Blodgett points out. For example, if a political campaign were to use a standard NLP tool to analyze opinions on Twitter, but did not capture what African-Americans are saying, the tools could be missing a significant portion of the electorate and could vastly misinterpret overall sentiment.