Wednesday, September 21, 2016

Teaching the Machines to Read

In my final years of grad school, I became interested in a different approach to data collection - text mining of online information - and a complementary analysis approach - natural language processing. I started my training in psychology as a purely quantitative researcher, collecting numerical data that I could analyze with any number of statistical tests.

In 2007, I started working on a collaborative project with Loyola (my grad school) and Chicago Public Schools. Unlike my past research, this research drew upon qualitative methods - looking for themes and patterns in narrative text. Over the years that followed, I worked on improving my knowledge of qualitative methods. In fact, that qualitative experience is (part of) what got me my job at VA.

Around 2010, when I was working on my dissertation, I learned more about text mining and natural language processing - not enough to know how to actually do it, but enough to be dangerous (or, for my Dunning-Kruger fans, enough to know there was a lot to learn). I discovered a Python package called Natural Language Processing Toolkit, and started teaching myself Python. I now know there are other languages that can be used to do the same thing, but I still prefer Python for its easy syntax and data analysis capabilities.

My plan was to use natural language processing (or NLP) for a large-scale content analysis of press coverage of criminal trials (pretrial publicity: the subject of my dissertation), but NLP is used in many different applications. For instance, NLP is used to essentially teach computers to read, to improve search engines. However, researchers usually use print sources like the Wall Street Journal to train their computers, which represents Standard English, but doesn't get at other ways people use language, like slang and dialect. But a recent study at University of Massachusetts Amherst might signal a change:
Using only standard English has left out whole segments of society who use dialects and non-standard varieties of English, and the omission is increasingly problematic, say researchers Brendan O'Connor, an expert in natural language processing (NLP) at the University of Massachusetts Amherst, and Lisa Green, director of the campus' Center for Study of African-American Language. They recently collaborated with computer science doctoral student Su Lin Blodgett on a case study of dialect in online Twitter conversations among African Americans.

The authors believe their study has created the largest data set to date for studying African-American English from online communication, examining 59 million tweets from 2.8 million users.

The researchers identified "new phenomena that are not well known in the literature, such as abbreviations and acronyms used on Twitter, particularly those used by African-American speakers," notes Green. adds, "This is an example of the power of large-scale online data. The size of our data set lets us characterize the breadth and depth of language."
Not only can understanding different dialects help improve performance of search engines, it can help with research examining public opinion (on anything from politics to preferences for soda packaging). In order to use language data in this type of research, you have to understand how language is being used by your sample (as opposed to how it should be used). Although that last statement is a matter of debate in the linguistics community:

No comments:

Post a Comment