By Anna Boiko-Weyrauch
It’s been nothing but unrequited love between computational linguists and journalists, until now. For years, linguists have parsed the English language by examining news articles, Associate Professor at Carnegie Mellon University, Noah Smith, said at the NewsCamp::Text as Data workshop on Thursday morning.
“You may not know this,” he said, “but there’s a creepy field out there that watches what you do, gathers your articles and then goes and does science with it.”
The media finally showed some love and paid attention to Smith’s research a few years ago. His team analyzed the sentiments of Twitter posts with the word “jobs” and the results mirrored a Gallup poll on consumer confidence. The blue line representing tweets followed the squiggles of the Gallup poll, in the chart projected at the front of the room.
Smith demonstrated a number of other examples of using text as data in Thursday’s presentation, ranging from more accessible analyses to “rocket science” featuring a diagram of Greek symbols.
His team analyzed Twitter data based on geographic region and plotted the results on a map of the United States, Smith said. Remember folks, if you tweet from your phone it often includes your GPS coordinates. The researchers discovered regional topics in the way people use language, and what they say.
For example: tacos are important in Southern California, and cabs dominate tweets in NYC. No surprises there. But did you know that the expression “;p” — which is winking and sticking your tongue out at the same time — is something people mostly do in Boston?
Laughs aside, the biggest message Smith had for the conference was caution. A lot of people want to “take some text and predict something real about the world,” but the way to do it is complicated. He said there needs to be higher level tools for journalists to analyze text, so they don’t have to get caught up in the technical details. There are open source tools out there (he said he did not feel comfortable recommending any one), and some might work better than others, but it depends on what you want them to do, he said.
Overall, interrogate your methods and remember that computer programming and data analysis are two different skills. Don’t necessarily trust your intuition when it comes to language, he said. Text analysis is a kind of engineering, so you have to evaluate your tools. Ask: “Is this analysis doing what I want it to do? How can I be sure?”
“These questions keep honest academics up at night,” Smith said.
Anna Boiko-Weyrauch is a graduate student at the University of Missouri's School of Journalism.