The IRE website will be unavailable while we complete routine maintenance on Friday, September 17 from 8-10 am CT.
IRE favicon

Hadley Wickham explains data science for the perplexed

After teaching a full day of data science during NewsCamp on Thursday, Hadley Wickham on Friday morning presented a brief introduction to data science called “Data science for the perplexed. For everyone.”

Wickham is an Assistant Professor at Rice University and working for RStudio. He describes himself as a statistician by training, and tried to offer some insight into the buzzword that is “data science.” 

He first cited the unofficial and often used description: “a blend of red bull-fueled hacking and espresso-inspired statistics." He also referenced this often-used venn diagram.

Data science isn’t just about hacking with big data, he said, because you must also care about statistics. It’s also not merely about statistics -- for one, it also involves communicating your analysis.

What he can offer a better definition of, he said, is what a data scientist is: Someone who can ask and answer questions about and with data.

A good data scientists has to have a mix of curiosity and skepticism, Wickham said. Curiosity is key for developing good questions to ask data, skepticism is needed to make sure you aren’t over-curious and seeing things in the data that aren’t actually there.

The spectrum looks something like this:

Skepticism                     ___________________   Curiosity
Inferential statistics                                                         Visualization

The first step in Wickham’s analysis is one any CAR reporter could recognize: cleaning -- getting the data from whatever crazy format it arrived in into a format you can analyze. From there, Wickham cycles through three steps: transformation, visualization, and modeling. 

I’m not going to attempt explaining the steps here, but you can see Wickham’s course slides in Dropbox. NewsCamp attendee Sisi Wei also made her notes available via Google Docs.

To cycle through these steps and perform the jobs of a data scientist, you must know how to program. Wickham outlined what he sees as the programming languages a data scientist should know:

The essentials: R, Javascript and Python
For dealign with certain types of data: SQL, Regex and xpath
Helpful for building tools that use data science: C/C++, fortran, scala

141 Neff Annex   |   Missouri School of Journalism Columbia, MO 65211   |   573-882-2042   |   info@ire.org   |   Privacy Policy
crossmenu linkedin facebook pinterest youtube rss twitter instagram facebook-blank rss-blank linkedin-blank pinterest youtube twitter instagram