After teaching a full day of data science during NewsCamp on Thursday, Hadley Wickham on Friday morning presented a brief introduction to data science called “Data science for the perplexed. For everyone.”
Wickham is an Assistant Professor at Rice University and working for RStudio. He describes himself as a statistician by training, and tried to offer some insight into the buzzword that is “data science.”
He first cited the unofficial and often used description: “a blend of red bull-fueled hacking and espresso-inspired statistics." He also referenced this often-used venn diagram.
Data science isn’t just about hacking with big data, he said, because you must also care about statistics. It’s also not merely about statistics -- for one, it also involves communicating your analysis.
What he can offer a better definition of, he said, is what a data scientist is: Someone who can ask and answer questions about and with data.
A good data scientists has to have a mix of curiosity and skepticism, Wickham said. Curiosity is key for developing good questions to ask data, skepticism is needed to make sure you aren’t over-curious and seeing things in the data that aren’t actually there.
The spectrum looks something like this:
Skepticism ___________________ Curiosity
Inferential statistics Visualization
The first step in Wickham’s analysis is one any CAR reporter could recognize: cleaning -- getting the data from whatever crazy format it arrived in into a format you can analyze. From there, Wickham cycles through three steps: transformation, visualization, and modeling.
I’m not going to attempt explaining the steps here, but you can see Wickham’s course slides in Dropbox. NewsCamp attendee Sisi Wei also made her notes available via Google Docs.
To cycle through these steps and perform the jobs of a data scientist, you must know how to program. Wickham outlined what he sees as the programming languages a data scientist should know:
For dealign with certain types of data: SQL, Regex and xpath
Helpful for building tools that use data science: C/C++, fortran, scala
Looks like you haven't made a choice yet.