After teaching a full day of data science during NewsCamp on Thursday, Hadley Wickham on Friday morning presented a brief introduction to data science called “Data science for the perplexed. For everyone.”
Wickham is an Assistant Professor at Rice University and working for RStudio. He describes himself as a statistician by training, and tried to offer some insight into the buzzword that is “data science.”
He first cited the unofficial and often used description: “a blend of red bull-fueled hacking and espresso-inspired statistics." He also referenced this often-used venn diagram.
Data science isn’t just about hacking with big data, he said, because you must also care about statistics. It’s also not merely about statistics -- for one, it also involves communicating your analysis.
What he can offer a better definition of, he said, is what a data scientist is: Someone who can ask and answer questions about and with data.
A good data scientists has to have a mix of curiosity and skepticism, Wickham said. Curiosity is key for developing good questions to ask data, skepticism is needed to make sure you aren’t over-curious and seeing things in the data that aren’t actually there.
The spectrum looks something like this:
Skepticism ___________________ Curiosity
Inferential statistics Visualization
The first step in Wickham’s analysis is one any CAR reporter could recognize: cleaning -- getting the data from whatever crazy format it arrived in into a format you can analyze. From there, Wickham cycles through three steps: transformation, visualization, and modeling.
To cycle through these steps and perform the jobs of a data scientist, you must know how to program. Wickham outlined what he sees as the programming languages a data scientist should know:
For dealign with certain types of data: SQL, Regex and xpath
Helpful for building tools that use data science: C/C++, fortran, scala