I'm pleased to announce that my part in writing my book, Data Smart, is pretty much done. Phew! So that took about 8 full months to write. It's got 10 very thorough chapters on aspects of data science -- if you ever wondered how a Big M constraint in optimization is like a dead squirrel or how Megaman X is related to naive Bayes then this book is for you. The book will come out October 28, so preorder now. And if you'd like a sample chapter, just sign up for my newsletter, and I'll send you one shortly.
So! Everything is turned in, edited, spreadsheets are checked, R code is checked. I've even assembled a playlist to go along with the middle school dance floor clustering tutorial.
So! Everything is turned in, edited, spreadsheets are checked, R code is checked. I've even assembled a playlist to go along with the middle school dance floor clustering tutorial.
Now that my work is done, I thought I'd take a moment to reflect on the process of teaching data science and what it was like to use spreadsheets. I can honestly say that spreadsheets were a really nice way to teach data science concepts while at times also being slightly frustrating. What made spreadsheets so great for Data Smart?
1. At every step along the way in an algorithm, I could show exactly what was happening to the data. You get to see every state of your data. It's like seeing hotdogs being assembled from beef, to slurry, to wiener. An example:
Here's a screen shot from the chapter on boosting, where we're looking at how some decision stumps for predicting pregnant customers at a big box retailer are splitting apart the training data.
The stumps get evaluated based on their weighted error. And the winner gets selected.
The winning stump receives an alpha value for weighting its vote in the final ensemble model.
New weights are assigned to the training data and we select a new decision stump.
I love that approach. While it might be a bit staid, it's great for learning. And the feeling of doing all the steps in building a model and then later (in Chapter 10 of the book) building that same model in R and getting the same answer...that feeling is awesome. Why? Because you're not just some R data science script kiddie anymore. Sure, you're using the packages, but you now know exactly what Hyndman's forecast package is doing. That is cool.
2. Some algorithms just feel natural using the "drag the formula down to fill the cell" approach that you have in Excel. It's like an artisanal apply() function. ;-) For example, when looking at the error correction formulas for Holt-Winters, you can do a single time period, and then a second one, and then drag everything down. It feels a bit like induction.
3. Spreadsheets are great for teaching predictive modeling/forecasting, data mining/graphing, and optimization modeling. While many of the techniques are opaque in R when you use packages, if you do them by hand in R, they're actually pretty clear. Except for optimization. If you want to teach other modeling techniques plus optimization in R then you're kinda screwed, because all the optimization hooks in R just take a full-on constraint matrix and a right hand side vector. Contrast this with Excel Solver where you get to build constraints individually. It's totally better for teaching. Now, that said, Python has some nice hooks into optimization modeling that would be similar to Excel. Since spreadsheets are so nice for viewing data, then prepping data, objective functions, and constraints, and then optimizing, it means that algorithms such as modularity maximization using branch and bound plus divisive clustering can be taught there, and it's actually easier to see than it would be in nearly any other environment. Plus, if you're careful you can actually cluster data better than even Gephi's native Louvain method implementation can. Bam!
4. Quite simply, I didn't need to teach any code in the book. Yes, in two places I have the reader record a macro of some clicks and then press the macro shortcut key a couple times, but that's it. And actually watching this loop run using keypresses is in itself a valuable lesson for those who don't intuitively get how something like a monte carlo simulation works.
So there are a few things I really enjoyed about using spreadsheets to teach data science. Where did the spreadsheets fail?
1. Visualization. Visualization in Excel is nice when there's native support for the particular type of graph you want. But if you want a fan chart or a correlogram with critical values marked, then things get slightly annoying. You can often graph what you need by doing formatting cart wheels. Grrrrr.
2. Some algorithms just feel natural using the "drag the formula down to fill the cell" approach that you have in Excel. It's like an artisanal apply() function. ;-) For example, when looking at the error correction formulas for Holt-Winters, you can do a single time period, and then a second one, and then drag everything down. It feels a bit like induction.
3. Spreadsheets are great for teaching predictive modeling/forecasting, data mining/graphing, and optimization modeling. While many of the techniques are opaque in R when you use packages, if you do them by hand in R, they're actually pretty clear. Except for optimization. If you want to teach other modeling techniques plus optimization in R then you're kinda screwed, because all the optimization hooks in R just take a full-on constraint matrix and a right hand side vector. Contrast this with Excel Solver where you get to build constraints individually. It's totally better for teaching. Now, that said, Python has some nice hooks into optimization modeling that would be similar to Excel. Since spreadsheets are so nice for viewing data, then prepping data, objective functions, and constraints, and then optimizing, it means that algorithms such as modularity maximization using branch and bound plus divisive clustering can be taught there, and it's actually easier to see than it would be in nearly any other environment. Plus, if you're careful you can actually cluster data better than even Gephi's native Louvain method implementation can. Bam!
4. Quite simply, I didn't need to teach any code in the book. Yes, in two places I have the reader record a macro of some clicks and then press the macro shortcut key a couple times, but that's it. And actually watching this loop run using keypresses is in itself a valuable lesson for those who don't intuitively get how something like a monte carlo simulation works.
So there are a few things I really enjoyed about using spreadsheets to teach data science. Where did the spreadsheets fail?
1. Visualization. Visualization in Excel is nice when there's native support for the particular type of graph you want. But if you want a fan chart or a correlogram with critical values marked, then things get slightly annoying. You can often graph what you need by doing formatting cart wheels. Grrrrr.
A correlogram with marked critical values. Doable but annoying in Excel.
2. Spreadsheets are ugly for true matrix math. The beauty of something like R versus Excel becomes most apparent not when performing boosting or bagging or clustering or any of these more complex things. The place where it's most glaring is in taking a t test in a multiple regression by hand. Why? Because you have to do matrix inversions on large portions of data in order to get the standard error of the regression coefficients. And that's just unattractive in Excel. Sure, Excel does it for you using the LINEST function, but I wanted to teach t tests from the ground up. R would have been better there.
3. Spreadsheets are occasionally slow. While Solver is awesome for teaching, its simplex and evolutionary algo implementations aren't going to blind anyone with speed. That's why in the book I recommend using OpenSolver plugged into Excel any time the reader can.
Anyway, I think that on balance the book is extremely powerful as a teaching tool, especially for a particular type of student...a student like me. Someone who has a deep seated fear of script-kiddie-ness. Someone who needs to teach and see the data in order to believe. I am the Doubting Thomas of data scientists, but once I do work through a problem piece by piece, then I'm able to internalize a confidence in the technique. I know when and how to use it. Then and only then am I happy to stand on the shoulders of R packages and get work done.
3. Spreadsheets are occasionally slow. While Solver is awesome for teaching, its simplex and evolutionary algo implementations aren't going to blind anyone with speed. That's why in the book I recommend using OpenSolver plugged into Excel any time the reader can.
Anyway, I think that on balance the book is extremely powerful as a teaching tool, especially for a particular type of student...a student like me. Someone who has a deep seated fear of script-kiddie-ness. Someone who needs to teach and see the data in order to believe. I am the Doubting Thomas of data scientists, but once I do work through a problem piece by piece, then I'm able to internalize a confidence in the technique. I know when and how to use it. Then and only then am I happy to stand on the shoulders of R packages and get work done.