[This post was also published on LinkedIn.]
Today I met a colleague in the marketing team in Ireland. He is taking one of these popular online Data Analytics courses and needed help with a cluster analysis assignment. He was stuck, so I offered to take a look.
The exercise involved clustering insurance customers in terms of risk factors. He got as far as running the k-means algorithm and creating some visualizations by following the instructor’s recipe. Then we got to a part where we look at output like this:
(This is raw output from R). These are cluster centers; here I have clustered 150 flowers into four groups, and show average measures across the different groups. The idea is, flowers in the same group are most similar to each other across these four measures.
His table had many more columns, and he was trying to make sense of the cluster centers to tell the story of the different groups. Not easy to do on the ugly R console. So I suggested throwing it in Excel and putting some conditional formatting to make it easier… but he did not know what that was. Turns out, he did not really know Excel.
And then it hit me.
EXCEL IS NOT SEXY
Take a look at some of the popular analytics training courses out there. Excel is not a prerequisite nor is it part of the curriculum. There are promises of a hot job market and all kinds of charts showing analytics jobs as some of the highest paying. But no mention of the most famous and most successful analysis software on the planet.
For all the crap Microsoft gets, they’ve done at least one thing well — Excel is a damn great application. But the problem is, big data / machine learning / data analytics is such a sexy thing these days, and sadly there is no sex appeal in Excel. In many R instructors you will find highly impressive PhDs and “data scientists” (I hate that title) who are deep into the cool things like Python, R, Hadoop, yada yada yada. At best, they don’t consider Excel to be that useful, and at worst they will look down at it as an amateur playground.
CRAWL BEFORE YOU RUN
Ever since starting my R for Excel Users project, I’ve helped so many people who’ve been frustrated trying to learn R. They skip steps — like learning to build models before really knowing how to work with vectors and data frames. Or they get overwhelmed by learning things like matrices, which are likely not so useful to them anyway. Or like my friend, they jump into R without having built a solid foundation working with data.
So my advice to him and other students: crawl before you run.
Go ahead, learn R, but take it one step at a time. Build your baseline. Make sure you can crush a PivotTable before trying to build multiple regression models. Create nice charts in Excel before you enter the black hole of visualizations in R (it’s a deep never ending hole with lots of goodies inside).
And don’t ignore Excel. Used together, R and Excel can supercharge your career in analytics.
By the way, this is how it looks with quick conditional formatting. Now it’s a bit easier to explain the groups:
John – I agree with you a 100% . Each word of this article and the title resonates with what I see around. Also, its not just the analytics portion but even data collation and cleansing that’s getting a makeover in Excel with new add-ins such as Power Query.
Great post! I agree that R can have a “steep learning curve” as they say in the program-learning world. Crawling is recommended before walking and running. Just like learning anything challenging, use easy progressions, right?
By the way, if anyone searches this post out for info on cluster analysis per se, there are many simpler to use (and equally powerful) stand-alone programs that perform detailed cluster algorithms. “PAUP” is one of those, find it at: https://paup.phylosolutions.com/