The Excel PivotTable is plain awesome. In this post I will show you how to make a PivotTable in R (kind of).
We will use two popular libraries,
reshape2. This exercise is doable with base R (aggregate(), apply() and others), but would leave much to be desired.
One important distinction to keep in mind is that while PivotTables are useful for both analysis and reporting, R is inherently not a great tool for reporting. For example, column and row totals are commonplace in Excel reports, but not in R. Other formatting elements like cascaded columns in PivotTables are also not practical in R. What you do get in R is the ability to summarize and reshape data frames in the same way you do with PivotTables.
We will replicate these three features:
- Basic by-group summaries with filters (sum, count, calculated fields, etc)
- Transpose columns and rows
- Advanced by-group summaries and manipulation
We will use data on marijuana prices, scraped from priceofweed.com by the folks at Mode Analytics. The data import code can be found at the end of this article.
I know nothing about marijuana, so this could be interesting…
Basic PivotTable Summaries
Here is a basic PivotTable:
It’s a three-step process with the
dplyr library in R:
# 1: filter to keep three states. basic_summ = filter(mprices, state %in% c("California", "New York", "Illinois")) # 2: set up data frame for by-group processing. basic_summ = group_by(basic_summ, quality, state) # 3: calculate the three summary metrics basic_summ = summarise(basic_summ, sum_amount = sum(amount), avg_ppo = mean(ppo), avg_ppo2 = sum(price) / sum(amount)) basic_summ
quality state sum_amount avg_ppo avg_ppo2 1 high quality California 5494.925 277.9640 226.9336 2 high quality Illinois 1538.014 376.3324 310.5967 3 high quality New York 2251.540 375.3289 306.3876 4 low quality California 356.148 275.0264 189.7413 5 low quality Illinois 118.911 227.0347 142.5184 6 low quality New York 170.510 349.5067 175.5440 7 medium quality California 6053.553 212.2781 165.8847 8 medium quality Illinois 1351.176 306.1785 218.0745 9 medium quality New York 2316.076 287.7946 222.6205
That’s it! The numbers match the Excel table. avg_ppo is the straight average of the ppo column, while avg_ppo2 is like a calculated field in a Pivot Table.
Note also, as I pointed out earlier, R is not a good tool for reporting, per se. You don’t get a beautifully formatted table as you could in Excel, though there may be packages to help with that.
Sometimes we need to restructure the data set for particular analyses. For example, with a PivotTable we might do this:
I moved the “quality” column to the top of the PivotTable. To transpose the data frame like that, we can use the
dcast() function from the reshape2 library.
dcast() can be used to turn a long data to wide format, while
melt() can be used for turning a wide data set to long:
basic_summ_t = dcast(basic_summ, state ~ quality, value.var = "avg_ppo2") basic_summ_t
state high quality low quality medium quality 1 California 226.9336 189.7413 165.8847 2 Illinois 310.5967 142.5184 218.0745 3 New York 306.3876 175.5440 222.6205
The second parameter in the function,
quality ~ state is written as a formula. You can think of the variable on the left, quality, as the PivotTable row item, and the right, state, as the PivotTable column item.
This was a simple case when we only had one metric, avg_ppo2. But what if we wanted to show multiple metrics, as we do in this PivotTable:
Here we kept two metrics. Let’s do this in R, in a two-step process:
# first melt the data frame to put all the metrics in a single column basic_summ_t2 = melt(basic_summ, id.vars = c("state","quality"), measure.vars = c("avg_ppo2", "sum_amount")) basic_summ_t2
This is what that intermediate data frame looks like. We have converted the data frame from wide to long using the
melt() function. As you can see, all the metrics are now in a single column, value, identified by the variable column.
state quality variable value 1 California high quality avg_ppo2 226.9336 2 Illinois high quality avg_ppo2 310.5967 3 New York high quality avg_ppo2 306.3876 4 California low quality avg_ppo2 189.7413 5 Illinois low quality avg_ppo2 142.5184 6 New York low quality avg_ppo2 175.5440 7 California medium quality avg_ppo2 165.8847 8 Illinois medium quality avg_ppo2 218.0745 9 New York medium quality avg_ppo2 222.6205 10 California high quality sum_amount 5494.9250 11 Illinois high quality sum_amount 1538.0140 12 New York high quality sum_amount 2251.5400 13 California low quality sum_amount 356.1480 14 Illinois low quality sum_amount 118.9110 15 New York low quality sum_amount 170.5100 16 California medium quality sum_amount 6053.5530 17 Illinois medium quality sum_amount 1351.1760 18 New York medium quality sum_amount 2316.0760
So now we can transpose that data frame as we did before with
dcast(), just adding one more variable:
# then transpose the quality and variable co basic_summ_t2 = dcast(basic_summ_t2, state ~ quality + variable, value.var = "value") basic_summ_t2
Notice the only difference between this dcast() and the one earlier is we added the
+ variable to the formula.
state high quality_avg_ppo2 high quality_sum_amount low quality_avg_ppo2 low quality_sum_amount medium quality_avg_ppo2 medium quality_sum_amount 1 California 226.9336 5494.925 189.7413 356.148 165.8847 6053.553 2 Illinois 310.5967 1538.014 142.5184 118.911 218.0745 1351.176 3 New York 306.3876 2251.540 175.5440 170.510 222.6205 2316.076
There you have it. We have transposed a data frame like a PivotTable.
Advanced by-group processing
There are many operations not supported by PivotTables that are PivotTable-like in nature. For example, taking the first PivotTable above, what if we wanted to rank each state within each quality level, based on average price per ounce? You can do that like this:
# first set up the data frame for by-group processing, by quality basic_summ_rank = group_by(basic_summ, quality) # add new column ranking states based on avg_ppo2 basic_summ_rank = mutate(basic_summ_rank, ppo_rank = rank(avg_ppo2)) basic_summ_rank
quality state sum_amount avg_ppo avg_ppo2 ppo_rank 1 high quality California 5494.925 277.9640 226.9336 1 2 high quality Illinois 1538.014 376.3324 310.5967 3 3 high quality New York 2251.540 375.3289 306.3876 2 4 low quality California 356.148 275.0264 189.7413 3 5 low quality Illinois 118.911 227.0347 142.5184 1 6 low quality New York 170.510 349.5067 175.5440 2 7 medium quality California 6053.553 212.2781 165.8847 1 8 medium quality Illinois 1351.176 306.1785 218.0745 2 9 medium quality New York 2316.076 287.7946 222.6205 3
The ppo_rank ranks the three states for high, medium and low qualities separately. E.g., California has the cheapest high-quality stuff but the most expensive low-quality. The key to this is the
group_by() function sets the data frame up for by-group processing. In this case, we are running a process separately for each value of quality.
Notice all the syntax looks similar to the group_by() + summarise() pair we saw earlier.
summarise() collapses the data set and creates a summary, while mutate maintains the same number of rows and simply creates a new column.
Hopefully you got a sense of how to replicate PivotTable-like functionality in R. As with that last example, there are a lot of things you can do with dplyr that PivotTables cannot do.
Setting up the data
Here is how the data was loaded. We will import the data stored on a web server:
# import data from github. Reading http:// is straightforward # A few extra steps needed for https urls mprices = tempfile() download.file("https://raw.githubusercontent.com/mode/blog/master/2013-10-03%20MMap/m_prices.csv",destfile=mprices, method="curl") mprices = read.csv(mprices, stringsAsFactors = FALSE) head(mprices)
state location price quantity quality date_old amount quality_score ppo date state_code 1 Arizona Phoenix, Arizona 220 an ounce high quality 2012-01-01 1 3 220 2.01201e+11 4 2 California Sacramento, California 280 an ounce high quality 2012-01-01 1 3 280 2.01201e+11 6 3 Colorado Grand Junction, Colorado 300 an ounce high quality 2012-01-01 1 3 300 2.01201e+11 8 4 Michigan West Bloomfield, Michigan 20 an ounce high quality 2012-01-01 1 3 20 2.01201e+11 26 5 Minnesota Saint Cloud, Minnesota 300 an ounce medium quality 2012-01-01 1 2 300 2.01201e+11 27 6 Mississippi Tupelo, Mississippi 120 an ounce high quality 2012-01-01 1 3 120 2.01201e+11 28
This includes data on marijuana prices, scraped from priceofweed.com by the folks at href=”http://blog.modeanalytics.com/mapping-marijuana-prices/” target=”_blank”>Mode Analytics. The data frame has about 130k rows of data, and the columns are generally straightforward. Each row represents a marijuana purchase as recorded anonymously by users.
Great article, thank you! Do you know how to add Grand Totals under the summary in R (just like you have in your excel example)?
That’s a great question and one that I get often. The truth is that R doesn’t have a great facility for Grand Total rows and columns because those tend to be more useful for visual reporting. R is not really built with reporting in mind.
That being said, the usual workaround I have seen is to simply append a row labeled Total and the corresponding summary statistics. In the first example above (Basic PivotTable Summaries), you can create that summary row by re-running the group_by() without the “quality” and “state”, then running summarise() on that. That will give you a one-row data frame, which you can then rbind() to the summary data set.
I generally advise against mixing raw data with summary data in the same data frame, but if you really need a solution, that will work.
Hope that helped … 12 days later!
dcast() has a margins parameter that is supposed to do row and total columns:
margins: vector of variable names (can include “grand\_col” and
“grand\_row”) to compute margins for, or TRUE to compute all
margins . Any variables that can not be margined over will be
Unfortunately, while this works in the example in the dcast documentation, I can’t get it to work in my application — it throws an “index out of bounds” error. If you could provide a working example that would be great.
On a different topic, I’m wondering why you don’t discuss rmarkdown and knitr in the context of reporting. I use knitr all the time (with LaTeX) and find it a fantastic addition to R.
Anyway, thanks for the post!
Thanks for chipping in, Robert. dcast() does have a margins parameter which works great when the Totals rows/columns are simple sums. But I find it gets more complicated with averages and other calculated fields because weighting becomes an issue. So I would use margins in the simple scenarios, but most other times I would take the long route I described in my previous comment.
Regarding knitr, I agree it’s great for reporting. I used it a few years ago, but have since taken up Shiny since it was a better solution for my reporting needs at work. I will add it to my pipeline of topics. (FYI, a guest post is always welcome … do I have a volunteer?)
Hello John, Thanks for the paper !
Im a newbie on this, just started this week playing around with R, i need to it in order to analyse some data. My questions is if it is possible to get data from 2 different csv files to build a pivot table in R ? like to correlate variables…
My comments notification is not very good — so I totally missed this comment. Your question is a bit unclear, but I would say yes — import the files, merge them in some way (using the merge() function), then apply some of the techniques above to slice/dice/summarize your combined data set as needed.
I’m very late to the party, but for those browsing the comments, here’s how I’d do it. First, think of your data, merge() is like a vlookup type thing, so basically adds extra columns to a data frame based on one or more joining columns. What I think you want to do is append one file below the other, provided you have the same columns in each file. Here’s how I’d do the latter:
# Read files
f = list.files(“~/path/to/directory”, full.names=T)
dl = lapply(f, read.csv)
# You may want to add extra options to you read call, but this is the basic for putting each file into a list item
# As a side note, explore your list with
head(dl[]) # which shows head of first list item
# Append all files/list items together
d = do.call(“rbind.data.frame”, dl)
Thanks Mike. I agree on that approach for stacking data tables on top of each other (there’s actually a writeup on this site that does what you just said). https://www.rforexcelusers.com/combine-delimited-files-r/
Instead of do.call(rbind), recently I’ve been using about rbind_all from dplyr, which is amazingly fast and most importantly does not require all the tables to have the same columns (which is sometimes the case as you may know).
Thanks for sharing, really simple approach for excel users.
When I tried doing it on my data set – the variables are being read as character and not numeric. I used the as.numeric function but not getting the desired output – can you please advice.
sales = sum(as.numeric(SALES)))
What does your input data look like? If SALES is being read as character, I suspect something in that column is causing it to be read as such — look out for currency symbols and commas. I recommend you correct that upfront rather than coercing it within the summarise() function.
If you share more of your code, or the str() / head() of subcat_summary, it might give more hint as to where the problem lies.
Hi all, I have an Excel file with three columns: Day Hour Price 1 1 100 … … … 1 24 134 2 1 97 … … … 2 24 111 I have 7 days, so 7×24 = 168 rows What I have to do is to create a pivot table that gives me as a result: Day Hour Price 1 1 100 … 1 … 7 1 97 1 2 56 … 2 … 7 2 111 For all 24 hours. How can i do this in R ? Thank you, Alessandro
Perhaps I’m misunderstanding something, but is that really a Pivot Table you’re requesting? It looks like the same 168-row table, simply re-sorted. I assume I’m missing something here…?
If this is indeed what you want, I suggest looking at the arrange() function in the dplyr library. It’s the best function for sorting data frames. arrange(data_name, Day, Hour, Price).
Hi, I appreciate your kindness to share your knowledge. But, to make everyone’s life easy, why don’t you just use the million example datasets freely included in R ? It’s the same problem with your book, which made me to return back in the store. Please use example datasets included with R
That’s a fair point. Dataset choice won’t make everyone happy.
Frankly, I am so tired of seeing the same old boring built-in datasets being used for everything. Also, the people I teach are beginners. There’s nothing better than seeing the full workflow over and over … of importing the data, looking at the head(), str(), etc. Out in the real world, you’ll be dealing with messy csv files, database queries, etc. You will NOT be pulling a built-in dataset from R and use it for actual analysis, ever.
Hi John, how to add sub total after end of every unique value like we do in excel pivot
I’ve never done this because I don’t use R much as a reporting engine. But if I were to do this, I would just compute the totals row manually (the apply function is good for this, or even these dplyr functions with no group_by()) … and then append that row to the data frame with rbind().
It’s really wonky, and perhaps there are libraries that would handle this for you. Considering the growth of Shiny, maybe R is being used more and more for reporting, so someone may have built a tool for this.