Skip to content

How to Make a PivotTable in R

The Excel PivotTable is plain awesome. In this post I will show you how to make a PivotTable in R (kind of).

We will use two popular libraries, dplyr and reshape2. This exercise is doable with base R (aggregate(), apply() and others), but would leave much to be desired.

One important distinction to keep in mind is that while PivotTables are useful for both analysis and reporting, R is inherently not a great tool for reporting. For example, column and row totals are commonplace in Excel reports, but not in R. Other formatting elements like cascaded columns in PivotTables are also not practical in R. What you do get in R is the ability to summarize and reshape data frames in the same way you do with PivotTables.

Goal

We will replicate these three features:

  1. Basic by-group summaries with filters (sum, count, calculated fields, etc)
  2. Transpose columns and rows
  3. Advanced by-group summaries and manipulation

We will use data on marijuana prices, scraped from priceofweed.com by the folks at Mode Analytics. The data import code can be found at the end of this article.

I know nothing about marijuana, so this could be interesting…

Basic PivotTable Summaries

Here is a basic PivotTable:

PivotTable in R

It’s a three-step process with the dplyr library in R:

# 1: filter to keep three states.  
basic_summ = filter(mprices, state %in% c("California", "New York", "Illinois"))
 
# 2: set up data frame for by-group processing.  
basic_summ = group_by(basic_summ, quality, state)
 
# 3: calculate the three summary metrics
basic_summ = summarise(basic_summ, 
                        sum_amount = sum(amount),
                        avg_ppo = mean(ppo),
                        avg_ppo2 = sum(price) / sum(amount))
 
basic_summ

Result…

         quality      state sum_amount  avg_ppo avg_ppo2
1   high quality California   5494.925 277.9640 226.9336
2   high quality   Illinois   1538.014 376.3324 310.5967
3   high quality   New York   2251.540 375.3289 306.3876
4    low quality California    356.148 275.0264 189.7413
5    low quality   Illinois    118.911 227.0347 142.5184
6    low quality   New York    170.510 349.5067 175.5440
7 medium quality California   6053.553 212.2781 165.8847
8 medium quality   Illinois   1351.176 306.1785 218.0745
9 medium quality   New York   2316.076 287.7946 222.6205

That’s it! The numbers match the Excel table. avg_ppo is the straight average of the ppo column, while avg_ppo2 is like a calculated field in a Pivot Table.

Note also, as I pointed out earlier, R is not a good tool for reporting, per se. You don’t get a beautifully formatted table as you could in Excel, though there may be packages to help with that.

Transposed Summaries

Sometimes we need to restructure the data set for particular analyses. For example, with a PivotTable we might do this:

Transpose in R

I moved the “quality” column to the top of the PivotTable. To transpose the data frame like that, we can use the dcast() function from the reshape2 library. dcast() can be used to turn a long data to wide format, while melt() can be used for turning a wide data set to long:

basic_summ_t = dcast(basic_summ, state ~ quality, value.var = "avg_ppo2")
basic_summ_t
       state high quality low quality medium quality
1 California     226.9336    189.7413       165.8847
2   Illinois     310.5967    142.5184       218.0745
3   New York     306.3876    175.5440       222.6205

The second parameter in the function, quality ~ state is written as a formula. You can think of the variable on the left, quality, as the PivotTable row item, and the right, state, as the PivotTable column item.

This was a simple case when we only had one metric, avg_ppo2. But what if we wanted to show multiple metrics, as we do in this PivotTable:

Transpose data frame in R

Here we kept two metrics. Let’s do this in R, in a two-step process:

# first melt the data frame to put all the metrics in a single column
basic_summ_t2 = melt(basic_summ, id.vars = c("state","quality"), 
                                 measure.vars = c("avg_ppo2", "sum_amount"))
basic_summ_t2

This is what that intermediate data frame looks like. We have converted the data frame from wide to long using the melt() function. As you can see, all the metrics are now in a single column, value, identified by the variable column.

        state        quality   variable     value
1  California   high quality   avg_ppo2  226.9336
2    Illinois   high quality   avg_ppo2  310.5967
3    New York   high quality   avg_ppo2  306.3876
4  California    low quality   avg_ppo2  189.7413
5    Illinois    low quality   avg_ppo2  142.5184
6    New York    low quality   avg_ppo2  175.5440
7  California medium quality   avg_ppo2  165.8847
8    Illinois medium quality   avg_ppo2  218.0745
9    New York medium quality   avg_ppo2  222.6205
10 California   high quality sum_amount 5494.9250
11   Illinois   high quality sum_amount 1538.0140
12   New York   high quality sum_amount 2251.5400
13 California    low quality sum_amount  356.1480
14   Illinois    low quality sum_amount  118.9110
15   New York    low quality sum_amount  170.5100
16 California medium quality sum_amount 6053.5530
17   Illinois medium quality sum_amount 1351.1760
18   New York medium quality sum_amount 2316.0760

So now we can transpose that data frame as we did before with dcast(), just adding one more variable:

# then transpose the quality and variable co
basic_summ_t2 = dcast(basic_summ_t2, state ~ quality + variable, value.var = "value")
basic_summ_t2

Notice the only difference between this dcast() and the one earlier is we added the + variable to the formula.

       state high quality_avg_ppo2 high quality_sum_amount low quality_avg_ppo2 low quality_sum_amount medium quality_avg_ppo2 medium quality_sum_amount
1 California              226.9336                5494.925             189.7413                356.148                165.8847                  6053.553
2   Illinois              310.5967                1538.014             142.5184                118.911                218.0745                  1351.176
3   New York              306.3876                2251.540             175.5440                170.510                222.6205                  2316.076

There you have it. We have transposed a data frame like a PivotTable.

Advanced by-group processing

There are many operations not supported by PivotTables that are PivotTable-like in nature. For example, taking the first PivotTable above, what if we wanted to rank each state within each quality level, based on average price per ounce? You can do that like this:

# first set up the data frame for by-group processing, by quality
basic_summ_rank = group_by(basic_summ, quality)
 
# add new column ranking states based on avg_ppo2
basic_summ_rank = mutate(basic_summ_rank, ppo_rank = rank(avg_ppo2))
basic_summ_rank
         quality      state sum_amount  avg_ppo avg_ppo2 ppo_rank
1   high quality California   5494.925 277.9640 226.9336        1
2   high quality   Illinois   1538.014 376.3324 310.5967        3
3   high quality   New York   2251.540 375.3289 306.3876        2
4    low quality California    356.148 275.0264 189.7413        3
5    low quality   Illinois    118.911 227.0347 142.5184        1
6    low quality   New York    170.510 349.5067 175.5440        2
7 medium quality California   6053.553 212.2781 165.8847        1
8 medium quality   Illinois   1351.176 306.1785 218.0745        2
9 medium quality   New York   2316.076 287.7946 222.6205        3

The ppo_rank ranks the three states for high, medium and low qualities separately. E.g., California has the cheapest high-quality stuff but the most expensive low-quality. The key to this is the group_by() function sets the data frame up for by-group processing. In this case, we are running a process separately for each value of quality.

Notice all the syntax looks similar to the group_by() + summarise() pair we saw earlier. summarise() collapses the data set and creates a summary, while mutate maintains the same number of rows and simply creates a new column.

Conclusion

Hopefully you got a sense of how to replicate PivotTable-like functionality in R. As with that last example, there are a lot of things you can do with dplyr that PivotTables cannot do.

Appendix

Setting up the data

Here is how the data was loaded. We will import the data stored on a web server:

# import data from github.  Reading http:// is straightforward
# A few extra steps needed for https urls
mprices = tempfile()
download.file("https://raw.githubusercontent.com/mode/blog/master/2013-10-03%20MMap/m_prices.csv",destfile=mprices, method="curl")
mprices = read.csv(mprices, stringsAsFactors = FALSE)
head(mprices)
        state                  location price quantity        quality   date_old amount quality_score ppo        date state_code
1     Arizona          Phoenix, Arizona   220 an ounce   high quality 2012-01-01      1             3 220 2.01201e+11          4
2  California    Sacramento, California   280 an ounce   high quality 2012-01-01      1             3 280 2.01201e+11          6
3    Colorado  Grand Junction, Colorado   300 an ounce   high quality 2012-01-01      1             3 300 2.01201e+11          8
4    Michigan West Bloomfield, Michigan    20 an ounce   high quality 2012-01-01      1             3  20 2.01201e+11         26
5   Minnesota    Saint Cloud, Minnesota   300 an ounce medium quality 2012-01-01      1             2 300 2.01201e+11         27
6 Mississippi       Tupelo, Mississippi   120 an ounce   high quality 2012-01-01      1             3 120 2.01201e+11         28

This includes data on marijuana prices, scraped from priceofweed.com by the folks at href=”http://blog.modeanalytics.com/mapping-marijuana-prices/” target=”_blank”>Mode Analytics. The data frame has about 130k rows of data, and the columns are generally straightforward. Each row represents a marijuana purchase as recorded anonymously by users.

Comments

  1. Hi John,

    Great article, thank you! Do you know how to add Grand Totals under the summary in R (just like you have in your excel example)?

    Thanks,
    Viola

    1. Hi Viola,

      That’s a great question and one that I get often. The truth is that R doesn’t have a great facility for Grand Total rows and columns because those tend to be more useful for visual reporting. R is not really built with reporting in mind.

      That being said, the usual workaround I have seen is to simply append a row labeled Total and the corresponding summary statistics. In the first example above (Basic PivotTable Summaries), you can create that summary row by re-running the group_by() without the “quality” and “state”, then running summarise() on that. That will give you a one-row data frame, which you can then rbind() to the summary data set.

      I generally advise against mixing raw data with summary data in the same data frame, but if you really need a solution, that will work.

      Hope that helped … 12 days later!

      John

      1. dcast() has a margins parameter that is supposed to do row and total columns:

        margins: vector of variable names (can include “grand\_col” and
        “grand\_row”) to compute margins for, or TRUE to compute all
        margins . Any variables that can not be margined over will be
        silently dropped.

        Unfortunately, while this works in the example in the dcast documentation, I can’t get it to work in my application — it throws an “index out of bounds” error. If you could provide a working example that would be great.

        On a different topic, I’m wondering why you don’t discuss rmarkdown and knitr in the context of reporting. I use knitr all the time (with LaTeX) and find it a fantastic addition to R.

        Anyway, thanks for the post!

        1. Thanks for chipping in, Robert. dcast() does have a margins parameter which works great when the Totals rows/columns are simple sums. But I find it gets more complicated with averages and other calculated fields because weighting becomes an issue. So I would use margins in the simple scenarios, but most other times I would take the long route I described in my previous comment.

          Regarding knitr, I agree it’s great for reporting. I used it a few years ago, but have since taken up Shiny since it was a better solution for my reporting needs at work. I will add it to my pipeline of topics. (FYI, a guest post is always welcome … do I have a volunteer?)

  2. Hello John, Thanks for the paper !

    Im a newbie on this, just started this week playing around with R, i need to it in order to analyse some data. My questions is if it is possible to get data from 2 different csv files to build a pivot table in R ? like to correlate variables…

    Cheers
    Carlos

    1. My comments notification is not very good — so I totally missed this comment. Your question is a bit unclear, but I would say yes — import the files, merge them in some way (using the merge() function), then apply some of the techniques above to slice/dice/summarize your combined data set as needed.

      1. I’m very late to the party, but for those browsing the comments, here’s how I’d do it. First, think of your data, merge() is like a vlookup type thing, so basically adds extra columns to a data frame based on one or more joining columns. What I think you want to do is append one file below the other, provided you have the same columns in each file. Here’s how I’d do the latter:

        # Read files
        f = list.files(“~/path/to/directory”, full.names=T)
        dl = lapply(f, read.csv)

        # You may want to add extra options to you read call, but this is the basic for putting each file into a list item

        # As a side note, explore your list with
        head(dl[[1]]) # which shows head of first list item

        # Append all files/list items together
        d = do.call(“rbind.data.frame”, dl)

        Job done.

  3. Hi John,
    Thanks for sharing, really simple approach for excel users.

    When I tried doing it on my data set – the variables are being read as character and not numeric. I used the as.numeric function but not getting the desired output – can you please advice.

    summarise(subcat_summary,
    sales = sum(as.numeric(SALES)))

    Thanks
    Nabha

    1. Hi Nabha,

      What does your input data look like? If SALES is being read as character, I suspect something in that column is causing it to be read as such — look out for currency symbols and commas. I recommend you correct that upfront rather than coercing it within the summarise() function.

      If you share more of your code, or the str() / head() of subcat_summary, it might give more hint as to where the problem lies.

      John

      1. Hi all, I have an Excel file with three columns: Day Hour Price 1 1 100 … … … 1 24 134 2 1 97 … … … 2 24 111 I have 7 days, so 7×24 = 168 rows What I have to do is to create a pivot table that gives me as a result: Day Hour Price 1 1 100 … 1 … 7 1 97 1 2 56 … 2 … 7 2 111 For all 24 hours. How can i do this in R ? Thank you, Alessandro

        1. Hi Alessandro,

          Perhaps I’m misunderstanding something, but is that really a Pivot Table you’re requesting? It looks like the same 168-row table, simply re-sorted. I assume I’m missing something here…?

          If this is indeed what you want, I suggest looking at the arrange() function in the dplyr library. It’s the best function for sorting data frames. arrange(data_name, Day, Hour, Price).

  4. Hi, I appreciate your kindness to share your knowledge. But, to make everyone’s life easy, why don’t you just use the million example datasets freely included in R ? It’s the same problem with your book, which made me to return back in the store. Please use example datasets included with R

    1. That’s a fair point. Dataset choice won’t make everyone happy.

      Frankly, I am so tired of seeing the same old boring built-in datasets being used for everything. Also, the people I teach are beginners. There’s nothing better than seeing the full workflow over and over … of importing the data, looking at the head(), str(), etc. Out in the real world, you’ll be dealing with messy csv files, database queries, etc. You will NOT be pulling a built-in dataset from R and use it for actual analysis, ever.

    1. I’ve never done this because I don’t use R much as a reporting engine. But if I were to do this, I would just compute the totals row manually (the apply function is good for this, or even these dplyr functions with no group_by()) … and then append that row to the data frame with rbind().

      It’s really wonky, and perhaps there are libraries that would handle this for you. Considering the growth of Shiny, maybe R is being used more and more for reporting, so someone may have built a tool for this.

Leave a Reply

Your email address will not be published.