How to Make a PivotTable in R

The Excel PivotTable is plain awesome. In this post I will show you how to make a PivotTable in R (kind of).

We will use two popular libraries, dplyr and reshape2. This exercise is doable with base R (aggregate(), apply() and others), but would leave much to be desired.

One important distinction to keep in mind is that while PivotTables are useful for both analysis and reporting, R is inherently not a great tool for reporting. For example, column and row totals are commonplace in Excel reports, but not in R. Other formatting elements like cascaded columns in PivotTables are also not practical in R. What you do get in R is the ability to summarize and reshape data frames in the same way you do with PivotTables.


We will replicate these three features:

  1. Basic by-group summaries with filters (sum, count, calculated fields, etc)
  2. Transpose columns and rows
  3. Advanced by-group summaries and manipulation

We will use data on marijuana prices, scraped from by the folks at Mode Analytics. The data import code can be found at the end of this article.

I know nothing about marijuana, so this could be interesting…

Basic PivotTable Summaries

Here is a basic PivotTable:

PivotTable in R

It’s a three-step process with the dplyr library in R:

# 1: filter to keep three states.  
basic_summ = filter(mprices, state %in% c("California", "New York", "Illinois"))

# 2: set up data frame for by-group processing.  
basic_summ = group_by(basic_summ, quality, state)

# 3: calculate the three summary metrics
basic_summ = summarise(basic_summ, 
                        sum_amount = sum(amount),
                        avg_ppo = mean(ppo),
                        avg_ppo2 = sum(price) / sum(amount))

         quality      state sum_amount  avg_ppo avg_ppo2
1   high quality California   5494.925 277.9640 226.9336
2   high quality   Illinois   1538.014 376.3324 310.5967
3   high quality   New York   2251.540 375.3289 306.3876
4    low quality California    356.148 275.0264 189.7413
5    low quality   Illinois    118.911 227.0347 142.5184
6    low quality   New York    170.510 349.5067 175.5440
7 medium quality California   6053.553 212.2781 165.8847
8 medium quality   Illinois   1351.176 306.1785 218.0745
9 medium quality   New York   2316.076 287.7946 222.6205

That’s it! The numbers match the Excel table. avg_ppo is the straight average of the ppo column, while avg_ppo2 is like a calculated field in a Pivot Table.

Note also, as I pointed out earlier, R is not a good tool for reporting, per se. You don’t get a beautifully formatted table as you could in Excel, though there may be packages to help with that.

Transposed Summaries

Sometimes we need to restructure the data set for particular analyses. For example, with a PivotTable we might do this:

Transpose in R

I moved the “quality” column to the top of the PivotTable. To transpose the data frame like that, we can use the dcast() function from the reshape2 library. dcast() can be used to turn a long data to wide format, while melt() can be used for turning a wide data set to long:

basic_summ_t = dcast(basic_summ, state ~ quality, value.var = "avg_ppo2")
       state high quality low quality medium quality
1 California     226.9336    189.7413       165.8847
2   Illinois     310.5967    142.5184       218.0745
3   New York     306.3876    175.5440       222.6205

The second parameter in the function, quality ~ state is written as a formula. You can think of the variable on the left, quality, as the PivotTable row item, and the right, state, as the PivotTable column item.

This was a simple case when we only had one metric, avg_ppo2. But what if we wanted to show multiple metrics, as we do in this PivotTable:

Transpose data frame in R

Here we kept two metrics. Let’s do this in R, in a two-step process:

# first melt the data frame to put all the metrics in a single column
basic_summ_t2 = melt(basic_summ, id.vars = c("state","quality"), 
                                 measure.vars = c("avg_ppo2", "sum_amount"))

This is what that intermediate data frame looks like. We have converted the data frame from wide to long using the melt() function. As you can see, all the metrics are now in a single column, value, identified by the variable column.

        state        quality   variable     value
1  California   high quality   avg_ppo2  226.9336
2    Illinois   high quality   avg_ppo2  310.5967
3    New York   high quality   avg_ppo2  306.3876
4  California    low quality   avg_ppo2  189.7413
5    Illinois    low quality   avg_ppo2  142.5184
6    New York    low quality   avg_ppo2  175.5440
7  California medium quality   avg_ppo2  165.8847
8    Illinois medium quality   avg_ppo2  218.0745
9    New York medium quality   avg_ppo2  222.6205
10 California   high quality sum_amount 5494.9250
11   Illinois   high quality sum_amount 1538.0140
12   New York   high quality sum_amount 2251.5400
13 California    low quality sum_amount  356.1480
14   Illinois    low quality sum_amount  118.9110
15   New York    low quality sum_amount  170.5100
16 California medium quality sum_amount 6053.5530
17   Illinois medium quality sum_amount 1351.1760
18   New York medium quality sum_amount 2316.0760

So now we can transpose that data frame as we did before with dcast(), just adding one more variable:

# then transpose the quality and variable co
basic_summ_t2 = dcast(basic_summ_t2, state ~ quality + variable, value.var = "value")

Notice the only difference between this dcast() and the one earlier is we added the + variable to the formula.

       state high quality_avg_ppo2 high quality_sum_amount low quality_avg_ppo2 low quality_sum_amount medium quality_avg_ppo2 medium quality_sum_amount
1 California              226.9336                5494.925             189.7413                356.148                165.8847                  6053.553
2   Illinois              310.5967                1538.014             142.5184                118.911                218.0745                  1351.176
3   New York              306.3876                2251.540             175.5440                170.510                222.6205                  2316.076

There you have it. We have transposed a data frame like a PivotTable.

Advanced by-group processing

There are many operations not supported by PivotTables that are PivotTable-like in nature. For example, taking the first PivotTable above, what if we wanted to rank each state within each quality level, based on average price per ounce? You can do that like this:

# first set up the data frame for by-group processing, by quality
basic_summ_rank = group_by(basic_summ, quality)

# add new column ranking states based on avg_ppo2
basic_summ_rank = mutate(basic_summ_rank, ppo_rank = rank(avg_ppo2))
         quality      state sum_amount  avg_ppo avg_ppo2 ppo_rank
1   high quality California   5494.925 277.9640 226.9336        1
2   high quality   Illinois   1538.014 376.3324 310.5967        3
3   high quality   New York   2251.540 375.3289 306.3876        2
4    low quality California    356.148 275.0264 189.7413        3
5    low quality   Illinois    118.911 227.0347 142.5184        1
6    low quality   New York    170.510 349.5067 175.5440        2
7 medium quality California   6053.553 212.2781 165.8847        1
8 medium quality   Illinois   1351.176 306.1785 218.0745        2
9 medium quality   New York   2316.076 287.7946 222.6205        3

The ppo_rank ranks the three states for high, medium and low qualities separately. E.g., California has the cheapest high-quality stuff but the most expensive low-quality. The key to this is the group_by() function sets the data frame up for by-group processing. In this case, we are running a process separately for each value of quality.

Notice all the syntax looks similar to the group_by() + summarise() pair we saw earlier. summarise() collapses the data set and creates a summary, while mutate maintains the same number of rows and simply creates a new column.


Hopefully you got a sense of how to replicate PivotTable-like functionality in R. As with that last example, there are a lot of things you can do with dplyr that PivotTables cannot do.


Setting up the data

Here is how the data was loaded. We will import the data stored on a web server:

# import data from github.  Reading http:// is straightforward
# A few extra steps needed for https urls
mprices = tempfile()
download.file("",destfile=mprices, method="curl")
mprices = read.csv(mprices, stringsAsFactors = FALSE)
        state                  location price quantity        quality   date_old amount quality_score ppo        date state_code
1     Arizona          Phoenix, Arizona   220 an ounce   high quality 2012-01-01      1             3 220 2.01201e+11          4
2  California    Sacramento, California   280 an ounce   high quality 2012-01-01      1             3 280 2.01201e+11          6
3    Colorado  Grand Junction, Colorado   300 an ounce   high quality 2012-01-01      1             3 300 2.01201e+11          8
4    Michigan West Bloomfield, Michigan    20 an ounce   high quality 2012-01-01      1             3  20 2.01201e+11         26
5   Minnesota    Saint Cloud, Minnesota   300 an ounce medium quality 2012-01-01      1             2 300 2.01201e+11         27
6 Mississippi       Tupelo, Mississippi   120 an ounce   high quality 2012-01-01      1             3 120 2.01201e+11         28

This includes data on marijuana prices, scraped from by the folks at href=”” target=”_blank”>Mode Analytics. The data frame has about 130k rows of data, and the columns are generally straightforward. Each row represents a marijuana purchase as recorded anonymously by users.

0 0 vote
Article Rating
Notify of
Inline Feedbacks
View all comments
6 years ago

Hi John,

Great article, thank you! Do you know how to add Grand Totals under the summary in R (just like you have in your excel example)?


Robert McDonald
6 years ago
Reply to  John Taveras

dcast() has a margins parameter that is supposed to do row and total columns:

margins: vector of variable names (can include “grand\_col” and
“grand\_row”) to compute margins for, or TRUE to compute all
margins . Any variables that can not be margined over will be
silently dropped.

Unfortunately, while this works in the example in the dcast documentation, I can’t get it to work in my application — it throws an “index out of bounds” error. If you could provide a working example that would be great.

On a different topic, I’m wondering why you don’t discuss rmarkdown and knitr in the context of reporting. I use knitr all the time (with LaTeX) and find it a fantastic addition to R.

Anyway, thanks for the post!

Carlos Rodrigues
Carlos Rodrigues
6 years ago

Hello John, Thanks for the paper !

Im a newbie on this, just started this week playing around with R, i need to it in order to analyse some data. My questions is if it is possible to get data from 2 different csv files to build a pivot table in R ? like to correlate variables…


Mike Spencer
Mike Spencer
5 years ago
Reply to  John Taveras

I’m very late to the party, but for those browsing the comments, here’s how I’d do it. First, think of your data, merge() is like a vlookup type thing, so basically adds extra columns to a data frame based on one or more joining columns. What I think you want to do is append one file below the other, provided you have the same columns in each file. Here’s how I’d do the latter:

# Read files
f = list.files(“~/path/to/directory”, full.names=T)
dl = lapply(f, read.csv)

# You may want to add extra options to you read call, but this is the basic for putting each file into a list item

# As a side note, explore your list with
head(dl[[1]]) # which shows head of first list item

# Append all files/list items together
d =“”, dl)

Job done.

5 years ago

Hi John,
Thanks for sharing, really simple approach for excel users.

When I tried doing it on my data set – the variables are being read as character and not numeric. I used the as.numeric function but not getting the desired output – can you please advice.

sales = sum(as.numeric(SALES)))


5 years ago
Reply to  John Taveras

Hi all,

I have an Excel file with three columns:

Day Hour Price
1 1 100
… … …
1 24 134
2 1 97
… … …
2 24 111

I have 7 days, so 7×24 = 168 rows

What I have to do is to create a pivot table that gives me as a result:

Day Hour Price
1 1 100
… 1 …
7 1 97

1 2 56
… 2 …
7 2 111

For all 24 hours.

How can i do this in R ?

Thank you,

4 years ago

Hi John,

I have Date to be added to the Columns, can it be done

4 years ago

Heloooo Is anyone here…
I need one help.

I have one data with the different product names with different amount.

i want all the product in columm. can you suggest some coding.

Ravi Sastry
Ravi Sastry
3 years ago

Hi, I appreciate your kindness to share your knowledge. But, to make everyone’s life easy, why don’t you just use the million example datasets freely included in R ? It’s the same problem with your book, which made me to return back in the store. Please use example datasets included with R

3 years ago
Reply to  Ravi Sastry

That’s a fair point. Dataset choice won’t make everyone happy.

Frankly, I am so tired of seeing the same old boring built-in datasets being used for everything. Also, the people I teach are beginners. There’s nothing better than seeing the full workflow over and over … of importing the data, looking at the head(), str(), etc. Out in the real world, you’ll be dealing with messy csv files, database queries, etc. You will NOT be pulling a built-in dataset from R and use it for actual analysis, ever.

3 years ago

Hi John, how to add sub total after end of every unique value like we do in excel pivot

3 years ago
Reply to  Jigar

I’ve never done this because I don’t use R much as a reporting engine. But if I were to do this, I would just compute the totals row manually (the apply function is good for this, or even these dplyr functions with no group_by()) … and then append that row to the data frame with rbind().

It’s really wonky, and perhaps there are libraries that would handle this for you. Considering the growth of Shiny, maybe R is being used more and more for reporting, so someone may have built a tool for this.