Skip to content

How to do VLOOKUP in R

VLOOKUP is usually the first magical formula people learn when learning Excel. The magic never goes away. So let’s see how to do a VLOOKUP in R.

Situation

Let’s say we have two datasets from World Bank — one showing annual average life expectancy by country and the other showing a measure of access to sanitation facilities. To do our analysis (regression, visual, whatever), we need these two sets of values combined.

% OF POPULATION WITH ACCESS TO SANITATION

LIFE EXPECTANCY IN YEARS

The tables have 260 rows and >50 columns (one for each year). World Bank Data Links: Life ExpectancySanitation Access

Basic VLOOKUP in R

Let’s say you already have your two datasets set up. (The code to import and set up the two data sets is at the end of this article).

Say we want to analyze the most recent year of data, 2012. We can use the merge() function to combine two datasets. In the code below, we are merging a subset of life_expectancy and sanitation; the subsets each have just two columns, country name and the values for 2012.

data_2012 = merge(life_expectancy[, c("country.name", "le_2012")], 
                  sanitation[, c("country.name", "san_2012")])
head(data_2012, 10)

Note that I’m only keeping one numeric variable, for 2012, from each data set. If we simply fed the entire life_expectancy and sanitation data frames as is, our result would have 100+ columns. Have you ever VLOOKUP’d 50+ columns from one data set to another? It’s not nearly as pretty as merge().

# Sample - first 10 rows
          country.name  le_2012  san_2012
1          Afghanistan 60.50912  29.00000
2              Albania 77.35046  91.20000
3              Algeria 70.88217  95.20000
4       American Samoa       NA  62.50000
5        Andean Region       NA        NA
6              Andorra       NA 100.00000
7               Angola 51.46400  60.10000
8  Antigua and Barbuda 75.66532        NA
9           Arab World 69.94970  81.64278
10           Argentina 76.01268  97.20000

There we have it — the country name along with the 2012 values for each metric. Notice we did not specify any options in the merge() function; we just told it which two data frames to combine. That’s because the default options of merge() suffice in this case; this is a basic example. Specifically, the lookup columns in both data frames have the same name, and both data frames have the same number of rows (one per column), making this a simple 1-to-one lookup.

INNER / LEFT / RIGHT JOIN in R

What if the two data frames were different sizes? Say you have life expectancy on every country, but only have income disparity stats on a subset of countries? Then you have to decide what you need for your analysis. Here is a cheat sheet assuming you use the following merge() function:

merge(x = data_w_all_countries, y = data_w_some_countries)
SCENARIOAKA IN SQLMERGE PARAMETERS
Keep rows where there’s a match in bothINNER JOINN/A. This is default
Keep all rows from x, regardless of match in yLEFT JOINset all.x = TRUE
Keep all rows from y, regardless of match in xRIGHT JOINset all.y = TRUE
Keep all rows from x AND from yOUTER JOINset all = TRUE
R Merge Cheat Sheet

So if you were building a “master” data set and you need to keep all the countries from the life expectancy table, regardless of finding a match in the income disparity data set, then you would use the following:

merge(life_expectancy, income_disparity, all.x = TRUE)

Merge by multiple columns

One of the weaknesses of VLOOKUP is that it could only merge by one column. The example above works fine in Excel because we are only VLOOKUPing data by country. But what if we had multiple columns to match on — say country and city? That is, if both data sets had stats at the country and city level.

By default, as long as the columns are named the same way in both data frames, R is smart enough to automatically join the two data frames by these columns. In any case, if we want to be explicit in our code (which I recommend), you can specify which columns to merge by, as follows:

merge(life_expectancy, income_disparity, by = c("country", "city"))

Here we used the by parameter to specify which columns to merge by. We provided a vector of column names. (Remember, c() is shorthand for creating a vector of elements).

When “by” column names differ

Often times, the column names across data frames will differ, so we need to explicitly tell R which columns to merge by. In the example above, if life_expectancy’s country column was named “cntr_nm” and sanitation’s column was named “country.name”, we would use the following merge():

merge(life_expectancy, sanitation, by.x = "cntr_nm", by.y = "country.name")

There we instruct R which columns to base the merge on.

Best Practice: be explicit

I always recommend you be explicit when merging, and don’t rely on the defaults. Even if the “by” columns are named the same, I still like to specify which columns to merge by. It will make your code more readable, but also keep you out of trouble.

For example, remember how our original data set had one metric column for each year (2010, 2011, 2012, etc)? If we tried merging the raw life_expectancy and sanitation data frames without renaming the columns, and without setting by parameters, R would have tried merging the two data frames by all common columns — namely country.name, 2010, 2011, 2012, etc. Since the numeric columns (2010-2012) likely won’t match across the two data sets, your merge will yield no results. Plus, if your data sets are big enough, the merge could really take a lot of time to process. So always lean on the side of explicit in the case of merge().

Conclusion

VLOOKUP is awesome, but merge() is even more awesome. It can do a lot of things that VLOOKUP cannot. Even though the default options are really useful, I recommend explicitly telling the function how you want things merged, by specifying the by parameter.

Appendix

Importing and prepping the data

Let’s import the data and rename the metrics columns. I want to rename them because they have the same names in both data sets, and (e.g., 1995 column name). This way we can distinguish the two sets of columns better.

life_expectancy = read.csv("life_expectancy.csv", skip = 2, header = TRUE, stringsAsFactors = FALSE)
sanitation = read.csv("sanitation.csv", skip = 2, header = TRUE, stringsAsFactors = FALSE)
 
## rename all yearly metric columns.  E.g., X1995 becomes le_1995 (for life expectancy)
names(life_expectancy)[grepl("\\bX", names(life_expectancy))] = gsub("X", "le_", names(le)[grepl("\\bX", names(life_expectancy))])
names(sanitation)[grepl("\\bX", names(sanitation))] = gsub("X", "san_", names(sanitation)[grepl("\\bX", names(sanitation))])
 
## lowercase column names
names(life_expectancy) = tolower(names(life_expectancy))
names(sanitation) = tolower(names(sanitation))

I use regular expressions (link coming soon) for the renaming. Here is a sample of the Life Expectancy dataset:

# Sample - 6 rows, 8 columns
   country.name country.code                          indicator.name indicator.code  le_1961  le_1962  le_1963  le_1964
1         Aruba          ABW Life expectancy at birth, total (years) SP.DYN.LE00.IN 65.98802 66.36554 66.71398 67.04429
2       Andorra          AND Life expectancy at birth, total (years) SP.DYN.LE00.IN       NA       NA       NA       NA
3   Afghanistan          AFG Life expectancy at birth, total (years) SP.DYN.LE00.IN 32.09598 32.61188 33.12732 33.64124
4        Angola          AGO Life expectancy at birth, total (years) SP.DYN.LE00.IN 33.38622 33.78759 34.18846 34.59034
5       Albania          ALB Life expectancy at birth, total (years) SP.DYN.LE00.IN 63.27346 64.16285 64.88710 65.43820
6 Andean Region          ANR Life expectancy at birth, total (years) SP.DYN.LE00.IN       NA       NA       NA       NA

You can see the year columns now have the “le_” prefix. Data is prepped and ready for VLOOKUP.

Comments

  1. Really nice explanation of ‘merge’. Thanks.

    As a minor point, Excel can do the equivalent to a VLOOKUP across multiple criteria, by using INDEX and MATCH functions. http://stackoverflow.com/questions/18233144/optimization-of-vlookup-with-multiple-criteria-index-match.

    However, your conclusion is right: merge() is even more awesome. And merge() is much simpler (a concise ‘R’ statement on one line versus a lengthy Excel array formula and copying it across all rows).

    1. Absolutely! In fact, I tend to tell colleagues nothing is impossible with Excel because I love the challenge of cobbling up the most insane logical formulas. But knowing that some things can simply be done easier and more powerfully with R will drastically improve productivity.

  2. Thanks for the post, it’s helped me solve a bunch of different challenges! But I’m having trouble applying merge to another issue I’m facing. I have a data frame with multiple columns (player data from the NBA). Each row is the statistics from a player in a game. I need to get the full name of the opponent in a new column. I have the three digit opposing team abbreviation (stripped from a longer string in another column using substr), the three digit player team abbreviation and the player team full name.

    In excel speak, I need to vlookup the team name from the three digit abbreviation (i.e. =vlookup(three digit opposing, three digit player team:player team full name, 2,0) or =index(player team full name,match(three digit opposing, three digit player,0)))

    Is there a way to do this with merge? I can’t seem to get it to work. Each time I merge, I get a result with millions of rows (vs my dataset of ~9000 rows). There is probably something totally obvious that I’m missing.

    1. Hi John,

      Without seeing your data, it sounds to me like the lookup table is not properly formatted for the merge. This is a very common mistake in SQL as well; joining two small tables can easily blow up to a massive table. Do a google search for “avoid sql cartesian product” … you should find a few good examples.

      Anyway, I would suggest something like this:

      merge(player_table, unique(opponent_table[,c(“team_code”,”team_name”), by.x = “opponent_team_code”, by.y = “team_code”]

      What we’re doing here is vlookup’ing the team name from a tiny table consisting of one row per team_code/team_name combination. This is as it should be anyway; the lookup table is basically a reference table. Secondly, we’re specifying the variable to merge “by” on both sides of the merge. Depending on the column names of both tables, you could end up with other issues if you don’t specify them.

      I hope this helps!

  3. Thanks for that John. Would you by any chance know how to do the equivalent of a HLOOKUP in R? I have a dataset of just under 600k rows and approx. 60 columns that I can reduce to under 20 with a simple HLOOKUP in excel, however it just keeps crashing on me. Thanks in advance.

    1. Hi Deirdre, I assume you’ve figured it out by now. If not, could you clarify a bit? Feel free to email me directly with the example … John at rforexcelusers. I’m more than happy to help you; probably just a matter of transposing a table and merging in some way.

  4. Hi!

    I want to ask how to do ‘Multiple VLOOKUP’ just like we do in Excel.

    For example :

    =VLOOKUP(E2,$A$2:$C$30,VLOOKUP(F2,$J$2:$K$3,2,FALSE),FALSE)

    Thanks 🙂

    1. Hi Iren,

      Yikes, this one is actually a tough one and no obvious simple solution pops to mind right away. I can give you a number of complicated solutions, but don’t think that would be best.

      I would step back for a sec and question why the data is set up this way. Sometimes we set things up in Excel in weird ways to overcome Excel’s shortcomings. For example, what does that lookup table in J:K represent? I know it’s pulling the column number for the main lookup, but why have that separate table in the first place?

      Let’s say your main table (A:C) is student test scores, one row per student, and two columns with test scores. And the second lookup @ J:K shows “which test was the student’s best (either 1 or 2)”. Instead of having this structure with two separate tables, you could instead calculate the max test score on the fly for each student.

      If you want more help with your specific problem, feel free to shoot me a note with more details and I may write a whole blog post about it!

      John

  5. What if we wanted to match/merge data in to a column that already existed in both data sets? Say, we have the following:

    authors <- data.frame(
    surname = c("Tukey", "Venables", "Tierney", "Ripley", "McNeil"),
    nationality = c("US", "Australia", "US", "UK", "Australia"),
    deceased = c("yes", rep("no", 4))
    )

    books <- data.frame(
    name = c("Tukey", "Venables", "Tierney", "Ripley", "McNeil", "R Core"),
    title = c("Exploratory Data Analysis",
    "Modern Applied Statistics",
    "LISP-STAT",
    "Spatial Statistics",
    "Interactive Data Analysis",
    "An Introduction to R"),
    nationality = c(rep("US", 6))
    )

    books_update = merge(books, authors, by.x = "name", by.y = "surname", all = TRUE)

    Nationality is in both data frames, so it creates a 'nationality.x' and 'nationality.y' respectively for each field. Is there a way to overwrite the data from y (the 'authors' data frame in this instance), or would a different function be needed?

    1. just came across this very nice explanation, you might have solved the problem but still..
      one can try
      books_update = merge(books, authors, by.x = c(“name”,”nationality”), by.y = c(“surname”,”nationality”), all = TRUE)

  6. Thank you for the great explanation – I’ve been dreading doing this in R, but it is actually almost easier than in Excel! Is there an equivalent to the final value in the VLOOKUP command (i.e. TRUE or FALSE)? I have a list of dates and prices. The dates are the dates on which the price changed. April 1 $5.00; April 9 $5.50; April 22 $5.75, etc. Obviously on April 4 the price is still $5, but it isn’t explicitly in the table as such. In Excel, I could sort by date and use TRUE (=VLOOKUP(A8,$A$1:$B$3,2,TRUE) and it would correctly figure out what the price was on ANY day. Is there a way to do that in R?

Leave a Reply

Your email address will not be published.