For many beginners, the word “vector” is one of those scary, geeky things that creep us out. But understanding vectors will instantly speed up your learning process. So in this post I will try my best to break down r vectors for you.
Note: this post is similar to the book’s chapter on vectors
What are vectors
This is a vector:
This is another vector:
And so is this:
A vector is simply a one-dimensional collection of things. The things in the vector are of the same type (i.e., they must all be numeric, or character, etc). They can be as long as you want, or as short as one (like the first one).
How to Create Vectors in R
There are essentially two ways to “create” a vector:
- Make one manually using the
c()
function or other shortcuts - Take one from an existing object, like a data frame
Let’s replicate the examples above manually using the c()
function:
# two ways to make a vector of length one v1 = 55 v1 = c(55) v2 = c(55, 63, 1000, 2) v3 = c("James", "John", "Jane", "Jack")
Note the subtle power of the first example. The value 55 may look like a simple number, but in R speak it’s actually a vector of length one. The other examples are trivial; c()
stands for combine — as in combining things together.
There are sometimes shortcuts for creating vectors:
v = 5:10
There we created a vector v with integer values 5 through 10.
Vectors also can come from existing data objects, like data frames. E.g., df$colname is a reference to a column from the data frame df. That, itself, is a vector. So is a row from a data frame, like df[1,]. That is the first row of data frame df, and is also, itself, a vector.
How to Reference Vectors
Here are some ways to reference vectors:
# get second value of vector v2 v2[2] #result: 63 # get first and third values of vector v3 ## writing this out out in words because wordpress does not render the c-vector correctly v3[ c parenthesis 1 comma 3 parenthesis ] # result: "James" "Jane" # get the values of the second vector, in reverse order v2[ 4:1 ] # result: 2 1000 63 55 # we could also use a vector of TRUE/FALSE to pick specific values ## writing this out out in words because wordpress does not render the c-vector correctly v2[ c parenthesis TRUE, TRUE, FALSE, FALSE parenthesis ] # result: 55 63
Basically, in all those examples, we are supplying a vector of numbers
So Why do Vectors Matter?
Ever wonder what’s happening when you run something like this?
It’s definitely a head-scratcher for newbies. There we are sorting the dataframe by columname and only keeping the first ten columns. It turns out the two things inside the brackets are simply vectors instructing R how to order the rows and which columns to keep.
You may also see things like this:
It looks absolutely incomprehensible to newbies, and even some intermediate users. Here we are pulling column names from dataframe that have “blah_” in the name. But it turns out to be simple vector operations.
Vector Math
I’m not sure this is technically “vector math” but it sounds smart, so I’ll stick with that. Let’s run some operations on our vectors above:
# tell me which values > 60 v2 > 60
The result looks like this (second row):
The R operation returns a vector of TRUE / FALSE values indicating which values are > 60.
Recall in the section above, to reference values within a vector we simply supply it a vector of either index numbers, or TRUE / FALSE values indicating which values to keep. So what if we wanted the actual values of the vector that are > 60?
# tell me the values of vector v2 > 60 v2[ v2 > 60 ] # that basically runs this operations: # v2[ c(FALSE, TRUE, TRUE, FALSE) ] # result: # 63 1000
And on that same token, let’s find the names in vector v3 that have the letter “n”:
v3[ grepl("n", v3) ] # result: # "John" "Jane"
grepl()
is sort of like Excel’s FIND() and SEARCH() function, but way more powerful. It’s basically looks for “n” in the vector v3 and returns a vector of TRUE/FALSE values accordingly.
Let’s look at yet another example to bring this whole thing to full clarity. Let’s sort the values of vector v2, using the order()
function. The order function returns the ranking of each value in a vector. That ranking is, naturally, a vector itself:
order(v2) # values of v2: # 55 63 1000 2 # result: # 4 1 2 3
This is actually slightly confusing, but here’s what the result says:
the first ordered value (i.e., smallest) is the 4th value in v2 (which is 2)
the second ordered value is the 1st value in v2 (55)
…
the largest value is the 3rd value in v2 (1000)
So knowing that, we get a sorted version of v2 by simply putting the order function inside v2’s brackets:
v2[ order(v2) ] # result: # 2 55 63 1000
I don’t now about you, but this is magical.
Vectors in Data Frame Operations
Recall a data frame is like an Excel data table. A collection of rows and columns, where each column is of the same type (numeric, character, etc). And you reference a data frame like this:
And most importantly, each row and column is a vector itself. That is, df$col is a vector (column named col). df[1, ] is the first row of the data frame, and is also a vector.
Here is a sampling of data frame operations you will use / see often, involving vectors:
# get first ten rows and first ten columns. Remember 1:10 is just a shortcut for c(1,2,3,4,5,6,7,8,9,10) df[ 1:10, 1:10 ] # SORT df by col2, and get all columns. df$col2 references col2, and is actually a vector itself df[ order(df$co2), ] # FILTER rows where col2 > 50. Like an Excel filter. df$col2 > 50 returns a TRUE/FALSE vector df[ df$col2 > 50, ] # FILTER rows meeting multiple criteria. & is like AND() in Excel. It returns TRUE/FALSE if all conditions are met df[ df$col2 > 50 & df$col4 == "John", ] # Get columns containing "raw_". Return all rows df[ , grepl("raw_", names(df) ] # Organize columns alphabetically df[ , order(names(df) ] # Get columns whose names have more than five characters df[ , nchar(names(df)) > 5 ] # CREATE new column based on conditional ifelse(), which is like Excel's IF() df$newcol = ifelse(df$oldcol < 5, "lt 5", "gte 5")
The list is practically endless. But now you can see the powerful role that vectors play in data frame manipulation.
Let’s look a bit closer at the last example, ifelse()
. ifelse()
evaluates the condition, df$oldcol < 5, which results in a vector of TRUE/FALSE. When TRUE, assign newcol “lt 5”, when FALSE, assign newcol “get 5”. Once again, vectors come into play.
Conclusion
Every data object in R (including lists, matrix, dataframe, etc) can be broken down to vectors. Vectors are the fundamental data structure of R, and once you realize that and get comfortable with vectors, a lot of things in R will suddenly make a lot of sense.
Just noting a typo in “so why do vectors matter”
“dataframe[ order(dataframe$columnname), 1:10 ]
It’s definitely a head-scratcher for newbies. There we are sorting the dataframe by columname and only keeping the first ten rows.”
I think you may have intended to write “only keeping the first 10 columns”
Ah, yes — good catch. I’ll fix it now. Thank you!
This is fabulous, if only someone can explain easily how to import data into R from excel files it’d be great. I love the ease with which you explained vectors, hope to switch from excel pretty soon!!!
Thanks, Sid! I’m glad it’s been helpful. I have a long list of ideas to write about, just not enough time to do it all. Importing data from Excel is relatively easy. The below should import a sheet named “blahsheet” from the file “blah.xlsx”. You’ll have to check the options and change defaults as needed … but it’s similar to read.csv which is covered in the book.
library(xlsx)
read.xlsx(“blah.xlsx”, “blahsheet”)
I believe there’s also a parenthesis missing in “You may also see things like this:
names(dataframe)[grepl(“blah_”, names(dataframe)]”
Also, if v3 = c(“James”, “John”, “Jane”, “Jack”)
then v31 returns an error. If we need to return the first and third values of that vector v3, then I believe one way of doing it is:
v3
Ah, yes — good catch on the missing paranthesis.
Looks like wordpress has an issue showing c(1,3) … so I just wrote it out in words instead.
Dear all! Congratulations for your web site. It looks very helpful. I’m a advanced excel user, but I’m on the first steps on R. I’m trying to create a data on excel, but to do that I need to calculate a array formula, but no one was able to help me with that. As you can see bellow, the formula is a array, and basically I need to Sum something from a minimum distance due some restriction. In the first “IF” there is the restriction, on this case the colun “G” represents the Year, and I want to sum only the sum from the last year “G2-1”. In the second “IF” it’s a formula who calculates the distance in kilometers from a point to all my data and consider only that one who is smaller than the value on the cell “BD$1”. So my formula Sum everything from the cells “$AA$2:$AA$225667” due to the restrictions that I told before. As you can see, it’s a Big Data and to process this formula in all the 225667 lines in a core i7 with 8GB it would take more than 50 hours, so if you could help to build on R, I… Read more »
Hi Otávio, Thank you for your question. I will say — you are picking a really challenging entry point into R. What you’re trying to do is not trivial, but definitely very doable. I created a simplified fake data set to mimic the problem you’re facing and wrote an mapply() function that worked: mapply(function(y, x) sum(t$col_to_sum[t$year == (y – 1) & (t$x – x) < 0.4]),y = t$year, x = t$x) Basically, there are three basic vectors you’re working with: Column AA … equivalent to my t$col_to_sum Column G … equivalent to my t$year Column C … equivalent to my t$x Let’s look at the second row in my code. Here we are taking the sum of the t$col_to_sum column WHERE values in the t$year column equal the “current” year minus 1 AND the difference between values in the t$x column and the “current” x value < 0.4. By "current" I mean the value in the row that is being evaluated in every iteration. So just like Excel, this happens to become a manipulation of three vectors. You just need to replace my (t$x -x) part and replace it with the distance comparison function. mapply() allows you to apply that… Read more »