So many friends and family members have had babies in the past few years, and I can’t think of a single baby whose name has been repeated. Are parents getting more creative? Is it a function of the diversity around me (in NYC / CT / Boston)? Is it a statistical anomaly?
In this post I will do some quick data crunching to get a sense of high-level baby-naming trends. Are we indeed seeing more names in the books? Are parents disfavoring the popular names?
I will use data from Social Security Administration, which we imported in the last post, and will make heavy use of the dplyr library for data manipulation and summarization.
How has the number of distinct names changed over time?
With 5.5M rows, this data set is already too big for Excel, so this analysis is a non-starter. Even if it did fit, this would still not be a straightforward analysis in Excel. That’s because names are repeated across states, and there is no good way of counting unique instances. However, using just one state’s worth of data, this is a simple PivotTable and charting exercise:

This is using Colorado’s data. The number of boys’ and girls’ names has nearly quadrupled over the past 100 years, and female names started outpacing male names in the mid-1940s.
Analysis with R
As usual with R, there are dozens of ways to do the same thing. We could use SQL (see my SQL in R post), we could use built-in functions like aggregate()
, by()
, ave()
. We could use the plyr library or doBy. We can go on and on.
But my new personal favorite is dplyr. (1) it’s fast and (2) the syntax is consistent and simple.
Set up data
Let’s first replicate the PivotTable:
# Step 1: summarize the data names_count = group_by(files, year, gender) names_count = summarise(names_count, unique_names = n_distinct(name), names_per_1M = unique_names / (sum(count) / 1000000))
To summarize data with dplyr, we first specify the columns we wish to group by. That’s what we do in line 2, grouping the files data frame by year and gender. This does not visibly change the data; it just sets it up for by-group processing. In line 3, the summarise()
function creates two new columns:
- unique_names. Using the
n_distinct()
function, we get number of distinct names. This is the function that’s missing in Excel; counting distinct items in Excel is that so easy. - names_per_1M. Normalizes the name count by number of births. We might see fewer names way back in the day simply because there were fewer births. Normalizing takes care of that, so comparing years becomes apples-to-apples. This is like a calculated field in a PivotTable.
Here is a snippet of the summary table:
year gender unique_names names_per_1M 1 1910 F 1083 3075.988 2 1910 M 692 4213.731 3 1911 F 1066 2862.829 4 1911 M 754 3897.849 5 1912 F 1261 2500.654 6 1912 M 1114 2903.302 7 1913 F 1350 2381.200 8 1913 M 1256 2721.053 9 1914 F 1549 2222.812 10 1914 M 1494 2504.913
Now let’s graph it. For some graphics packages, like ggplot2, this data frame would be sufficient. But we will use the googleVis package, which requires that each series be represented by a column, just like Excel charts. So let’s transpose the gender column:
# Step 2: transpose the gender column uniq_names_count = dcast(names_count, year ~ gender, value.var="unique_names") names_per_1M = dcast(names_count, year ~ gender, value.var="names_per_1M")
I created two new data frames, one holding the number of unique names by gender, and the other having the number of names per million births. Instead of creating two data frames, I could have also “melted” the data frame with melt()
then transposed it with dcast()
. But for simplicity, I’m doing it this way.
We use the dcast()
function to transpose the gender column. This function is part of the reshape2 library. dcast()
takes a formula, year ~ gender, as input to indicate how the new table should be structured. If you think of it like a PivotTable, the year column is on the left hand side of the PivotTable, gender is on the top part and names_per_1M is in the values part of the PivotTable.
Here is an excerpt of the names_per_1M data frame:
year F M 1 1910 3075.988 4213.731 2 1911 2862.829 3897.849 3 1912 2500.654 2903.302 4 1913 2381.200 2721.053 5 1914 2222.812 2504.913 6 1915 1998.805 2226.530 7 1916 1950.397 2185.154 8 1917 1912.197 2139.653 9 1918 1873.776 1992.210 10 1919 1908.073 2124.031
Now we’re ready to plot it using the googleVis library:
# Step 3: Plot the chart with googleVis plot(gvisLineChart(names_per_1M, xvar="year", yvar=c("F","M"), options=list(title="Number of Names Per Million Births", hAxis="{title: 'Year'}", vAxis="{title: '# of Names'}")))
googleVis, though limited in features compared to ggplot2, has two great advantages. (1) it is interactive and web-ready and (2) the syntax is significantly (!) more simple than ggplot2’s. In this example, I specified the data frame (names_per_1M), the x-variable and y-variable. The options are aesthetic.
|
|
I only showed code for the graph on the right, but the graph on the left is essentially the same, only referring to the other data frame. The first chart shows that the number of names clearly been growing for many decades. In 1911, about 1,800 names covered the US population; today that number is more than 10,000. In the early 1940s, the number of female names started growing at a faster pace than that of men, and today, there are approximately 1,500 more female than male names.
The second chart shows an interesting twist — specifically the dip from the 1940s to late 1950s. For every million births, this period had the fewest number of distinct names. These are the WWII baby boomers. So many babies born, with fewer names to go around. This exercise shows why normalizing is so important — it really tells a different story from the non-normalized view.
The number of names per million births has nearly quadrupled in the past 60 years. So parents do, indeed, have more names to choose from these days.
Are popular names less popular?
Ok fine, so we see more distinct names showing up. But the lion’s share of newborns must still have the core popular names, right? To answer this question, let’s look at the percent of the population having the top 50 names. That is, let’s take the top 50 names in each state and tally up the percent of people in that state who have those name.
This would a dreadful analysis in Excel. To figure out the top 50 names in each state, I could imagine a PivotTable counting births in a state-vs-name matrix and flagging the top 50 from each name column and vlookup’ing that into the raw table. To repeat that for every year, forget about it; Excel just isn’t made for this kind of database analysis.
Let’s do it in R with some dplyr functions:
# flag top 50 names in each state top_by_state = group_by(files, year, state, gender) %.% mutate(group = ifelse( rank(desc(count), ties.method = "first") <= 50, "TopX", "Other")) # summarise births by top vs other, across the country top_national = group_by(top_by_state, gender, year, group) %.% summarise(births = sum(count)) # transpose the gender and group columns top_national = dcast(top_national, year ~ gender+group, value.var = "births") # calculate % of births among top names top_national$M_topx_pct = top_national$M_TopX / (top_national$M_Other + top_national$M_TopX) top_national$F_topx_pct = top_national$F_TopX / (top_national$F_Other + top_national$F_TopX)
In the first chunk, I find the top 50 boys and girls names in each state every year. Frankly within this whole analysis, this is the killer part. In fact, once we do this, we can easily send it back to Excel and run a PivotTable. But let’s stick with R.
As usual, we first use group_by()
whenever we want to operate “by” something; in this case we want to create the Top vs Other flag by year, state and gender. Then with mutate()
we use rank()
to assign a “Top” vs “Other” flag to each name. The rank()
function works like the Excel version of RANK(). The rest is standard summarizing and transposing.
Here is what that looks like:
year F_Other F_TopX M_Other M_TopX M_topx_pct F_topx_pct 1 1910 135787 216295 41545 122680 0.7470239 0.6143313 2 1911 142400 229959 49814 143626 0.7424835 0.6175734 3 1912 198754 305514 114399 269302 0.7018538 0.6058564 4 1913 224439 342502 141875 319711 0.6926358 0.6041228 5 1914 278652 418213 189011 407417 0.6830950 0.6001349 6 1915 368441 540102 255586 525899 0.6729483 0.5944705
The last two columns are the ones we care about. The first row shows 75% of baby boys had one of the top 50 names in 1910. Wow! The remaining 25% of boys had the other 1000+ names. Now let’s plot the trend over time:
plot(gvisLineChart(top_national, xvar = "year", yvar = c("M_topx_pct","F_topx_pct"), options = list(title = "% of newborns having the top 50 names", vAxis="{minValue:0}")))
Interesting. Starting in the early 80s, we see a decline in the popularity of popular names. Nowadays about 36% of baby boys have one of the 50 popular names in their state; 60 years ago 60% of boys had one of the fifty popular names. A similar effect is happening with girls’ names.
Conclusion
Parents these days have more names to pick from when naming their kids. Have parents gotten more creative, or is this a function of growing ethnic diversity in America? Who knows. We are also seeing that popular names are just not that popular anymore. When taking the top 50 names in each state each year, we are seeing a smaller and smaller percentage of babies having those names.
With just a few lines of code, we did in R what would have taken significantly more effort in Excel. This is a case where I would pick R over Excel, without a doubt. Stay tuned for more interesting analyses in future posts.
Nice post. One comment, though: I don’t think scaling the number of unique names by the population size is the ideal way.
The expected scaling will depend on name frequency, but, for example, if names behaved like words and roughly followed Zipf’s Law, you’d expect to get about log(N) unique names from N people when there was no underlying change in the name distribution. I don’t think the data really support a decrease in name diversity in the early 20th century.
Hi Thomas, thanks for your feedback! How would you suggest scaling? I chose this simple version because it’s straightforward enough. All I wanted was to compare frequencies over time, and I figured a per-capita normalization would do that for me.
I do agree the drop in the early 20th century looks odd. I first assumed that drop was the tail-end of a big immigration wave (babies having more German / Irish names, etc). But it might actually be a structural issue w.r.t. the data and record keeping in those early days.
How would you have done this?