Data Vis Diary: Words, words, words

I discovered Google’s ngrams last week through my data vis class. Basically, the Google Books project digitized millions of books and scanned all of the words, amassing a giant database of billions of the words published in the past 200 years. What’s really cool is that Google has built an online viewing tool that you can use to graph the frequency of any word that you are interested in, over any time period in the last 200 years. The results look like this:

I made this graph to look at changing conversation (well, publication) of science ideas from 1800 to the present. The yellow line is “oil”, peaking in the first half of the 20th century. The green line for “computer” rises sharply, out of nowhere, just before 1960. “DNA” rises about the same time as computer, but never reaches the same frequency. Pink and light blue lines for “chemistry” and “physics” remain fairly stable. The high peak on this graph is at 0.013% for “oil”, or a frequency of 1 out of every 8,000 or so words published in 1940 was “oil”.

It’s fun to play with the viewer, because you can track the popularization of new ideas and new vocabulary, in addition to watching words die out and fade from common usage. More than 15 million books have been included in the database, approximately 12 percent of all the books ever published. The project is called n-grams, because the database is searchable by specific single words (1-gram) to phrases up to 5 words (5-gram) and the viewer can search phrases of n, between 1 and 5. Google has data-sets for several languages, but I’ve only considered the English database here.

For this view, I choose controversial science terms. Mostly 2-grams, except “genetics” in purple, which is the top line, peaking around 2000.  This graph shows 1950-2008. “Natural selection” remains relatively stable, in blue, across the mid-graph. In yellow, “stem cells” gain slowly from about 1965. I liked comparing the stability of “natural selection” to the quick popularizing of both “climate change” and “global warming”, which increase rapidly in usage starting in 1985. Then by 1995, “global warming” flat-lines, while “climate change” continues to rise, the changing vocabulary reflecting the recognition that the impacts of the increased CO2 in our atmosphere is creating more uncertainty in the climate system, not just straight warming. The peak frequency on this graph is genetics, around 2000, at 0.00055%, which represents about 1 out of every 180,000 words.

These graphs reveal some interesting trends in ideas, but they are just scratching the surface of what this data base can accomplish. I read a cool data paper published in Science last year that made some broad scale analysis of the data. Their first move was to use the database to figure out how many words actually composed the English language.  They totaled all of the 1-grams, eliminated misspellings and  foreign words.

Using this technique, we estimated the number of words in the English lexicon as 544,000 in 1900, 597,000 in 1950, and 1,022,000 in 2000. The lexicon is enjoying a period of enormous growth: The addition of ~8500 words/year has increased the size of the language by over 70% during the past 50 years.

The researchers also used the data-set to watch English grammar evolve.  The analysis focused on words with regular and irregular forms in past tense, like “burnt” and “burned” The data shows some irregular forms disappearing, and others, holding strong. They analyzed cultural significance by studying the names of famous people, charting how long they continued to be written about after their peak of popularity. They even mapped, on average, how long it takes, from birth, to become famous, in different fields, like actors or actresses (avg about 30 years) politicians (50 years) and mathematicians (hardly ever).

It’s an incredibly versatile data set. I think no matter your interests, you can find language trends that could relate to your research questions.  I’m going to tackle more specific question about the changing vocabulary of healthy eating on Eating Science.  If you’ve run any cool n-gram searches, please share your findings in the comments. Thanks!


