Download Raw Files of the Google Ngram Data

Alex John Quijano

1. Introduction.

The google ngram is a freely accessible database that represents years of human discourse in multiple languages by counting/scanning ngrams (word sequences) in the Google books corpus. With the Google Ngram viewer, you can easily track the 1gram occurences of words from 1800-2008 in eight unique languages. These datasets where generated in 2012 (version 2) [Lin et.al. 2012] and in 2009 (version 1). Large portions of the dataset can be downloaded directly from this link, Google Ngram Datasets V2.

GITHUB. raw-data-google-ngram

Purpose. The python scripts on this repository provides an easy way to download, filter, and normalize these large datasets iteratively for researchers interested in data science, mathematical modeling, computational linguistics, historical linguistics, and/or content analysis. This kind of dataset have been used to study culture and cultural trends - also known as culturomics [Michel 2010]. Although this data set is impressive, it is not immune to sampling bias. Researchers have shown that the word trend frequencies taken from the google books corpus are dominated by scientific literature [Pechenick 2015].

The provided script named downloadAndFilter.ngram.sh allows you to easily download and filter the entire dataset of a chosen language with specific parameters. The python script normalize.ngram.py takes in the filtered data and it organizes the data such that the year occurences of each unique ngram is represented as a matrix where each row represents the ngram and column represents years. The downloadAndFilter.ngram.sh and normalize.ngram.py creates three directories. Each corresponds to raw, filtered, and normalized data but the raw directory only contains the file information that was downloaded and filtered because these scripts immediately deletes these large files when the normalization is complete.

Limitations. The downloadAndFilter.ngram.sh downloads ngrams that start with any alphabet a-z. The numerals 0-9 and punctuations are excluded but you can include this by editing (uncomment) line 23 of the script.

2. Download and Process Raw Data.

Below uses bash scripting to download and process the Google Ngram data of the Simplified-Chinese (chi-sim) language as a demonstration. See Section 3 for language codes.

2.1. Downloading and Filtering Raw data (with specified parameters).

2.2. Normalizing the Filtered Data (with specified parameters).

We follow the normalization process from previous work of Sindi and Dale 2016 [Sindi and Dale 2016]. Given a set of words $V = \{w_1, w_2, \cdots, w_c\}$ and years $Y = \{t_0, t_1, t_2, \cdots, t_{T-1} \}$, the frequency of a word $w_i$ in a corpus at time $t$, $freq(r_{i,t}) = x_{i,t}$ is the number of occurrences of that word in that corpus. As such, the total number of words, $c$, is fixed as is the number of years $T$ and we thus represent the word frequencies as a matrix: $\mathbf{R} \in \mathbb{R}^{c \times T}$ where

$$\mathbf{R}_{i,t} = freq(r_{i,t}) = x_{i,t}, \hspace{15px} x_{i,t} \ge 1$$

In our normalization process, we first convert the frequency matrix $\mathbf{R}$ into a proportion matrix $\mathbf{P}$ by normalizing the columns of $\mathbf{R}$ which normalizes word frequencies by year:

$$\mathbf{P}_{i,t} = p_{i,t}, \hspace{15px} p_{i,t} = \frac{x_{i,t}}{\sum_{i=1}^{c} x_{i,t}}.$$

Finally, we normalize the proportions for each unigram by converting the rows of $\mathbf{P}$ into z-scores:

$$\mathbf{Z}_{i,t} = z_{i,t}, \hspace{15px} z_{i,t} = \frac{p_{i,t} - \overline{p_{i}}}{\sigma_{p_{i}}}$$

where $\overline{p_{i}}$ is the mean and $(\sigma_{p_{i}})^2$ is the variance of the $i$th row of $\mathbf{P}$;

$$\overline{p_{i}} = \frac{1}{T} \sum_{t=0}^{T-1} p_{i,t}$$

and

$$(\sigma_{p_{i}})^2 = \frac{1}{T-1} \sum_{t=0}^{T-1} (p_i - \overline{p_{i}})^2.$$

The matrix $\mathbf{P}$ has the following properties.

  1. $\sum_{i=1}^c p_{i,t} = 1$
  2. $\sum_{i=1}^c \overline{p_{i}} = 1$
  3. $\sum_{i=1}^c \sum_{j=1}^c \overline{p_{i}}\overline{p_{j}} = 1$
  4. $\sum_{i=1}^c \sum_{t=0}^{T-1} p_{i,t} = \sum_{t=0}^{T-1} \sum_{i=1}^c p_{i,t} = T$
  5. $\sum_{t=0}^{T-1} \frac{1}{c} \sum_{i=1}^{c} p_{i,t} = \frac{T}{c}$

The matrix $\mathbf{Z}$ has the following properties.

  1. $\frac{1}{T} \sum_{t=0}^{T-1} z_{i,t} = 0$
  2. $\frac{1}{T-1} \sum_{t=0}^{T-1} \left(z_{i,t} - 0 \right)^2 = 1$

2.3. Downloading for $n > 1$.

Downloading ngrams for $n > 1$ requires large memory (i.e > 1 terabyte). To process ngrams with $n > 1$, set the optional parameter, specific_fileName='1gram-list-special-chi-sim'. This option will tell the machine to search for the file named '1gram-list-special-chi-sim' in the '1gram-list' directory that contains a list of 1grams. The resulting ngrams will contain these 1grams. For example 1gram '特' will result in 2grams '一 特' and '特 一', etc. Setting this paramter to specific_fileName='all' while $n > 1$ will process all ngrams.

You can create your own list of 1grams and save it in the '1gram-list' directory. You can also try the included stop words lists (ex. '1gram-list-stop-word-chi-sim').

3. Loading the Processed Data using the googleNgram Python module.

import googleNgram as gn
n = '1'
l = 'chi-sim'
D = gn.read(n,l,ignore_case=True,restriction=True,annotation=False)

Inputs of gn.read.

  1. 'n': (string) The n of ngram must be a string (ex. n='1', 1grams only for preprocessed data).
  2. 'l': (string) The available languages for the preprocessed data are the following.
    • 'eng' (English)
    • 'eng-us' (American English)
    • 'eng-gb' (British English)
    • 'eng-fiction' (English Fiction)
    • 'chi-sim' (Simplified Chinese)
    • 'fre' (French)
    • 'ger' (German)
    • 'heb' (Hebrew)
    • 'ita' (Italian)
    • 'rus' (Russian)
    • 'spa' (Spanish)
  3. 'ignore_case': (boolean optional, default = True) The ngrams are in lowercase and the raw counts of any two identical ngram are consolidated.
  4. 'restriction': (boolean optional, default = True) Ngrams with a zero raw counts on any year are excluded.
  5. 'annotation': (boolean optional, default = False) The part-of-speech annotation (the '_NOUN' part of the ngram string) of an ngram is removed and the raw counts of any two identical ngram are consolidated.
  6. 'specific_fileName': (string optional, default = 'all' for $n = 1$ only) The filename of a list of 1grams to normalize if $n > 1$. (see Section 2.3)

The parameters ignore_case, restriction, and annotation must match the normalized parameters from Section 2.

Outputs of gn.read.

The output is a dictionary data structure with 'rscore', 'kscore', 'pscore', 'zscore', and 'pos' as keys. Each key contained DataFrames where rows are the ngrams and the columns are the years.

  1. 'rscore': The raw counts.
  2. 'pscore': The probability scores.
  3. 'zscore': The z-scores
  4. 'pos': The part-of-speech annotations vector associated of a given ngram if annotations are consolidated for each ngram. Consolidation of part-of-speech is when a word with multiple annotations are summed up. For example, the raw data ngram strings one_NUM and one_NOUN will result in one in the vocabulary and [' NUM NOUN'] in the part-of-speech vector. That means the 1gram one is annotated as a number NUM and/or as a noun NOUN.

The part-of-speech annotations.

The version 2 of the google ngram raw data includes part-of-speech annotations on each ngram. Refer to the table below for a full list of annotations.

part-of-speech annotation
Noun NOUN
Verb VERB
Adjective ADJ
Adverb ADV
pronoun PRON
determiner and article DET
preposition and postposition ADP
numeral NUM
conjunction CONJ
particle PRT
other O
none

3.1. Loading and Visualizing the Dataset Examples.

The python scripts below loads the pre-processed datasets and produces basic time-series plots for a given list of 1gram strings.

Note. The english vocabulary is only limited to some words. Words that are somehow did not exist in the raw data and/or filtered out during data processing and normalization may not appear in the vocabulary variable. The script below may give you errors if a given ngram string does not exist in the vocabulary.

3.1.1. Dataset with Consolidated Annotations.
3.1.2. Dataset with Annotations.

3.2. Subsetting the Data According to a word list from a File.

Suppose you want to only choose specific rows in the dataframe. You can do this by having a list of words - stopwords for example - and use that list to choose the rows corresponding to those words. Below is an example on how to subset the dataframe using a list of words from a file.

References

  1. Sindi, Suzanne S., and Rick Dale. "Culturomics as a data playground for tests of selection: Mathematical approaches to detecting selection in word use." Journal of theoretical biology 405 (2016): 140-149.

  2. Pechenick, Eitan Adam, Christopher M. Danforth, and Peter Sheridan Dodds. "Characterizing the Google Books corpus: Strong limits to inferences of socio-cultural and linguistic evolution." PloS one 10.10 (2015): e0137041.

  3. Y. Lin, J.-B. Michel, E. L. Aiden, J. Orwant, W. Brockman, and S. Petrov, Syntactic annotations for the google books ngram corpus, in Proceedings of the ACL 2012 system demonstrations, Association for Computational Linguistics, 2012, pp. 169–174.

  4. Michel, Jean-Baptiste, et al. "Quantitative analysis of culture using millions of digitized books." science (2010): 1199644.