The data files in this directory are derived from the [Google Web Trillion Word Corpus][corpus], as described by Thorsten Brants and Alex Franz, and [distributed][distributed] by the Linguistic Data Consortium. Note that this data **"may only be used for linguistic education and research"**, so for any other usage you should acquire a different data set. [corpus]: http://googleresearch.blogspot.com/2006/08/all-our-n-gram-are-belong-to-you.html [distributed]: https://catalog.ldc.upenn.edu/LDC2006T13