instant-segment/data/README.md

The data files in this directory are derived from the [Google Web Trillion Word
Corpus][corpus], as described by Thorsten Brants and Alex Franz, and [distributed][distributed] by the
Linguistic Data Consortium. Note that this data **"may only be used for linguistic
education and research"**, so for any other usage you should acquire a different data set.

[corpus]: http://googleresearch.blogspot.com/2006/08/all-our-n-gram-are-belong-to-you.html
[distributed]: https://catalog.ldc.upenn.edu/LDC2006T13