1.3 KiB
Example data files
The data files in this directory are derived from the Google Books Ngram data (version 3, 2020-02-17) and the SCOWL word list.
Recreating these files
The code used to build the provided en-unigrams.txt
and en-bigrams.txt
is provided as part of this repository. To recreate the files, first
download the input data. Note that this will download about 300 GB of
data and unpack it to about 1.4 TB of data.
instant-segment $ cd data
data $ python3 grab.py
After the data has been downloaded, run the merge
tool to create the word lists:
instant-segment $ cargo run --release --example merge
License
The SCOWL word list is licensed under a number of licenses detailed in LICENSE-SCOWL, which the website describes as MIT-like.
The Ngram data provided by Google is available under the Creative Commons Attribution 3.0 Unported license:
This work is licensed under the Creative Commons Attribution 3.0 Unported License. To view a copy of this license, visit http://creativecommons.org/licenses/by/3.0/ or send a letter to Creative Commons, PO Box 1866, Mountain View, CA 94042, USA.