instant-segment/data/README.md

39 lines
1.3 KiB
Markdown
Raw Normal View History

2021-06-02 12:34:35 +00:00
# Example data files
2021-04-29 09:12:42 +00:00
2021-06-02 12:34:35 +00:00
The data files in this directory are derived from the Google Books Ngram data
([version 3, 2020-02-17][ngrams]) and the [SCOWL][scowl] word list.
## Recreating these files
The code used to build the provided `en-unigrams.txt` and `en-bigrams.txt`
is provided as part of this repository. To recreate the files, first
download the input data. Note that this will download about 300 GB of
data and unpack it to about 1.4 TB of data.
```
instant-segment $ cd data
data $ python3 grab.py
```
After the data has been downloaded, run the `merge` tool to create the word lists:
```
instant-segment $ cargo run --release --example merge
```
## License
The SCOWL word list is licensed under a number of licenses detailed in
[LICENSE-SCOWL](./LICENSE-SCOWL), which the website describes as MIT-like.
The Ngram data provided by Google is available under the Creative Commons
Attribution 3.0 Unported license:
> This work is licensed under the Creative Commons Attribution 3.0 Unported
> License. To view a copy of this license, visit
> http://creativecommons.org/licenses/by/3.0/ or send a letter to Creative
> Commons, PO Box 1866, Mountain View, CA 94042, USA.
[ngrams]: https://storage.googleapis.com/books/ngrams/books/datasetsv3.html
[scowl]: http://wordlist.aspell.net/