Update README.md

This commit is contained in:
Beau Hartshorne 2021-08-18 13:18:56 -07:00 committed by GitHub
parent fdd743478e
commit 8230ac6ed5
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
1 changed files with 1 additions and 9 deletions

View File

@ -14,14 +14,8 @@ which is in turn based on code from Peter Norvig's chapter [Natural Language
Corpus Data][chapter] from the book [Beautiful Data][book] (Segaran and
Hammerbacher, 2009).
The data files in this repository are derived from the [Google Web Trillion Word
Corpus][corpus], as described by Thorsten Brants and Alex Franz, and
[distributed][distributed] by the Linguistic Data Consortium. Note that this
data **"may only be used for linguistic education and research"**, so for any
other usage you should acquire a different data set.
For the microbenchmark included in this repository, Instant Segment is ~100x
faster than the Python implementation. The API has been carefully constructed
faster than the Python implementation. The API was carefully constructed
so that multiple segmentations can share the underlying state to allow parallel
usage.
@ -107,7 +101,5 @@ make test-python
[python]: https://github.com/grantjenks/python-wordsegment
[chapter]: http://norvig.com/ngrams/
[book]: http://oreilly.com/catalog/9780596157111/
[corpus]:
http://googleresearch.blogspot.com/2006/08/all-our-n-gram-are-belong-to-you.html
[distributed]: https://catalog.ldc.upenn.edu/LDC2006T13
[issues]: https://github.com/InstantDomainSearch/instant-segment/issues