Update README.md
This commit is contained in:
parent
fdd743478e
commit
8230ac6ed5
10
README.md
10
README.md
|
@ -14,14 +14,8 @@ which is in turn based on code from Peter Norvig's chapter [Natural Language
|
|||
Corpus Data][chapter] from the book [Beautiful Data][book] (Segaran and
|
||||
Hammerbacher, 2009).
|
||||
|
||||
The data files in this repository are derived from the [Google Web Trillion Word
|
||||
Corpus][corpus], as described by Thorsten Brants and Alex Franz, and
|
||||
[distributed][distributed] by the Linguistic Data Consortium. Note that this
|
||||
data **"may only be used for linguistic education and research"**, so for any
|
||||
other usage you should acquire a different data set.
|
||||
|
||||
For the microbenchmark included in this repository, Instant Segment is ~100x
|
||||
faster than the Python implementation. The API has been carefully constructed
|
||||
faster than the Python implementation. The API was carefully constructed
|
||||
so that multiple segmentations can share the underlying state to allow parallel
|
||||
usage.
|
||||
|
||||
|
@ -107,7 +101,5 @@ make test-python
|
|||
[python]: https://github.com/grantjenks/python-wordsegment
|
||||
[chapter]: http://norvig.com/ngrams/
|
||||
[book]: http://oreilly.com/catalog/9780596157111/
|
||||
[corpus]:
|
||||
http://googleresearch.blogspot.com/2006/08/all-our-n-gram-are-belong-to-you.html
|
||||
[distributed]: https://catalog.ldc.upenn.edu/LDC2006T13
|
||||
[issues]: https://github.com/InstantDomainSearch/instant-segment/issues
|
||||
|
|
Loading…
Reference in New Issue