From 8230ac6ed594870460290bc1d7e7c07752303897 Mon Sep 17 00:00:00 2001 From: Beau Hartshorne Date: Wed, 18 Aug 2021 13:18:56 -0700 Subject: [PATCH] Update README.md --- README.md | 10 +--------- 1 file changed, 1 insertion(+), 9 deletions(-) diff --git a/README.md b/README.md index 8dcb7f3..37a6d39 100644 --- a/README.md +++ b/README.md @@ -14,14 +14,8 @@ which is in turn based on code from Peter Norvig's chapter [Natural Language Corpus Data][chapter] from the book [Beautiful Data][book] (Segaran and Hammerbacher, 2009). -The data files in this repository are derived from the [Google Web Trillion Word -Corpus][corpus], as described by Thorsten Brants and Alex Franz, and -[distributed][distributed] by the Linguistic Data Consortium. Note that this -data **"may only be used for linguistic education and research"**, so for any -other usage you should acquire a different data set. - For the microbenchmark included in this repository, Instant Segment is ~100x -faster than the Python implementation. The API has been carefully constructed +faster than the Python implementation. The API was carefully constructed so that multiple segmentations can share the underlying state to allow parallel usage. @@ -107,7 +101,5 @@ make test-python [python]: https://github.com/grantjenks/python-wordsegment [chapter]: http://norvig.com/ngrams/ [book]: http://oreilly.com/catalog/9780596157111/ -[corpus]: - http://googleresearch.blogspot.com/2006/08/all-our-n-gram-are-belong-to-you.html [distributed]: https://catalog.ldc.upenn.edu/LDC2006T13 [issues]: https://github.com/InstantDomainSearch/instant-segment/issues