Add testing

2025-02-16 13:02:10 +00:00 · 2021-04-23 10:10:08 -07:00 · 2021-04-23 10:10:08 -07:00 · 5e2b1fd054
commit 5e2b1fd054
parent 2b6862e54f
2 changed files with 22 additions and 1 deletions
--- a/README.md
+++ b/README.md
@ -60,7 +60,7 @@ instant-segment = "*"

 ## Using

-Instant Segment works by segmenting a string into words by selecting the splits with the highest probability given a vocabulary of words and their occurances.
+Instant Segment works by segmenting a string into words by selecting the splits with the highest probability given a corpus of words and their occurances.

 For instance, provided that `choose` and `spain` occur more frequently than `chooses` and `pain`, Instant Segment can help you split the string `choosespain.com` into [`ChooseSpain.com`](https://instantdomainsearch.com/search/sale?q=choosespain) which more likely matches user intent.

@ -125,6 +125,20 @@ Play with the examples above to see that different numbers of occurances will in

 The example above is succinct but, in practice, you will want to load these words and occurances from a corpus of data like the ones we provide [here](./data). Check out [the](./instant-segment/instant-segment-py/test/test.py) [tests](./instant-segment/instant-segment/src/test_data.rs) to see examples of how you might do that.

+## Testing
+
+To run the tests run the following:
+
+```
+cargo t -p instant-segment --all-features
+```
+
+You can also test the python bindings with:
+
+```
+make test-python
+```
+
 [python]: https://github.com/grantjenks/python-wordsegment
 [chapter]: http://norvig.com/ngrams/
 [book]: http://oreilly.com/catalog/9780596157111/
--- a/data/README.md
+++ b/data/README.md
@ -0,0 +1,7 @@
+The data files in this directory are derived from the [Google Web Trillion Word
+Corpus][corpus], as described by Thorsten Brants and Alex Franz, and [distributed][distributed] by the
+Linguistic Data Consortium. Note that this data **"may only be used for linguistic
+education and research"**, so for any other usage you should acquire a different data set.
+
+[corpus]: http://googleresearch.blogspot.com/2006/08/all-our-n-gram-are-belong-to-you.html
+[distributed]: https://catalog.ldc.upenn.edu/LDC2006T13