2020-12-16 09:38:59 +00:00
|
|
|
![Cover logo](./cover.svg)
|
|
|
|
|
2021-04-22 17:36:57 +00:00
|
|
|
# Instant Segment: fast English word segmentation in Rust
|
2020-06-17 20:11:06 +00:00
|
|
|
|
2020-12-16 09:44:56 +00:00
|
|
|
[![Documentation](https://docs.rs/instant-segment/badge.svg)](https://docs.rs/instant-segment/)
|
|
|
|
[![Crates.io](https://img.shields.io/crates/v/instant-segment.svg)](https://crates.io/crates/instant-segment)
|
2020-12-15 20:02:22 +00:00
|
|
|
[![Build status](https://github.com/InstantDomainSearch/instant-segment/workflows/CI/badge.svg)](https://github.com/InstantDomainSearch/instant-segment/actions?query=workflow%3ACI)
|
2020-06-17 20:11:06 +00:00
|
|
|
[![License: Apache 2.0](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](LICENSE-APACHE)
|
|
|
|
|
2021-04-22 20:41:03 +00:00
|
|
|
## Partial examples
|
|
|
|
|
2021-04-22 20:48:29 +00:00
|
|
|
### Python
|
2021-04-22 20:41:03 +00:00
|
|
|
|
|
|
|
```python
|
|
|
|
segmenter = instant_segment.Segmenter(unigrams(), bigrams())
|
|
|
|
search = instant_segment.Search()
|
|
|
|
segmenter.segment("instantdomainsearch", search)
|
|
|
|
print([word for word in search])
|
|
|
|
> ['instant', 'domain', 'search']
|
|
|
|
```
|
|
|
|
|
2021-04-22 20:48:29 +00:00
|
|
|
### Rust
|
2021-04-22 20:41:03 +00:00
|
|
|
|
|
|
|
```rust
|
2021-04-22 20:48:29 +00:00
|
|
|
let segmenter = Segmenter::from_maps(unigrams, bigrams);
|
2021-04-22 20:41:03 +00:00
|
|
|
let mut search = Search::default();
|
|
|
|
let words = segmenter
|
|
|
|
.segment("instantdomainsearch", &mut search)
|
|
|
|
.unwrap();
|
|
|
|
println!("{:?}", words.collect::<Vec<&str>>())
|
|
|
|
```
|
|
|
|
|
2021-04-22 17:36:57 +00:00
|
|
|
Instant Segment is a fast Apache-2.0 library for English word segmentation.
|
2020-06-17 20:11:06 +00:00
|
|
|
It is based on the Python [wordsegment][python] project written by Grant Jenkins,
|
|
|
|
which is in turn based on code from Peter Norvig's chapter [Natural Language
|
|
|
|
Corpus Data][chapter] from the book [Beautiful Data][book] (Segaran and Hammerbacher, 2009).
|
|
|
|
|
|
|
|
The data files in this repository are derived from the [Google Web Trillion Word
|
|
|
|
Corpus][corpus], as described by Thorsten Brants and Alex Franz, and [distributed][distributed] by the
|
|
|
|
Linguistic Data Consortium. Note that this data **"may only be used for linguistic
|
|
|
|
education and research"**, so for any other usage you should acquire a different data set.
|
|
|
|
|
2021-04-22 17:36:57 +00:00
|
|
|
For the microbenchmark included in this repository, Instant Segment is ~17x faster than
|
2020-06-17 20:11:06 +00:00
|
|
|
the Python implementation. Further optimizations are planned -- see the [issues][issues].
|
|
|
|
The API has been carefully constructed so that multiple segmentations can share
|
2020-12-16 09:48:31 +00:00
|
|
|
the underlying state to allow parallel usage.
|
2020-06-17 20:11:06 +00:00
|
|
|
|
|
|
|
[python]: https://github.com/grantjenks/python-wordsegment
|
|
|
|
[chapter]: http://norvig.com/ngrams/
|
|
|
|
[book]: http://oreilly.com/catalog/9780596157111/
|
|
|
|
[corpus]: http://googleresearch.blogspot.com/2006/08/all-our-n-gram-are-belong-to-you.html
|
|
|
|
[distributed]: https://catalog.ldc.upenn.edu/LDC2006T13
|
2020-12-15 20:02:22 +00:00
|
|
|
[issues]: https://github.com/InstantDomainSearch/instant-segment/issues
|