instant-segment/README.md

![Cover logo](./cover.svg)

# Instant Segment: fast English word segmentation in Rust

[![Documentation](https://docs.rs/instant-segment/badge.svg)](https://docs.rs/instant-segment/)
[![Crates.io](https://img.shields.io/crates/v/instant-segment.svg)](https://crates.io/crates/instant-segment)
[![Build status](https://github.com/InstantDomainSearch/instant-segment/workflows/CI/badge.svg)](https://github.com/InstantDomainSearch/instant-segment/actions?query=workflow%3ACI)
[![License: Apache 2.0](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](LICENSE-APACHE)

## Partial examples

### Python

```python
segmenter = instant_segment.Segmenter(unigrams(), bigrams())
search = instant_segment.Search()
segmenter.segment("instantdomainsearch", search)
print([word for word in search])
> ['instant', 'domain', 'search']
```

### Rust

```rust
let segmenter = Segmenter::from_maps(unigrams, bigrams);
let mut search = Search::default();
let words = segmenter
    .segment("instantdomainsearch", &mut search)
    .unwrap();
println!("{:?}", words.collect::<Vec<&str>>())
```

Instant Segment is a fast Apache-2.0 library for English word segmentation.
It is based on the Python [wordsegment][python] project written by Grant Jenkins,
which is in turn based on code from Peter Norvig's chapter [Natural Language
Corpus Data][chapter] from the book [Beautiful Data][book] (Segaran and Hammerbacher, 2009).

The data files in this repository are derived from the [Google Web Trillion Word
Corpus][corpus], as described by Thorsten Brants and Alex Franz, and [distributed][distributed] by the
Linguistic Data Consortium. Note that this data **"may only be used for linguistic
education and research"**, so for any other usage you should acquire a different data set.

For the microbenchmark included in this repository, Instant Segment is ~17x faster than
the Python implementation. Further optimizations are planned -- see the [issues][issues].
The API has been carefully constructed so that multiple segmentations can share
the underlying state to allow parallel usage.

[python]: https://github.com/grantjenks/python-wordsegment
[chapter]: http://norvig.com/ngrams/
[book]: http://oreilly.com/catalog/9780596157111/
[corpus]: http://googleresearch.blogspot.com/2006/08/all-our-n-gram-are-belong-to-you.html
[distributed]: https://catalog.ldc.upenn.edu/LDC2006T13
[issues]: https://github.com/InstantDomainSearch/instant-segment/issues
Add cover to README 2020-12-16 09:38:59 +00:00			`![Cover logo](./cover.svg)`

Switch to using "Instant Vitals" in non-code contexts 2021-04-22 17:36:57 +00:00			`# Instant Segment: fast English word segmentation in Rust`
Create initial README (fixes #1) 2020-06-17 20:11:06 +00:00
Add crate badges to README 2020-12-16 09:44:56 +00:00			`[![Documentation](https://docs.rs/instant-segment/badge.svg)](https://docs.rs/instant-segment/)`
			`[![Crates.io](https://img.shields.io/crates/v/instant-segment.svg)](https://crates.io/crates/instant-segment)`
Update README with new name 2020-12-15 20:02:22 +00:00			`[![Build status](https://github.com/InstantDomainSearch/instant-segment/workflows/CI/badge.svg)](https://github.com/InstantDomainSearch/instant-segment/actions?query=workflow%3ACI)`
Create initial README (fixes #1) 2020-06-17 20:11:06 +00:00			`[![License: Apache 2.0](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](LICENSE-APACHE)`

Add brief examples 2021-04-22 20:41:03 +00:00			`## Partial examples`

Add some basic examples 2021-04-22 20:48:29 +00:00			`### Python`
Add brief examples 2021-04-22 20:41:03 +00:00
			```python
			`segmenter = instant_segment.Segmenter(unigrams(), bigrams())`
			`search = instant_segment.Search()`
			`segmenter.segment("instantdomainsearch", search)`
			`print([word for word in search])`
			`> ['instant', 'domain', 'search']`
			```

Add some basic examples 2021-04-22 20:48:29 +00:00			`### Rust`
Add brief examples 2021-04-22 20:41:03 +00:00
			```rust
Add some basic examples 2021-04-22 20:48:29 +00:00			`let segmenter = Segmenter::from_maps(unigrams, bigrams);`
Add brief examples 2021-04-22 20:41:03 +00:00			`let mut search = Search::default();`
			`let words = segmenter`
			`.segment("instantdomainsearch", &mut search)`
			`.unwrap();`
			`println!("{:?}", words.collect::<Vec<&str>>())`
			```

Switch to using "Instant Vitals" in non-code contexts 2021-04-22 17:36:57 +00:00			`Instant Segment is a fast Apache-2.0 library for English word segmentation.`
Create initial README (fixes #1) 2020-06-17 20:11:06 +00:00			`It is based on the Python [wordsegment][python] project written by Grant Jenkins,`
			`which is in turn based on code from Peter Norvig's chapter [Natural Language`
			`Corpus Data][chapter] from the book [Beautiful Data][book] (Segaran and Hammerbacher, 2009).`

			`The data files in this repository are derived from the [Google Web Trillion Word`
			`Corpus][corpus], as described by Thorsten Brants and Alex Franz, and [distributed][distributed] by the`
			`Linguistic Data Consortium. Note that this data **"may only be used for linguistic`
			`education and research"**, so for any other usage you should acquire a different data set.`

Switch to using "Instant Vitals" in non-code contexts 2021-04-22 17:36:57 +00:00			`For the microbenchmark included in this repository, Instant Segment is ~17x faster than`
Create initial README (fixes #1) 2020-06-17 20:11:06 +00:00			`the Python implementation. Further optimizations are planned -- see the [issues][issues].`
			`The API has been carefully constructed so that multiple segmentations can share`
Tighten the language a little bit 2020-12-16 09:48:31 +00:00			`the underlying state to allow parallel usage.`
Create initial README (fixes #1) 2020-06-17 20:11:06 +00:00
			`[python]: https://github.com/grantjenks/python-wordsegment`
			`[chapter]: http://norvig.com/ngrams/`
			`[book]: http://oreilly.com/catalog/9780596157111/`
			`[corpus]: http://googleresearch.blogspot.com/2006/08/all-our-n-gram-are-belong-to-you.html`
			`[distributed]: https://catalog.ldc.upenn.edu/LDC2006T13`
Update README with new name 2020-12-15 20:02:22 +00:00			`[issues]: https://github.com/InstantDomainSearch/instant-segment/issues`