Dirkjan Ochtman
14b00e6417
Refactor to make test segmenter more accessible
2021-06-02 22:43:42 +02:00
Dirkjan Ochtman
f329176ed9
Remove old data files
2021-06-02 22:43:42 +02:00
Dirkjan Ochtman
73d891114c
Update data README
2021-06-02 22:43:42 +02:00
Dirkjan Ochtman
c0e2ddbf46
Update test cases to deal with new data
2021-06-02 22:43:42 +02:00
Dirkjan Ochtman
e89606235b
Add new data files
2021-06-02 22:43:42 +02:00
Dirkjan Ochtman
59045f645d
Add Rust code to process ngram data
2021-06-02 22:43:42 +02:00
Dirkjan Ochtman
05360e6769
Add script to download word list input data
2021-06-02 15:14:59 +02:00
Dirkjan Ochtman
3b1e90bd35
Improve test framework to show all failures
2021-06-02 10:41:17 +02:00
Dirkjan Ochtman
e4e773c896
py: bump version to 0.1.3
2021-05-28 15:21:30 +02:00
Dirkjan Ochtman
89c232e3af
py: update crate metadata
2021-05-28 15:20:28 +02:00
Dirkjan Ochtman
7214ffc126
Remove note about planned further optimizations
2021-05-28 14:44:44 +02:00
Dirkjan Ochtman
f081d4b171
py: bump version to 0.1.1
2021-05-28 14:34:04 +02:00
Dirkjan Ochtman
9edd1bc8b7
Bump version number to 0.8.2
2021-05-28 14:31:59 +02:00
Dirkjan Ochtman
85f4f94b53
Use more efficient segmentation strategy
...
Based on the triangular matrix approach as explained here:
https://towardsdatascience.com/fast-word-segmentation-for-noisy-text-2c2c41f9e8da
Use iteration rather than recursion to segment the input forwards
rather than backwards and use a `Vec`-based memoization strategy
instead of relying on a `HashMap` of words. This version is about
4.8x faster, 100 lines of code less and should use much less memory.
2021-05-28 14:30:27 +02:00
dependabot-preview[bot]
541644a329
Upgrade to GitHub-native Dependabot
2021-04-30 09:51:09 +02:00
Dirkjan Ochtman
0ebae2923c
Add license files ( fixes #15 )
2021-04-29 15:34:02 +02:00
Nick Rempel
9bbb633f1d
Flesh out README ( #14 )
2021-04-29 11:12:42 +02:00
Dirkjan Ochtman
eca12c572f
Bump version number to 0.8.1
2021-04-22 15:08:23 +02:00
Dirkjan Ochtman
bba1de7543
Simplify loop
2021-04-22 15:07:54 +02:00
Dirkjan Ochtman
c21b66ab83
Rename sentence_score() to score_sentence()
2021-04-22 15:04:48 +02:00
Dirkjan Ochtman
62f5b79d6d
py: add Segmenter::sentence_score() method
2021-04-22 15:04:06 +02:00
Dirkjan Ochtman
85035a9b34
Add Segmenter::sentence_score() method
2021-04-22 14:58:06 +02:00
Dirkjan Ochtman
bd014dcc5c
Move logarithm conversion into score()
2021-04-22 14:54:54 +02:00
Beau Hartshorne
85038d1f6f
Add files via upload
2021-04-20 11:06:47 -07:00
Dirkjan Ochtman
507e8da5ef
Bump version to 0.8.0
2021-04-01 11:04:42 +02:00
Dirkjan Ochtman
754b0d5692
Revert version number for testing
2021-04-01 10:07:22 +02:00
Dirkjan Ochtman
2d942bbfc9
Box up the BitVec array
...
The `Search::best` field will take about 8000 bytes. In some of our usage
with rayon, this appeared to cause stack overflows. Boxing it up makes the
code slower by about 1-2%, but should hopefully avoid stack overflows.
2021-04-01 10:03:48 +02:00
Dirkjan Ochtman
a4fe0e4039
Bump version to 0.7.2
...
Now that the core crate is in a directory, we no longer needlessly publish
data files on crates.io.
2021-03-24 13:16:49 +01:00
Dirkjan Ochtman
55fb3c664f
Tweak CI to avoid testing bindings for now
2021-03-24 11:57:29 +01:00
Dirkjan Ochtman
987220c586
py: add some comments
2021-03-24 11:57:29 +01:00
Dirkjan Ochtman
5d8f1b2fb0
py: add load() and dump() methods
2021-03-24 11:57:29 +01:00
Dirkjan Ochtman
8fe1b2ab46
Optimize test data reader
2021-03-24 11:57:29 +01:00
Dirkjan Ochtman
fd774ad465
py: initial version of Python bindings
2021-03-24 11:57:29 +01:00
Dirkjan Ochtman
f6061044fc
Add helper method for Python bindings
2021-03-24 11:57:29 +01:00
Dirkjan Ochtman
0ce148db1e
Consistent ordering of impl blocks
2021-03-24 11:57:29 +01:00
Dirkjan Ochtman
11a7e88b95
Start a workspace
2021-03-24 11:57:29 +01:00
Dirkjan Ochtman
a146790e17
Bump version number to 0.7.1
2021-03-24 11:53:20 +01:00
Dirkjan Ochtman
8f7959eeed
Use separate total value for bigrams
2021-02-11 12:08:19 +01:00
Dirkjan Ochtman
9addd3810b
Use more compact cache key
2021-02-11 12:05:39 +01:00
Dirkjan Ochtman
83aa46593a
Use bit vectors to improve performance
2021-02-11 11:56:33 +01:00
Dirkjan Ochtman
9dd1cf089d
Simplify the API some more
2021-02-10 13:23:37 +01:00
Dirkjan Ochtman
4338ff2c0c
Fix formatting
2021-02-10 13:10:38 +01:00
Dirkjan Ochtman
cd06fbecc8
Guarantee known size of the output iterator
2021-02-10 13:07:05 +01:00
Dirkjan Ochtman
d190aa5240
Simplify API by moving result data into Search
2021-02-10 13:03:06 +01:00
Dirkjan Ochtman
9735e64ee4
Bump version to 0.5.1
2021-02-10 12:51:36 +01:00
Dirkjan Ochtman
95804e9672
Derive Clone for Search
2021-02-10 12:51:21 +01:00
Dirkjan Ochtman
da26dedfc8
Apply clippy suggestion
2021-02-10 11:53:15 +01:00
Dirkjan Ochtman
a862ec97a5
Version bump to 0.5.0
2021-02-10 11:49:09 +01:00
Dirkjan Ochtman
13b29d183e
Take an explicit search parameter
2021-02-10 11:48:24 +01:00
Dirkjan Ochtman
be0f8c0ed7
Don't normalize input strings implicitly
2021-02-08 15:53:24 +01:00