Commit Graph

165 Commits

Author SHA1 Message Date
Beau Hartshorne e2f6f5c4a5
Update README.md 2021-06-05 14:30:22 -07:00
Dirkjan Ochtman bc59c6cf6f Refactor to make test segmenter more accessible 2021-06-03 16:09:24 +02:00
Dirkjan Ochtman 65b85d9806 Remove old data files 2021-06-03 16:09:24 +02:00
Dirkjan Ochtman 99ddbf7366 Update data README 2021-06-03 16:09:24 +02:00
Dirkjan Ochtman 3c52201fa0 Update test cases to deal with new data 2021-06-03 16:09:24 +02:00
Dirkjan Ochtman fee2adb995 Add new data files 2021-06-03 16:09:24 +02:00
Dirkjan Ochtman fcf24c7543 Add Rust code to process ngram data 2021-06-03 16:09:24 +02:00
Dirkjan Ochtman cc95d39063 Add script to download word list input data 2021-06-03 16:09:24 +02:00
Dirkjan Ochtman 57221b1dd5 Improve test framework to show all failures 2021-06-03 16:09:24 +02:00
Dirkjan Ochtman e4e773c896 py: bump version to 0.1.3 2021-05-28 15:21:30 +02:00
Dirkjan Ochtman 89c232e3af py: update crate metadata 2021-05-28 15:20:28 +02:00
Dirkjan Ochtman 7214ffc126 Remove note about planned further optimizations 2021-05-28 14:44:44 +02:00
Dirkjan Ochtman f081d4b171 py: bump version to 0.1.1 2021-05-28 14:34:04 +02:00
Dirkjan Ochtman 9edd1bc8b7 Bump version number to 0.8.2 2021-05-28 14:31:59 +02:00
Dirkjan Ochtman 85f4f94b53 Use more efficient segmentation strategy
Based on the triangular matrix approach as explained here:

https://towardsdatascience.com/fast-word-segmentation-for-noisy-text-2c2c41f9e8da

Use iteration rather than recursion to segment the input forwards
rather than backwards and use a `Vec`-based memoization strategy
instead of relying on a `HashMap` of words. This version is about
4.8x faster, 100 lines of code less and should use much less memory.
2021-05-28 14:30:27 +02:00
dependabot-preview[bot] 541644a329 Upgrade to GitHub-native Dependabot 2021-04-30 09:51:09 +02:00
Dirkjan Ochtman 0ebae2923c Add license files (fixes #15) 2021-04-29 15:34:02 +02:00
Nick Rempel 9bbb633f1d
Flesh out README (#14) 2021-04-29 11:12:42 +02:00
Dirkjan Ochtman eca12c572f Bump version number to 0.8.1 2021-04-22 15:08:23 +02:00
Dirkjan Ochtman bba1de7543 Simplify loop 2021-04-22 15:07:54 +02:00
Dirkjan Ochtman c21b66ab83 Rename sentence_score() to score_sentence() 2021-04-22 15:04:48 +02:00
Dirkjan Ochtman 62f5b79d6d py: add Segmenter::sentence_score() method 2021-04-22 15:04:06 +02:00
Dirkjan Ochtman 85035a9b34 Add Segmenter::sentence_score() method 2021-04-22 14:58:06 +02:00
Dirkjan Ochtman bd014dcc5c Move logarithm conversion into score() 2021-04-22 14:54:54 +02:00
Beau Hartshorne 85038d1f6f
Add files via upload 2021-04-20 11:06:47 -07:00
Dirkjan Ochtman 507e8da5ef Bump version to 0.8.0 2021-04-01 11:04:42 +02:00
Dirkjan Ochtman 754b0d5692 Revert version number for testing 2021-04-01 10:07:22 +02:00
Dirkjan Ochtman 2d942bbfc9 Box up the BitVec array
The `Search::best` field will take about 8000 bytes. In some of our usage
with rayon, this appeared to cause stack overflows. Boxing it up makes the
code slower by about 1-2%, but should hopefully avoid stack overflows.
2021-04-01 10:03:48 +02:00
Dirkjan Ochtman a4fe0e4039 Bump version to 0.7.2
Now that the core crate is in a directory, we no longer needlessly publish
data files on crates.io.
2021-03-24 13:16:49 +01:00
Dirkjan Ochtman 55fb3c664f Tweak CI to avoid testing bindings for now 2021-03-24 11:57:29 +01:00
Dirkjan Ochtman 987220c586 py: add some comments 2021-03-24 11:57:29 +01:00
Dirkjan Ochtman 5d8f1b2fb0 py: add load() and dump() methods 2021-03-24 11:57:29 +01:00
Dirkjan Ochtman 8fe1b2ab46 Optimize test data reader 2021-03-24 11:57:29 +01:00
Dirkjan Ochtman fd774ad465 py: initial version of Python bindings 2021-03-24 11:57:29 +01:00
Dirkjan Ochtman f6061044fc Add helper method for Python bindings 2021-03-24 11:57:29 +01:00
Dirkjan Ochtman 0ce148db1e Consistent ordering of impl blocks 2021-03-24 11:57:29 +01:00
Dirkjan Ochtman 11a7e88b95 Start a workspace 2021-03-24 11:57:29 +01:00
Dirkjan Ochtman a146790e17 Bump version number to 0.7.1 2021-03-24 11:53:20 +01:00
Dirkjan Ochtman 8f7959eeed Use separate total value for bigrams 2021-02-11 12:08:19 +01:00
Dirkjan Ochtman 9addd3810b Use more compact cache key 2021-02-11 12:05:39 +01:00
Dirkjan Ochtman 83aa46593a Use bit vectors to improve performance 2021-02-11 11:56:33 +01:00
Dirkjan Ochtman 9dd1cf089d Simplify the API some more 2021-02-10 13:23:37 +01:00
Dirkjan Ochtman 4338ff2c0c Fix formatting 2021-02-10 13:10:38 +01:00
Dirkjan Ochtman cd06fbecc8 Guarantee known size of the output iterator 2021-02-10 13:07:05 +01:00
Dirkjan Ochtman d190aa5240 Simplify API by moving result data into Search 2021-02-10 13:03:06 +01:00
Dirkjan Ochtman 9735e64ee4 Bump version to 0.5.1 2021-02-10 12:51:36 +01:00
Dirkjan Ochtman 95804e9672 Derive Clone for Search 2021-02-10 12:51:21 +01:00
Dirkjan Ochtman da26dedfc8 Apply clippy suggestion 2021-02-10 11:53:15 +01:00
Dirkjan Ochtman a862ec97a5 Version bump to 0.5.0 2021-02-10 11:49:09 +01:00
Dirkjan Ochtman 13b29d183e Take an explicit search parameter 2021-02-10 11:48:24 +01:00