| google/sentencepiece |
8,851 |
|
120 |
787 |
about 2 years ago |
34 |
May 02, 2023 |
32 |
apache-2.0 |
C++ |
| Unsupervised text tokenizer for Neural Network-based text generation. |
| lancopku/pkuseg-python |
6,001 |
|
4 |
8 |
over 3 years ago |
22 |
June 19, 2020 |
119 |
mit |
Python |
| pkuseg多领域中文分词工具; The pkuseg toolkit for multi-domain Chinese word segmentation |
| rsennrich/subword-nmt |
1,937 |
|
18 |
18 |
over 3 years ago |
8 |
December 08, 2021 |
2 |
mit |
Python |
| Unsupervised Word Segmentation for Neural Machine Translation and Text Generation |
| PyThaiNLP/pythainlp |
902 |
|
24 |
51 |
about 2 years ago |
101 |
November 26, 2023 |
35 |
apache-2.0 |
Python |
| Thai Natural Language Processing in Python. |
| messense/jieba-rs |
585 |
|
5 |
15 |
over 2 years ago |
40 |
July 16, 2023 |
9 |
mit |
Rust |
| The Jieba Chinese Word Segmentation Implemented in Rust |
| cbaziotis/ekphrasis |
583 |
|
7 |
0 |
over 3 years ago |
54 |
May 17, 2022 |
18 |
mit |
Python |
| Ekphrasis is a text processing tool, geared towards text from social networks, such as Twitter or Facebook. Ekphrasis performs tokenization, word normalization, word segmentation (for splitting hashtags) and spell correction, using word statistics from 2 big corpora (english Wikipedia, twitter - 330mil english tweets). |
| vncorenlp/VnCoreNLP |
472 |
|
0 |
0 |
about 3 years ago |
0 |
|
0 |
other |
Java |
| A Vietnamese natural language processing toolkit (NAACL 2018) |
| taishi-i/nagisa |
365 |
|
1 |
7 |
about 2 years ago |
22 |
July 30, 2023 |
4 |
mit |
Python |
| A Japanese tokenizer based on recurrent neural networks |
| jacksonllee/pycantonese |
290 |
|
0 |
0 |
almost 3 years ago |
24 |
December 28, 2021 |
5 |
mit |
Python |
| Cantonese Linguistics and NLP |
| grantjenks/python-wordsegment |
268 |
|
0 |
0 |
about 6 years ago |
0 |
|
8 |
other |
Python |
| English word segmentation, written in pure-Python, and based on a trillion-word corpus. |