| google/sentencepiece |
8,851 |
|
120 |
787 |
about 2 years ago |
34 |
May 02, 2023 |
32 |
apache-2.0 |
C++ |
| Unsupervised text tokenizer for Neural Network-based text generation. |
| huggingface/tokenizers |
8,056 |
|
0 |
362 |
about 2 years ago |
85 |
November 14, 2023 |
233 |
apache-2.0 |
Rust |
| 💥 Fast State-of-the-Art Tokenizers optimized for Research and Production |
| Morizeyao/GPT2-Chinese |
7,249 |
|
0 |
0 |
over 2 years ago |
0 |
|
105 |
mit |
Python |
| Chinese version of GPT2 training code, using BERT tokenizer. |
| sebastianbergmann/php-token-stream |
6,457 |
|
104,288 |
188 |
over 4 years ago |
36 |
November 30, 2020 |
0 |
other |
PHP |
| Wrapper around PHP's tokenizer extension. |
| theseer/tokenizer |
5,084 |
|
42,659 |
11 |
over 2 years ago |
8 |
November 20, 2023 |
0 |
other |
PHP |
| A small library for converting tokenized PHP source code into XML (and potentially other formats) |
| sindresorhus/file-type |
3,366 |
|
64,893 |
1,894 |
over 2 years ago |
141 |
November 11, 2023 |
16 |
mit |
JavaScript |
| Detect the file type of a Buffer/Uint8Array/ArrayBuffer |
| teamtnt/tntsearch |
3,004 |
|
113 |
27 |
over 2 years ago |
63 |
July 19, 2023 |
78 |
mit |
PHP |
| A fully featured full text search engine written in PHP |
| Chevrotain/chevrotain |
2,350 |
|
706 |
272 |
about 2 years ago |
170 |
August 14, 2023 |
50 |
apache-2.0 |
TypeScript |
| Parser Building Toolkit for JavaScript |
| roshan-research/hazm |
1,381 |
|
17 |
13 |
4 months ago |
20 |
October 01, 2023 |
12 |
mit |
Python |
| Persian NLP Toolkit |
| natasha/natasha |
1,085 |
|
3 |
9 |
over 2 years ago |
19 |
July 24, 2023 |
24 |
mit |
Python |
| Solves basic Russian NLP tasks, API for lower level Natasha projects |