| miso-belica/sumy |
3,669 |
|
96 |
14 |
15 days ago |
16 |
October 23, 2022 |
18 |
apache-2.0 |
Python |
| Module for automatic summarization of text documents and HTML pages. |
| adbar/trafilatura |
2,447 |
|
0 |
66 |
about 2 years ago |
39 |
November 29, 2023 |
66 |
gpl-3.0 |
Python |
| Python & command-line tool to gather text on the Web: web crawling/scraping, extraction of text, metadata, comments |
| unidoc/unipdf |
2,231 |
|
0 |
45 |
about 2 years ago |
72 |
November 11, 2023 |
66 |
other |
Go |
| Golang PDF library for creating and processing PDF files (pure go) |
| chrismattmann/tika-python |
1,316 |
|
83 |
54 |
over 2 years ago |
35 |
January 02, 2023 |
4 |
apache-2.0 |
Python |
| Tika-Python is a Python binding to the Apache Tika™ REST services allowing Tika to be called natively in the Python community. |
| whitelok/image-text-localization-recognition |
928 |
|
0 |
0 |
over 2 years ago |
0 |
|
0 |
|
|
| A general list of resources to image text localization and recognition 场景文本位置感知与识别的论文资源与实现合集 シーンテキストの位置認識と識別のための論文リソースの要約 |
| miso-belica/jusText |
811 |
|
30 |
12 |
about 1 year ago |
6 |
October 21, 2021 |
8 |
bsd-2-clause |
Python |
| Heuristic based boilerplate removal tool |
| unidoc/unidoc |
691 |
|
4 |
6 |
almost 7 years ago |
16 |
May 23, 2019 |
0 |
other |
Go |
| This repository has moved! https://github.com/unidoc/unipdf |
| MaLeLabTs/RegexGenerator |
656 |
|
0 |
0 |
about 7 years ago |
0 |
|
0 |
gpl-3.0 |
Java |
| This project contains the source code of a tool for generating regular expressions for text extraction: 1. automatically, 2. based only on examples of the desired behavior, 3. without any external hint about how the target regex should look like |
| ICIJ/datashare |
519 |
|
0 |
0 |
about 2 years ago |
135 |
November 21, 2023 |
17 |
agpl-3.0 |
Java |
| A self-hosted search engine for documents. |
| ropensci/pdftools |
480 |
|
51 |
55 |
over 2 years ago |
30 |
September 25, 2023 |
52 |
other |
C++ |
| Text Extraction, Rendering and Converting of PDF Documents |