| laurilehmijoki/s3_website |
2,259 |
|
606 |
0 |
about 3 years ago |
109 |
October 11, 2017 |
76 |
other |
Scala |
| Manage an S3 website: sync, deliver via CloudFront, benefit from advanced S3 website features. |
| apache/tika |
2,007 |
|
1,687 |
570 |
about 2 years ago |
66 |
October 17, 2023 |
49 |
apache-2.0 |
Java |
| The Apache Tika toolkit detects and extracts metadata and text from over a thousand different file types (such as PPT, XLS, and PDF). |
| chrismattmann/tika-python |
1,316 |
|
83 |
54 |
over 2 years ago |
35 |
January 02, 2023 |
4 |
apache-2.0 |
Python |
| Tika-Python is a Python binding to the Apache Tika™ REST services allowing Tika to be called natively in the Python community. |
| dadoonet/fscrawler |
1,279 |
|
0 |
1 |
about 2 years ago |
5 |
January 10, 2022 |
145 |
apache-2.0 |
Java |
| Elasticsearch File System Crawler (FS Crawler) |
| pemistahl/lingua |
622 |
|
0 |
3 |
over 2 years ago |
17 |
August 02, 2022 |
15 |
apache-2.0 |
Kotlin |
| The most accurate natural language detection library for Java and the JVM, suitable for long and short text alike |
| ICIJ/datashare |
519 |
|
0 |
0 |
about 2 years ago |
135 |
November 21, 2023 |
17 |
agpl-3.0 |
Java |
| A self-hosted search engine for documents. |
| USCDataScience/sparkler |
401 |
|
0 |
0 |
about 3 years ago |
0 |
|
55 |
apache-2.0 |
Java |
| Spark-Crawler: Apache Nutch-like crawler that runs on Apache Spark. |
| pcbje/gransk |
237 |
|
0 |
0 |
over 9 years ago |
0 |
|
3 |
apache-2.0 |
Python |
| Document processing for investigations |
| ICIJ/extract |
229 |
|
0 |
1 |
about 2 years ago |
58 |
November 13, 2023 |
10 |
mit |
Java |
| A cross-platform command line tool for parallelised content extraction and analysis. |
| michaelklishin/pantomime |
171 |
|
27 |
0 |
about 7 years ago |
27 |
January 19, 2018 |
3 |
|
JavaScript |
| A tiny Clojure library that deals with MIME types (Internet media types) |