| internetarchive/heritrix3 |
2,579 |
|
0 |
2 |
over 2 years ago |
9 |
July 27, 2022 |
48 |
other |
Java |
| Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project. |
| iipc/awesome-web-archiving |
1,669 |
|
0 |
0 |
about 2 years ago |
0 |
|
3 |
cc0-1.0 |
|
| An Awesome List for getting started with web archiving |
| ArchiveTeam/grab-site |
1,121 |
|
0 |
0 |
over 2 years ago |
0 |
|
92 |
other |
Python |
| The archivist's web crawler: WARC output, dashboard for all crawls, dynamic ignore patterns |
| simon987/awesome-datahoarding |
892 |
|
0 |
0 |
over 2 years ago |
0 |
|
4 |
|
|
| List of data-hoarding related tools |
| internetarchive/brozzler |
613 |
|
2 |
0 |
about 2 years ago |
23 |
January 02, 2020 |
40 |
apache-2.0 |
Python |
| brozzler - distributed browser-based web crawler |
| ArchiveTeam/ArchiveBot |
328 |
|
0 |
0 |
over 2 years ago |
0 |
|
169 |
mit |
Python |
| ArchiveBot, an IRC bot for archiving websites |
| sparrow629/Tumblr_Crawler |
258 |
|
0 |
0 |
over 7 years ago |
0 |
|
2 |
gpl-3.0 |
Python |
| This is a Multi-thread crawler for Tumblr. |
| icy/google-group-crawler |
213 |
|
0 |
0 |
about 4 years ago |
0 |
|
6 |
|
Shell |
| [Deprecated] Get (almost) original messages from google group archives. Your data is yours. |
| commoncrawl/cc-crawl-statistics |
97 |
|
0 |
0 |
over 2 years ago |
0 |
|
0 |
apache-2.0 |
Python |
| Statistics of Common Crawl monthly archives mined from URL index files |
| ArchiveTeam/wget-lua |
72 |
|
0 |
0 |
over 2 years ago |
0 |
|
10 |
gpl-3.0 |
C |
| Wget-AT is a modern Wget with Lua hooks, Zstandard (+dictionary) WARC compression and URL-agnostic deduplication. |