HACKER Q&A
📣 Apocryphon

What’s the best way to archive web sites?


What’s the preferred format for archiving static web pages composed of mostly text and graphics? I had been exporting some as PDFs on Safari, but now I’m wondering if for completeness I should save them in a more complete archive format via Chrome or a different browser.


  👤 rcarmo Accepted Answer ✓
I personally use an instance of https://archivebox.io/ on my home NAS. It does all the saving and conversion for you (runs Chrome internally and saves to multiple formats) and it’s just a “docker run” away.

Only thing that I wish it did much better is search. Right now it doesn’t do a good job of indexing full text content, but I’m hopeful that will change.


👤 trilinearnz
This might not be what you want, but you can direct the Wayback Machine on the Internet Archive to explicitly remember specific URLs. More info here: https://help.archive.org/help/save-pages-in-the-wayback-mach...

👤 prachisingh123
A website can be archived using a variety of methods. A single webpage may be easily saved to your hard drive, you can rely on a CMS backup, or you can utilize free internet archive tools like HTTrack and the Wayback Machine. However, using an automatic archiving system that records every update is the ideal approach to record a site.

👤 phendrenad2
HTTrack was the gold standard for backing up sites awhile ago, and I'm guessing that it's still a very good option.

👤 sp332
You can direct wget to spider your site and save all the requests and responses to a single Web Archive format file called a WARC. https://wiki.archiveteam.org/index.php/Wget