Show HN: CLI tool for saving web pages as a single file

flatroze | 640 points

One thing I always wonder when I see native software posted here:

How do you guys handle the security aspect of executing stuff like this on your machines?

Skimming the repo it has about a thousand lines of code and a bunch of dependencies with hundreds of sub-dependencies. Do you read all that code and evaluate the reputation of all dependencies?

Do you execute it in a sandboxed environment?

Do you just hope for the best like in the good old times of the C64?

FreeHugs | 5 years ago

The main problem with your code is that you only handle simple web1 site.

What about javascript execution ? If you replay your capture, you have no idea of what you will see on general Web2 website.

The only way I know to capture a web page properly is to "execute" it on a browser.

Gildas, the guy behind SingleFile (https://github.com/gildas-lormeau/SingleFile) is well aware of that and his approach realy works everytime.

Try on a Facebook post, a Tweet, ... It just works.

mikaelmorvan | 5 years ago

MHTML is pretty good for this already btw (not to take away from this neat project though :)). Similarly stores assets as base64'd data URIs and saves it as a single file. Can be enabled in Blink-based browsers using a settings flag and previously in Firefox using addons (also in the past natively in Opera and IE).

Springtime | 5 years ago

I think it would be way better to explain in the repository:

- how do you handle images?

- does it handle embedded videos?

- does it handle JS? to what extent?

- does it handle lazily loaded assets (i.e. images that load only when you scroll down, or JS that loads 3 seconds later after the page is loaded)

In general, how does this work? The current readme doesn't do a decent job explaining what the tool exactly is. For all I can tell, it probably just takes a screenshot of the page, encodes as base64 into the html and shows it.

alpb | 5 years ago

If you only want a portion of a webpage I made a tool called SnipCSS for that:

https://www.snipcss.com

The desktop version saves an HTML file, stylesheet and images/fonts locally, and it only contains the HTML of the snippet with the CSS rules that apply to the DOM subtree of the element you select.

I'm still working out bugs but it would be great if people try it out and let me know how it goes.

mrieck | 5 years ago

I really like this concept, and I've been using an npm package called inliner which does this too: https://www.npmjs.com/package/inliner

I'm glad there's more people taking a look at the use case, and I'd be interested to see a list of similar solutions.

If you combine this with Chrome's headless mode, you can prerender many pages that use JavaScript to perform the initial render, and then once you're done send it to one of these tools that inlines all the resources as data URLs.

  /Applications/Google\ Chrome.app/Contents/MacOS/Google\ Chrome ./site/index.md.html --headless --dump-dom --virtual-time-budget=400
The result is that you get pages that load very fast and are a single HTML file with all resources embedded. Allowing the page to prerender before inlining will also allow you to more easily strip all the JavaScript in many cases for pages that aren't highly interactive once rendered.
jordwalke | 5 years ago

This is awesome. One question though: how does it handle the same resource (e.g. image) appearing multiple times? Does it store multiple copies, potentially blowing up the file size? If not, how does it link to them in a single HTML file? If or if so, is there any way to get around it without using MHTML (or have you considered using MHTML in that case)?

Also, side-question about Rust: how do I get rid of absolute file paths in the executable to avoid information leakage? I feel like I partially figured this out at some point, but I forget.

mehrdadn | 5 years ago

I've been printing to PDF for decades now, and nothing comes close to the ease of use and versatility of 2 decades worth of interesting web pages .. I have pretty much every interesting article, including many from HN, from decades of this habit.

Need to find all articles relating to 'widget'?

    $ ls -l ~/PDFArchive/ | grep -i widget
This has proven so valuable, time and again .. there is a great joy in not having to maintain bookmarks, and in being able to copy the whole directory content to other machines for processing/reference .. and then there's the whole pdf->text situation, which has its thorns truly (some website content is buried in masses of ad-noise), but also has huge advantage - there's a lot of data to be mined from 50,000 PDF files ..

Therefore, I'd quite like to know, what does monolith have to offer over this method? I can imagine that its useful to have all the scripting content packaged up and bundled into a single .html file - but does it still work/run? (This can be either a pro or a con in my opinion..)

fit2rule | 5 years ago

This would be a perfect fit for IPFS. I love the idea of having just one file in a permanent link.

leshokunin | 5 years ago

I am using "Save Page WE" Firefox extension for this. Better at saving JS content and less clutter than saving all the images and stuff.

js8 | 5 years ago

Good, but won't work with the heavy JS pages using Ajax to load any single content.

The firefox extension seems to do that :

https://addons.mozilla.org/fr/firefox/addon/single-file/

sametmax | 5 years ago

Note that SingleFile can easily run on command line too, cf. https://github.com/gildas-lormeau/SingleFile/tree/master/cli.

gildas | 5 years ago

Nice. I can see some automated uses for this. In ordinary browsing, am currently using a Firefox addon called SingleFile which works surprisingly well. Stuffs everything into (surprise, surprise) one huge single file - html with embedded data, so compatible everywhere.

interfixus | 5 years ago

With respect to the Unlicense, does anybody have any knowledge about how good it is in countries which don't allow you to intentionally pass things into the public domain (most countries that aren't the US)? How does it compare to CC0 in that respect?

mikekchar | 5 years ago

I imagined that https://www.w3.org/TR/widgets/ would be the open container format for saving a Web app to a single file.

hendry | 5 years ago

This is interesting - I think any of us who save things off the internet have made something like this (I usually save entire sites or large chunks, though - so I have a different toolset - still, I also do single pages, so I might try out this tool).

One thing I would propose to add - either a flag, or by default - have it parse the path to the page and create the file with the name - that way you can just "monolith {url}" and not have to worry about it.

I am also curious as to how it handles advertisements and google tracking and such; some way to strip out just those scripts (and elements) could be handy.

cr0sh | 5 years ago

Ahh, to me it looks like it creates an amalgamation of the web page+contents.

How does this work on neverending webpages/forever scroll? How will it behave if you need to authenticate before browsing the page?

makach | 5 years ago

Ah, I've been thinking about making something like this. You beat me to it. I've been using the SingleFile add-on until now. I'll definitely give this a try.

jplayer01 | 5 years ago

super project ! i ve pretty baffled with the difficulty to save a webpage in proper format. I’ve tried with PDF converter, getPolaroid app and of course firefox screen shot feature for the entire scroll thing. Will try this for saving purposes.

I am also interested in cloning/forking sites for modification purposes, I will feedback you on the results four my consulting gigs

lucasverra | 5 years ago

This is pretty useful. It would be great to have a functionality of converting the HTML page to a PDF as well.

sankalp210691 | 5 years ago

This sounds great, but the first thing I thought was how this would be a perfect tool to make automated mass phishing scams.

If the outcomes are realistic, take a massive list of sites, make a snapshot of each page, replace the POST login URLs with the phishers, deploy these individual HTML files, and spread the links through email.

I wonder how does this project handle forms.

sergioisidoro | 5 years ago

Sweet idea! I would especially like to be able to capture videos and pictures too.

I suspect for saving videos, a good approach would be some sort of proxy + headless browser combination, where the proxy is responsible for saving a copy of all data the browser requests for.

Thoughts?

fouc | 5 years ago

`cargo install` install 237 packages for this?! I don't think that's acceptable.

personjerry | 5 years ago

Very cool. Have you considered incorporating an option for following links within the same domain to a certain depth? I remember using tools such as this in the past to save all the content from certain websites.

ajxs | 5 years ago
tenken | 5 years ago

Well you could do that for a long time with MHTML, WARC, etc. downloaders, including those available in browsers via "Save Page as", though CSS imports aren't covered by older tools (are they by yours?). Anyway, congrats for completing this as a Rust first-timer project, which certainly speaks to the quality of the Rust ecosystem. For using this approach as offline browser, of course, the problem is that Ajax-heavy pages using Javascript for loading content won't work, including every React and Vue sites created in the last five years (but you could make the point those aren't worth your attention as a reader anyway).

tannhaeuser | 5 years ago

Thank you for this. I’ve been looking for something that does this exact thing. I don’t like any of the other HTML archiving formats .

dtjohnnymonkey | 5 years ago

If the output were a tar file, couldn’t we also say it was saving web pages as a single file? Wouldn’t that also be easier?

dfee | 5 years ago

I noticed there is a `-j` argument to remove javascript. A `-i` argument for removing images would be great too.

ahub | 5 years ago

Does not compile with some byzantine message about let in const funcs being unstable.

ur-whale | 5 years ago

Nice work! I am wondering if Puppeteer can also be used to accomplish the same thing.

sbmthakur | 5 years ago

I'm not so experienced but how does this compare to .webarchive?

dvcrn | 5 years ago

Saving this for later

Exuma | 5 years ago

call me old fashioned, but I still use Ctrl+S

nessunodoro | 5 years ago

Very cool idea - thank you for this!

On question: How does it handle those cookie pop-ups, gdpr-warnings etc?

VvR-Ox | 5 years ago