The main problem with your code is that you only handle simple web1 site.
What about javascript execution ? If you replay your capture, you have no idea of what you will see on general Web2 website.
The only way I know to capture a web page properly is to "execute" it on a browser.
Gildas, the guy behind SingleFile (https://github.com/gildas-lormeau/SingleFile) is well aware of that and his approach realy works everytime.
Try on a Facebook post, a Tweet, ... It just works.
MHTML is pretty good for this already btw (not to take away from this neat project though :)). Similarly stores assets as base64'd data URIs and saves it as a single file. Can be enabled in Blink-based browsers using a settings flag and previously in Firefox using addons (also in the past natively in Opera and IE).
I think it would be way better to explain in the repository:
- how do you handle images?
- does it handle embedded videos?
- does it handle JS? to what extent?
- does it handle lazily loaded assets (i.e. images that load only when you scroll down, or JS that loads 3 seconds later after the page is loaded)
In general, how does this work? The current readme doesn't do a decent job explaining what the tool exactly is. For all I can tell, it probably just takes a screenshot of the page, encodes as base64 into the html and shows it.
If you only want a portion of a webpage I made a tool called SnipCSS for that:
The desktop version saves an HTML file, stylesheet and images/fonts locally, and it only contains the HTML of the snippet with the CSS rules that apply to the DOM subtree of the element you select.
I'm still working out bugs but it would be great if people try it out and let me know how it goes.
I really like this concept, and I've been using an npm package called inliner which does this too: https://www.npmjs.com/package/inliner
I'm glad there's more people taking a look at the use case, and I'd be interested to see a list of similar solutions.
If you combine this with Chrome's headless mode, you can prerender many pages that use JavaScript to perform the initial render, and then once you're done send it to one of these tools that inlines all the resources as data URLs.
/Applications/Google\ Chrome.app/Contents/MacOS/Google\ Chrome ./site/index.md.html --headless --dump-dom --virtual-time-budget=400
The result is that you get pages that load very fast and are a single HTML file with all resources embedded. Allowing the page to prerender before inlining will also allow you to more easily strip all the JavaScript in many cases for pages that aren't highly interactive once rendered.This is awesome. One question though: how does it handle the same resource (e.g. image) appearing multiple times? Does it store multiple copies, potentially blowing up the file size? If not, how does it link to them in a single HTML file? If or if so, is there any way to get around it without using MHTML (or have you considered using MHTML in that case)?
Also, side-question about Rust: how do I get rid of absolute file paths in the executable to avoid information leakage? I feel like I partially figured this out at some point, but I forget.
I've been printing to PDF for decades now, and nothing comes close to the ease of use and versatility of 2 decades worth of interesting web pages .. I have pretty much every interesting article, including many from HN, from decades of this habit.
Need to find all articles relating to 'widget'?
$ ls -l ~/PDFArchive/ | grep -i widget
This has proven so valuable, time and again .. there is a great joy in not having to maintain bookmarks, and in being able to copy the whole directory content to other machines for processing/reference .. and then there's the whole pdf->text situation, which has its thorns truly (some website content is buried in masses of ad-noise), but also has huge advantage - there's a lot of data to be mined from 50,000 PDF files ..Therefore, I'd quite like to know, what does monolith have to offer over this method? I can imagine that its useful to have all the scripting content packaged up and bundled into a single .html file - but does it still work/run? (This can be either a pro or a con in my opinion..)
This would be a perfect fit for IPFS. I love the idea of having just one file in a permanent link.
I am using "Save Page WE" Firefox extension for this. Better at saving JS content and less clutter than saving all the images and stuff.
Good, but won't work with the heavy JS pages using Ajax to load any single content.
The firefox extension seems to do that :
Note that SingleFile can easily run on command line too, cf. https://github.com/gildas-lormeau/SingleFile/tree/master/cli.
Nice. I can see some automated uses for this. In ordinary browsing, am currently using a Firefox addon called SingleFile which works surprisingly well. Stuffs everything into (surprise, surprise) one huge single file - html with embedded data, so compatible everywhere.
With respect to the Unlicense, does anybody have any knowledge about how good it is in countries which don't allow you to intentionally pass things into the public domain (most countries that aren't the US)? How does it compare to CC0 in that respect?
I imagined that https://www.w3.org/TR/widgets/ would be the open container format for saving a Web app to a single file.
This is interesting - I think any of us who save things off the internet have made something like this (I usually save entire sites or large chunks, though - so I have a different toolset - still, I also do single pages, so I might try out this tool).
One thing I would propose to add - either a flag, or by default - have it parse the path to the page and create the file with the name - that way you can just "monolith {url}" and not have to worry about it.
I am also curious as to how it handles advertisements and google tracking and such; some way to strip out just those scripts (and elements) could be handy.
Ahh, to me it looks like it creates an amalgamation of the web page+contents.
How does this work on neverending webpages/forever scroll? How will it behave if you need to authenticate before browsing the page?
Ah, I've been thinking about making something like this. You beat me to it. I've been using the SingleFile add-on until now. I'll definitely give this a try.
super project ! i ve pretty baffled with the difficulty to save a webpage in proper format. I’ve tried with PDF converter, getPolaroid app and of course firefox screen shot feature for the entire scroll thing. Will try this for saving purposes.
I am also interested in cloning/forking sites for modification purposes, I will feedback you on the results four my consulting gigs
This is pretty useful. It would be great to have a functionality of converting the HTML page to a PDF as well.
This sounds great, but the first thing I thought was how this would be a perfect tool to make automated mass phishing scams.
If the outcomes are realistic, take a massive list of sites, make a snapshot of each page, replace the POST login URLs with the phishers, deploy these individual HTML files, and spread the links through email.
I wonder how does this project handle forms.
Styling breaks on this site: https://www.scientificamerican.com/article/the-hunt-is-on-fo...
Sweet idea! I would especially like to be able to capture videos and pictures too.
I suspect for saving videos, a good approach would be some sort of proxy + headless browser combination, where the proxy is responsible for saving a copy of all data the browser requests for.
Thoughts?
`cargo install` install 237 packages for this?! I don't think that's acceptable.
Very cool. Have you considered incorporating an option for following links within the same domain to a certain depth? I remember using tools such as this in the past to save all the content from certain websites.
How is this different from https://en.m.wikipedia.org/wiki/Web_ARChive
Well you could do that for a long time with MHTML, WARC, etc. downloaders, including those available in browsers via "Save Page as", though CSS imports aren't covered by older tools (are they by yours?). Anyway, congrats for completing this as a Rust first-timer project, which certainly speaks to the quality of the Rust ecosystem. For using this approach as offline browser, of course, the problem is that Ajax-heavy pages using Javascript for loading content won't work, including every React and Vue sites created in the last five years (but you could make the point those aren't worth your attention as a reader anyway).
Thank you for this. I’ve been looking for something that does this exact thing. I don’t like any of the other HTML archiving formats .
If the output were a tar file, couldn’t we also say it was saving web pages as a single file? Wouldn’t that also be easier?
I noticed there is a `-j` argument to remove javascript. A `-i` argument for removing images would be great too.
Does not compile with some byzantine message about let in const funcs being unstable.
Nice work! I am wondering if Puppeteer can also be used to accomplish the same thing.
I'm not so experienced but how does this compare to .webarchive?
Saving this for later
call me old fashioned, but I still use Ctrl+S
Very cool idea - thank you for this!
On question: How does it handle those cookie pop-ups, gdpr-warnings etc?
One thing I always wonder when I see native software posted here:
How do you guys handle the security aspect of executing stuff like this on your machines?
Skimming the repo it has about a thousand lines of code and a bunch of dependencies with hundreds of sub-dependencies. Do you read all that code and evaluate the reputation of all dependencies?
Do you execute it in a sandboxed environment?
Do you just hope for the best like in the good old times of the C64?