1

I'm trying to save a Reddit page for OFFLINE viewing as a single HTML file, EXACTLY as it's displayed on the browser and after having already manually expanded some comment threads. This issue is a subset of the general question of how one can save the entire web DOM in its current state while preserving the CSS effects and layout. For example, here are a multitude of posts across the Stack Exchange platform that ask this general question:


Almost all answers are of one of the following forms:

  • Right click and select Save as... and then save as either Web Page, Complete (*.htm;*.html) or Web page, Single File (*.mhtml).

  • Open Chrome DevTools and copy the entire HTML (Copy outerHTML) from the Elements tab.

  • You'll never be able to save a file that looks exactly as the live website version due to many links being "relative" links, and many links to external scripts can be contained inside CSS and JS files.

  • Use a tool such as HTTrack. (As far as I know, however, HTTrack doesn't support saving everything in a single HTML file.)

  • Saving a webpage as a single HTML file exactly as it appears to the user during a live render is simply impossible for many websites.

  • Use a browser extension, such as "Single File" (the developer's GitHub page is here), "Save Page WE", or "WebScrapBook".

  • Try the "WebRecorder" Chrome extension.

Several of these answers do actually achieve some level of saving the webpage's layout as a single HTML file exactly as it appears when rendered live, but there is a HUGE downside: they do not save the HTML file in such a way that makes it possible for the user to view the page OFFLINE. The offline viewing part is essentially what I'm after, and is the crux of my issue.

For example, opening Chrome DevTools and saving the entire outerHTML from the Elements tab does actually allow the user to save the page exactly how it looks like when rendered live, but as soon as the user tries opening the HTML file in offline mode, none of the external scripts are able to load, and thus the entire comment section of the Reddit page literally doesn't even display. I did some manual inspection of the HTML file itself, and I found out that the comments themselves are actually present in the HTML file, but they just don't render when the user loads the file, since they depend on external scripts to dictate how to display to the user.

A solution (almost...)

In my experience, I have found that using the SingleFile chrome extension does exactly the task that I'm after (almost), and it does it best. It's able to save the page precisely as it looks like to the user during a live render (even when viewed offline), and I've found that it's better than both the "Save Page WE" and the "WebScrapBook" extensions. SingleFile handles many sites flawlessly, but it fails miserably when attempting to save a Reddit page that has a huge comment thread. In such cases, the extension consumes too much memory and simply crashes the tab (an Out of Memory error occurs). The sad part is that the extension works well on Reddit posts that have a very small comment section, but rather mockingly, most of the time when I want to save a Reddit post, the Reddit post has a very large comment section, and thus the SingleFile extension can't handle it.

The SingleFile developer has a command-line variant of the tool on his GitHub page, but that simply just launches a headless browser and downloads the requested URL. This approach is useless in my case since I want to save the Reddit page with the modifications that I've personally and manually made (i.e., with the desired comment threads manually expanded). Moreover, I've had the same Out of Memory issue with this approach.

Dirty workaround

I've found that a super dirty workaround to my issue is to simply save the page in PDF format, but I don't want a PDF format. I want an HTML format.

Any ideas on how to save a Reddit page for offline viewing, even in instances wherein the comment section is rather large?

2
  • What's the URL? Seeing it would really help to give specific advice.
    – JayCravens
    Commented Jul 7 at 3:58
  • @JayCravens, the URL can be any Reddit post that has a "large" comment section, really. By "large", it's usually one that the user can scroll all the way to the bottom of the page and click on "View more comments" several times. Here's an example URL: reddit.com/r/politics/comments/1dsupsh/… Commented Jul 8 at 2:59

1 Answer 1

0

They are using your typical "lazy loader".
So, you have to load it in order to save it.
Scroll and load, until you have no more to load. Don't scroll back up.

Then, you can:

  • Ctrl+A > Right-Click on the selection (on the blue highlight)
  • "View selection source". That'll take awhile, go for a coffee.
  • Ctrl+A > Copy > Paste (in a notepad)

Save as my-saved-post.html.

Open with your browser.
How broken the layout is without loading all the external components?
Usually not too bad. You'll now have every post.

Clean up the HTML as much as you'd like. Now you have it in .html format


web page complete

You'll have everything but the loader content.


I was looking at that image and noticed it's a 2.2mb .html file?! You might have the lazy loaders content. You just don't have any server side functionality.

You should try running the page with Five-Server. Once you have it installed, rename data.html to index.html. Then, open a terminal in that directory and type: five-server.


I may have an idea for your dirty workaround PDF file. I think Ubuntu's repository has it. The link below is Fedora & FreeBSD versions. You can also get the source from Poppler, if preferred.

pdftohtml version 24.02.0
Copyright 2005-2024 The Poppler Developers - http://poppler.freedesktop.org

pdftohtml 'input.pdf' 'output.html' -s -nomerge -dataurls -noframes

It does a reasonable job. I tested it on a textual PDF file. Here is the output:pdf vs html

4
  • I don't see any "View selection source" option in the right-click menu. Are you talking about the "Inspect" button, which simply opens up the Chrome DevTools to the Elements tab? Commented Jul 9 at 17:31
  • @CuriosityCalls No, you have to right click on the blue selected part of the text. Not just anywhere on the page after it's selected. You must right click on the actual highlighted part for the "View Selection Source" to display.
    – JayCravens
    Commented Jul 9 at 18:45
  • okay, I see you're on Firefox. I'm on Chrome. Regardless, your answer still does not do as intended; the file cannot be viewed properly offline. Have you tried your own method yourself? Download it, then clear your browser's cache/cookies and try loading the file... And the broken layout + HTML cleanup that you mention in your answer adds no value since that is precisely what I am trying to avoid. The end goal is to do what the SingleFile extension accomplishes WITHOUT running into an "Out of Memory" error. Commented Jul 10 at 20:01
  • @CuriosityCalls Yeah, chrome isn't the greatest, in general. I'd try it with firefox. I do this regularly. If the post is that ridiculously long, view the selection source in two or three chunks instead of select all. I updated the answer on a way to help with external dependencies when offline.
    – JayCravens
    Commented Jul 11 at 0:48

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .