Handle lazy loading images in url2epub, once and for all

2023-05-07

url2epub is a Go library I wrote to generate readable HTML (by stripping out unnecessary html elements), and also epub files, from URLs. I also wrote a Telegram bot using the library that can send the generated epub file into my reMarkable cloud account directly. This is my own alternative to Pocket/Instapaper/ReadItLater to send links to my reMarkable 2 to read later.

The lazy loading images problem

Since an epub file is basically a zip archive containing the html and resource files (mainly images), it’s essential to figure out what are the image files to be packed into the epub file. Also since reMarkable 2 only has a monochrome e-ink display, there’s no reason to keep the original colors, and url2epub also has the ability to convert the images to grayscale before packing (fun fact: I learnt how huge the visual differences are between 8-bit and 16-bit grayscale colors on reMarkable 2).

Here comes the problem making this hard: Quite a few websites use lazy loading images.

Lazy loading images are the images that only use a placeholder in the src attribute in the img nodes, and then use javascript to load the real image later and replace the placeholder, to make the page seems to load faster.

Here is an example from a blog post regarding GarminOS’s security vulnerabilities I saw on Hacker News the other day:

<img
  alt="Compromising Garmin&#8217;s Sport Watches: A Deep Dive into GarminOS and its MonkeyC Virtual Machine"
  data-src="https://www.anvilsecure.com/wp-content/uploads/2023/04/Compromising-Garmin-Sport-Watches.jpg"
  class="lazyload"
 
 src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw=="
>

As you can see, in this img node, the src attribute is just an embedded minimal placeholder gif, with the real image url is in the data-src attribute (which is not a standard html attribute), and loaded by a javascript later.

Here is another example from The Verge:

<img
  alt=""
 
 src="data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7"
  decoding="async"
  data-nimg="fill"
  style="position:absolute;top:0;left:0;bottom:0;right:0;box-sizing:border-box;padding:0;border:none;margin:auto;display:block;width:0;height:0;min-width:100%;max-width:100%;min-height:100%;max-height:100%;object-fit:cover"
/>

That the real url is not in the img node at all.

The hack

I first started to notice this issue from a local news website, which used seems to be a somewhat popular third-party javascript library called nitro to do the image lazy loading, which uses nitro-lazy-src attribute inside img node to store the real image url to do lazy loading. At the time I just added a hack to also try the nitro-lazy-src attribute if the src attribute inside the img node is not a real image URL.

That worked for a while, until The Verge did an overhaul of their UI and switched to the new system that also does lazy loading, with the real image url not even in the img node (see the example above). I didn’t have the time to dig into it and just read articles from The Verge less 🤷.

Then comes the blog post regarding GarminOS, which does not use nitro, but the real image url does exist inside the img node (as data-src attribute), so I just added data-src to the same list with nitro-lazy-src to be tried after src.

But this obviously won’t scale.

The real fix

When digging into the blog post regarding GarminOS further, I noticed a thing: They actually always have a noscript node immediately followed by the lazy loading img node, which contains an img node with the real image url inside its src attribute:

<img
  alt="Compromising Garmin&#8217;s Sport Watches: A Deep Dive into GarminOS and its MonkeyC Virtual Machine"
  data-src="https://www.anvilsecure.com/wp-content/uploads/2023/04/Compromising-Garmin-Sport-Watches.jpg"
  class="lazyload"
 
 src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw=="
>
<noscript>
  <img
    src="https://www.anvilsecure.com/wp-content/uploads/2023/04/Compromising-Garmin-Sport-Watches.jpg"
    alt="Compromising Garmin&#8217;s Sport Watches: A Deep Dive into GarminOS and its MonkeyC Virtual Machine"
  >
</noscript>

And looking into The Verge articles, they have the same thing:

<img
  alt=""
 
 src="data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7"
  decoding="async"
  data-nimg="fill"
  style="position:absolute;top:0;left:0;bottom:0;right:0;box-sizing:border-box;padding:0;border:none;margin:auto;display:block;width:0;height:0;min-width:100%;max-width:100%;min-height:100%;max-height:100%;object-fit:cover"
/>
<noscript>
  <img
    alt=""
    loading="lazy"
    decoding="async"
    data-nimg="fill"
    style="position:absolute;top:0;left:0;bottom:0;right:0;box-sizing:border-box;padding:0;border:none;margin:auto;display:block;width:0;height:0;min-width:100%;max-width:100%;min-height:100%;max-height:100%;object-fit:cover"
    sizes="(max-width: 768px) 100vw, 300px"
    srcSet="https://cdn0.vox-cdn.com/hermano/verge/product/image/10104/236616_Dyson_Zone_AKrales_0556.jpg 376w, https://cdn0.vox-cdn.com/hermano/verge/product/image/10104/236616_Dyson_Zone_AKrales_0556.jpg 384w, https://cdn0.vox-cdn.com/hermano/verge/product/image/10104/236616_Dyson_Zone_AKrales_0556.jpg 415w, https://cdn0.vox-cdn.com/hermano/verge/product/image/10104/236616_Dyson_Zone_AKrales_0556.jpg 480w, https://cdn0.vox-cdn.com/hermano/verge/product/image/10104/236616_Dyson_Zone_AKrales_0556.jpg 540w, https://cdn0.vox-cdn.com/hermano/verge/product/image/10104/236616_Dyson_Zone_AKrales_0556.jpg 640w, https://cdn0.vox-cdn.com/hermano/verge/product/image/10104/236616_Dyson_Zone_AKrales_0556.jpg 750w, https://cdn0.vox-cdn.com/hermano/verge/product/image/10104/236616_Dyson_Zone_AKrales_0556.jpg 828w, https://cdn0.vox-cdn.com/hermano/verge/product/image/10104/236616_Dyson_Zone_AKrales_0556.jpg 1080w, https://cdn0.vox-cdn.com/hermano/verge/product/image/10104/236616_Dyson_Zone_AKrales_0556.jpg 1200w, https://cdn0.vox-cdn.com/hermano/verge/product/image/10104/236616_Dyson_Zone_AKrales_0556.jpg 1440w, https://cdn0.vox-cdn.com/hermano/verge/product/image/10104/236616_Dyson_Zone_AKrales_0556.jpg 1920w, https://cdn0.vox-cdn.com/hermano/verge/product/image/10104/236616_Dyson_Zone_AKrales_0556.jpg 2048w, https://cdn0.vox-cdn.com/hermano/verge/product/image/10104/236616_Dyson_Zone_AKrales_0556.jpg 2400w" src="https://cdn0.vox-cdn.com/hermano/verge/product/image/10104/236616_Dyson_Zone_AKrales_0556.jpg"
  />
</noscript>

This actually makes a lot of sense. There are still accessibility and other reasons (converting to epub, for example 😄) for people to use browsers/readers with no/limited javascript support, so having the noscript fallback to good ol’ days of img node without lazy loading is the reasonable thing to do.

And as a result, instead of using a hack to add every known “real” src attribute to a list to try on, just find the img node inside the noscript node. Since url2epub will auto drop img nodes without a valid src, I don’t even need to do any special handlings on those lazy loading img nodes, as they will be auto dropped!

The only minor problem is Go’s html parser library treat the data inside noscript node as texts instead of html nodes, likely for security reasons, so I need to do additional html parsings for the text to find the img nodes.

The result is this commit to add img inside noscript support. With it, I was also able to totally drop the old hack of “alternative srcs”.

#English #go #url2epub #tech