Happy Monday!
all posts

Advanced HTML sanitizing using custom scrubbers

Published on Mar 7, 2021

Sven Pachnit GitHub Twitter StackOverflow

You probably have used Rails' sanitize method before. A common use case that I have seen is to simply restrict tags and attributes to a "safe" whitelist.

sanitize(content,
  tags: %w(strong strike em a img),
  attributes: %w(src type href target)
)

This can go a long way but we wanted to allow iframes in our markdown powered posts/comments to allow for embedment of certain content (e.g. YouTube). You maybe also want to allow images (but make sure it's https only or proxy them) or autolink only certain protocols and/or domains.

Custom scrubbers

An easy way to get powerful scrubbing capabilities is to create a custom scrubber class and pass it to the sanitize method.

sanitize(content, scrubber: PostScrubber.new)

Let's take a look at the scrubber class, we will subclass the default PermitScrubber which already handles
the aforementioned tag/attribute whitelist. We just have to define our whitelist again, we could pass them as options or we do higher-level options like allow_embed and build our list inside the class. If you don't define any tags (e.g. @tags is nil) Loofah's default whitelist will be used, same goes for @attributes.

class EntryScrubber < Rails::Html::PermitScrubber
  def initialize opts = {}
    super() # with parantheses we can use custom `opts' parameter
    self.tags = %w(strong strike em a img)
    self.attributes = %w(src href target type)

    if opts[:allow_embed]
      self.tags << "iframe"
      self.attributes += %w(allowfullscreen frameborder)
    end
  end
end

Scrubbers will iterate over all nodes in the document using Nokogiri (refer to Nokogiri docs for more information on how to use the node elements). For our needs we can skip text nodes which is also the default behavior of PermitScrubber so you can omit this method. I thought I just mention it as you might want to skip more or less.

class EntryScrubber < Rails::Html::PermitScrubber
  # [..]
  def skip_node? node
    node.text?
  end
end

The meat: sweet scrubbing bacon

Additionally we can override certain methods for all of our scrubbing needs. I will focus on keep_node? / allowed_node? but if you are into attribute scrubbing also take a look at scrub_attribute? and scrub_attributes (there's also something for CSS declarations in style attributes).

It's simple: You get a node, return true/false to either keep or strip it (subtree is preserved).

I personally prefer to think of keeping nodes but the documentation suggests to overload allowed_node? instead of keep_node? but since we are using super anyway it doesn't really matter.

class EntryScrubber < Rails::Html::PermitScrubber
  # [..]

  EMBED_WHITELIST = %w(
    youtube.com
    player.vimeo.com
    player.twitch.tv
  )

  def keep_node? node
    # if super returns false we already want to scrub based on tag whitelist
    return false unless super

    # let's restrict link protocols (only allow http/https/mailto)
    if src = node.attributes["href"]&.value
      return false unless src =~ /\A(http(s)?:\/\/|mailto:)/i
    end

    # check sources (img/iframe)
    if src = node.attributes["src"]&.value
      # only allow https (embedding http wouldnt fly on a secure site anyway)
      return false unless src =~ /\Ahttps:\/\//i

      # additionally whitelist domains for iframes
      if node.name == "iframe"
        begin
          uri = URI.parse(src)

          return false unless EMBED_WHITELIST.include?(uri.host.to_s.downcase)
        rescue URI::InvalidURIError
          # invalid URI => scrub
          return false
        end
      end
    end

    # seems good, don't scrub
    true
  end
end

You could even go as far as to allow certain script tags (for real Twitter, a script embed can't be the ultimate solution).

Using scrubbers to modify document

Since we have tree write abilities as a necessity for scrubbing we can also use a scrubber to modify the document. As an example, let's rewrite image sources to proxy them through imageproxy

class ImageproxyScrubber < Rails::Html::PermitScrubber
  def keep_node? node
    if node.name == "img" && src = node.attributes["src"]&.value
      node.attributes["src"].value = "https://imageproxy.local/0x0/#{src}"
    end

    true # keep everything as we just rewrite and scrub somewhere else
  end

  def scrub_attribute? node
    false # keep everything as we just rewrite and scrub somewhere else
  end
end
h = ApplicationController.helpers
h.sanitize(
  h.image_tag("https://i.imgur.com/yed5Zfk.gif"),
  scrubber: ImageproxyScrubber.new
)

# => "<img src=\"https://imageproxy.local/0x0/https://i.imgur.com/yed5Zfk.gif\">"