You probably have used Rails' sanitize method before. A common use case that I have seen is to simply restrict tags and attributes to a "safe" whitelist.
sanitize(content,
tags: %w(strong strike em a img),
attributes: %w(src type href target)
)
This can go a long way but we wanted to allow iframes in our markdown powered posts/comments to allow for embedment of certain content (e.g. YouTube). You maybe also want to allow images (but make sure it's https only or proxy them) or autolink only certain protocols and/or domains.
Custom scrubbers
An easy way to get powerful scrubbing capabilities is to create a custom scrubber class and pass it to the sanitize method.
sanitize(content, scrubber: PostScrubber.new)
Let's take a look at the scrubber class, we will subclass the default PermitScrubber which already handles
the aforementioned tag/attribute whitelist. We just have to define our whitelist again, we could pass them as options or we do higher-level options like allow_embed
and build our list inside the class. If you don't define any tags (e.g. @tags
is nil) Loofah's default whitelist will be used, same goes for @attributes
.
class EntryScrubber < Rails::Html::PermitScrubber
def initialize opts = {}
super() # with parantheses we can use custom `opts' parameter
self.tags = %w(strong strike em a img)
self.attributes = %w(src href target type)
if opts[:allow_embed]
self.tags << "iframe"
self.attributes += %w(allowfullscreen frameborder)
end
end
end
Scrubbers will iterate over all nodes in the document using Nokogiri (refer to Nokogiri docs for more information on how to use the node elements). For our needs we can skip text nodes which is also the default behavior of PermitScrubber so you can omit this method. I thought I just mention it as you might want to skip more or less.
class EntryScrubber < Rails::Html::PermitScrubber
# [..]
def skip_node? node
node.text?
end
end
The meat: sweet scrubbing bacon
Additionally we can override certain methods for all of our scrubbing needs. I will focus on keep_node?
/ allowed_node?
but if you are into attribute scrubbing also take a look at scrub_attribute?
and scrub_attributes
(there's also something for CSS declarations in style attributes).
It's simple: You get a node, return true/false to either keep or strip it (subtree is preserved).
I personally prefer to think of keeping nodes but the documentation suggests to overload allowed_node?
instead of keep_node?
but since we are using super
anyway it doesn't really matter.
class EntryScrubber < Rails::Html::PermitScrubber
# [..]
EMBED_WHITELIST = %w(
youtube.com
player.vimeo.com
player.twitch.tv
)
def keep_node? node
# if super returns false we already want to scrub based on tag whitelist
return false unless super
# let's restrict link protocols (only allow http/https/mailto)
if src = node.attributes["href"]&.value
return false unless src =~ /\A(http(s)?:\/\/|mailto:)/i
end
# check sources (img/iframe)
if src = node.attributes["src"]&.value
# only allow https (embedding http wouldnt fly on a secure site anyway)
return false unless src =~ /\Ahttps:\/\//i
# additionally whitelist domains for iframes
if node.name == "iframe"
begin
uri = URI.parse(src)
return false unless EMBED_WHITELIST.include?(uri.host.to_s.downcase)
rescue URI::InvalidURIError
# invalid URI => scrub
return false
end
end
end
# seems good, don't scrub
true
end
end
You could even go as far as to allow certain script tags (for real Twitter, a script embed can't be the ultimate solution).
Using scrubbers to modify document
Since we have tree write abilities as a necessity for scrubbing we can also use a scrubber to modify the document. As an example, let's rewrite image sources to proxy them through imageproxy
class ImageproxyScrubber < Rails::Html::PermitScrubber
def keep_node? node
if node.name == "img" && src = node.attributes["src"]&.value
node.attributes["src"].value = "https://imageproxy.local/0x0/#{src}"
end
true # keep everything as we just rewrite and scrub somewhere else
end
def scrub_attribute? node
false # keep everything as we just rewrite and scrub somewhere else
end
end
h = ApplicationController.helpers
h.sanitize(
h.image_tag("https://i.imgur.com/yed5Zfk.gif"),
scrubber: ImageproxyScrubber.new
)
# => "<img src=\"https://imageproxy.local/0x0/https://i.imgur.com/yed5Zfk.gif\">"