HTML Sanitizing

November 17th, 2009

A rant about parsing HTML with Regular Expressions has been making the rounds recently, and I figured I’d throw in my two cents. By ranting about something that is only sort of related! Specifically, about HTML sanitizers:

Every single¹ HTML sanitizer does the wrong thing².

Here’s what an HTML sanitizer does. Let’s say I have instructed it to not allow any HTML. It takes a string like this:

This is my <b>interesting</b> text!

And turns it into a string like this:

This is my interesting text!

It removed my HTML! What the crap!?

Now, from a security stand point this is completely satisfactory. There is no longer any potentially dangerous tags in that bit of text. From a layout stand point this is also completely satisfactory; there is no risk of HTML changing the styling of the text.

But from a usability stand point this is confusing. What happened to the text I put in there? Now, in this case it isn’t too big of a deal, at least the meaning of my text is intact. But what if my text is instead this:

This is my <del>interesting</del> boring text!

This would get rendered as:

This is my ~~interesting~~ boring text!

But a sanitizer turns it into a string like this:

This is my interesting boring text!

Which is rendered as:

This is my interesting boring text!

That doesn’t make any sense! HTML sanitizers should not remove anything. I obviously went to the effort to type that HTML – I don’t want it to disappear. A sanitizer should only make unwanted tags and attributes not do anything by escaping them.

Applying this system to the last example, you get this:

This is my &lt;del&gt;interesting&lt;/del&gt; boring text!

Which is rendered as:

This is my <del>interesting</del> boring text!

And while this won’t necessarily make sense to most people, it will make sense to the person who made that bit of text. And more importantly it won’t change the meaning of the text.

I want an HTML sanitizer that allows me to specify which tags and attributes are allowed, but escapes everything else.

At least all the ones I have used. ↩
From the point of view of sanitizing input by users from a form on a website.³ ↩
If you can’t tell, I just figured out how to add footnotes to my posts! ↩