HTML Sanitizing
A rant about parsing HTML with Regular Expressions has been making the rounds recently, and I figured I’d throw in my two cents. By ranting about something that is only sort of related! Specifically, about HTML sanitizers:
Every single1 HTML sanitizer does the wrong thing2.
Here’s what an HTML sanitizer does. Let’s say I have instructed it to not allow any HTML. It takes a string like this:
And turns it into a string like this:
It removed my HTML! What the crap!?
Now, from a security stand point this is completely satisfactory. There is no longer any potentially dangerous tags in that bit of text. From a layout stand point this is also completely satisfactory; there is no risk of HTML changing the styling of the text.
But from a usability stand point this is confusing. What happened to the text I put in there? Now, in this case it isn’t too big of a deal, at least the meaning of my text is intact. But what if my text is instead this:
This would get rendered as:
This is my
interestingboring text!
But a sanitizer turns it into a string like this:
Which is rendered as:
This is my interesting boring text!
That doesn’t make any sense! HTML sanitizers should not remove anything. I obviously went to the effort to type that HTML – I don’t want it to disappear. A sanitizer should only make unwanted tags and attributes not do anything by escaping them.
Applying this system to the last example, you get this:
Which is rendered as:
This is my <del>interesting</del> boring text!
And while this won’t necessarily make sense to most people, it will make sense to the person who made that bit of text. And more importantly it won’t change the meaning of the text.
I want an HTML sanitizer that allows me to specify which tags and attributes are allowed, but escapes everything else.