My problem with sanitization is that it puts representational logic in
the model.
And embedding HTML in the data doesn't? If it's representational to remove it, then it's representational to allow it, no?
Should the model really care that its data might one day
appear on an HTML page? Or should the HTML page take care of its own
needs?
IMO, both.
At the goes-inta stage, sanitization is nothing more than a particular type of validation. As a parallel, I don't want phone numbers to be formatted, yet I allow users to enter formatting, and then strip it out before I store the data. I'm not going to keep the myriad formats of entered values, and the deal with removing it over & over again as the data is used or displayed. The model should take care of itself here.
I don't buy the argument at all that users "may" need a certain HTML tag. Setting aside the simplistic and narrow view of the world from the perspective of the ubiquitous blog, HTML has no business in the fields that make up a real, data-centric, application. Removing all traces of it from such fields is an input validation issue that the model should be taking care of before the data even get into the model IMO (not after it is loaded into the model the way Rails currently works).
Not stripping code before it reaches the model based on an academic or philosophical point ignores the real-world danger of that stuff, and the greater responsibility to take every opportunity to protect those who use my aplication by taking every chance I can get to ensure the data is not infected.
Personally, for fields that may require stylizing, I prefer an alternative form of markup that provides greater control over what is allowed. If there simply is no other option but raw HTML, then such cases can be handled as exceptions, not as rules which endanger the greater balance of the data.
Having done all that, one can end up with a false sense of, err, security, by assuming that data coming from the database can be trusted. Remember your X-Files lessons, and Trust No One. You never know who, or what might have direct access to the database. Data imports, mergers, restored backups from before data was cleaned. Even that trusted admin with direct db access that you've paid to go to all those security conferences may decide there really is easy money to made by sneaking some code into the data.
So, for these and similar reasons (alternative uncontrolled data sources like RSS vs DB for news stories), of course the HTML page should take care of itself too with proper filters applied at the goes-outta stage.
So, yeah, I say sanitization needs to be done at both ends, and debating whether it should be done at one end or the other is like debating whether we should vaccinate for tuberculosis and ignore the disease if it shows up vs. ignoring the opportunity to vaccinate because we can just treat someone if they get it. We need to do both: vaccinate for prevention, and treat for containment.