Strip & Sanitize BEFORE saving data

So I've googled my brains out, and I see a lot of talk about TextHelper for views, but next to no discussion about cleaning text _before_ it is saved.

I figured this had to be asked 4 zillion times, but I'm not finding anything concrete/obvious.

Using h is fine as a safety catch, but that alone is not acceptable to me as the means of diffusing the impact of HTML or JS in text data. It needs to be removed / tested for in validations for rejection (and of course it should be wise to all the entity/unicode/null and other obfuscation tricks)

I did see refc to some tools like whitelist, but again, they're all focused on the display side of things.

Obviously I can use validates_format_of for the vast majority of small fields that can be restricted to known characters which would preclude an HTML/XSS injection. But there are plenty of larger, free-text fields where that's not practical.

I'm surprised there's no basic validation for this (that I can see), so I'm hoping that's because there's a common technique which combines some other tool with validations to do this?

What I'm thinking of is something like strip_tags except that it is usable on the model side of things.

-- gw (www.railsdev.ws)

The common technique was to mix the helpers directly in to your model and call them directly. However, I recently just refactored a lot of that code into the html tokenizer library. You can now access the classes directly as HTML::Sanitizer, HTML::LinkSanitizer, and HTML::WhiteListSanitizer.

Greg: 1. I am moved to ask a meta-question: How did you come upon this solution?      - Reading Rails docs?      - Reading Rails source code?      - Regular Ruby reflection programming?

    Ah, I see, regular Ruby reflection. I went looking for the Cleaner module in the docs and source.

2. A future version of rails could use the name 'Cleaner' and create a naming conflict, but that sounds manageable.

3. Also, the Pickaxe Ruby book deprecates 'append_features' in favor of 'Module#included'.

4. Why would I name a column 'html' if I wanted to remove all html tags? I'm more likely to use this on a column named 'description', 'name' or even 'email address'. I guess it's just an example.

5. Anyway, I like the code.

fredistic

So I've googled my brains out, and I see a lot of talk about TextHelper for views, but next to no discussion about cleaning text _before_ it is saved.

[SNIP]

What I'm thinking of is something like strip_tags except that it is usable on the model side of things.

The common technique was to mix the helpers directly in to your model and call them directly.

Yeah, I'm lost on trying to do that. I can't seem to ever figure out what the correct syntax is for doing this at any given point. include? require? quoted? not-quoted? camel case? underscored? path names needed? name spaces needed? Far too many ways to accomplish what all looks like the same thing, and I still can't figure out what part of it is straight Ruby and where (if) Rails sticks its finger in the pie trying to make things "easier."

Leveraging strip_tags inside a custom validator seems the right way to go, but I'm lost in trying to get that into a model.

Need some spoon feeding to get started.

However, I recently just refactored a lot of that code into the html tokenizer library. You can now access the classes directly as HTML::Sanitizer, HTML::LinkSanitizer, and HTML::WhiteListSanitizer.

This is a Rails 2.0 enhancement?

There have also been some new plugins that have come out in the last few weeks:

http://code.al3x.net/svn/acts_as_sanitized/ http://code.google.com/p/sanitizeparams/

Ha. About an hour ago I came across Tony's blog entry, and I started
to write something based on it (I was going to try some simple stuff
without whitelist to start with though).

I guess I'll try this. Hopefully this works with 1.2.5 ?

-- gw

This is just a shot in the dark, but couldn't you add a global "before_save" callback that would go through the self.attributes hash and escape all strings? Then you could have a function call in the model to determine the fields that should *not* be sanitized.

Ah yes, sorry. I recently added them to the HTML Tokenizer library that powers assert_tag and assert_select. This way you don't have to mess with including the helpers. The 3 sanitizers all respond to #sanitize. The WhiteListSanitizer responds to #sanitize_css too.

Also, #append_features isn't really 'deprecated' as far as I know, but it is one less line of code:

module FooBar   def self.append_features(base)     super     base....   end

  def self.included(base)     base... # no super required   end end

Agreed all around, except that I find for 98% of the inputs in my apps, sanitizing in the model works for me for which I just finished a series of new validates_as_ class methods: WEBSITE.WS - Your Internet Address For Life™ that helps reduce the bulk and improve readability IMO for many of the common things I do.

However, for those times where I have large open-editing fields, I'd still like to try your setup, but can't get white_list and your plugin to load on 1.2.6.

Installed using:

script/plugin install http://svn.techno-weenie.net/projects/plugins/white_list/

script/plugin install http://sanitizeparams.googlecode.com/svn/trunk/sanitize_params/

# environment.rb config.plugins = %W( white_list, sanitize_params )

restart the site, and I get this error:

/usr/local/lib/ruby/gems/1.8/gems/rails-1.2.6/lib/initializer.rb:195:in `load_plugins': Cannot find the plugin 'white_list,'! (LoadError)

If it's worth your time pursuing a 1.2.x working version, I'm all ears. Not sure yet when I'll move to 2.0.

Actually looking closer at your code, I see it's more likely a problem with getting white_list to work on 1.2.6, so nevermind... it's not your worry :slight_smile:

-- gw

Tony Stubblebine wrote:

The default Rails method of sanitizing in templates has two problems. One, you sanitize too late to give feedback to the person who entered the data. Two, you're opening yourself to programmer error because you're asking programmers to add the sanitization call on every single display of data. I'm struggling to come up with a benefit to this approach. Maybe if you were intentionally keeping a database of unsafe data but wanted to put a safe web front end on it, like a spam database.

You would only want to give users feedback on sanitation effects when HTML text is expected to be entered. Early sanitation can be confusing to users if plain text input is instead required.

For example, if pre-sanitation is used and a user enters the following in a text field:

    Java < Rails

they'll see

   Java &lt; Rails

when they next go to edit the field.

My problem with sanitization is that it puts representational logic in the model. Should the model really care that its data might one day appear on an HTML page? Or should the HTML page take care of its own needs?

///ark

My problem with sanitization is that it puts representational logic in the model.

And embedding HTML in the data doesn't? If it's representational to remove it, then it's representational to allow it, no?

Should the model really care that its data might one day appear on an HTML page? Or should the HTML page take care of its own needs?

IMO, both.

At the goes-inta stage, sanitization is nothing more than a particular type of validation. As a parallel, I don't want phone numbers to be formatted, yet I allow users to enter formatting, and then strip it out before I store the data. I'm not going to keep the myriad formats of entered values, and the deal with removing it over & over again as the data is used or displayed. The model should take care of itself here.

I don't buy the argument at all that users "may" need a certain HTML tag. Setting aside the simplistic and narrow view of the world from the perspective of the ubiquitous blog, HTML has no business in the fields that make up a real, data-centric, application. Removing all traces of it from such fields is an input validation issue that the model should be taking care of before the data even get into the model IMO (not after it is loaded into the model the way Rails currently works).

Not stripping code before it reaches the model based on an academic or philosophical point ignores the real-world danger of that stuff, and the greater responsibility to take every opportunity to protect those who use my aplication by taking every chance I can get to ensure the data is not infected.

Personally, for fields that may require stylizing, I prefer an alternative form of markup that provides greater control over what is allowed. If there simply is no other option but raw HTML, then such cases can be handled as exceptions, not as rules which endanger the greater balance of the data.

Having done all that, one can end up with a false sense of, err, security, by assuming that data coming from the database can be trusted. Remember your X-Files lessons, and Trust No One. You never know who, or what might have direct access to the database. Data imports, mergers, restored backups from before data was cleaned. Even that trusted admin with direct db access that you've paid to go to all those security conferences may decide there really is easy money to made by sneaking some code into the data.

So, for these and similar reasons (alternative uncontrolled data sources like RSS vs DB for news stories), of course the HTML page should take care of itself too with proper filters applied at the goes-outta stage.

So, yeah, I say sanitization needs to be done at both ends, and debating whether it should be done at one end or the other is like debating whether we should vaccinate for tuberculosis and ignore the disease if it shows up vs. ignoring the opportunity to vaccinate because we can just treat someone if they get it. We need to do both: vaccinate for prevention, and treat for containment.

If there is a business rule that states "HTML shall not be stored in the model," then of course you should validate it out in the model. If, however, the business rule is "Our HTML pages should not be vulnerable," then the model is the wrong place to make that happen, in my opinion. The more presentational logic you put in the model, the worse, in my opinion. You need to handle this fully and properly in the place that knows what dangers exist.

As for the model "allowing" HTML, that's missing the point. The model shouldn't even know that HTML exists.

I know we're not going to agree on this, of course. I'm not exactly a belt-and-suspenders kind of guy. :slight_smile:

///ark

Based on your business rule descriptions, a hashed out live conversation would likely reveal me agreeing to the principle you're positing. However, for practical purposes, I see these two descriptions ending up as the same thing. I know they're not the same, but where the rubber hits the road, I think the second rule *usually* (not always of course) turns into the first rule because being "not vulnerable" usually turns into "don't even allow it" when you realize the app don't need it, which pushes me to say that by default, sanitizing should happen on the goes_inta side *except* where html is valid content. I don't see it as suspenders, but rather as an aggressive way to acheive both the first and second rule at the same time.

So, even if you beat me into agreeing with you fully, I'm quite sure I would leave the room and still sanitize HTML at the model because I think the first rule is the one that should apply everywhere except where it's obvious that it cannot.

Good discussion to understand the distinctions though. Thx.

Marston A. wrote:

There have also been some new plugins that have come out in the last few weeks:

http://code.al3x.net/svn/acts_as_sanitized/ Google Code Archive - Long-term storage for Google Code Project Hosting.

Looks to me like acts_as_sanitized has moved to http://svn.devjavu.com/actsassanitized with a project page at http://actsassanitized.devjavu.com/.

It didn't work with Rails 2.0.2 out of the box, so I tried a little tinkering by replacing this:

include ActionView::Helpers::TextHelper

with

include ActionView::Helpers::SanitizeHelper

I get a "undefined method `white_list_sanitizer' for #<Class:0x248a8bc>" error.

I've tried a number of other, but no luck. How would you update this for Rails 2.0?

cheers, Walter

Walter Mcginnis wrote:

It didn't work with Rails 2.0.2 out of the box, so I tried a little tinkering by replacing this:

include ActionView::Helpers::TextHelper

with

include ActionView::Helpers::SanitizeHelper

I get a "undefined method `white_list_sanitizer' for #<Class:0x248a8bc>" error.

Nevermind, got it working. Will submit to plugin maintainer...

Cheers, Walter

Walter Mcginnis wrote:

Walter Mcginnis wrote:

It didn't work with Rails 2.0.2 out of the box, so I tried a little tinkering by replacing this:

include ActionView::Helpers::TextHelper

with

include ActionView::Helpers::SanitizeHelper

I get a "undefined method `white_list_sanitizer' for #<Class:0x248a8bc>" error.

Nevermind, got it working. Will submit to plugin maintainer...

Cheers, Walter

I run into this similar error: ActionView::TemplateError (undefined method `white_list_sanitizer' for TextHelperSingleton:Class) How did you fix that? I used this: http://www.csummers.org/2006/08/07/extend-string-to-use-actionviews-text-helpers/

Thanks, Isaiah p.

Isaiah Peng wrote:

Walter Mcginnis wrote:

Walter Mcginnis wrote:

Nevermind, got it working. Will submit to plugin maintainer...

Cheers, Walter

I run into this similar error: ActionView::TemplateError (undefined method `white_list_sanitizer' for TextHelperSingleton:Class) How did you fix that? I used this: http://www.csummers.org/2006/08/07/extend-string-to-use-actionviews-text-helpers/

Thanks, Isaiah p.

I never heard anything back from the plugin maintainer, so I left my customizations "on top" of the plugin in my use of it. They can be seen at http://github.com/kete/kete/tree/master/vendor/plugins/acts_as_sanitized

Cheers, Walter

Fernando Perez wrote:

What are all the plugins available for sanitizing user input and that work with Rails 2.2.2?

You may want to checkout the new Sanitize gem. I just read about it here:

http://www.rubyinside.com/how-to-sanitize-html-and-css-in-ruby-1429.html

As far as Rails I haven't updated to the latest version yet, so no idea if my version of the plugin still works.

Cheers, Walter