Strip & Sanitize BEFORE saving data

So I've googled my brains out, and I see a lot of talk about TextHelper for views, but next to no discussion about cleaning text _before_ it is saved.

I figured this had to be asked 4 zillion times, but I'm not finding anything concrete/obvious.

Using h is fine as a safety catch, but that alone is not acceptable to me as the means of diffusing the impact of HTML or JS in text data. It needs to be removed / tested for in validations for rejection (and of course it should be wise to all the entity/unicode/null and other obfuscation tricks)

I did see refc to some tools like whitelist, but again, they're all focused on the display side of things.

Obviously I can use validates_format_of for the vast majority of small fields that can be restricted to known characters which would preclude an HTML/XSS injection. But there are plenty of larger, free-text fields where that's not practical.

I'm surprised there's no basic validation for this (that I can see), so I'm hoping that's because there's a common technique which combines some other tool with validations to do this?

What I'm thinking of is something like strip_tags except that it is usable on the model side of things.

-- gw (www.railsdev.ws)

You can make an ActiveRecord do exactly that. The open nature of Ruby
makes it simple:

app/model/cleaner.rb:

I use this same design pattern to make my blog titles into 'path'
fields I use in my blog urls:

http://destiney.com/blog/clean-your-ruby-rails-activerecords-automatically

A route is the real magic:

  map.connect "blog/:id",
              :controller => 'blog',
              :action => 'entry',
              :requirements => { :id => /[\w-]+/ },
              :id => nil

The common technique was to mix the helpers directly in to your model
and call them directly. However, I recently just refactored a lot of
that code into the html tokenizer library. You can now access the
classes directly as HTML::Sanitizer, HTML::LinkSanitizer, and
HTML::WhiteListSanitizer.

Greg:
1. I am moved to ask a meta-question: How did you come upon this
solution?
     - Reading Rails docs?
     - Reading Rails source code?
     - Regular Ruby reflection programming?

    Ah, I see, regular Ruby reflection. I went looking for the
Cleaner module in the docs and source.

2. A future version of rails could use the name 'Cleaner' and create
a naming conflict, but that sounds manageable.

3. Also, the Pickaxe Ruby book deprecates 'append_features' in favor
of 'Module#included'.

4. Why would I name a column 'html' if I wanted to remove all html
tags? I'm more likely to use this on a column named 'description',
'name' or even 'email address'. I guess it's just an example.

5. Anyway, I like the code.

fredistic

1. I am moved to ask a meta-question: How did you come upon this
solution?

Honestly I don't recall. If I had to guess I'd say from one of my Ruby books:

http://static.destiney.com/ror_vs_c_asm.jpg

2. A future version of rails could use the name 'Cleaner' and create
a naming conflict, but that sounds manageable.

You can always name your version Cleanerfoobar87, I'm pretty sure that
won't cause any future problems.

3. Also, the Pickaxe Ruby book deprecates 'append_features' in favor
of 'Module#included'.

I wouldn't be surprised a bit.

4. Why would I name a column 'html' if I wanted to remove all html
tags? I'm more likely to use this on a column named 'description',
'name' or even 'email address'. I guess it's just an example.

Obviously you're free to rename the variable to something else. Like
I was saying in that other post I use this sort of thing to make
fields into other fields, it's good for many other uses I'm sure.

So I've googled my brains out, and I see a lot of talk about
TextHelper for views, but next to no discussion about cleaning text
_before_ it is saved.

[SNIP]

What I'm thinking of is something like strip_tags except that it is
usable on the model side of things.

The common technique was to mix the helpers directly in to your model
and call them directly.

Yeah, I'm lost on trying to do that. I can't seem to ever figure out what the correct syntax is for doing this at any given point. include? require? quoted? not-quoted? camel case? underscored? path names needed? name spaces needed? Far too many ways to accomplish what all looks like the same thing, and I still can't figure out what part of it is straight Ruby and where (if) Rails sticks its finger in the pie trying to make things "easier."

Leveraging strip_tags inside a custom validator seems the right way to go, but I'm lost in trying to get that into a model.

Need some spoon feeding to get started.

However, I recently just refactored a lot of
that code into the html tokenizer library. You can now access the
classes directly as HTML::Sanitizer, HTML::LinkSanitizer, and
HTML::WhiteListSanitizer.

This is a Rails 2.0 enhancement?

There have also been some new plugins that have come out in the last
few weeks:

http://code.al3x.net/svn/acts_as_sanitized/
http://code.google.com/p/sanitizeparams/

Ha. About an hour ago I came across Tony's blog entry, and I started
to write something based on it (I was going to try some simple stuff
without whitelist to start with though).

I guess I'll try this. Hopefully this works with 1.2.5 ?

-- gw

This is just a shot in the dark, but couldn't you add a global
"before_save" callback that would go through the self.attributes hash
and escape all strings? Then you could have a function call in the
model to determine the fields that should *not* be sanitized.

Ah yes, sorry. I recently added them to the HTML Tokenizer library
that powers assert_tag and assert_select. This way you don't have to
mess with including the helpers. The 3 sanitizers all respond to
#sanitize. The WhiteListSanitizer responds to #sanitize_css too.

Also, #append_features isn't really 'deprecated' as far as I know, but
it is one less line of code:

module FooBar
  def self.append_features(base)
    super
    base....
  end

  def self.included(base)
    base... # no super required
  end
end

Agreed all around, except that I find for 98% of the inputs in my apps, sanitizing in the model works for me for which I just finished a series of new validates_as_ class methods: http://www.railsdev.ws/blog/11/custom-validations-in-rails/ that helps reduce the bulk and improve readability IMO for many of the common things I do.

However, for those times where I have large open-editing fields, I'd still like to try your setup, but can't get white_list and your plugin to load on 1.2.6.

Installed using:

script/plugin install http://svn.techno-weenie.net/projects/plugins/white_list/

script/plugin install http://sanitizeparams.googlecode.com/svn/trunk/sanitize_params/

# environment.rb
config.plugins = %W( white_list, sanitize_params )

restart the site, and I get this error:

/usr/local/lib/ruby/gems/1.8/gems/rails-1.2.6/lib/initializer.rb:195:in `load_plugins': Cannot find the plugin 'white_list,'! (LoadError)

If it's worth your time pursuing a 1.2.x working version, I'm all ears. Not sure yet when I'll move to 2.0.

Actually looking closer at your code, I see it's more likely a problem with getting white_list to work on 1.2.6, so nevermind... it's not your worry :slight_smile:

-- gw

Tony Stubblebine wrote:

The default Rails method of sanitizing in templates has two problems. One, you sanitize too late to give feedback to the person who entered the data. Two, you're opening yourself to programmer error because you're asking programmers to add the sanitization call on every single display of data. I'm struggling to come up with a benefit to this approach. Maybe if you were intentionally keeping a database of unsafe data but wanted to put a safe web front end on it, like a spam database.

You would only want to give users feedback on sanitation effects
when HTML text is expected to be entered. Early sanitation can be
confusing to users if plain text input is instead required.

For example, if pre-sanitation is used and a user enters the following
in a text field:

    Java < Rails

they'll see

   Java &lt; Rails

when they next go to edit the field.

My problem with sanitization is that it puts representational logic in
the model. Should the model really care that its data might one day
appear on an HTML page? Or should the HTML page take care of its own
needs?

///ark

My problem with sanitization is that it puts representational logic in
the model.

And embedding HTML in the data doesn't? If it's representational to remove it, then it's representational to allow it, no?

Should the model really care that its data might one day
appear on an HTML page? Or should the HTML page take care of its own
needs?

IMO, both.

At the goes-inta stage, sanitization is nothing more than a particular type of validation. As a parallel, I don't want phone numbers to be formatted, yet I allow users to enter formatting, and then strip it out before I store the data. I'm not going to keep the myriad formats of entered values, and the deal with removing it over & over again as the data is used or displayed. The model should take care of itself here.

I don't buy the argument at all that users "may" need a certain HTML tag. Setting aside the simplistic and narrow view of the world from the perspective of the ubiquitous blog, HTML has no business in the fields that make up a real, data-centric, application. Removing all traces of it from such fields is an input validation issue that the model should be taking care of before the data even get into the model IMO (not after it is loaded into the model the way Rails currently works).

Not stripping code before it reaches the model based on an academic or philosophical point ignores the real-world danger of that stuff, and the greater responsibility to take every opportunity to protect those who use my aplication by taking every chance I can get to ensure the data is not infected.

Personally, for fields that may require stylizing, I prefer an alternative form of markup that provides greater control over what is allowed. If there simply is no other option but raw HTML, then such cases can be handled as exceptions, not as rules which endanger the greater balance of the data.

Having done all that, one can end up with a false sense of, err, security, by assuming that data coming from the database can be trusted. Remember your X-Files lessons, and Trust No One. You never know who, or what might have direct access to the database. Data imports, mergers, restored backups from before data was cleaned. Even that trusted admin with direct db access that you've paid to go to all those security conferences may decide there really is easy money to made by sneaking some code into the data.

So, for these and similar reasons (alternative uncontrolled data sources like RSS vs DB for news stories), of course the HTML page should take care of itself too with proper filters applied at the goes-outta stage.

So, yeah, I say sanitization needs to be done at both ends, and debating whether it should be done at one end or the other is like debating whether we should vaccinate for tuberculosis and ignore the disease if it shows up vs. ignoring the opportunity to vaccinate because we can just treat someone if they get it. We need to do both: vaccinate for prevention, and treat for containment.

If there is a business rule that states "HTML shall not be stored in
the model," then of course you should validate it out in the model.
If, however, the business rule is "Our HTML pages should not be
vulnerable," then the model is the wrong place to make that happen, in
my opinion. The more presentational logic you put in the model, the
worse, in my opinion. You need to handle this fully and properly in
the place that knows what dangers exist.

As for the model "allowing" HTML, that's missing the point. The model
shouldn't even know that HTML exists.

I know we're not going to agree on this, of course. I'm not exactly a
belt-and-suspenders kind of guy. :slight_smile:

///ark

Based on your business rule descriptions, a hashed out live conversation would likely reveal me agreeing to the principle you're positing. However, for practical purposes, I see these two descriptions ending up as the same thing. I know they're not the same, but where the rubber hits the road, I think the second rule *usually* (not always of course) turns into the first rule because being "not vulnerable" usually turns into "don't even allow it" when you realize the app don't need it, which pushes me to say that by default, sanitizing should happen on the goes_inta side *except* where html is valid content. I don't see it as suspenders, but rather as an aggressive way to acheive both the first and second rule at the same time.

So, even if you beat me into agreeing with you fully, I'm quite sure I would leave the room and still sanitize HTML at the model because I think the first rule is the one that should apply everywhere except where it's obvious that it cannot.

Good discussion to understand the distinctions though. Thx.

Marston A. wrote:

There have also been some new plugins that have come out in the last
few weeks:

http://code.al3x.net/svn/acts_as_sanitized/
http://code.google.com/p/sanitizeparams/

Looks to me like acts_as_sanitized has moved to
http://svn.devjavu.com/actsassanitized with a project page at
http://actsassanitized.devjavu.com/.

It didn't work with Rails 2.0.2 out of the box, so I tried a little
tinkering by replacing this:

include ActionView::Helpers::TextHelper

with

include ActionView::Helpers::SanitizeHelper

I get a "undefined method `white_list_sanitizer' for #<Class:0x248a8bc>"
error.

I've tried a number of other, but no luck. How would you update this
for Rails 2.0?

cheers,
Walter

Walter Mcginnis wrote:

It didn't work with Rails 2.0.2 out of the box, so I tried a little
tinkering by replacing this:

include ActionView::Helpers::TextHelper

with

include ActionView::Helpers::SanitizeHelper

I get a "undefined method `white_list_sanitizer' for #<Class:0x248a8bc>"
error.

Nevermind, got it working. Will submit to plugin maintainer...

Cheers,
Walter