escaping/stripping all user HTML input

(Some of you may be reading this twice as I accidentally posted this to ruby talk)

I am writing an application where I know that I do not need to allow any HTML input from a user.

I am considering using before_filter at the controller level to call a method that essentially performs the following on the appropriate members of the params hash: - call strip_tags() - escape any remaining characters with h()

The reason why I am doing this is it seems repetitive and error prone to have to call the above method every time in a view where user input is being displayed. Ultimately, I would prefer to store the data in as "non-malicious" format as possible and not have to worry at the presentation level of escaping that data at a later time.

Is there a better way to do this? Is there existing code that does this already? Some googling yielded nothing specific other than postings to the effect of "in your view, make sure to use h()".

Luis wrote:

I am writing an application where I know that I do not need to allow any HTML input from a user.

I am considering using before_filter at the controller level to call a method that essentially performs the following on the appropriate members of the params hash: - call strip_tags() - escape any remaining characters with h()

I will answer, but I don't understand the problem with storing raw HTML and escaping it. If the user typed <yo> into a text area, they should see <yo> in the view, with the angle brackets, and we should not strip the tags. You can call the equivalent of h() when you save their change, for example.

The reason why I am doing this is it seems repetitive and error prone to have to call the above method every time in a view where user input is being displayed. Ultimately, I would prefer to store the data in as "non-malicious" format as possible and not have to worry at the presentation level of escaping that data at a later time.

Is there a better way to do this? Is there existing code that does this already? Some googling yielded nothing specific other than postings to the effect of "in your view, make sure to use h()".

I can think of a way. It's sick, excessive, and bullet-proof. Here is assert_tidy:

  def assert_tidy(messy = @response.body, verbosity = :noisy)     scratch_html = RAILS_ROOT + '/../scratch.html' # TODO tune me!     File.open(scratch_html, 'w'){|f| f.write(messy) }     gripes = `tidy -eq #{scratch_html} 2>&1`     gripes.split("\n")     exclude, inclued = gripes.partition do |g|       g =~ / - Info\: / or       g =~ /Warning\: missing \<\!DOCTYPE\> declaration/ or       g =~ /proprietary attribute/ or       g =~ /lacks "(summary|alt)" attribute/     end     puts inclued if verbosity == :noisy     # inclued.map{|i| puts Regexp.escape(i) }     assert_xml `tidy -wrap 1001 -asxhtml #{scratch_html} 2>/dev/null`       # CONSIDER that should report serious HTML deformities   end

You can take out the assert_xml if you don't have yar_wiki (whence that comes). That code uses the command-line tidy. Don't worry about the Ruby-oriented tidy.so project.

Migrate that test-side code to production code, and you have a function that turns sloppy HTML into pristine XHTML. Now that your input is XML, you can strip out all the tags like this:

class REXML::Element    def inner_text      self.each_element( './/text()' ){}.join( '' )    end end

This obscenely over-the-top solution will hopefully inspire someone to post a solution in the usual 4 lines!