Strip out ALL javascript from HTML source.

Hi.

I’ve got a bit of an issue where I have an input source of HTML source that anyone can use. I need to strip out all javascript. Attributes, links tags etc.

At this stage I’m thinking Hpricot is the go. I guess I’m hoping there is someone out there that has done this and is willing to share.

Cheers Daniel

One of Rick Olson's many plugins can do what you want:

http://agilewebdevelopment.com/plugins/whitelist

Which tags are handled is controllable by your code

Thanx for the pointer. But I think I need a bit more than that. I need to be able to leave tags alone for the most part, except tags, but attributes need a little more control. what I’ve come up with so far:

  • all on*** attributes have to go
  • any attribute that has “javascript:” in it has to go
  • any attribute with “.js” has to go
  • Also according to the exploit on myspace by sam It seems that I need to remove javascript: in attributes with newlines anywhere in the word.

I hope I’ve got them all. It doesn’t seem that the whitelist plugin will do this, although I will be very happy if it does.

Cheers

Daniel

Sorry if this got through and is a double post. It got sent back to me.

all on*** attributes have to go any attribute that has "javascript:" in it has to go any attribute with "*.js*" has to go Also according to the exploit on myspace by sam It seems that I need to remove javascript: in attributes with newlines anywhere in the word.

I hope I've got them all. It doesn't seem that the whitelist plugin will do this, although I will be very happy if it does.

You can of course contribute back to the plugin. However, I believe it'll do everything you listed short of removing any attribute with *.js*. Not sure what the point of that is though.

Another option is to just yank the code and make it bend to your specific whims. It's not a very large one.

I’ll certainly look at contributing if I can find a way to extend the functionality. Perhaps a strip_all_javascript method or something like that.

I think the point is trying to make it as difficult as possible to upload javascript. I want to accept arbitrary html source and display it on my page but at the same time, minimise the risk of having my page hijacked. The above list is the ways that I have thought of to include javascript in a submission. I don’t think it’s possible to completely remove the risk of submitted javascript since I could have a url like http://example.com/stuff set the headers to javascript and return whatever script it wants, but I want to minimise that risk.

I hope i’ve considered most of the ways that ppl could hijack my page. I want to include as many tags intact as possible.

Cheers Daniel

I dunno how secure you want this to be, but to be truly safe from XSS you'll need to handle more cases then Rick's plugin does - here is one stab at it:

http://golem.ph.utexas.edu/~distler/blog/archives/001181.html

If you want to get even more depressed about securing a web app today, go here to get an idea of the insane amount of XSS vectors.

http://ha.ckers.org/xss.html

- Rob

> Thanx for the pointer. But I think I need a bit more than that. I need to > be able to leave tags alone for the most part, except <script> tags, but > attributes need a little more control. what I've come up with so far:

Pipe your HTML thru tidy -asxhtml. Then use REXML and XPath to strip out anything you don't need (such as the header block that -asxhtml will install). And strip out the <script> tags, and anything that looks like a <script> tag, such as the <object> tags.

The absolute safest, of course, is to strip anything not appearing on a whitelist, such as <i>, <b>, <em>, etc.

I dunno how secure you want this to be, but to be truly safe from XSS you'll need to handle more cases then Rick's plugin does - here is one stab at it:

XSS | Musings

It uses most of the same tests I wrote, adds a lot more allowed svg/mathml tags, and style attribute sanitizing. I just prefer to leave it out, but textile uses it. Those tests were written from that hackers article. You could just port the style stuff to white_list, and then you don't have to bother maintaining a plugin.

Thats is a bit depressing. It seems that no matter how hard I try I won’t be able to completely remove the js in submitted source.

Is the object tag really that bad? I mean I think I need to support it since you tube widgets and I guess others are based on object tags and I need to support youtube at least.

Is the object tag really that bad? I mean I think I need to support it since you tube widgets and I guess others are based on object tags and I need to support youtube at least.

One idea is to allow a custom format. Perhaps just look for youtube urls, and convert them to videos? Obviously this should be done after sanitizing...

I’m not really sure what you mean by custom format. Does that mean like dom selection in the whitelist plugin? eg. Allow tag x if it’s a child of tag Y and has attribute z=‘value’ or z!=‘javascript’

I really want to be as broad ranging as possible and include as many tags as possible and also in their original form. It’s important for this app that the tags, as much as possible be left as they’re inputted, I just don’t want the result to hijack my page.

I really want to be as broad ranging as possible and include as many tags as possible and also in their original form. It's important for this app that the tags, as much as possible be left as they're inputted, I just don't want the result to hijack my page.

Well, I originally meant something very custom like <video:http://youtubeurl…>. Though since most normal folks can't grok this, and web power users have enough formats to figure out, perhaps you could just seek out youtube urls sitting on a single line or something.

For instance, Tumbler lets me add the raw embed code or just a youtube video URL if I want to post a video.

I could not change the input to that level. video:... but I’ve had a look at the youtube and also odeo widgets and they both boil down to an embed tag with a type of shockwave flash.

Do you think it would be a bad idea to enable support for embed tags with that type with src from youtube.com or odeo.com / (a list of known) domains? If I did this I could remove the object tag from around the embed tag and I don’t think it would have much of an impact.

> > > I really want to be as broad ranging as possible and include as many tags > > as possible and also in their original form. It's important for this app > > that the tags, as much as possible be left as they're inputted, I just don't > > want the result to hijack my page. > > Well, I originally meant something very custom like > <video:http://youtubeurl…>. Though since most normal folks can't > grok this, and web power users have enough formats to figure out, > perhaps you could just seek out youtube urls sitting on a single line > or something. > > For instance, Tumbler lets me add the raw embed code or just a youtube > video URL if I want to post a video.

I could not change the input to that level. <video:...> but I've had a look at the youtube and also odeo widgets and they both boil down to an embed tag with a type of shockwave flash.

You're really not getting the point of what I'm trying to say. I'm saying, strip all object tags, and use something custom that gets replaced w/ an object tag that you generate afterwards. If you're generating insecure JS, you have issues :slight_smile:

Do you think it would be a bad idea to enable support for embed tags with that type with src from youtube.com or odeo.com / (a list of known) domains? If I did this I could remove the object tag from around the embed tag and I don't think it would have much of an impact.

I don't really know, I haven't thought about this stuff much. I just strip all object/embed tags by default. You may have to do some digging for any attack vectors on object/embed tags. I don't think it'd be that different from image tags though.

I really want to be as broad ranging as possible and include as many tags as possible and also in their original form. It’s important for this app

that the tags, as much as possible be left as they’re inputted, I just don’t want the result to hijack my page.

Well, I originally meant something very custom like

video:[http://youtubeurl....](http://youtubeurl....). Though since most normal folks can’t grok this, and web power users have enough formats to figure out, perhaps you could just seek out youtube urls sitting on a single line

or something.

For instance, Tumbler lets me add the raw embed code or just a youtube video URL if I want to post a video.

I could not change the input to that level. video:... but I’ve had a

look at the youtube and also odeo widgets and they both boil down to an embed tag with a type of shockwave flash.

You’re really not getting the point of what I’m trying to say. I’m

saying, strip all object tags, and use something custom that gets replaced w/ an object tag that you generate afterwards. If you’re generating insecure JS, you have issues :slight_smile:

Ok that makes more sense to me.

Do you think it would be a bad idea to enable support for embed tags with

that type with src from youtube.com or odeo.com / (a list of known) domains? If I did this I could remove the object tag from around the embed

tag and I don’t think it would have much of an impact.

I don’t really know, I haven’t thought about this stuff much. I just strip all object/embed tags by default. You may have to do some digging for any attack vectors on object/embed tags. I don’t think

it’d be that different from image tags though.

K thanx for your help. Looks like I’ve got some digging to do :slight_smile:

Cheers

Daniel

Daniel N wrote:

Is the object tag really that bad? I mean I think I need to support it since you tube widgets and I guess others are based on object tags and I need to support youtube at least.

I didn't read the original post. If the question is "how do I do safe markup and transclusions, in a public blog?", then naturally get either a Wiki markup (or YAML), or permit a subset of HTML. To transclude Object tags, invent a new tag called <video>. That way you prevent shenanigans, right?

Daniel N wrote:

Thats is a bit depressing. It seems that no matter how hard I try I won't be able to completely remove the js in submitted source.

(Use the XPath system I suggested, then) remove all tags except those on a short white-list, and then remove all their attributes.

We <http://www.jobscore.com> use SafeHtml <http://pixel-apes.com/ safehtml/> it's really good about leaving the tags alone but removing potentially dangerous XSS type stuff.

It's PHP, but I wrapped it in a class that shells out to the php interpreter.

Alex