Strip out ALL javascript from HTML source.

Hi.

I’ve got a bit of an issue where I have an input source of HTML source that anyone can use. I need to strip out all javascript. Attributes, links tags etc.

At this stage I’m thinking Hpricot is the go. I guess I’m hoping
there is someone out there that has done this and is willing to share.

Cheers
Daniel

One of Rick Olson's many plugins can do what you want:

http://agilewebdevelopment.com/plugins/whitelist

Which tags are handled is controllable by your code

Thanx for the pointer. But I think I need a bit more than
that. I need to be able to leave tags alone for the most part,
except tags, but attributes need a little more
control. what I’ve come up with so far:

  • all on*** attributes have to go
  • any attribute that has “javascript:” in it has to go
  • any attribute with “.js” has to go
  • Also according to the exploit on myspace by sam It seems that I need to remove javascript: in attributes with newlines anywhere in the word.

I hope I’ve got them all. It doesn’t seem that the whitelist
plugin will do this, although I will be very happy if it does.

Cheers

Daniel

Sorry if this got through and is a double post. It got sent back to me.

all on*** attributes have to go
any attribute that has "javascript:" in it has to go
any attribute with "*.js*" has to go
Also according to the exploit on myspace by sam It seems that I need to
remove javascript: in attributes with newlines anywhere in the word.

I hope I've got them all. It doesn't seem that the whitelist plugin will
do this, although I will be very happy if it does.

You can of course contribute back to the plugin. However, I believe
it'll do everything you listed short of removing any attribute with
*.js*. Not sure what the point of that is though.

Another option is to just yank the code and make it bend to your
specific whims. It's not a very large one.

I’ll certainly look at contributing if I can find a way to extend the functionality. Perhaps a strip_all_javascript method or something like that.

I think the point is trying to make it as difficult as possible to upload javascript. I want to accept arbitrary html source and display it on my page but at the same time, minimise the risk of having my page hijacked. The above list is the ways that I have thought of to include javascript in a submission. I don’t think it’s possible to completely remove the risk of submitted javascript since I could have a url like http://example.com/stuff set the headers to javascript and return whatever script it wants, but I want to minimise that risk.

I hope i’ve considered most of the ways that ppl could hijack my page. I want to include as many tags intact as possible.

Cheers
Daniel

I dunno how secure you want this to be, but to be truly safe from XSS
you'll need to handle more cases then Rick's plugin does - here is one
stab at it:

http://golem.ph.utexas.edu/~distler/blog/archives/001181.html

If you want to get even more depressed about securing a web app today,
go here to get an idea of the insane amount of XSS vectors.

http://ha.ckers.org/xss.html

- Rob

http://robsanheim.com
http://seekingalpha.com

> Thanx for the pointer. But I think I need a bit more than that. I need to
> be able to leave tags alone for the most part, except <script> tags, but
> attributes need a little more control. what I've come up with so far:

Pipe your HTML thru tidy -asxhtml. Then use REXML and XPath to strip
out anything you don't need (such as the header block that -asxhtml
will install). And strip out the <script> tags, and anything that
looks like a <script> tag, such as the <object> tags.

The absolute safest, of course, is to strip anything not appearing on
a whitelist, such as <i>, <b>, <em>, etc.

I dunno how secure you want this to be, but to be truly safe from XSS
you'll need to handle more cases then Rick's plugin does - here is one
stab at it:

http://golem.ph.utexas.edu/~distler/blog/archives/001181.html

It uses most of the same tests I wrote, adds a lot more allowed
svg/mathml tags, and style attribute sanitizing. I just prefer to
leave it out, but textile uses it. Those tests were written from that
hackers article. You could just port the style stuff to white_list,
and then you don't have to bother maintaining a plugin.

Thats is a bit depressing. It seems that no matter how hard I try
I won’t be able to completely remove the js in submitted source.

Is the object tag really that bad? I mean I think I need to
support it since you tube widgets and I guess others are based on
object tags and I need to support youtube at least.

Is the object tag really that bad? I mean I think I need to support it
since you tube widgets and I guess others are based on object tags and I
need to support youtube at least.

One idea is to allow a custom format. Perhaps just look for youtube
urls, and convert them to videos? Obviously this should be done after
sanitizing...

I’m not really sure what you mean by custom format. Does
that mean like dom selection in the whitelist plugin? eg. Allow
tag x if it’s a child of tag Y and has attribute z=‘value’ or
z!=‘javascript’

I really want to be as broad ranging as possible and include as many
tags as possible and also in their original form. It’s important
for this app that the tags, as much as possible be left as they’re
inputted, I just don’t want the result to hijack my page.

I really want to be as broad ranging as possible and include as many tags
as possible and also in their original form. It's important for this app
that the tags, as much as possible be left as they're inputted, I just don't
want the result to hijack my page.

Well, I originally meant something very custom like
<video:http://youtubeurl…>. Though since most normal folks can't
grok this, and web power users have enough formats to figure out,
perhaps you could just seek out youtube urls sitting on a single line
or something.

For instance, Tumbler lets me add the raw embed code or just a youtube
video URL if I want to post a video.

I could not change the input to that level. video:... but
I’ve had a look at the youtube and also odeo widgets and they both boil
down to an embed tag with a type of shockwave flash.

Do you think it would be a bad idea to enable support for embed tags
with that type with src from youtube.com or odeo.com / (a list of
known) domains? If I did this I could remove the object tag
from around the embed tag and I don’t think it would have much of an
impact.

>
> > I really want to be as broad ranging as possible and include as many
tags
> > as possible and also in their original form. It's important for this
app
> > that the tags, as much as possible be left as they're inputted, I just
don't
> > want the result to hijack my page.
>
> Well, I originally meant something very custom like
> <video:http://youtubeurl…>. Though since most normal folks can't
> grok this, and web power users have enough formats to figure out,
> perhaps you could just seek out youtube urls sitting on a single line
> or something.
>
> For instance, Tumbler lets me add the raw embed code or just a youtube
> video URL if I want to post a video.

I could not change the input to that level. <video:...> but I've had a
look at the youtube and also odeo widgets and they both boil down to an
embed tag with a type of shockwave flash.

You're really not getting the point of what I'm trying to say. I'm
saying, strip all object tags, and use something custom that gets
replaced w/ an object tag that you generate afterwards. If you're
generating insecure JS, you have issues :slight_smile:

Do you think it would be a bad idea to enable support for embed tags with
that type with src from youtube.com or odeo.com / (a list of known)
domains? If I did this I could remove the object tag from around the embed
tag and I don't think it would have much of an impact.

I don't really know, I haven't thought about this stuff much. I just
strip all object/embed tags by default. You may have to do some
digging for any attack vectors on object/embed tags. I don't think
it'd be that different from image tags though.

I really want to be as broad ranging as possible and include as many
tags

as possible and also in their original form. It’s important for this
app

that the tags, as much as possible be left as they’re inputted, I just
don’t

want the result to hijack my page.

Well, I originally meant something very custom like

video:[http://youtubeurl....](http://youtubeurl....). Though since most normal folks can’t
grok this, and web power users have enough formats to figure out,
perhaps you could just seek out youtube urls sitting on a single line

or something.

For instance, Tumbler lets me add the raw embed code or just a youtube
video URL if I want to post a video.

I could not change the input to that level. video:... but I’ve had a

look at the youtube and also odeo widgets and they both boil down to an
embed tag with a type of shockwave flash.

You’re really not getting the point of what I’m trying to say. I’m

saying, strip all object tags, and use something custom that gets
replaced w/ an object tag that you generate afterwards. If you’re
generating insecure JS, you have issues :slight_smile:

Ok that makes more sense to me.

Do you think it would be a bad idea to enable support for embed tags with

that type with src from youtube.com or odeo.com / (a list of known)
domains? If I did this I could remove the object tag from around the embed

tag and I don’t think it would have much of an impact.

I don’t really know, I haven’t thought about this stuff much. I just
strip all object/embed tags by default. You may have to do some
digging for any attack vectors on object/embed tags. I don’t think

it’d be that different from image tags though.

K thanx for your help. Looks like I’ve got some digging to do :slight_smile:

Cheers

Daniel

Daniel N wrote:

Is the object tag really that bad? I mean I think I need to support it
since you tube widgets and I guess others are based on object tags and I
need to support youtube at least.

I didn't read the original post. If the question is "how do I do safe
markup and transclusions, in a public blog?", then naturally get
either a Wiki markup (or YAML), or permit a subset of HTML. To
transclude Object tags, invent a new tag called <video>. That way you
prevent shenanigans, right?

Daniel N wrote:

Thats is a bit depressing. It seems that no matter how hard I try I won't
be able to completely remove the js in submitted source.

(Use the XPath system I suggested, then) remove all tags except those
on a short white-list, and then remove all their attributes.

We <http://www.jobscore.com> use SafeHtml <http://pixel-apes.com/
safehtml/> it's really good about leaving the tags alone but removing
potentially dangerous XSS type stuff.

It's PHP, but I wrapped it in a class that shells out to the php
interpreter.

Alex