Duplicating strange bot error

I’m getting a 500 error on my website that obviously comes from a bot. I’d like to duplicate that error so that I can try to suppress the email message that gets sent to me.

The error contains:

(ArgumentError) "invalid %-encoding

It’s in a “show” action, so it’s a GET command. I can see the URL and that URL doesn’t contain any strange characters. When I put that URL in a browser everything works.

I notice, in the error message I receive, there is a bunch of non-ascii text, and embedded in it is “Network Solutions Certificate Authority”.

There is no indication that I can see of how that info is being sent. Is that in a cookie? Is there any other mechanism that a client can sent info to the server?

NOTE: This is NOT an https site.

As a last resort, I could suppress all “invalid %-encoding” errors, but I would like to see that error if it really came from a real person.

I guess another approach would be to suppress all errors from non-humans, but I’m not sure how to do that.

And ultimately, I’m curious about exactly what is being sent to the server. I want to understand that.

I've been seeing a lot of these lately, all from this user-agent: Mozilla/5.0 (compatible; EasouSpider; +http://www.easou.com/search/spider.html) from the following IP:183.60.214.126 (China Telecom block)

The problem is it's a GET request with a content-body, which is not strictly prohibited by the RFCs, but not technically supported either.

If your exception notifier provides it, look at the value of 'rack.request.form_vars' where you'll see what appears to be a binary cert file's contents.

Regardless, it seems like this spider is either seriously broken, or actively hostile. I'm thinking about a Rack filter to drop any GET request with a content-length header or a non-empty body, but the quickest fix is to use iptables to block this thing altogether :slight_smile:

HTH,

Thanks. I see that the sender’s IP always starts with 183.60.x.x with the third number between 213 and 216.

I could just block those addresses and kick the can down the road.

If I could duplicate what the bot is sending then I could take a stab at the rack filter. It seems like I should be able to do that with curl. I’ll post if my experiments look useful, but if anyone has already figured it out, please post.