ed costello: articles and essays: Blocking Referer Spam

Blocking Referer Spam

May 03, 2004

Read Blocking Referer Spam (shorter version) if you want just the quick & dirty take on referrer spam and possible solutions.

Note

This article is getting a lot of traffic (well, for me at least) and while the advice is still valid, it turns out that what I was seeing was not necessarily referer spam, but something a bit weirder which I need to write up in detail when I have time. Basically, my .htaccess information below is correct for what I was seeing, but may not work at all to solve strict referer spam problems. The problem I was seeing (and still do, though much less so) was that somehow my site ended up listed as an open proxy on some idiot's list. It isn't now, nor has ever been (I suspect it was due to an early PHP mistake on my part). So, the traffic I was calling "modified referer spam" was actually someone's attempt to fake traffic through affiliate sites by routing through my "open" proxy (and I imagine many others).

Anyway, read the rest for entertainment, perhaps enlightenment, but it's not necessarily correct. Check out Proposal for a solution to referrer spam: Using MT-Blacklist and other blacklists to filter spamming URLs for a better example than I provide here.

On a monthly, sometimes bi-weekly basis, I scan through my traffic logs and reports looking for things. Nothing in particular, oddities, things that stand out. I used to do this on a much larger scale at ibm.com and it helped me in tuning the site as well as debugging problems within IBM's web space.

Late last year I noticed an increase in odd referrers. A referrer is (in theory) the URI which contains a link to a given site and is passed by the user agent (eg: Microsoft Internet Explorer, Mozilla Firefox) in the block of information sent to the target site when a resource is requested from that site.

Basically, that means that if you’re viewing http://example.com/ and click on the link ed costello: articles and essays, a request like the following gets sent to my server:

GET /articles/ HTTP/1.1
Host: epcostello.net
Referer: http://example.com/
...other stuff irrelevant to this post...

So, that is what a referrer is (referer is a tragic misspelling which occurred some time in the early days of the web and we're stuck with today in the CGI spec and other places).

Referrer spam is a recent innovation by the lower life forms which populate the web. I'd seen it in the ibm.com logs, but infrequently and not on any regular scale. Now there are tools and web sites you can use to try to drive traffic to your site by spamming the referrer field on the target sites.

Referrer spam has become popular due to the am I cool or what need of various bloggers to show who's linking to their blogs. The easy way to do this has been to scrape the referrer field from the access logs or to capture them in real time using PHP or some other server side scripting language. Whatever way they're captured, they're then reposted to the site, sometimes ranked, usually linked.

The spammers rely on the fact that people will click on anything on a web site, even something that clearly says in bright letters DON'T CLICK HERE. Referrer spamming may also help increase a site's pagerank though I doubt that is that effective.

Whatever the cause, I'm now getting referrer spam. Of course, this is silly since I don't post referrers anywhere on any of my sites. Nowhere. In a country where you can get arrested, tried, and convicted simply for linking to content which someone has deemed illegal, reposting referrers just seemed like an easy invite for trouble.

§

Silly me, I thought that if I don't post referrers, I wouldn't get referrer spam.

Not only am I getting referrer spam, I'm getting what I now call modified referer spam: this consists of malformed proxy requests like the following:

GET http://example.com/images/searchbox_2_1.gif HTTP/1.0
Referer: http://www.hanyhost.com
—and—
GET http://example.com/s.php?uid=20694&keywords=monitor6&submit=Go%21 HTTP/1.0
Referer: http://www.gofortraffic.com

Now, I don't run my site as an open proxy either, so this is just stupid, irritating, and a complete waste of my resources.

This referrer spam traffic provides no value to me at all, and if it grows could negatively impact whatever real traffic I do want to accept and respond to.

So, I'm fighting back.

My site uses the Apache web server, the following code bits are relevant only to Apache.

My first step was to block the IP addresses of the systems running whatever client application is available to generate referrer spam thusly:

deny from 218.11.90.83
deny from 218.11.93.1
deny from 218.11.93.101
deny from 218.11.93.148
deny from 218.11.94.112
deny from 218.11.94.168
deny from 218.11.94.215
deny from 218.11.94.251

The problem here is that this gets to be a pain to maintain, eventually the spammer gets a new IP address, or gets smart and uses AOL or some other large ISP for a run. Who’s going to block an entire ISP?

I realized that it would be easier to block by the patterns that the spammers use, as well as by the referrers being spammed. Since one pattern is to request a resource which isn’t on my server at all I check to see if the hostname matches my hostname. If the hostname doesn’t match, then I bounce the request. I’ve tried a couple different methods of bouncing requests...you can fail them entirely, serve up a nasty comment or two, or redirect the request.

A neat thing I discovered while I was running www.ibm.com was this: when you redirect a request, the Referer does not get updated to reflect your site as the redirecting site, I found this was true for every web browser in popular use in the 1996-1998 timeframe and I believe it to be true today.

Case in point: in the 1996-1997 timeframe someone wrote a really stupid web crawler whose sole purpose for existence was to scrape email addresses from web pages. One night I watched the site monitors for www.ibm.com and realized we were being attacked: something was driving a high volume of traffic to the site, and worse was causing a high volume of errors.

Doing some digging and tailing some of the logs, I realized that it was this stupid crawler. It had become trapped in the site, not handling a URL correctly and just generating ever more erroneous requests to the site. I did the only logical thing I could think of, since I wanted to get rid of the traffic (and the crawler was not stopping in response to 403, 404 or 500 errors), I added the URI in error to our redirect file, and targeted the redirect at the web site of the crawler in question.

The traffic immediately disappeared.

Taking this a step further, since we had all sorts of code patched into www.ibm.com (the homepage itself was a CGI for a long time, probably far too long): I redirected all requests from the crawler (which happily supplied a user agent identifying itself and the company responsible for developing it) to the developer’s web site.

Anyway, based on that bit of history, that’s how I’ve responded on my own sites: redirect the traffic back to the spammers in question. I don’t want the traffic, I derive no financial benefit from receiving the traffic, I have no contractual obligation to accept the traffic. And I am breaking no laws that I know of in redirecting the traffic back to the originators.

So, without further adieu, here is the htaccess directives to do so, note that I’ve changed references to my site to example.com:

RewriteEngine On
RewriteCond %{HTTP_HOST} !^example.com$ [NC]
RewriteCond %{HTTP_REFERER} ^(.*)$ [NC]
RewriteRule ^(.*)$ %1 [R=301,L]

There’s multiple variations on this of course, you could do:

RewriteEngine On
RewriteCond %{HTTP_HOST} !^example.com$ [NC]
RewriteCond %{REMOTE_ADDR} ^(.*)$ [NC]
RewriteRule ^(.*)$ http://%1 [R=301,L]

Which tells the client to redirect to itself or at least the IP address it is spoofing.

I could just fail the request, using RewriteRule ^(.*)$ $1 [F,L] but that seems self-defeating: if my server has to put up with the crap traffic to begin with I want someone else, preferably the bozo initiating it or paying for it, to feel some pain as well.

I strongly believe that the primary reason email spam and referer spam is so successful is that it’s so easy to do and carries so few penalties. If more sites reacted with strong defensive measures instead of just sucking up the additional traffic there would be less value to the spammers to do this sort of thing.

Note: This article was modified on 7 August 2004 to edit the URLs in the modified referer spam example to 'example.com'.

2004-12-09T20:03Z: I've turned comments back on...see if the spam bots attack again.

Comments

From: <a href="https://epcostello.net/" rel="nofollow">Me</a>
Date: Saturday, May 8, 2004 05:32 PM -05:00
So far, 8 people have felt a need to click a link entitled <q>Don't Click Here</q>.
From: Shaun
Date: Tuesday, May 11, 2004 10:16 PM -05:00
honestly, i clicked on the link because i figured that the response would be: "see? i knew you'd click on it."
From: <a href="https://epcostello.net/ego/" rel="nofollow">me</a>
Date: Sunday, May 16, 2004 05:36 PM -05:00
A footnote to this post: the same bozo at 218.11.90.174 tried massively spamming the site yesterday (8000 requests), but the .htaccess rules appeared to work, the catch being that the client in use by this bozo apparently doesn't react to 301s or 404s.
From: <a href="http://dss.editthispage.com" rel="nofollow">David Singer</a>
Date: Monday, May 17, 2004 11:55 AM -05:00
I hit it because I was sure you'd have something amusing behind the link. And because I trust you.
From: <a href="https://epcostello.net/ego/" rel="nofollow">Ed Costello</a>
Date: Tuesday, June 8, 2004 11:44 PM -05:00
Up to 23 people now. Interestingly, most spiders manage not to hit the link, those that do do not honor the Refresh: header sent with the HTTP headers.
From: Tom
Date: Thursday, May 19, 2005 12:24 AM -05:00
I clicked the do not click here button, because I wanted to see what was on the other side.
From: <a href="http://www.thewritingpot.com/" rel="nofollow">Ed Yang</a>
Date: Friday, May 27, 2005 08:05 PM -05:00
I clicked here because I wanted to see exactly what they were going to do to me. Were they going to: 1. Reprimand me on a relatively normal page 2. Give me a goody 3. Use some evil Javascript trick that would have caused IE to bounce across the page (and firefox be unaffected) 4. Use some sort of evil hack that doesn't work on Firefox 5. Give me a 404 6. Launch a DDOS attack on my ISP 7. ... 8. Profit!

Posted in Best Practices, Referer Spam, Webmastery on Monday 3-May-2004 at 07:52 PM

[Archive Link | Technorati | Comments (7) ]

Elsewhere

Yes, I’m on facebook. No, I don’t want to be your friend there.


Monthly Archives

Archives by topic