Checking through my web stats I noticed a large number of requests for the feed for this site.
Checking around, I discovered two IPs ([24.16.107.1xx] and [129.33.1.3x]) which were slamming the feeds, requesting over and over again, and not using modified-GET requests.
What? You don't know what a modified-GET request is?
Ok, briefly: modified-GET request in essence occurs when a web user agent (basically: a browser) has already requested some resource (a page, a graphic, any resource) and has cached it locally (on disk, in memory, doesn't matter).
The user agent sends a request for the resource (ie: GET /resource) but adds a bit of information that says but only send me this if it's changed since this timestamp
.
The timestamp is derived from the Last-modified header sent by the server the first time the resource was requested.
This is something that was invented as recently as 1993.
Maybe 1994, though possibly as early as 1990 when TBL invented HTTP and the web.
It's not new, and anyone writing a browser or other agent for the web should really, really, really implement it.
Especially if you're writing something which has the potential to request a resource repeatedly, like a feed reader or news aggregator requesting an RSS feed.
In the ancient, pre-broadband days, the value was that if you already had an image cached locally, why wait for the entire image to be re-sent from the server.
Actually, there's still value: it reduces bandwidth utilization on the server side, and helps make sites appear to respond faster.
It's really critical for files which could be requested by automated agents
So, in digging through, and checking out those IPs (the 24.x.y.z is a comcast subscriber in either Washington or California, it literally reverse resolves to two host names, the 129.x.y.z is one of the IBM gateways in Southbury, CT) I discovered the guilty user-agents are Magpie RSS, a PHP based feed slurper; and, surprisingly, Google Desktop.
I'm not going to do anything, for now, since it's clearly only two people (and, if you're reading this feed, you'll know who you are ;-), but it's something I'll have to keep a watch for on the other sites I work on which genuinely might have more than a dozen people reading various feeds.
My standard solution is pretty drastic: block the user-agent from requesting the feed.
A sub-concern: Google Desktop doesn't accept compressed feeds.
See, I have a hack which compresses the RSS and Atom feeds using GZIP, and then use content negotiation to serve the feed.
My stats show that about 75% of the feed slurpers use compression, which helps minimize bandwidth utilization.
Magpie RSS uses compression.
But Google Desktop?
Not only does it not use conditional, modified-GET requests, it doesn't (apparently) support compression.
Why is this important?
Extrapolating a bit from the data on my site, at 5000 users of Google Desktop would swamp my site, actually a bit less than that, but 5000 users would put me at the bandwidth limit on my site, and degrade performance for anyone who actually wants to read my site.
For Magpie RSS (which on further investigation appears to be installed on someone's site and requests my RSS feed on demand? If you're reading this and are using Magpie RSS in this way you might want to see if it has a caching option for the feeds it requests), the numbers are better since it uses compression, so it would take almost three times as many Magpie RSS "users" to push my bandwidth utilization over the limit.
Neither situation is likely with this site, but a commercial site like any of the Weblogs, inc., the *ist network, DailyKos, etc could easily hit this subscriber numbers.
And while it'd be nice to have 5000 people reading the feed, the bandwidth used by Google Desktop ends up being wasted, since it's unlikely that people drop everything and read the feed twice an hour, the entire day.
So, if you use Google Desktop, drop a note to Google asking that they implement compression and modified-GET requests.
If you use Magpie RSS, it needs to cache feeds, and use modified-GET requests.
e.p.c. posted this at 10:09 GMT on 20-Sep-2005 .
Archive Link
, Comments [2]