Trying to spread some of my Pagerank Google juice here:
Search Analytics for your Site
is a book by Lou Rosenfeld and Richard Wiggins to be published in January 2007
by Rosenfeld Media.
It is not being published by O'Reilly and Associates.
Due to the leak of some bad data from the O'Reilly publishing database,
many web sites continue to list the book as being published by ORA, or worse being cancelled entirely.
In July 2006, Wiggins and Rosenfeld discovered that the bad data had leaked like a meme, worm, or virus out from the O'Reilly publishing system into the wider web, as various companies picked up that O'Reilly was publishing the book, but did not pick up the cancellation bit. Wiggins and Rosenfeld have been playing whack-a-mole since, trying to get the bad ISBN removed from various online book selling systems.
§
We faced this problem on a small scale at IBM, mostly with press releases.
Various groups within IBM wrote their own hacks to draw press releases out of the IBMLink system, to republish them on their own product or division web sites.
IBMLink was this horrid mainframe based system that was actually easy to use once you got the hang of it (I write that as a guy who eventually got the hang of TSO, so there). But when the organization behind IBMLink created a web version, the web version repeated many of the same horrid user interface mistakes, plus it was run on an under-powered MITS Altair (I jest, I'm sure it was on the much faster IBM PC XT).
So, it was natural for groups to want to publish their press releases on their own web sites.
Typically this meant sucking in the content from IBMLink, transmogrifying it into HTML and then slapping in into a standard template.
But what this also meant is that each copy of the press release was different, both because the templates different, but also due to minor changes in the body copy that would creep in since everyone had their own special toolkit to perform the transmogrification.
The first time I was called at midnight to pull a press release from the web I found it rather easy: there was a copy on www.ibm.com which I was responsible for, a copy at IBMLink (web) which (in theory) would auto-magically be removed once the item was purged from IBMLink (mainframe), and a copy on the related division's web site.
As the years wore on though this became a much more difficult task.
I couldn't rely on things like the IBM.com search engine, because it sucked.
I can write that, I was responsible for running it most of the time (though not at all responsible for its suckage most of the time, well, perhaps a little).
I couldn't rely on using hashes, since each templated document ended up with a different md5 hash (and also because I was excorciated internally for even thinking of relying on MD5 hashes since I could not prove sufficiently that there would never, ever, ever, be any collisions).
In the end we relied on simple brute force: search and destroy, using the IBM search engine, Altavista, Yahoo! (what is this Google you speak of?
This was circa 1995-1999.); ee would divert our very small staff from the minor task of keeping www.ibm.com online into tracking down where the hell the copy had been copied to; we'd spam the internal webmaster list with a The world will end if you do not remove this document about IBM Mouse Balls
email.
But I found no way to truly eliminate the problem.
Once the data was online, it was free.
Most of the time, the information wants to be free
mantra is laudable, but every so often there's truly information that we want to nail down and destroy.
We don't need to save every last bit of information, our minds can't handle it, and it's difficult to organize and manage.
PROFS had a feature called GETBACK
(ok, this may not have been PROFS, but it was prevalent on the IBM internal
systems and I've known non-IBMers who relied on it externally).
GETBACK would let you retrieve emails you had sent. I don't recall if it worked across
systems or if it only worked on your own VM system.
But everyone knew about it and relied on it.
You could write a nasty email, get the satisfaction of hitting SEND and then on the immediately regret of doing so, type GETBACK to attempt to retrieve it.
Sometimes it worked, sometimes it didn't.
I don't know that I ever used it, I was sort of in weird email land within IBM because I avoided using PROFS as much as possible, using IPERNOTE, then LaMail then eventually Lotus Notes (after we figured out how to hack a forwarding gateway to the internet thanks to my sendmail fu).
GETBACK did not work with alternate email systems.
It did not work with Internet email, much to several employees' distress.
It did not work with Lotus Notes.
I found that Notes had a feature, sort of, which helped temper regrettable email outbursts.
You could set up your email to only send every N minutes, or with every replication, or immediately.
I believe the default was immediate send, based on some exchanges I was on the receiving end for.
I changed my setting to build in a delay, I could always force the email to be sent if it really needed to go out.
This had the pleasant side effect of preventing email receipts or acknowledgements from being sent immediately.
I wrote a bit of LotusScript to purge these traces of my attention from the outbound mail queue before they could be transmitted.
Lotus Notes also has (possibly had, I haven't used Notes since leaving IBM in 2001) a feature where you could lock the text of a document, in theory preventing it from being printed or forwarded.
I found that in the last couple of years I was IBM's alleged Corporate Webmaster, fellow employees would take to writing me veiled and not-so-veiled threats in email, and would tick off the checkbox to prevent the note from being forwarded or printed.
They would be quite upset to discover that I could print and forward the email (typically forwarding it to their next immediate manager with a simple FYI, I found it pointless to engage in any sort of discussion around such emails).
Even encrypting email, which is trivial to do with Notes (trivial as in: my grandmother could use it, not as in once you download this package, or import an x509 certificate, etc.), even encrypting email does not guarantee control over what happens to your words after you hit send.
When you forward an encrypted email, it tends to lose its signature (it's theoretically possible to forward it as an signed MIME message but this hasn't been my experience).
You can copy, cut, and paste encrypted emails (made a little difficult if someone has set the do not forward flag, not impossible).
Once the bits are out there, they're out there.
Neither hacks, DRM, encryption, nor character encoding will keep these bits of data from becoming public.
It's the old literary theory problem of authorial intent writ digital: once these words, these bytes and bits leave the author's system of control, whether via email, Word document, web page, Adobe Acrobat PDF, once they leave
our control, we lose control over how they are read, interpreted, distributed, consumed.
Back to Lou Rosenfeld's problem: I have no ideas how to solve propagation of bad data.
With the way search engines do rankings, the more you call attention to bad data, the more likely (perversely) it will rank higher in search results.
Inevitably, some conspiracy theorists will copy and spread the bad data around in an effort to prevent them
from purging it off the web.
I suppose in some future you could write a countervirus or counterworm, set off a counter-meme to try to eradicate the bad data, but then you have the problem of the counter-meme being in the wild, how to kill that
off once it's done its duty.
I know, we'll set a bit in the counter-meme to self-destroy once it's removed all evidence of the bad data, of its anti-meme.
Yeah, that will work.
e.p.c. posted this at 19:18 GMT on 30-Oct-2006 from Brooklyn, NY.
Archive Link