Spammers have discovered blogs. Several of the most popular blogging tools have come under attack in the last few weeks; spammers are taking advantage of interactive features to cause their ads to be displayed in the blog's HTML.
Is the Semantic Web doomed? Links and discussion inside . . .
First, observational data from the field. Wired News has reported that spammers have started leaving ad URLs as the "referring page" in their HTTP headers. By reloading a given blog a few hundred times, they can force their chosen URL to appear to be the most common link to that blog.
The twist is that the spammer's web page doesn't link to the blog at all. The idea is that bloggers, being egotistical, read their logs to see who links to them. When whoever reads the referrer logs goes over and checks out the web page, that's a page view a couple pennies in the bank. As a second-order trick, some blogging tools have HTTP-log analysis tools built in; next to each post they show the pages that link to that post. These tools can't tell spam from legitimate hits; they throw the link to the ad up on the page itself.
But this isn't the only blog-spam technique out there. A few bloggers are now reporting seeing comments spam. That's right, comments spam. Someone clicks on a "comment on this post" link and leaves their thoughts, but the putative "comment" is just another Make Money Fast. The blog spambots out there right now seem pretty primitive. They don't unparse a blog's HTML or do anything sophisticated; they just go through a site guessing URLs for comments forms (sequentially-numbered in Movable Type) and moving on. Nor have there been any large-scale attacks on large numbers of blogs reasonably simultaneously. But it seems reasonable to expect that the pros will move in soon.
And, to a first approximation, when the pros arrive, comment forms will be useless. The current default is typically completely unsecured; anyone can leave any text they want without any authentication, moderation, or throttling. If comments on blogs (including many sites, such as LawMeme, in which the comments are meant to be at least as important as the blog itself) are going to survive, something is going to have to be done.
There's an obvious first place to look, both for solutions and for further problems: email spam. There's been extensive discussion of the email analogy. Email spam-busters have been using blacklists, whitelists, client-side heuristics, and other automated filters. In addition, various bloggers have suggested Yahoo-style "type in this word" anti-bot checks, hidden fields or other HTML modifications, throttling, human review of all comments, registration (whether automatic or human-reviewed isn't certain), and a few other purely technical solutions.
It has also been observed, in these discussions, that none of these technical solutions will work. That is, none of them are going to be any more effective, in the medium term, than the technical responses to email spam -- and even if that war hasn't been lost, it hasn't been won either, nor does victory over the spammers seem immanent. And, from a technical point of view, these problems are very close to equivalent. A message comes in from somewhere, carrying almost no authentication information, and you have to decide whether it's legitimate or spam. Quite often, you've never seen a message from this source before; good spammers send one email to a million people, not a million emails to one person.
(To be fair, there is one critical difference. The blogiverse is, to a striking degree, less constrained by history than the world of email. HTTP is end-to-end in a way that sendmail is not, blogs interact with users more than they interact with each other, and blog tools are on a much faster upgrade cycle than email tools are. That is to say, if Blogger added a global authentication system tomorrow, Blogger users (and users of other tools that decided to interop with Blogger) would be able to start posting authenticated comments the same day. Try doing that with Look Out!, mutt, and your local ISP's tweaked version of IMP.)
People seem also to be dimly aware that this problem extends beyond comments. Moveable Type's Trackback -- the very feature which is making this discussion possible to follow across dozens of weblogs -- is also based on an unverified CGI script. So too, in principle, is every non-trivial blog feature, every feature which lets a blog reach out beyond its own four corners to truly communicate with other blogs. In every blog-to-blog communication, there are two people involved. As soon as you take one out of the loop, the other can be replaced by a 'bot and you're wide open to spam again.
The problem here is that being able to take one of the humans out of the loop is the whole point of a true peer-to-peer distributed semantic web. That's what lets you make an appointment online without needing to talk or email the receptionist to confirm the time. That's what lets you have two-way hyperlinks. That's what lets you suck down aggregated RSS news feeds without you (or the person whose feed you're sucking down) having to manually approve every story. That's what lets you string together individual web services into something new and wonderful.
You can't end-run social realities. In some sense, you could say that email stalled out because of the spam problem. Most of the programming effort that's gone into email in the last few years has gone towards combatting viruses and spam. With email stagnating, other tools have come to look better and better for achieving the many-to-many communication for which the Internet is supposed to exist. But it turns out that as soon as these tools become truly interactive, they run into exactly the same issues that plague email.
When you add a comment form to your blog, you become interactive -- and you also become vulerable to spam. Where people use SMS or IM to talk with each other, the spammers will start sending messages, too. Or look over at the RIAA, which is doing its best to flood peer-to-peer networks with spam. Sure, it's spam that says "don't copy music," but that doesn't make it not spam. The seven-layer network model should perhaps be modified to include an eighth layer. Spam will run atop any protocol it can find. Spam is architecture-independent. Write once, spam anywhere.
Now, there are network architectures -- in the Larry Lessig sense -- which are more resistant to this generalized notion of spam. These are the architectures with strong authentication, closely tied to real-world identities. Leave your Social Security number to get an account; and,oh, by the way, there's a three-day waiting period and a background check before you get your email address. In these cases, it's not that the network recognizes spam and filters it out; I feel reasonably confident in saying that this problem is AI-complete.
No, what happens here is that when you get a message, the network can reliably establish who sent it. Combine this traceability with firm legal rules against spam and you have a technique that makes it possible to hunt down spammers and punish them, whether civilly or criminally. (You could do something similar, perhaps, by making it crime to benefit from spam, even if you didn't send it; this approach might dodge some of the need for a strong network sense of personal identity, although it has other problems But that's a discussion for another day.) And that would be nice, I suppose. But I don't think that giving up all possibility of online anonymity is really a good price to pay for eliminating spam. There are too many other reasons, both political and social, to prefer the existence of online grey zones.
So, conceding for the moment that we're not going to TCP-PSP (That's TCP running on top of the Police State Protocol) quite yet, we're not getting away from generalized spam. And that fact bodes ill for online interaction in every form.
Spam. It can happen here.