Monday, October 29, 2012

The Internet needs weeding

In librarian terminology, Weeding is jargon for the process of going through a collection of works (books, magazines, etc) and removing ones that are no longer worth having. These works may be:

  • Out of date to the point of uselessness (like Windows 95 for Dummies);
  • Damaged and worn out;
  • Discredited;
  • Superceded by new revisions;
  • Surplus to requirements, where they're potentially still useful but space is needed for other more important things; etc

Why do you care? Because the Internet needs weeding, too. Right now individual website operators must take responsibility for that themselves. Some of them aren't; either they can't manage their large libraries of ageing content or they just don't want to.

This TechRepublic article was dubious when it was written, and it's now amazingly out of date and plain wrong, yet there's a steady stream of comments suggesting that people still refer to it. This article, from 2002, doesn't bother to mention little details like version numbers that might help place it in context. It claims, among other things, that MySQL doesn't support subqueries, views, or foreign keys. It also simply says that MySQL is "faster" and PostgreSQL is "slower". It's never been that simple, and it sure isn't now.

I discovered it because someone linked to it on Stack Overflow as if it was current information. Someone who usually does a fairly decent job writing informative answers; they just didn't bother to look at this particular article and see if it was any good before citing it.

In print, at least you can look at a book and go "Ugh, that's old". When an article has been carried over to a site's nice shiny new template and is surrounded by auto-included content with recent dates and context, how's a newbie to know it's complete garbage?

By the way, I don't claim it's easy to manage a library of tens or hundreds of thousands of ageing articles. Periodic review simply isn't practical. Websites that host large content libraries need to provide ways for users to flag content as obsolete, misleading, discredited or otherwise problematic. They also need to make an effort to ensure that their articles will age well by including prominent versions, dates, "as of ..." statements, etc at time of writing. This article would've been OK if it'd simply said "PostgreSQL 7.2" and "MySQL 3.3" (for example) instead of just "MySQL" and "PostgreSQL". It's easy to forget to do this, but being responsive to feedback means you can correct problems and remain a reasonably reputable source.

One of the things you and I - the community - can do is to flag use of these articles when you see them linked to, and try to contact site owners to take them down, add warnings indicating their versions and age, or otherwise fix them.

Time for me to try to have a chat with TechRepublic.


  1. I like the "7:00 AM PDT" bit! At least we know it's more relevant than something posted at 6:00 AM PDT...

    1. At least TechRepublich shows July 2002, so even a newbie (with a little bit of common sense) can see it's 10 years old. OTOH, what I dislike about Blogger is that unless the blog owner is careful enough to include the post date in the blog template as Craig did, the only "date" is in the "Posted by" line and as you can see above, it only says "3:34 PM", and a newbie may not be aware that the only way to find out when the article was created is by hovering over that 3:34 to see the full date/time. I've seen *many* blogs where that's the only way of telling how old a post is.


Captchas suck. Bots suck more. Sorry.