The Tech Faucet: October 2012

Monday, October 29, 2012

The Internet needs weeding

In librarian terminology, Weeding is jargon for the process of going through a collection of works (books, magazines, etc) and removing ones that are no longer worth having. These works may be:

Out of date to the point of uselessness (like Windows 95 for Dummies);
Damaged and worn out;
Discredited;
Superceded by new revisions;
Surplus to requirements, where they're potentially still useful but space is needed for other more important things; etc

Why do you care? Because the Internet needs weeding, too. Right now individual website operators must take responsibility for that themselves. Some of them aren't; either they can't manage their large libraries of ageing content or they just don't want to.

This TechRepublic article was dubious when it was written, and it's now amazingly out of date and plain wrong, yet there's a steady stream of comments suggesting that people still refer to it. This article, from 2002, doesn't bother to mention little details like version numbers that might help place it in context. It claims, among other things, that MySQL doesn't support subqueries, views, or foreign keys. It also simply says that MySQL is "faster" and PostgreSQL is "slower". It's never been that simple, and it sure isn't now.

I discovered it because someone linked to it on Stack Overflow as if it was current information. Someone who usually does a fairly decent job writing informative answers; they just didn't bother to look at this particular article and see if it was any good before citing it.

In print, at least you can look at a book and go "Ugh, that's old". When an article has been carried over to a site's nice shiny new template and is surrounded by auto-included content with recent dates and context, how's a newbie to know it's complete garbage?

By the way, I don't claim it's easy to manage a library of tens or hundreds of thousands of ageing articles. Periodic review simply isn't practical. Websites that host large content libraries need to provide ways for users to flag content as obsolete, misleading, discredited or otherwise problematic. They also need to make an effort to ensure that their articles will age well by including prominent versions, dates, "as of ..." statements, etc at time of writing. This article would've been OK if it'd simply said "PostgreSQL 7.2" and "MySQL 3.3" (for example) instead of just "MySQL" and "PostgreSQL". It's easy to forget to do this, but being responsive to feedback means you can correct problems and remain a reasonably reputable source.

One of the things you and I - the community - can do is to flag use of these articles when you see them linked to, and try to contact site owners to take them down, add warnings indicating their versions and age, or otherwise fix them.

Time for me to try to have a chat with TechRepublic.

Wednesday, October 24, 2012

Natural sorting: An example of the utility of Pg's composite types and arrays

While looking at a recent stack overflow question I found myself wondering if it was possible to write a natural sort for strings containing numbers interleaved with non-number text using only PostgreSQL's core functionality.

Natural sorts are an important usability feature, as Jeff points out in his post on natural sorts above.

So I asked for ideas, and it turns out that yes, you can, though it's a bit long-winded. Props to Erwin Brandstetter for persistently refining the approach. The general idea is to create a composite type of `(text,integer)` then sort on an array of that type. See the linked question for details.

This illustrates how powerful Pg's composite types and arrays are, though I'm not sure you should consider any of the proposed solutions for real world production use.

It it also helps to show how nice it'd be to have access to native OS-independent Unicode collation in PostgreSQL using the International Components for Unicode (ICU) project, which would not only solve those nasty Windows-vs-Linux locate name issues when restoring dumps, but would also allow the use of advanced collation flags like UCOL_NUMERIC_COLLATION.

I'd really love to be able to use a custom collation function in Pg, either via an ORDER BY extension or by creating a collation that uses a user-defined collation function then using that collation in the COLLATE clause. Then I could write a C function to use ICU to do the special collation required for a particular job. This doesn't appear to be possible at the moment.

I recommend reading Jeff's post on natural sorting and why it's important; as usual, it's excellent.

Thursday, October 18, 2012

Generating random bytea values in PostgreSQL

While playing around with answering an interesting question on dba.stackexchange.com I wrote a simple C extension to PostgreSQL that generates random bytea values of a user-specified size. Fast.

In case anyone else is looking for a good way to dummy up random binary data in PostgreSQL, you can find the code in my scrapcode repository on GitHub

See the extension's README for details.

There's a pure SQL version fast enough to use for generating a few 100s of KB of data, too, or a couple of MB if you're patient.

Sunday, October 14, 2012

Avoiding PostgreSQL database corruption

TL;DR: Don't ever set fsync=off, don't kill -9 the postmaster then delete postmaster.pid, don't run PostgreSQL on network file systems.

Reports of database corruption on the PostgreSQL mailing list are uncommon, but hardly rare. While a few data corruption issues have been found in PostgreSQL over the years, the vast majority of issues are caused by:

Administrator action;
Misconfiguration; or
Bad hardware

A recent mailing list post asked what can be done to reduce the chance of corruption and keep data safe.

If you think you have a corrupt PostgreSQL database, stop the database server and take a complete copy of the data directory now. See the Corruption page on the wiki. Then ask for help on the pgsql-general mailing list, or for critical/urgent issues contact a profesional support provider.

Do not attempt to fix the problem before taking a complete copy of the entire data directory. You might make the problem much worse, turning a recoverable problem into unrecoverable data loss.

Here's my advice for avoiding DB corruption issues, with some general PostgreSQL administration advice thrown in for good measure:

The Tech Faucet