Friday, October 19, 2012

Natural sorting: An example of the utility of Pg's composite types and arrays

While looking at a recent stack overflow question I found myself wondering if it was possible to write a natural sort for strings containing numbers interleaved with non-number text using only PostgreSQL's core functionality.

Natural sorts are an important usability feature, as Jeff points out in his post on natural sorts above.

So I asked for ideas, and it turns out that yes, you can, though it's a bit long-winded. Props to Erwin Brandstetter for persistently refining the approach. The general idea is to create a composite type of `(text,integer)` then sort on an array of that type. See the linked question for details.

This illustrates how powerful Pg's composite types and arrays are, though I'm not sure you should consider any of the proposed solutions for real world production use.

It it also helps to show how nice it'd be to have access to native OS-independent Unicode collation in PostgreSQL using the International Components for Unicode (ICU) project, which would not only solve those nasty Windows-vs-Linux locate name issues when restoring dumps, but would also allow the use of advanced collation flags like UCOL_NUMERIC_COLLATION.

I'd really love to be able to use a custom collation function in Pg, either via an ORDER BY extension or by creating a collation that uses a user-defined collation function then using that collation in the COLLATE clause. Then I could write a C function to use ICU to do the special collation required for a particular job. This doesn't appear to be possible at the moment.

I recommend reading Jeff's post on natural sorting and why it's important; as usual, it's excellent.

2 comments:

  1. Craig,

    IIRC, the reason we didn't add ICU support years ago was sheer size; the ICU library is huge and would have added a significant amount to the overall PostgreSQL code base. Not sure if that's still a real concern.

    BTW, can you please remove the Captcha on your blog comments? It's really annoying ... this will be my last comment if it's still there in the future.

    ReplyDelete
    Replies
    1. Josh,

      I've removed the CAPTCHA, thanks for pointing that out.

      As for ICU, I'm rather surprised that size was a concern - it seems to be packaged on most Linux distros and there's a BSD port. It'd increase the windows installer size somewhat, but that already includes things that most users will never use like the .pdb debugging files, so size is clearly not a huge concern there.

      I'd be more worried about getting inconsistent collations between the OS and Pg, or between a utf-8 database powered by ICU and a non-utf-8 database using an OS collation.

      The ICU wiki page suggests that a major concern is that ICU natively uses UTF-16, while Pg natively uses 8-bit encodings and Unicode.

      Delete

Captchas suck. Bots suck more. Sorry.