Wednesday, December 7, 2011

PostgreSQL: Great even when you can see the warts

I’m a very happy PostgreSQL user, and was interested to read a recent post by an MS SQL Server DBA advocating PostgreSQL.

This post is a response, enumerating areas where Pg may not be so hot after all. You might say it's my tendency to always see the negative, or perhaps it's just my desire to inform others of the downsides of something so the loud advocacy doesn't leave them feeling let-down when the reality hits. I'd like to think it's the latter. Telling someone how it is from the start is much better than showing them the rosy-goggles view until they're committed. So, from an enthusiastic PostgreSQL user who spends a lot of time helping out on the mailing lists, here's a list of some of the warts and quirks I'm aware of, so you don't have to find out about them the hard way:

This post is based on PostgreSQL 9.2. Each version improves a great deal, so don't be surprised if some of these issues are gone soon.

  • Limited CPU concurrency. Pg can run many queries on many different CPUs, but one query can only run on one CPU at a time. If you have one really big, CPU-expensive query, Pg can’t complete it faster by using multiple CPUs. There are some limitations on concurrent disk use, too, but they’re much less significant. None of this affects workloads with larger numbers of smaller queries, where Pg easily maxes out the machine.
  • No query queueing or admission control (though connection pools help a lot). Pg can’t natively be told to “run no more than 10 queries concurrently, queuing connections in submission order”; every connection can be actively executing a query at the same time. Too many actively working connections lead to contention for resources, excess context switches, etc and slow overall throughput. Every connection has its own query executor, so even high (many hundreds) idle connection counts can be expensive in synchronization overhead and in RAM, plus they tend to lead to "stampeding herd" problems when the DB or app is restarted. Pg has a “sweet spot” for number of actively working connections that varies depending on the hardware, and it’s usually best to use a connection pool to queue queries and stop it exceeding that sweet spot. Good connection pools like PgPool-II and PgBouncer are available, so this is more of a wart than a real problem.
  • PostgreSQL has no true stored procedures and no autonomous transactions. PL/PgSQL functions always run within a transaction; they cannot commit that transaction or create others. (They *can* use savepoints via EXCEPTION blocks, though). dblink can be used as a workaround, but has its own issues. It looks like MS SQL Server may have the same limitation, but I’m not up to date with MS SQL Server and wanted to point this out in case it doesn’t.
  • All-or-nothing built-in replication. 3rd party replication options are not affected by this. You can’t currently control which databases or tables get replicated when you’re using the built-in wal-shipping or streaming replication, it’s all or nothing. This can be a pain if you have a busy but unimportant database and a less-busy but very important database, where replication requirements differ. You can work around it by using 3rd party replication tools. Alternately, you can run two PostgreSQL instances, but they can’t use the same shared memory segment so you’ll pay a RAM cost for it, plus they have to be on different ports or IPs.
  • Limited XML functionality. Support for in-db XSLT in particular is currently in an add-on module with known issues.
  • Limited performance monitoring and somewhat primitive performance monitoring tools.
  • work_mem is per-sort (etc) not per-connection, so it’s hard to give a query or set of queries the maximum possible RAM to work with w/o risking exceeding available resources or pushing too much cached disk data out. In practice this isn’t too big an issue, but it’s a bit ugly.
  • Incompatible upgrades. Until PostgreSQL 8.4, you used to have to do a full dump and reload whenever you wanted to do a major release upgrade, which was a frequently raised argument against PostgreSQL for larger sites. pg_upgrade fixes this for later releases, making upgrades much less painful, but major release upgrades still often contain gotchas or backward compatibility breaks that may require code changes to some applications. On the upside, changes are generally to make common mistakes more obvious or harder to make, fix bugs or design issues, and generally improve the system; they're not made casually. Backward-incompatible changes are also always documented in the release notes.

In other words: Pg isn’t perfect. It has quirks, and you should be aware of them. Despite that, it’s *REALLY* good for lots of workloads, and the price is hard to argue with. The generally really helpful and friendly community doesn’t hurt, either.

The very incomplete list of things I love about using Pg is:

  • Continuous improvement. Every major release adds something that's really handy for a project you're working on or fixes some frustrating limitation or quirk.
  • The community is friendly, open and very helpful. There are lots of people on the mailing lists and on Stack Overflow who're happy to help anyone who's made at least a minimal effort to explain their problem and their environment. Responses are usually detailed and helpful.
  • Devs are part of the community. Some of the core dev team are active on the user mailing lists - it's not uncommon to see a question posted, followed half an hour later by something like: "Yup, that's a bug; fixed in c6e3ac11b60ac4a8942ab964252d51c1c0bd8845 in head and backported to 9.0 and 8.4 for the next minor release". PostgreSQL is one of the only places where I've seen a bug found, reported, and fixed within half an hour.
  • User contribution, involvement and influence. Got a bug that really annoys you or a feature you really need? You can quite possibly fix it, interest someone else in fixing it, or pay someone else to fix it. I've written fixes for issues that annoy me and had them accepted, so I speak from experience. If you're not a coder, you can learn, or you can pay someone who already knows the codebase to work on an issue for you. Many major PostgreSQL features are developed through commercially sponsored work by consultants and the model appears to work well.
  • Support options. You have a choice in which organisations you use for support, ranging from "none, I'll do it myself" to a variety of consultants and companies. This means you're not stuck with a single complacent support provider who knows you can't go anywhere else if you don't like the price or the service. If you're not paying for support you can't demand anyone else fix an issue for you - but then when was the last time you had any luck getting Oracle or MS to fix a bug in software you were paying support for? With Pg, you can (via consultants or direct participation) get an important fix into the next patch release, and include it in your own builds until then so you have it immediately.
  • Extensibility. The ability to add your own data types, procedural languages, etc comes in very handy when you're doing more than writing accounting software.
  • Stability and a focus on correctness. PostgreSQL doesn't eat your data, and if it's not sure if something's safe it'll throw an error rather than going ahead and making it up as it goes along. This is really nice when you care about your data.
  • Great documentation makes a big difference to how nice Pg is to use day-to-day, as does the excellent psql command line client with its good error messages, good UI and built-in help.

I previously had on the above warts list:

  • No automatic savepoints. If one statement in a transaction fails, the whole transaction is automatically rolled back. You can’t pick it up again from the last successful statement unless you’ve been issuing SAVEPOINT statements before each statement that might fail. Savepoints have a performance cost, too, so this can slow things down as well as being a pain. This behaviour is actually the safest choice and is more feature than wart; see this mailing list discussion, which points out that Bad Things Happen when you paste a script into an auto-savepointing session, the INSERT ... SELECT line fails, then the DELETE line succeeds. Ouch. In the psql shell you can actually get automatic savepoints during interactive operation; see ON_ERROR_ROLLBACK in the manual.

3 comments:

  1. On SQL Server 2005 and 2008, I have no trouble invoking begin transaction, commit and rollback inside of a stored procedure.

    ReplyDelete
  2. Thank you for this post. I think should explain "PostgreSQL has no true stored procedures and no autonomous transactions" more in depth. According to an oracle page: "Autonomous transactions allow you to leave the context of the calling transaction, perform an independant transaction, and return to the calling transaction without affecting it's state".

    ReplyDelete
  3. guettli: The Oracle page about sums it up, really. An autonomous transaction is work that effectively happens in a different session. The surrounding transaction only ever sees its complete effects (if it commits) or no effects (if it rolls back). It can commit even if the surrounding transaction rolls back.

    Pg supports subtransactions. These have the first property, in that they either happen entirely or not at all as far as the surrounding transaction is concerned. However, a subtransaction cannot commit independently of its parent transaction; if the parent rolls back, all subtransactions also roll back. Subtransactions are not supported within PostgreSQL's PL/PgSQL stored procedures, though exception blocks that use the same underlying mechanism are.

    Autonomous transactions are used in some audit work as well as complex stored procedures and logic. They're a handy, but not generally a vital, feature. Subtransactions are generally much more important, because they let you recover an uncommitted transaction to a known good point after a failure.

    ReplyDelete