Friday, October 15, 2010

On SSDs

I suspect that some of the ideas behind the design of SSD drives as currently sold are rather flawed. Update: See the end of the article for an alternative viewpoint, though.

These drives are embedded computers, with their own CPUs, RAM, firmware, etc. They are much more complicated than they should be for a simple storage device, and are more like mini RAID controllers than they are like hard disks.

The reason things are done this way is because of the dire state of direct flash memory support in most OSes. NTFS, HFS+, Ext3/4, etc are all focused on rotating media. They try to minimize seeks and fragmentation, don't care about grouping writes/overwrites together, etc. For solid-state flash-based storage to work well with these file systems and the OSes that use them, the drives need to do a lot of work behind the scenes to cope with the file system's access patterns.

In an attempt to work around the mismatch between flash storage and rotating-media-file systems, current SSDs also sometimes cheat and do things like claim to have flushed their buffers to persistent storage when they really haven't, or ignore write-ordering barriers, both of which can cause horrifying data corruption for database workloads if power is lost suddenly. (Most SSDs don't suffer from this problem - high-end enterprise drives have big capacitors that give them time to write out their caches before losing power, and most consumer drives respect the OS's request to flush their write cache. Test carefully before trusting one, though.)

Existing flash-oriented file systems like JFFS2 are (a) not widely availible on major platforms, and (b) severely limited in the maximum size of media they can handle, because of the requirement to scan the media to build in-memory metadata before becoming usable. This has led to the release of SSDs oriented toward use with existing file systems - which *can't* be accessed directly for use with intelligent, flash-aware file systems.

That's the bit that drives me nuts. None of the consumer and few of the enterprise SSDs can be lobotomized to present a dumb pass-through direct-access interface for OSes, HW raid controllers, etc that know how to use the flash directly. Because of this, we're stuck with the frustrating situation where file systems focused on rotating media are being optimized to work better on flash storage that's trying to pretend to be rotating media. This makes it hard to build and test, let alone adopt, file systems that're actually designed for flash!

With 64-bit (OK, physically 48-bit) machines, we should really just be able to map the flash storage directly into host address space and let the OS play with it that way. With an IOMMU (universal these days) it can be done safely and securely, and you'd benefit from the host's memory barrier support etc. Alternately, a simple direct PCI-E interface to access the flash wouldn't be too much to ask. This strange way of doing it by pretending to be a SATA hard disk appears to be wasteful, expensive, and SLOW. It also makes it nigh-impossible to do interesting things with flash SSDs, like use them as write cache for software RAID.

It's not showing any signs of getting better, either. You'd think that one or more SSD manufacturers would be working with Microsoft, at least, to provide an NTFS port optimized for direct use on flash, so they could avoid including all that *expensive* non-flash-chip hardware on their SSDs.


Update: Arguably the complexity and abstraction of SSDs is an extension of the existing trend in hard drives. First, HDDs went from electronics-offboard MFM and RLL drives to electronics-onboard IDE drives. Drives later gained logical block addressing (LBA) to hide the drive's internal layout from the OS, so the old c/h/s mappings became meaningless and the OS lost all visibility into the drive's internal layout, having to trust the drive to use a sensible block mapping. Shortly thereafter drives began to transparently remap bad sectors in firmware, so the OS didn't have to be aware of failed sectors or map them at the file system level. Command queuing and reordering arrived a while thereafter, letting the drive make intelligent decisions about the order in which it executes requests. Now modern HDDs even have extensive self reporting and analysis tools like S.M.A.R.T . A spinning-disk HDD is already a small computer, and provides a useful abstraction of storage away from the OS so the OS doesn't have to understand the fine details of every drive.

Viewing things this way, perhaps the complexity of SSDs is a natural evolution. Perhaps the SSD firmware and PLC will simply do a better job of wear-leveling and write-ordering than the host file system ever can. Maybe on-board supercapacitor-protected write-back cache offers features that can't be matched at the host level. Perhaps it's a case of "right tool for the job" and sensible levels of abstraction. Keeping write-back cache close to the storage it is cache for is a particularly compelling advantage, and one that'd be particularly hard to match with host-controlled flash.

Personally, I think it's probably partway between the two. SSDs will have to expose more and more SATA/SAS extensions to allow host OSes and file systems better visibility into their innards and to gain better control. Host file systems will have to become more aware of SSDs and more able to adapt their usage patterns. Perhaps SSDs that plug straight into PCI-E and present an ACHI to the host without any real SATA bus in the middle will become common. Much like RAID controllers, modems, and many other things have slowly moved from hardware to running on the host CPU, so much of the SSD workload is likely to move to the OS driver level over time. On the other hand, a few things like safe write-back caching are probably always going to be better done in the individual SSD rather than the host, much as current ATA drives map logical addresses to C/H/S internally. It's just going to take a while for the division of responsibilities to settle out. The difficulty of plugging new file systems into existing OSes makes it initially more appealing to do everything in hardware, but over time that'll shift as flash gets cheaper and the control hardware becomes a bigger proportion of overall cost, pushing SSD manufacturers to move more and more control logic from expensive hardware into cheap drivers on the host OS.

3 comments:

  1. Sure, it's bad that we don't have dumb flash drives with smart software filesystems. But it seems clear to me why we don't. Even in their current form, SSDs are orders of magnitude faster than spinning disk and have 100% backwards compatibility with existing systems.

    Nobody was going to write a fancy new filesystem for drive tech that didn't even exist in theory.

    Now we have this first step, the path to using smarter tech is open. This whole process reminds me of AHCI. First there were SATA drives in "IDE compatibibility mode". Then AHCI was added as an option, nowadays it's the default.

    PS, prediction: log structured filesystems will be the microkernel of the next few years.

    ReplyDelete
  2. Nice posting, I couldn't agree more..

    Speaking of flash filesystems, UBIFS looks great - in both design & linux implementation. What do you think of it?

    ReplyDelete
  3. UBIFS (http://www.linux-mtd.infradead.org/doc/ubifs.html) has looked interesting for a while, but is bottlenecked by the UBI for larger/faster flash storage.

    ReplyDelete