A few months ago, I stumbled upon the existance of a mySQL project called ReThink DB that re-writes the MySQL storage engine to be optimized for SSDs.
Really cool, really really cool. I’m surprised it’s not causing more of a fuss.
If you’ve ever used an SSD, you’ll understand the difference they make to your PC. Somehow, it’s like a weight has been lifted – there is no more crunching sound when you try to load stuff, it just appears quickly and silently. Removing the hard drive has made my day-to-day much nicer: when I start eclipse, it just loads, and when I do a search, within about 5 silent seconds it has found what I’m looking for. (With no noisy background indexing.) It basically improves your PCs performance for file intensive applications by an order of magnitude, and databases are no exception.
But Rethink’s argument is that, aside from the order of magnitude gain from the raw performance of SSDs, there is a second order of magnitude gain to be had by the rewriting of database algorithms. And the more I read into it, the more I know they are right.
Think about it. Think of all the hacks we do to make DBs faster for disks that just aren’t relevant any more.
For example: SSDs can instantly append. So there is no more database transaction log. Just append stuff, and you relieve both the job of logging and appending. And you don’t need to worry about consistency, because you’re just appending. If you want to go back, just go back. Once you relive this, you also relieve a huge concurrency constraint, meaning you can actually use all 4 of those cores, all of the time. Not only does it improves the performance of the disk, but by removing a rotational element it also improves the consistency of that performance.
But really, Rethink is just the beginning. Rethink is building on the existing paradigms of databases to make things faster, but the real benefit will come from embracing new paradigms. I think there’s a third order of magnitude in this one that will really change the way we use databases. Systems like MapReduce allow a great way to perform massively parallel data processing but are sequential in terms of data access . At some point the data has to be put in an index (a H-Base or BigTable), which is on a disk cluster. But with SSDs, we have incredibly fast random access, which has an interesting element to add to that, now we have truly non-sequential access to input data. Research like that of Logothetis and Yocum gives us a hint at the future — SSDs bring the power of indexing to the MapReduce party for free, just as a side-effect of their fast random reads. Of course, then you have to store all your data on SSDs, which for now is a bit pricey….
Anyway, the fun is only just beginning. Personally, I’m looking forward to Document DBs like Apache CouchDB becoming suddenly a lot more viable as replacements for those tedious relational mappings we were all forced to go through in the past.