Thursday, September 17, 2009

Hierarchy of caches for high performance and high capacity memcached

This is an idea I've been kicking around for a while and wanted some feedback. Memcached does an amazing job as it is but there's always room for improvement.
Two areas that memcached could be improved are local acceleration and large capacity support (terabyte range). I believe this could be done through a 'hierarchy of caches' with a local in-process cache used to buffer the normal memcached and a disk-based memcached backed by berkeley DB providing large capacity.

The infrastructure would look like this:
in-process memcached -> normal memcached -> disk memcached

The in-process memcached would not be configured to access a larger memcached cluster. Clients would not use the network to get() objects and it would only take up a small amount of memory on the local machine. Objects would not serialize themselves before they're stored so this should act like an ultrafast LRU cache as if it were a native caching system within the VM.
Since it's a local it should be MUCH faster than the current memcached.
Here are some benchmarks of APC-cache vs memcached.
http://www.mysqlperformanceblog.com/2006/09/27/apc-or-memcached/
http://www.mysqlperformanceblog.com/2006/08/09/cache-performance-comparison/

Long story short a local cache can be 4-8x faster than normal memcached. The local in-process cache would be available on every node within this cluster and act as a L1 cache for ultra fast access to a small number of objects. I'm not sure all languages would support this type of cache because it would require access to and storage of object pointers. I believe you can do this with Java by hacking JNI pointers directly but I'm not 100% certain. This cache would be configured to buffer a normal memcached cluster. We're all familiar with this type of behavior so I won't explain this any further. The third component in the is a distributed memcached daemon which uses Berkeley DB (or another type of persistent hashtable) for storage instead of the normal slab allocator. While this might seem like blasphemy to a number of you I think it could be useful to a number of people with ultra large caching requirements (hundreds of gigs) which can't afford the required memory.

There's already a prototype implementation of this in Tugela Cache: http://meta.wikimedia.org/wiki/Tugelacache

For optimal performance the memcached driver would have to do parallel and concurrent getMulti requests so that each disk in the system can seek at the same time. There are a number of memcached implementations (including the Java impl which I've contributed to) which fetch in serial. Since memcached is amazingly fast this really hasn't shown up in any of my benchmarks but this would really hinder a disk-backed memcached.

This would provide ultra high capacity and since the disk seeks are distributed over a large number of disks you can just add spindles to the equation to get higher throughput. This system would also NOT suffer from disk hotspots since memcached and the local in-memory memcached would buffer the disk backend.

From a non-theoretical perspective the local cache could be skipped or replaced with a native LRU cache. These are problematic though due to memory fragmentation and garbage collection issues. I use a local LRU cache for Java but if I configure it to store too many objects I can run out of memory. It also won't be able to reache the 85% capacity we're seeing with the new memcached patches.

I might also note that since it's now backed by a persistent disk backend one could use Memcached as a large distributed hashtable similar to Bigtable.

Some people have commented on how a disk backed memcached would basically be MySQL.The only thing it would have in common is the use of a disk for storage. A disk-backed memcached would scale much better than MySQL and berkeley DB simply due to the fact that you can just keep adding more servers and your read/writes will scale to use the new capacity of the cluster. One disk vs N disks. MySQL replication doesn't help because you can't scale out the writes. You can read my post on MySQL replication vs NDB if you'd like a longer explanation.
This would be closer to Bigtable, S3, or MySQL cluster (different than normal MySQL) but be here today and much simpler. It wouldn't support SQL of course because it would be a dictionary/map interface but this model is working very well for Bigtable and S3 users. To make it practical it would have to support functionality similar to Bigtable including runtime repartitioning.

The core theory of MySQL replication scaling is bankrupt. The idea works in practice because people are able to cheat (cluster partitioning) and make somewhat large installs by selecting the right hardware and tuning their application.The theory behind a clustered DB is that most of the complexity behind writing a scalable application can be removed if you don't have to out-think your backend database (Google goes into this in their GFS and Bigtable papers).

The scalability of MySQL replication is essentially:
N * Tps * ((Tps * wf) - Tps)
where:
N is the number of machines in your cluster
Tps is the raw number of disk transactions per second per machine
wf is the 'write factor' or percentage of your transactions are writes (INSERT/DELETE/UPDATE) from 0-1.0.
You'll note that in the above equation if wf = 1.0 then you've hit your scalability wall and can't execute any more transactions. If wf = 0.0 you're performing all SELECT operation and actually scales fairly well.
The reason MySQL replication has worked to date is that most real world write factors are about .25 (25%) so most people are able to scale out their reads.

If you're running a cluster DB your scalability order is:
N * Tps * qf + ((N * Tps * wf)/2)

where qf is your query factor or the number of queries you need to run per second.
This is much more scalable. Basically this means that you can have as many transactions as you have boxes but writes perform on only 1/2 the boxes due to fault tolerant requirements. In this situation you can even scale your writes (though at 50% rate). MySQL cluster solves this problem as does Bigtable and essentially S3. Bigtable suffer's from lack of a solid Open Source implementation. S3 has a high latency problem since it's hosted on Amazon servers (though you can bypass this by running on EC2). My distributed Memcached backed by a persistent hashtable approach would have the advantage of scaling like the big boys (Bigtable, MySQL cluster, S3) and be simple to maintain and support caching internally. MySQL cluster doesn't do a very good job of caching at present and uses a page cache with is highly inefficient.

No comments:

Post a Comment