storageous

All things Storageous in Storage!


43 Comments

Twitter’s Blob store and Libcrunch – how it works

TwitterYou may have read one of my previous posts “Arming the cloud” where I talked about why and how large cloud providers are using commodity hardware with intelligent API’s to separate the dumb data and intelligent data to give us a better service. Well in a world of distributed computing and networking you will probably not find larger than Twitter.

To me and you when we upload a photo to the cloud its in the “cloud” we do not care much for what goes on in the background all we care about is how long is takes to upload or download. And this has been Twitter’s challenge, how do they  keep all this data synchronized around the world to meet our immediate demands? It is a common problem of how do large-scale web and cloud environment’s allow users from anywhere in the world to use the photo sharing service overcoming latency which ultimately boils down to me and you waiting for the service to work.

So Twitter announced a new photo sharing platform, but what I am going to look at is how the company manage software and infrastructure to enable this service. Here is what Twitter released yesterday;

“When a user tweets a photo, we send the photo off to one of a set of Blobstore front-end servers. The front-end understands where a given photo needs to be written, and forwards it on to the servers responsible for actually storing the data. These storage servers, which we call storage nodes, write the photo to a disk and then inform a Metadata store that the image has been written and instruct it to record the information required to retrieve the photo. This Metadata store, which is a non-relational key-value store cluster with automatic multi-DC synchronization capabilities, spans across all of Twitter’s data centers providing a consistent view of the data that is in Blob store.”

Sound familiar to what I was discussing in my previous posts? Of course it is, this is a classic example of commoditizing storage\compute\network hardware and having the software API intelligently manage this data.

So what you have to consider with a platform like Twitter is speed and cost, they want users to be able to see the tweet with the picture as soon as possible but they have to be conscious of cost to deliver this service. Twitter has many data centers with many resources but the trade off is always going to be cost.

The next element of this is reliability, how do Twitter ensure that your photos exist in multiple locations on file but not too many to cost too much to Twitter, it also has to think about how and where it stores information on servers which indicate where the actual file exists (meta data). If we took the servers for example, and then thought about how many photos are uploaded to Twitter each day, that’s a lot of meta data to store, what if one of those servers then fails? Then you would lose all meta data and the service would be unavailable. To remedy this the original way of thinking is to replicate this data, but that is costly and time-consuming to keep synchronized and lets not forget will be using some serious space.

So Twitter introduced a library called “libcrunch” and here is what they had to say about it;

“Libcrunch understands the various data placement rules such as rack-awareness, understands how to replicate the data in way that minimizes risk of data loss while also maximizing the throughput of data recovery, and attempts to minimize the amount of data that needs to be moved upon any change in the cluster topology (such as when nodes are added or removed).”

Does that sound familiar again? This is the Atmos play from EMC which is using intelligent API’s to manage all aspects of an element of data, I referred to this last time as an “Object Store”, and the point of this that the API itself understands what to do with a particular piece of data in terms of replication, security, encryption and protection. So we are no longer administering pools of storage but the API is self managing itself, and in the case of Twitter you have to admit that this would be the only way of doing this.

So what does the infrastructure look like, well they use cheap hard drives to store the actual file and the meta data is served from EFD drives for increased speeds. Think of meta data as a search engine it allows you to find articles related to a query very quickly rather than looking at the entire web.

So to sum this up as we place more and more information in to the cloud which is a blend of distributed compute and network, locating information across them is becoming more difficult and slow. Thinking like this with API’s controlling the data according to policies is the right direction to take when using large cloud services.

If you are interested in looking at a cloud solution platform delivering intelligence like this go to EMC Atmos

 

 

Advertisements