The growing importance of timing in data centers

Like a bad episode of Hoarders, people love to store all things digital, most of which will never be accessed again. And, like a bad episode of Storage Wars, our love of storing crap means we need more places to store it. Today’s content has outgrown even the hydro-electric dam powered Mega Data Centers built just yesteryear. Increasingly, operators are turning to distributing their information across multiple geographically dispersed data centers. As the number, size, and distances between the data centers have steadily grown, timing distribution and accuracy has likewise grown in importance in keeping the data centers in sync.
In a previous article I discussed new standards being developed to increase the accuracy of timing for the internet and other IP-based networks. Current systems and protocols offer milliseconds of accuracy. But that just isn’t enough as we depend more on real-time information and compute, storage and communications networks become more distributed. While people often cite the importance of timing on mobile backhaul networks in the next-genration LTE-Advanced networks,there has been less publicity around the need for these new timing technologies in the continued growth of data centers.

The rise of Hadoop in an age of digital garbage

Dinosaur image courtesy of Flickr user Denise Chen.

Massive storage of data appears to occur in periods, very analogous to dinosaur evolution. A database architecture will rise to the forefront, based upon its advantages, until it scales to the breaking point and is completely superseded by a new architecture. At first, databases were simply serial listed values with row/column arrangements. Database technology leapt forward and became a self-sufficient business with the advent of relational databases. It appeared for a while relational databases would be the end word in information storage, but then came Web 2.0, social media, and the cloud. Enter Hadoop.
A centralized database works, as the name suggests, by having all the data located in a single indexed repository with massive computational power to run operations on it. But a centralized database cannot hope to scale to the size needed by today’s cloud apps. Even if it could, the time needed to perform a single lookup would be unbearable to an end user at a browser window.
Hadoop de-centralizes the storage and lookup, as well as computational power. There is no index, per se. Content is distributed across a wide array of servers, each with their own storage and CPU’s, and the location and relation of each piece of data mapped. When a lookup occurs, the map is read, and all the pieces of information are fetched and pieced together again. The main benefit of Hadoop is scalability. To grow a database (and computational power), you simply keep adding servers and growing your map.

Even Hadoop is buried under mounds of digital debris

hadoop timing
It looked like Hadoop would reign supreme for generations to come, with extensions continuously breathing new life into the protocol. Yet, after only a decade, databases based upon Hadoop such as Facebook are at the breaking point. Global traffic is growing beyond exponential, and most of it is trash. Today’s databases look more like landfills than the great Jedi Archives. And recently hyped trends such as lifelogging suggest the problem will get much worse long before it gets better.
The main limitation of Hadoop is that it works great within the walls of a single massive data center, but is less than stellar once that database outgrows the walls of a single data center and has to be run across geographically separated databases. It turns out the main strength of Hadoop is also its Achilles heel. With no index to search, every piece of data must be sorted through, a difficult proposition once databases stretch across the globe. A piece of retrieved data might be stale by the time it reaches a requester, or mirrored copies of data might conflict with one another.
Enter an idea keep widely dispersed data centers in sync — Google True Time. To grossly oversimplify the concept, True Time API adds time attributes to data being stored, not just for expiration dating, but also so that all the geographically disparate data centers’ content can be time aligned. For database aficionados, this is sacrilegious, as all leading database protocols are specifically designed to ignore time to prevent conflicts and confusion. Google True Time completely turns the concept of data storage inside out.

Introducing Spanner

In True Time, knowing the accurate “age” of each piece of information, in other words where it falls on the timeline of data, allows data centers that may be 100ms apart to synchronize not just the values stored in memory locations, but the timeline of values in memory locations. In order for this to work, Google maintains an accurate “global wall-clock time” across their entire global Spanner network.
Transactions that write are time stamped and use strict two phase locking (S2PL) to manage access. The commit order is always the timestamp order. Both commit and timestamp orders respect global wall-clock time. This simple set of rules maintains coordination between databases all over the world.
However, there is an element of uncertainty introduced into each data field, the very reason that time has been shunned in database protocols since the dawn of the data itself.

Google calls this “network-induced uncertainty”, denoted with an epsilon, and actively monitors and tracks this metric. As of summer 2012, this value was running 10ms for 99.9 percent (3 nines) certainty. Google’s long term goal is to reduce this below 1ms. Accomplishing this will require a state of the art timing distribution network, leveraging the same technologies being developed and deployed for 4G LTE backhaul networks.

A modest proposal

While True Time was most likely developed to improve geographic load balancing, now that accurate time stamping of data exists, the possibilities are profound. The problems associated with large databases go beyond simply managing the data. The growth rate itself is unsustainable. Data storage providers must do more than grow their storage, they must also come up with ways to improve efficiencies and ebb the tsunami of waste that is common in the age of relatively free storage.
It’s a dangerous notion, one simply must challenge the basic tenet that all data is forever. Our minds don’t work that way, why should computers? We only hold on to key memories, and the further the time from an event, the fewer the details are held. Perhaps data storage could work similarly. Rather than delete a picture that hasn’t been accessed in a while, a search is performed for similar photos and then only one kept. And as time passes, perhaps rather than simple deletion, a photo is continuously compressed, with less information kept, until the photo memory fades into oblivion. Like that old Polaroid hung on the refrigerator door.

By Jim Theodoras

0 yorum: