Dynamo DB – Internal Design

Dynamo DB target requirements
1)All writes must not be rejected
2)No hierarchical namespaces or relational schema
3)No multiple hops. All nodes store sufficient info to route to the appropriate node
4)Setup in a trusted environment so security is not a requirement

Partitioning
uses consistent hashing to distribute load across multiple hosts.the output range of a hash function is treated as a ring. ie. the largest value wraps around to the samallest value.
Each node is assigned a random value in this ring which represents its position. When data has to be stored, the key of the data is hashed to yield a position in this ring and then you walk clockwise around the ring till you find the first available node with a position greater than the key’s position.
The advantage is that failure of a node affects only its neighbour nodes.
Random placement of nodes results in non-uniform load distribution. Also consistent hashing does not consider heterogeneity in nodes. So Dynamo uses the concept of virtual node where a physical node may be responsible for more than 1 virtual node. These physical nodes get assigned to different points around the ring.
Using virtual nodes has the following advantages:
if a node goes down, it’s work is shared equally among the remaining nodes.
if the node comes back online, it gets an equivalent amount of work from each available node
a node can be assigned to one or more virtual nodes depending on it’s capacity thereby accounting for heterogeneity

Replication
A key is usually replicated at N nodes where N can be configured per instance. So a key is hashed into a node and also N-1 successor nodes. This list of nodes is called the preference list.
Since the successor nodes could be virtual nodes containing the same physical node, Dynamo skips positions in the ring to ensure the key is stored in distinct physical nodes and not multiple virtual nodes

Versioning
Dynamo supports eventual consistency and writes are async so when a client writes to the database, the value may not reflect immediately in all the replica servers.
Vector clocks are used to reconcile conflicting versions. Simply put the system can never lose a change made by the client. All changes are considered and reconciled using timesstamps, sequence numbers and causal relationships.

Set and get
Dynamo uses a preference list as explained above to store a key and it’s associated value in N nodes. So when a read/write request comes in from a load balancer or the client the top node in the preference list processes the request and then passes it on to the next N-1 nodes.
If R is the minimum number of nodes participating in a read and W is the minimum in a write then using the quorum system R+W > N. So a write is considered successful if once the top coordinator node has completed the write, W-1 nodes respond with a status of success.

Handling failures – hinted handoff
Dynamo uses “sloppy quorum” which means that a write is made to ‘N’ healthy nodes if one or more of the N nodes in the prefernce list is down. Sticking to the traditional quorum rules will reduce durability in the simplest of failures.
A hint is also written along with the data to the interim node and once the node in the preference list comes back online data is sent back to it from the interim node. The interim node may then delete any copy of the write that was made.
Also nodes in the preference list are physically placed in different data centers cnnected by high speed network links

Handling inconsistencies – Merkle trees
A Merkle tree has leaves which reflect the hashes of the keys. Parent nodes are hashes of the children. Dynamo uses Merkle trees to detect if there are inconsistencies by comparing the parent nodes. If they are different then that would mean that the nodes have different values. The values of the children are exchanged to remove the inconsistency and the check is continued up the tree.
.

Internal Implementation
Dynamo has a pluggable local persistence engine. This can be Berkeley DB(BDB) Transactional Data Store, BDB Java edition or MySQL. Depending on an application’s access patterns the persistence engine can be changed. For instance mySQL can handle larger objects than BDB which can handle tens of KBs.
BDB Transactional Data Store is the most prevalent.

References
http://www.read.seas.harvard.edu/~kohler/class/cs239-w08/decandia07dynamo.pdf