A Quick introduction to Cassandra’s features

In Database, Distributed Systems, web application by Prabhu Missier

Cassandra is highly suited for large scale deployments spread across multiple data centers with requirements for high availability and fast throughput. Let us look at some of Cassandra’s top features:

Distributed
A Cassandra cluster can be distributed across multiple data centers. There is no master-slave configuration and a peer-to-peer architecture ensures that you can read from or write to any node. There is no single point of failure.

Elastic Scalability
Cassandra scales very well and increasing or decreasing nodes happens seamlessly. Cassandra scales horizontally and adding more nodes increases its throughput linearly. For eg. eBay has more than 10 billion write operations per day and they claim that Cassandra scales very well.

High Availability and Failover
Cassandra is highly available and supports forward failover where if a node fails the next available node in the ring automatically takes over. Reads and writes are not affected since a read/write can go to any node. And there is no single point of failure due to the p2p architecture.

Tuneable consistency
Cassandra can configured for high consistency or high availability and this tuning can be done separately for reads and writes. You can read or write to any node in Cassandra.

High performance
Though Cassandra is highly optimized for writes it can also be configured to optimize reads. All disk writes are sequential and Cassandra’s threading makes good use of multiple cores/multiple processor.

SQL-like interface
Cassandra uses CQL which has similar syntax to SQL but is not SQL. So you could use SQL-like queries to create tables, insert rows etc.

Dynamic Columnar storage
Cassandra is highly suited for large and sparse data sets. Data is stored in hash tables like so:

hashmap<ROWID, map<columnID, columnvalue>>
Every row has a unique key which maps to a set of columns. The row key helps in partitioning. Each column is identified by a key which has a corresponding value. The same columns need not be present in every row.
RowIDs should be hashed into a table so as to evenly distibute records across the cluster. There are no relations.

So in conclusion you would use Cassandra if you are having to grapple with large and sparse datasets and plenty of fast write operations, have geographically distributed data clusters, require high availability and tuneable consistency and want to scale seamlessly