If one can afford the seemingly large cost of using DynamoDB then it should be the default choice to take, given the simplicity of the APIs and no hassles of scaling up, down & managing replication, that would be required to handle an equivalent cluster of Cassandra nodes. DynamoDB, being a managed service, relieves the end-user of all the headaches associated with managing a large Cassandra cluster.
Normally NoSQL solutions such as Cassandra & DynamoDB, are evaluated when the expected number of reads and writes per seconds tends to be on a higher side, that could probably not be handled by a normal relational database, or when there is a requirement to keep the schema flexible to adopt to any recurring change in the application. These NoSQL solutions also boast of handling a much higher concurrent writes/second than a simple relational database like MySQL due to their internal architecture.
All these factors imply that, at least in production, you would have to deal with a very high load of reads and writes to the underlying database and would probably have at least a cluster of nodes, which would be partitioned based on the application logic and definitely replicated to handle consistency and failures. So we try to break it down and try to compare both the systems on several parameters:
- Ease of Maintenance
DynamoDB offers the NoSQL database-as-a-service, that might look costlier than having an equally robust Cassandra cluster hosted in your data center, but it removes the headache of having to maintain the cluster by a highly technical/trained TechOps team and adding/removing new nodes for scaling up and scaling down becomes really easy.
With DynamoDB, we just need to specify what IOPS ( I/O per second ) we need; In other words we need to specify the throughput that we need for read and write operations and we can specify that per Table. DynamoDB will internally do the replication, manage clustering & everything else. We just need to interact with DynamoDB using either the REST API or AWS SDK.
- API Support
Feature wise, both Netflix’s Astyanax and AWS SDK for DynamoDB provides us with similar features: Connection pooling, ORM Layer using Annotations, Batch Operations, Auto node discovery etc. but the AWS SDK turns out to be more easy & elegant to use. The Netflix API is more thrift-based and you need to understand the internals of a column family to create a column family. DynamoDB also provides a low-level API but that looks more difficult to use.
To be honest, DataStax has also released an updated version of Java Driver to work with Cassandra, which supports executing CQL also. Although, I’ve not been able to check it out so far but it definitely looks promising.
- Document Type
Cassandra is essentially a key-value store, while DynamoDB supports both key-value and rich documents as well. Both have the notion of sorting on range key, getting the value by Id ( obviously its a key-value store! ), no SQL-like-joins. DynamoDB claims to have atomic counters which we may turn out to be more handy in some situations. Both provide the support for Sets, Maps apart from the basic types. Cassandra provides additional UUIDs for Timestamp which DynamoDB does not provide OOTB, but if you use the AWS SDK for Java, it has all the support for every type that exists in Java.
- Secondary Indexes
Both supports Secondary Indexes, but Cassandra does not really encourage you to create a secondary index as it has its own cost. On the other hand, DynamoDB encourages to create Local Secondary Indexes ( limited to 5 per table ) but it can reduce the Capacity Cost of doing a query on that table. Although it will result in additional writes ( as with all indexes ) and hence increase the cost.
DynamoDB provides more flexibility in defining the secondary indexes – you could create a local or a global secondary index. You could also create a Sparse index ( a la MongoDB). You could also specify the projection while creating the secondary index and if that matches your use-case it makes the query really cheap to do.
- Cost Factor
Now I have not done the math, but generally DynamoDB would cost higher than keeping a Cassandra cluster but from what I’ve heard and experienced a bit, it seems that investing in that cost is much better than having to deal with Cassandra cluster issues around consistency, replication and scaling.
So in DynamoDB, when you create a table you provision the throughput of that table, by saying that this table supports 10,000 writes per second and 5,000 reads per second and obviously increasing that number would increase the cost.
Now when you do a query (insert or read) that consumes some of the throughput of this table that is called Capacity Units of that query. So if a query does 5 read capacity per second then for the above provisioned throughput, you can do 5,000/5 = 1,000 queries per second. If you want to do more queries then you have to increase the throughput of the table.
Using DynamoDB, If your read size exceeds a certain limit ( 4KB ) then it is considered as another Read Capacity Unit. So if you read data in one query which is of size 6KB then it would be treated as 2 Read Capacity. Similarly depending on whether you want eventual consistency or strong consistency the Capacity Units are specified. All such factors again increase the overall cost of using DynamoDB and hence doing Data Modeling is a very important aspect before actually going ahead and implementing the solution.
Cost is definitely an important factor to take into account before one can take a decision and Cassandra wins hands down there if you look at the operational cost. But hardly its the complete picture. It may also seem wise to estimate the cost incurred in the time invested/wasted by DevOps/TechOps to manage or debug issues in the Cassandra cluster for replication, consistency, software upgrades, node replacements and the like. Initially they may seem rudimentary, but as the cluster grows, so does the pain of managing it.
On the other hand, with DynamoDB, one should exercise caution of not creating too many secondary indexes ( as they quickly increase the cost) and if your application really require some heavyweight querying capabilities, it would be more prudent to evaluate other search solutions like CloudSearch or Solr Cloud instead.