Only last year I was introduced to Datastax Cassandra and the performance blew me away. Querying 176 million records in under 5 seconds into a tableau dashboard. Amazing.
But then I did my research, how does it do it?
It comes down to NoSQL on a distributed computing paradigm focused on hash-distribution, bloom filters, SSTables and MEMTables.
The original idea I was told was to load data vault into Cassandra and perform the querying using Apache Spark SQL, a marvelous idea I thought, I dug a little deeper...
- How does Cassandra achieve such impressive performance metrics?
INSERT-ONLY, DV2.0 does this.
Consistent hashing, column families and bloom filters.
1) data is sent consistently to different nodes and replicated across other nodes.
2) data is organized vertically and laterally into families across nodes.
3) a memory efficient algorithm that depending on the query let's you know where the data is definitely is not located and maybe where it might be located.
Because all the attributes are available to a key depending on the query all copies of the data must be consistent on the outset. This is an inverse of how a relational database works. Getting your results from an RDMS is a join-on-read operation whereas table in Cassandra the operation creates a join-on-write. The query is executed upfront.
Consensus - Cassandra does this through the Paxos protocol where the majority wins. Sounds obtuse... but imagine if you have nodes everywhere with data replicas and the nearest node to a client responds to a query (Cassandra is masterless) what is the value that is reported to the customer? If the nearest node is not updated what is the correct value? Do we need to hear the response of all nodes to what is the correct value or reach a consensus by majority to decide what is the value to report on? Cassandra can be configured for different replication strategies depending on the operating model, although not 100% consistent it is very close. This is a part of CAP theorem (recommended reading). How does it get all of its nodes to agree on the data values in the presence of non-responsive nodes? Paxos. What about Byzantine generals... ?
- What about indexing?
Secondary indexes - it is possible but not recommended. Secondary indexes are intended for low cardinality columns and even then it still forces the query to make a full cluster scan --- anti-pattern.
Materialized views - yes possible, a nice and convenient way of not having to ensure that multiple tables created for a dashboard are managed explicitly for data transfer. Really depends on the rate of update this could be a hindrance instead of an advantage. The message with MViews is this: you will sacrifice performance for manageability.
Apache Solr ... tightly coupled to Cassandra data nodes is the index developed through Apache Solr, based on lucene it is a trie hierarchy designed for auto completion of your search query as you type it. That's right, the auto-completion you are used to in Google is available in Cassandra.
Like most BigD platforms there is tight integration with Apache Spark, it can be used to perform at-node data analytics at scale and joining of data beyond CQL. In memory joins may be a way to deal with multi-node table joins... but you're still performing work to answer queries.
- What about modelling?
Data modelling in Cassandra feels similar to how you would model any ontology with an additional step, modeling the workflow for the frontend app. To do this you make use of what has been developed as an open source data modelling tool, KDM that uses chetboko notation.
Kashlev Data Modeler: http://kdm.dataview.org/
Here's the thing about Cassandra, depending on the key you define it will discard older versions of data loaded by that key; it is called "compaction". Not an uncommon term but in Cassandra it is in MEMTable meaning once applied all traces of that original record is gone. Bye bye audit.
- Who else does it?
It wouldn't be fair without some competition and there is competition, not only in storage but in methods for technology. The implementation of Apache Cassandra is inspired by a white paper describing DynamoDB before it become GA on AWS. The other mainstream implementation is Riak and it makes use of its own data type called Conflict Free Data Type (CRDTs) and slight deviations to perform the same function - consistent and highly available data everywhere.
- Does it suite to load a data vault in Cassandra?
I'd argue no, horses for courses. Cassandra is built for fast ingestion and fast consumption of the data, both attractive to data vault practitioners. But it gets there through its unique paradigm of replicating data everywhere for every query needed by a dashboard and discarding records loaded by the same key. The paradigm is described as Eventual Consistency (like I did in a previous post) but this is down to an architecture level where the nearest node is expected to serve a client. Read this: every unique query in your dashboard is a table in Cassandra. A drop down by account is a table, a drop down customer is another table...
Should data go to data vault before going to Cassandra? No, it should go to both simultaneously. It serves the mobile apps, dashboards and other real time analytics that as soon as the fact occurs that it gets loaded to the frontend. It must still go to data vault for use cases data vault is designed to serve: agility, automation and audit. Keep a trail of what happened to serve other analytic requirements.
Cassandra is capable of storing time-series data but there are other data stores more suited like InfluxDB and Druid. Would you build a Data Vault over MongoDB?
Welcome to hear your experience on Cassandra, similar tech or where you found DV is not suited.