I've been asked repeatedly over the years: what's the best way to handle Key-Value pair data? especially at volume levels? (say you're loading 700M rows every hour or so.)
Here are some performance and tuning hints and tips you need to consider: (REGARDLESS of platform)
- First: switch to a Key-Value Database that's optimized for this sort of thing, or a Triple-store. If you can't do that, then continue:
- Store all your data as JSON in a giant PSA (or flat-file landing zone) so you can handle re-load and schema changes in to the future
- For maximum performance in a Relational type DB: PIVOT the most important and most consistent key sets out in to columns. In Snowflake DB you can go as high as 1024 columns (I think). Each DB has it's upper limit. Try to pivot the most accessed (by queries, by where clauses, by joins). Leave the rest in the JSON KV format and access it when needed.
- Leave the columns in the slow JSON KV store "out" of 80% of the queries unless they are DRILL IN queries. This is a place where data virtualization technology like Stonebond will help.
Short of actually leveraging a Key-Value DB, or pivoting your data, there really isn't a good answer. Even Snowflake says: JSON does perform, but if you want faster performance at higher scales, with better compression - you need to store data in a columnar format. Same with SQLServer Azure Big Data Edition, or SQLServer with a Columnar Index
Would love to hear your thoughts...
Hope this helps, Dan