In Episode 2 of Unlocking the Data Vault, I mentioned the term: Poison Data. I must admit, it’s a cool term, and I also admit the term is not mine. It stems from espionage in an attempt to poison data that feeds AI / ML and deep learning algorithms. It distorts documents, imagery, video, and audio content! I just thought I’d take a minute to give you a few links on the subject. It’s not a new subject; however, I feel it’s extremely relevant – particularly given that the amount of hacks and data breaches are on the rise.
Links to how is this Defined?
I heard the term on a TED TALK. I can’t remember exactly which one – so I’ve linked a few below. I encourage you to read, research, and learn about data poisoning – because this is “infected data”, not just bad data, but bad data by design intended to alter your outcomes in AI / ML and deep learning.
- https://www.ted.com/playlists/130/the_dark_side_of_data << Ted Talk – I think I heard the term here >>
- https://aimagazine.com/data-and-analytics/data-poisoning-new-front-ai-cyber-war << Really interesting article
- https://www.fireeye.com/blog/threat-research/2019/03/breaking-the-bank-weakness-in-financial-ai-applications.html << I found this particularly interesting – especially with the Exploitation aspects
- https://arxiv.org/abs/1608.08182 << Cornell University Article – incredibly good, How-To use Factorization to poison data…
These articles (and more) are just the tip of the ice-berg. I hope you will find these articles useful, and I hope you understand that Data Vault can actually help you discover poison data by evaluating the patterns of the data in the Raw Data Vault. Looking for outliers, and the patterns of outliers (as I mentioned in the podcast #2) are what’s important here. But beware: Trojan Data (dark horse data) also exists, and can hide in plain site.
Hypothetical Question – Poison Data – Who’s Responsible?
What if you’re on the Snowflake Data Marketplace and Poison Data is embedded? Who’s responsible for helping, fixing, finding, combating, poison data? Is it the data providers / data sellers? or is it Snowflake? or… is it BOTH? I would argue it’s both the data provider and the platform. In my opinion: whomever is taking your money has a fiduciary responsibility to ensure it’s not loaded with poison.
I make this argument because the similar analogy holds with Grocery Stores – Who’s responsible for food quality (fresh foods, meats, vegetables, fruits)? the wholesaler? the grocery store? or everyone that moves the food through the system?
Responsibility Lies With Those Taking Money For Selling You Data
Any vendor that offers data-as-a-service (DaaS) should be held legally responsible for re-selling / hosting poison data. Like the grocery store – what happens if you get sick? What happens if it takes your business down (which is the intent by the way)? If you aren’t thinking this way, you need to. This is part of good governance over your Data Providers and the Service Level Agreements (SLA’s) that you have in place with them. My Advice? Don’t contract a DaaS unless they have a mitigation strategy in writing for handling poison data.
What about Spying?
Now this is interesting: an MIT guide on how to poison YOUR data to keep big-tech from spying on you!
Hope you found these links useful. I look forward to discussing this with all of you in the #datavault landscape going forward.