Big Data and Security — Oxymoron?
Big data technologies have dramatically changed the world we live in, and in double quick time. And you know that unless you have been living under a Martian rock. We take it for granted in many of our daily interactions — in our personal lives as well as at work. Big data technologies fuel the seemingly never-ending growth of the big tech behemoths — not only the Big Five of Google, Amazon, Facebook, Microsoft and Apple, but also many, many others that owe their growth to big data.
The two-part question we have been posing in our research is:
1. How dependable are such technologies — can we trust life-and-death decisions to these algorithms?
2. And what is the trend line, i.e., are big data technologies becoming more dependable over time, and, if so, at what rate?
This question forms the central theme of our Army Research Laboratory-supported Army Artificial Intelligence Innovation Institute (A2I2) project that started in 2020 and is funded for five years.
Of course, this is a sweeping question to ask for several reasons. What kinds of big data technologies? What application domains are we talking about? What does dependability mean — resilience to what kinds of failures or to what kinds of security attacks? Therefore, this post will generalize across some subsets of these factors and, in a few cases, talk of instances specific to A2I2.
Terminology clarification: I use the term “dependability” rather than “security” as the former encompasses resilience to malicious attacks plus natural bugs that are introduced without malicious intent.
Why should we care about big data becoming dependable?
We care that our big data processing systems are dependable. This is because we increasingly rely on these systems to make critical decisions, at work and at home in civilian life, and, slowly but surely, in law enforcement and military situations.
For example, at home, the Siris and Alexas of the world are at our beck and call to do our bidding. This also means they are in private spaces with ears, and sometime eyes, that can listen to or see our private conversations. At work, many companies process large amounts of data that are the most prized currency of the digital realm — the company that knows how to extract the most value out of the data wins. So, the most prized currency is morphing from the one that has access to the troves of data to the one that has such access and is able to extract monetizable insights from it.
On the non-civilian side, cybersecurity increasingly relies on automated bots for defending our systems, as well as attacking others. Also, decisions about law enforcement — where to police, or how to decide on sentencing — are being made in part by algorithms (an area that has been the focus of intense debate in political and intellectual circles).
So overall, big data systems have stepped out definitely and defiantly from the playhouse to the house of consequential decisions, even decisions of life and death.
How do we make big data dependable?
In the context of our institute, we achieve dependability of big data by answering the following big questions:
1. Can we build algorithms that are distributed and yet resilient to malicious actors?
2. Can we execute such algorithms in real time on a distributed computing platform? The distributed platform has devices of all hues, static and mobile, from embedded devices through edge computing devices to server-class machines.
3. Can we as human users of such systems read into the inscrutable workings of these algorithms? In other words, can we get some explanation for security-related decisions taken by the algorithms?
Here is our approach:
Attack model. First, let us think of what we are trying to make our autonomous systems resilient against. In one dimension, we are accounting for the possibility that our data or models can be corrupted maliciously. Data corruption can happen at training time while learning the model, or while using the model for performing the inferencing. Model corruption can occur in a targeted manner (e.g., the model always mistakes “red” color for “green”) or in an untargeted (and generally more damaging) manner (e.g., the model has poor accuracy across all classes). In another dimension, we account for what fraction of the agents in the system can act in a malicious manner, and for those malicious agents, what level of collusion is possible.
Robust algorithms under adversarial settings. We develop algorithms that can provide guarantees around their results even under adversarial conditions. This is challenging as one has to consider unconstrained adversaries, including those who know the inner workings of your algorithms. Our guarantees apply to worst-case behavior while also ensuring that performance under benign conditions is not significantly reduced.
Secure, real-time, distributed execution. We create practical instantiations of the algorithms that execute on distributed computing platforms, typically a heterogeneous mix of low to mid-end embedded and edge devices. This involves parallelization strategies (called, separately, data and model decomposition) and right-sizing each part of the algorithm to run on the device according to its capabilities. This also means dealing with the fact that each node may not be continuously connected to a backend server farm, but may only have transient connectivity and even low bandwidth.
Interpretable operation. Interpretability is an important aspect of our solution approach because it lets users adapt the model as needed and have greater trust in the outputs of the autonomy pipeline. By explaining predictions, we can aid users in determining whether the models have been compromised by an adversary. We design methods for two distinct problems of interpretability in machine learning: (a) algorithmic interpretability (understanding the learning process) and (b) prediction interpretability (explaining the predictions).
What are the trend lines and the end game in a five-year outlook?
Assurance of autonomous systems is a fast-moving area, so predictions are risky. Nevertheless, crystal ball gazing indicates that in the next few years, we will broaden the scope of autonomous systems. This will doubtless increase their attack surface — all the different ways in which adversaries can compromise them. But we will make rapid gains in the defenses. We are quickly building up a good understanding of the fundamental characteristics of the algorithms that underlie autonomous systems. And from this understanding will arise foundational defenses that will deliver us from the current seemingly endless cycle of an attack, a defense against it, and an attack to bypass that defense. Rather, the new foundational defenses will erect (practically) impenetrable shields against a wide, and rigorously quantified, set of attacks.
An instructive comparison can be drawn to the world of cryptography used in financial systems. Cryptographic break-ins are rare and executed mainly by powerful nation-state adversaries. Similarly, we will raise the barrier to compromising autonomous systems and use our defenses along with other measures, like human intuitive validation of data and models, to manage the risk of using autonomous systems to acceptable levels. Thus, we will ease into a world where autonomous systems are trustworthy and we can quantify and rigorously prove the level of trust in each such system.
Note: The A2I2 project involves, as thrust leads, Somali Chaterji (Agricultural and Biological Engineering), Mung Chiang (Electrical and Computer Engineering), David Inouye (ECE), Kwang Kim (ECE), and Prateek Mittal (Princeton ECE), with Saurabh Bagchi (ECE and Computer Science) as principal investigator.