At the Intel Science & Technology Center for Adversary-Resilient Security Analytics (ISTC-ARSA) housed at Georgia Tech’s Institute for Information Security & Privacy (IISP), researchers will study the vulnerabilities of machine learning (ML) algorithms and develop new security approaches to improve the resilience of ML applications including security analytics, search engines, facial and voice recognition, fraud detection, and more.
Already, attackers can launch a causative (or, data poisoning) attack, which injects intentionally misleading or false training data so that an ML model becomes ineffective. Intuitively, if the ML algorithm uses the wrong examples, it is going to learn the wrong model. Attackers can also launch an exploratory (or, evasion) attack to find the blind spots of a ML model and evade detection. For example, if an attacker discovers that a detection model looks for unusually high traffic, he can send malicious traffic at a lower volume and just take more time to complete his attack.
Researchers at the ISTC-ARSA will systematically evaluate the security and robustness of ML systems in the face of causative and exploratory attacks and develop new algorithms and systems to improve resilience. More specifically, the main research activities include:
In order to determine how adversaries can attack ML-based security analytics, we will study the theoretical vulnerabilities of ML algorithms. We will use the active learning framework, which determines the smallest number of samples required to learn a target function, to quantify the vulnerability of a ML-based system to exploratory attacks. We will use the machine teaching framework, which determines the minimum set of training samples such that a ML algorithm will learn the desired function, to quantify the vulnerability of a ML-base system to causative attacks. We will explore how adversaries can “improve” active learning algorithms to smartly select the most uncertain examples to optimize exploratory attacks, and use better machine teaching algorithms to launch sophisticated causative attacks even when the choices of ML models and algorithms are not known.
With the understandings of vulnerabilities of ML algorithms and capabilities of adversaries, we will develop theoretically-grounded approaches to improve the resilience of machine learning. First, we will improve the individual ML models. In particular, we will explore using deep learning on relation graph of co-occurring events or intrusions to produce models that are difficult to manipulate. We will apply a feature deletion framework to emerging ML algorithms such as deep learning to make them resilient to attacks that involve feature manipulations. We will explore strategies to generate pseudo training examples to avoid overfitting and defeat sample manipulations. We will explore an adaptive approach to select the appropriate interval of model retraining in order to be resilient to both exploratory and causative attacks. Second, we will develop an online ensemble boosting framework that combines the individual models and further improves resilience to causative attacks because of its online nature and exploratory attacks because a random subset of the ensemble is used at any given time.
We will empirically evaluate the vulnerabilities of ML algorithms as well as our algorithmic improvements using realistic datasets and security analysis environments. In particular, we will develop a MLSPLOIT tool that given a malware, it (through trial-and-error) automates exploratory attacks by transforming the malware to behave like a legitimate software (e.g., MS Word) to evade the detection of a ML-based model. The MLSPLOIT tool also enables malware to perform causative attacks: when a malware is detected and being analyzed to train a ML-based detection model, the MLSPLOIT-enabled malware will inject noise into its behaviors so that input data to the ML process is polluted. We will also improve system-level resilience to adversaries in particular in the context of real-world security analysis environments.
We will explore using Intel SGX to hide parts of the ML process, e.g., the selection of random subset of online ensemble, from adversaries. More significantly, since our algorithmic improvements include the modeling of co-occurring events and correlated features, such as certain user actions present with normal usage, we can expect that adversaries will program their malware to look for these “triggers” before performing malicious actions. We will explore approaches to automatically generate emulated user behaviors based on the features used in a ML model and use active learning to interrogate a malware’s evasion model to defeat malware evasion.
Researchers at the ISTC-ARSA have extensive background and accomplishments in machine learning, systems and network security, botnet and intrusion detection, and malware analysis. The team is committed to open source all software and datasets.