(session) durationHTTP request statusis URL encryptedis protocol HTTPSnumber of bytes upnumber of bytes downis URL in ASCIIclient port numberserver port numberuser agent lengthMIME-Type lengthnumber of '/' in pathnumber of '/' in querynumber of '/' in referreris the second-level domain raw IP
Once the input feature space has been established, getting the label is the next challenge. For each observation of the training set, we need to determine if it is a malicious or a benign pattern.
Having a network security expert create labels (on potentially millions of observations) will be expensive and time consuming. Instead we might rely on a semi-supervised approach by leveraging publicly available threat intelligence feeds. MLSec provides a set of tools and resources for gathering and processing various threat intelligence feeds.*
Thus labels are created by doing a join between public blacklists and the collected dataset. Matching can be done on fields like IP addresses, domains, agents, etc. Keep in mind that these identity elements are only used to produce the label and will not be part of the model input (Franc).
Finally once we have the input feature space and the labels, we are ready to train a model. We can anticipate certain characteristics of input space such as class imbalance, non-linearity and noisiness. A modern ensemble method like random forest or gradient boosted trees should overcome these issues with proper parameter tuning (Seni).
Bigger issue is that this is an adversarial use case and model decay will be a significant factor. Since there is an active adversary trying to avoid detection, attack patterns will constantly evolve which will cause data drift for the model.
Some possible solutions to the adversarial issue could be the use of a nonparametric method, using online/active learning (i.e. letting the model evolve on every new observation) or having rigorous tests to determine when the model should be retrained.
To address some of the issues unique to adversarial machine learning, Startup.ML is organizing a one-day special conference on September 10th in San Francisco. Leading practitioners from Google, Coinbase, Ripple, Stripe, Square, etc. will cover their approaches to solving these problems in hands-on workshops and talks.
Franc, Vojtech, Michal Sofka, and Karel Bartos. "Learning detector of malicious network traffic from weak labels." Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer International Publishing, 2015.
Seni, Giovanni, and John F. Elder. "Ensemble methods in data mining: improving accuracy through combining predictions." Synthesis Lectures on Data Mining and Knowledge Discovery 2.1 (2010): 1-126.
* This approach is limited to global IP addresses and domains and cannot be used for internal threats.