#BoringButImportantCyberPapers — Blog

Link to the paper: https://acmccs.github.io/papers/p1299-bilgeA.pdf

*Also just a heads up there is some crazy cool data being used in this data courtesy of Symantec Research Labs

Summary:

The real novelty of this paper is the fact they went down the road less traveled in cyber security - prediction, as opposed to the 3 main paths: analysis, detection, and prevention. This paper is also really timely given as cyber attacks create more and more harm economically for companies - companies are in the market to buy cybersecurity insurance which many times does not have accurate and up to date models. This paper therefore creates a novel analysis tool that is able to detect so called "infected computers" at a rate of 95% with a relatively low false positive rate. The problem though is their definition of "infected" might not actually be valid given they weight a high amount of their modeling on the amount of unique files a device has which just really isn't realistic when considering specific user types such as developers.

What I liked:

The study uses a lot of very comprehensive information ie 1 year of data spread over 18 enterprises and 600k machines with over 4.4 binary file events.
The paper is unique in the fact that cybersecurity research of the past has been on analysis, detection, and prevention while this paper is focused on prediction which has not seen much work
When it comes to the risk prediction they are using a model that prioritizes lower false positives because that is what enterprise industry has demanded in order for a solution to be deployed
Study goes into factors that create malware incidents ie they check if the use downloaded the stuff from home or out of office hours which is an interesting facet
The study uses and builds on NISTS vulnerability Database instead of creating its own scoring system which makes it simpler to evaluate

What I didn’t like:

Very early on the paper acknowledges that the ground truth is the most important thing of this study which is based on observing malicious files and infection records but they acknowledge its nearly impossible to obtain a perfect ground truth with the way they have conducted their study
In the study set up details they only go into depth about how the study was set up for windows and they never really clarify whether the study encompasses other operating systems or just windows.
Probably my biggest issue with this study is how they classify clean and not clean devices. Essentially their methodology is going through the files on a device and penalizing the device if they have unique files. But this really doesn't make sense - say if the device is a developer or something else - you are simply flagging it for normal use which kinda trashes the study.
Again as part of the user profiles the prevalence based features of their data penalizes users like developers who create their own types of files.
The study even though it had so much data really didn't produce that many visualizations;. Out of the 82 factors or profile details they only displayed like 9 graphs of these factors.

Points to talk about:

The paper references a technique to figure out who is vulnerable to phishing emails and have like extra layers of protection around those certain users which is very similar to what one of my classmates is working on.
How does this specific method detect new forms of malware when it only knows what it has seen?
How does the emergence of new malware end up messing with this prediction model? Is it something that is statically priced in at the end or is their other information that can be drawn from in order to make this model more dynamic
Did the system weight any of the factors more than others in determining whether computer was infected or not?
Is simply updating the data set enough to prevent concept drift in anti virus machine learning?

New Ideas:

Maybe we could use the same dataset except reveal the industries these 18 enterprises are in and map how certain industries are susceptible to different types of malware
The study clearly states that they do not seek to figure out the exact causes of the infection - but with the binary event log files it would not bee too hard to extrapolate certain causes for future study
Given that there are different users ie developers can their be more user types in this study to create a more accurate model of what an infected computer looks like? Would this lower effectiveness and/or false positives?
If the system weighted some factors more than others in determining what devices were infected or not can this be translated to a priority list for cyber security practitioners.
Possibly look into exporting this form of analysis into login data to detect false users when signing into portals?