DeepXplore: Automated Whitebox Testing of Deep Learning Systems

Summary:

This paper focused on a novel strategy for detecting safety and security concerns in places like self driving vehicles where there simply is too much data for humans to sort through manually and label. As a result there are an endless amount of test cases that just have not been accounted for. This whitebox uses similar models and essentially compares the output of the 2 models to see if there is an edge case that both models have not accounted for. Where I think this paper falls short is the fact that the whitebox doesn't work to well if you don't have a model that is similar enough to the model you want to find edge cases for.

What I liked:

This paper is really relevant to new technologies ie (self driving tech in Tesla's around us)
The white box they use is very fast and doesn't require custom hardware ie they were able to get results in one second on a commodity laptop
The paper explains why adversarial image techniques might not be the so effective in finding erroneous behaviors in deep learning models
I thought it was a really good idea to compare responses from one DL model to another to figure out edge cases
The end result of the experiment had great improvement in neuron coverage

What I didn't like:

The basis of this paper depends on having a similar DL model doing the same application - it runs into the issue of what if you have a novel application
I don't think its feasible to test every single corner case in any model, and I think that even though this does create some improvement its impossible to cover every simulation in a finite set
The computer doesn't know what the trend it is missing is ie the adversarial model proves that the computer is looking for things that might not be material to the context
The entire study runs on the idea that these platforms and models are using shared data sets which isn't really true in the private sector
The study never really explains why 3% is the max gain from their technology - I found it interesting why it was only 3%

Points for discussion:

Is 3% the size of the incorrect set that the testbox was able to catch? Where does this 3% improvement come from.
What causes the disparity between the code and neuron coverage?
What improvements can be made in order to prevent this disparity?
How practical would it be for companies to share common testing data?
How will regulation in software testing for autonomous vehicles change this field?

New Ideas:

A strategy for sharing data among competitors that doesn't expose what the specific edge case is. This could be used by regulators to prevent accidents. IE google has a model that is run on a competitors model which fails. Before the competitor deploys the product they have to find and fix that edge case
What if we used an old model as a tester to compare against a new model? Would this show the new edge cases that have been solved?
How can this DL technology be used in finding malware among normal computer software?
What ways can this technology be built into current standards for software testing
What are the limits of this corner case optimization because in theory it could be endless?