(Deep) Learning to trust

Deep learning technology has proven extremely successful in a variety of applications, achieving superhuman performance by some evaluation metrics. It’s natural for impressive results to inspire novel business applications, but strong test performance does not guarantee trustworthiness. This article discusses the risks that can arise from putting too much trust in laboratory results and looks at how the assessment of deep learning projects’ usefulness in a business context can be improved.

Trust is a nuanced concept. What it means for a tool to be trustworthy varies, but in the case of machine learning systems that need to ingest real-world data and respond appropriately, there are at least two key facets of trust:

  • The system is transparent: system users can understand and reason about the system’s behavior and what is expected.
  • The system is robust: malicious users cannot cause the system to misbehave.

These requirements are relevant to human decision making in a cyber security context too:

  • A human understands their role as a decision maker, can explain their reasoning for an action or decision, and can accept responsibility.
  • Training and policies can teach humans to resist manipulation by cyber criminals.

Despite the same logic, deep learning systems rarely meet their version of the requirements.


All deep learning systems are, fundamentally, equations. These are enormously complicated equations in most cases. Every variable in the equation can be accessed by the people who implement, train, and deploy a deep learning model. Yet, this access is rarely sufficient to understand the model’s behavior. The scale of the model dooms any attempt to understand its overall behavior through the comprehension of its constituent parts; the complex relationship between variables makes it almost impossible to say what any one single variable actually represents. Questions like, “Why did the model behave this way in this situation?” can almost never be answered in a technical sense.

This is a trust problem, because the lack of transparency prevents the model from being understood. Researchers training a deep learning model to solve a problem must translate a real-world problem into some type of numerical score. Unfortunately, there's no general way to prove that the numerical score genuinely and completely captures the original problem. For instance, if researchers are trying to design a model that will understand English, they may train a model to predict missing words from English sentences, using the surrounding words for context. This could work, if learning English is the best way to succeed at the task, but no machine learning technique can validate that assumption. This risk of a mismatch is common when trying to find a way to convert a real-world problem into the type of mathematical form a model can work with.

A common problem which has plagued artificial intelligence research is “specification gaming”. A model that can learn to game its specification is one that can learn to cheat at its task, for example, by exploiting a bug in a physics engine or discovering steganography techniques to leave notes for itself. These flaws make for entertaining stories, but they illustrate a serious problem: models can exploit any mismatch between the problem they were intended to solve, and the problem they were presented with in training.

What’s more, these flaws are also generally discovered by a researcher who has already identified something strange in the model’s behavior and is looking for an explanation. Thus, specification gaming creates a transparency issue, because non-obvious, yet still incorrect, behavior can go undetected. Without the ability to analyze the model’s inner workings to ensure they meet expectations, only the most egregious exploits by a model can be detected by its developers.

Even if the model does not learn to exploit its training regime, there are many kinds of risks that could result from the training failing to generalize to reality. For instance, facial recognition tools often reflect racial or gender biases in training data. There is also evidence that bias can result from common training techniques, not merely from data. If a model’s training does not reflect the way it will be deployed, then the model’s high performance is irrelevant—there are no guarantees that it will be an effective real-world system.


If models can’t be easily understood, and if that understanding can’t be guaranteed, trusting them to resist a cyber attack is even harder. The threat model for deep learning varies from application to application. Compare a model used to generate art with one used to detect financial fraud, and the latter has a far greater impact if attacked successfully. It’s reasonable to say that the former is unlikely to be exposed to malicious inputs, and the correctness of its behavior unlikely to be mission-critical, therefore it has limited security needs. But some models absolutely do need to be trained to resist malicious users, and the challenge of robustness is no simpler than the challenge of transparency.

Perhaps the most famous attack is the adversarial example: a crafted input that causes a classifier to misbehave. For instance, an adversarial example of data for an image classifier would be one which is crafted to be wrongly classified—a baseball classified as a cappuccino, an apple classified as a smartphone, and so on. These attacks are not limited to image classifiers and they are powerful. They can be applied without physical access to the target model and to multiple models at once. There’s even evidence that these attacks result from a type of specification gaming, in which the model learns absurd features that technically help boost the model’s score in training.

Other threats facing deep learning technologies include:

  • Model inversion, which can reveal data used to train the model and cause problems if that data is confidential or otherwise secret.
  • Model theft, in which a model is reverse-engineered from its behavior, allowing others to use it without the training investment from the model’s creator.
  • Data poisoning, in which an attacker who controls a small amount of training data (such as text scraped from the Internet) causes the model to misbehave when trained on that data.

Understanding which threats apply to any given deep learning system is an important task for security teams to undertake if they hold responsibility for deploying the technology in their organization. It's a challenge that requires experience and expertise, but a necessary one, because it's the only way to make informed decisions regarding which technical defenses to build into the model and its supporting infrastructure.


Bridging the gap between trustworthiness and the “state of the art” in deep learning is not an impossible task. Developers should offer features that answer questions like “How did this part of the input contribute to our result?”. And organizations investing in deep learning systems can seek out vendors who are explicit about how their deep learning platforms deal with cyber risk. They can also ask how transparency and robustness were addressed during development and training, and why deep learning is a better choice for this product than alternative algorithms.

Adversarial training is a key consideration for any model where resilience to attacker-crafted inputs is necessary. This should use adversarial examples to teach models how to avoid falling victim to similar adversarial examples in the wild, making the attack more difficult (though not impossible). Other techniques should be considered based on the use case of the model. For example, differential privacy prevents an attacker from recovering unique training examples. Each technique has costs and benefits, which prevents any one-size-fits-all approach. Thus, the best way forward is to ensure that decisions are made with intention and business context in mind, and to record the reasons behind using or not using a given strategy.

Once trained and deployed, deep learning systems need to always be easy to challenge. Whenever possible, developers should be investing in human oversight, and responsibility should always rest with a human who can respond accordingly. It is important not to conflate responsibility with blame for this purpose; remove any motive for people to argue that they don't need to resolve a problem because it wasn't their fault. This can be achieved by designating responsibility for making amends when/if a machine fails at the project planning stage.

Users and other partners should have an easy point of contact for raising complaints and making suggestions. The starting assumption should be that these complaints and suggestions are legitimate: they need not be trusted blindly, but the system should never be trusted over a respondent. Instead, deep learning-enabled systems would be better treated as though in a perpetual beta test state, never quite ready for fully autonomous operation.

These measures will help ensure that nobody falls through the cracks of an imperfect system. When it is not possible to provide deliberate reasoning for design decisions, or to allow user feedback to challenge a machine, alternatives to deep learning should be considered. There are many other technologies within the field of machine learning, many of which can be made much more transparent and robust, even if they are less impressive on a scoreboard.

Reading time: 10 min


  • December 2021

Jason Johnson, Security Consultant