How to interpret the MITRE ATT&CK Evaluation

ws_half_closed_computer_with_reflected_screen

One of the biggest challenges for organizations is buying and implementing the right tooling to empower their security teams. In general, most organizations have a history of inadvertently buying ineffective tooling or struggling to get value from existing tooling due to noise or complexity. And while organizations such as Gartner provide product guidance, this can often be too high level and not based on real-world benchmarking.

To help provide organizations with a more detailed analysis of tooling in 2017¹, MITRE launched a program to evaluate EDR vendors against the MITRE ATT&CK framework to effectively offer a publicly available impartial benchmark. The initial results were released in 2018 and give a great overview of the kinds of telemetry, alerts, interface and output you get from each product or service listed.

The assessment was based on a real-world threat group APT3 and provided a rich set of detection cases to measure against covering all major areas of the cyber kill chain. However, it did not factor in at all how effective in a real-world environment this would be, nor did it cover any aspects of responding to attacks. So although the evaluation is a useful starting point, it should form just one aspect of how you assess an EDR product.

In this article we’ll delve into the MITRE testing methodology and compare this against what matters in the real world to give some useful tips for analyzing the evaluation results.

An EDR Product Assessment

The Round 1 MITRE evaluation is essentially a product assessment that is focused on measuring EDR detection capabilities in a controlled environment with the main assessment criteria being telemetry and detections. The output is a list of test cases and results for each, focusing mainly on detection specificity and time to receive the information. Taking a simplified approach like this helps break down a complex problem space like detection into something more manageable. But does this overly simplify the problem?

Often, in the world of detection it’s not finding the “bad things” that matter, but excluding legitimate activity so your team can more effectively spot anomalous activity. By testing in a noise-free environment vendors are able to claim to “detect” test cases that would have likely been hidden by noise in the real world. MITRE clearly note this as a limitation but it’s not all that obvious when reviewing results.

Going beyond the product itself, key areas like the people driving the tool and process/workflow are also noticeably absent from the test and are often more important than the tool itself. As such we’d recommend taking a holistic approach, using the MITRE evaluation as a starting point but remaining aware of its limitations and instead asking your own questions. For example:

What are the false positive rates like in the real-world?
Can you demonstrate capabilities that either limit noise or help draw attention to specific activity that closely matches legitimate activity?
Can you demonstrate a real-world end-to-end investigation? From a threat hunting-based detection, to investigation, to time-lining and response?
Can you issue response tasks in order to retrieve forensic data from the machine?
Can you contain and battle an attacker off the network?
Is my detection team technically capable of driving the tool and available 24/7/365?
Could you benefit from a managed service and if so can they demonstrate they are able to detect advanced attacks?

But what can you learn from the existing results? And how should you interpret them?

Each vendor has their own set of results that consist of roughly 100 different test cases, each with an associated Description, Technique ID, Detection Type, and Detection Notes. The first thing to note is that this is a technical assessment with technical results and no high-level scoring mechanism so you may need to ask your technical team members (or an external party) for guidance. We’ve included an example result below.

The test results give some great technical detail, but no obvious score.

The most relevant fields here are the “Detection Type” and “Detection Notes” as they explain how the vendor performed. Together they give a summary of essentially whether the vendor logged any associated telemetry and whether there were any alerts/detections related to the activity.

In the following sections we’ll look at how you can assess the importance of both “Telemetry” and “Detections”.

Understanding Detection Types

Automated alerting allows your team to scale your detection efforts and increase your reliability of detecting known indicators. Detections are a key component of the MITRE evaluation, with detection quality captured by classifying alerts as enrichments, general behaviors or specific behaviors. In general, the more specific the indicator the better, as they create fewer alerts.

Do remember though that detections and alerts are just one component in your detection approach and should not be relied on as a single approach because:

Static detection rules can bypassed
Attackers are continually innovating and have a long history of bypassing security products – either by using different forms of obfuscation or never-seen-before techniques that existing tools simply can’t identify as malicious. Assume your rules can and will be bypassed.
Alerts are often false positive prone
One of the biggest challenges when handling alerts are the false positives, there is a huge difference between whether alerts will be acted upon if they happen in their hundreds or thousands a day, vs extremely high-fidelity alerts that are meaningful enough to have a big red alarm go off. Aside from missing attacks, alert fidelity can also impact team efficiency and lead to alert fatigue. Fidelity is unfortunately not something that is captured in Round 1 of the MITRE evaluation and is actually extremely difficult to measure effectively outside a real-world enterprise scale network. Therefore it is worth taking any MITRE detection results with a big pinch of salt.
Alerts are “reactive” instead of “proactive”
When used correctly alerts can help you reliably spot the easy stuff and improve your response times. The risk with taking an alert-based approach is that it can create a reactive culture within your team leading to complacency and a false sense of security. Finding the right balance between reactive alert-based detection and proactive research driven threat hunting will help you catch the anomalies that tools/alerts will often miss.

Comparing solutions

Although MITRE don’t score solutions, they do provide a comparison tool to help you easily see for each use-case how each solution performed.

It’s useful to take a holistic approach when comparing results, giving equal weighting to telemetry, detection, and how quickly results are returned (low number of “delayed” results), as each of these aspects bring different benefits to security teams. For the detection and managed service components you want to make sure that adequate information is provided to enable your team to respond to any notifications.

Forester have previously released a scoring tool for MITRE. While an interesting approach, the results for this tool are heavily weighted towards detections and – as mentioned already – using detections as your primary evaluation criteria is not a good way of measuring the overall effectiveness of an EDR tool. What matters most in a real-world breach is having the right data, analytics, detections, response features, and – most importantly – a capable team to drive any tool.

The MITRE Evaluation

The MITRE evaluation is a great step forward for the security industry, bringing some much needed visibility and independent testing to the EDR space. MITRE themselves should be applauded for their efforts, as fairly and independently comparing solutions in such a complex problem space is very challenging.

In the short term, we are excited to announce that WithSecure has just completed the Round 1 MITRE evaluation and we will be posting the results when they are ready.

References

[1] https://medium.com/mitre-attack/first-round-of-mitre-att-ck-evaluations-released-15db64ea970d

[2] https://www.endgame.com/blog/technical-blog/putting-mitre-attck-evaluation-context

About us

What's new

Latest updates

WithSecure recognized for a 15th time in the 2025 Gartner® Magic Quadrant™ for Endpoint Protection Platforms

WithSecure™ set to launch on AWS European Sovereign Cloud: Bringing trusted cyber security to European businesses

WithSecure and ENISA join forces to strengthen Europe’s cyber defenses

MITRE 2024: what’s new and should I care?

Select your location and language

Europe

North America

Central and South America

Asia - Pacific

Global

Solution highlights

WithSecure Elements now available on AWS Marketplace!

Watch & Read

Featured article

WithSecure™ set to launch on AWS European Sovereign Cloud: Bringing trusted cyber security to European businesses

Become a partner

For current partners

Find a partner

Partner's corner

Disrupting the Kill Chain with WithSecure™ Cloud Protection for Salesforce

The European Way

Outcome-based strategies

Keep learning

Self-help

Contact support

Let us know

Important updates

WithSecure Support