WithSecure MITRE Evaluation Results: Exceedingly good EDR
In June 2019 WithSecure completed the MITRE ATT&CK Evaluation and we’re excited to announce the results are now publicly listed on the MITRE website here.
In this post we’re going to reveal how our endpoint detection and response (EDR) agent did across:
- Telemetry coverage
- Detection coverage
- Modifiers – delayed and tainted
We will then provide guidance on other elements you should factor in when sourcing an EDR vendor, as well as our take on what other vendors you could consider.
So how did we do?
The results for WithSecure’s EDR – used in our managed detection and response solution, WithSecure Countercept – were very positive.
We scored highly in many of the tests, showing that the WithSecure Countercept platform provides the necessary datasets and detection logic to comprehensively detect a nation state threat actor such as APT3 (which was the focus of Round 1).
The results themselves are based on 20 attack phases broken down into 105 tests cases, which then expands to 136 total items for which you can demonstrate capabilities. We’ll walk through some of the key findings.
Telemetry coverage is one of the most useful metrics in the MITRE Evaluation as it shows you how much visibility a tool is likely to give you across different attack vectors. This can be calculated by looking at which out of the 136 test cases had information available (not “none”). The graph below shows how WithSecure Countercept was one of the top performers in terms of the total telemetry coverage with a score of 122/136.
Why are our scores here slightly higher than other vendors?
Part of the reason we had slightly higher scores here was because we capture Windows events, WMI and .NET data, which gives us some additional visibility. In general though, you’ll notice that many vendors had quite similar scores as they capture the same datasets such as process, network, file, and PowerShell events.
One subtle point that is not captured in the MITRE Evaluation is that many vendors solely rely on real-time data collection. WithSecure Countercept is unique in that our EDR agent also contains periodic scanners that capture persistence data and memory anomalies. So even in post-breach scenarios we can uncover activity that has happened in the past, without the need for a real-time event to occur.
A large part of Round 1 is focused on assigning detection categories (enrichment/general behavior/specific behavior) to test cases based on how much information is provided by each product.
Some of the limitations of this approach have been discussed in other posts, so we won’t cover them here, but we wanted to highlight that when assessing detection capabilities there’s a big difference between useful high fidelity alerts, low fidelity alerts, and enrichments.
- High fidelity alerts help you quickly spot genuine malicious activity
- Low fidelity alerts or enrichments aid you when threat hunting or during an investigation.
The challenge in Round 1 is that MITRE collected the enrichment data, but unfortunately not the first, more valuable part – the high-fidelity detections – so it becomes tricky to meaningfully compare the detection capabilities of different products. Comparing investigation capabilities is perhaps easier, although here key factors such as correlation, workflow, and response weren’t measured, making it difficult to accurately compare products.
To try and measure how vendors performed, Forrester published an evaluation script that counts and scores detections. Using their simple score metric, WithSecure Countercept achieved one of the highest scores of 376.
Does that mean our EDR is ‘better’?
Potentially, but not necessarily. The scoring in the Forrester script gives more weighting to non-delayed behaviors, which bumped our score. This tells you that our EDR possibly provides more context and is better for investigation, but that doesn’t necessarily equate to better detection.
With the limited data in Round 1, is it even possible to assess detection capability?
One approach would be to look at detection coverage while assuming enrichments and behaviors are equal (in terms of detection potential) and removing delayed detections – these are normally related to managed services whereas here we’re focusing on product only.
Remember that not all test cases are equal. In fact, in Round 1 – based on real world WithSecure data – we’ve estimated that only 25% of the test cases (maybe less) can be used for direct detection; the remaining 75% would require correlation for detection or would be used as enrichments during an investigation.
Taking this into account we get the following:
WithSecure performed well in detection coverage, as did Palo Alto, FireEye and Carbon Black. What’s more interesting here though are the high-fidelity results, which are a lot lower on average and better reflect the real-world effectiveness of EDR products. Also notice that the absolute differences in high fidelity results are negligible between the top vendors.
What about correlation?
In Round 1 a “tainted” modifier was used to determine if a detection relied on previous activity (which can be both a positive and negative). In Round 1, WithSecure Countercept had no tainted detections because we were able to demonstrate direct detections.
However, our platform does use correlation for detection and investigation, as shown in some of the screenshots, but this was not directly recorded in Round 1. For now, we’ve excluded correlation from our analysis, although the good news is MITRE is adding an explicit correlated modifier in Round 2.
I wanted to finish this section with a quote from MITRE:
“The evaluation focuses on articulating how detections occur, rather than assigning scores to vendor capabilities.”
While it’s tempting to try and score vendors by adding up the total counts of detections, you’ll often get more value by qualitatively analyzing results. Think quality, not quantity.
Limitations in Round 1
The Round 1 evaluation is a great start in providing a high-level generic set of testing that can be applied to any EDR solution. However, there are some limitations:
- All test cases are treated equally (when that’s not the case in the real world)
- It’s in a zero-noise environment
- Investigation workflow isn’t tested
- Response tasks aren’t used
- The human element is omitted
As such, Round 1 shouldn’t be used in isolation as a means to assess EDR products.
We’d recommend using the high-level telemetry and detection metrics, as well as the UI screenshots from Round 1, as a starting point to help you shortlist vendors.
To properly assess a tool, you will likely need to install it and test it yourself (ideally with some simulated attacks and workflow).
What is the right EDR agent for your organization?
Many vendors have made bold claims about why they are better than their competitors. At WithSecure we’re a bit different. We think there are a lot of good EDR products out there that are very similar to what we have developed for WithSecure Countercept. The MITRE Evaluation demonstrates this quite well.
EDR is an essential component of attack detection, with the people behind it making a massive impact on its effectiveness. At F-Secure Countercept, our focus is managed detection and response, pairing our EDR agent with some of the best people in the industry. If you want some of the world’s best threat hunters watching your back, please get in touch.