Technical Note
Designing for Analyst Cognition, Not Detection Throughput
How tool design shapes analyst performance — and what most platforms get wrong.
Security products like to count throughput: files scanned, alerts closed, detections generated, time to verdict. Those numbers are useful for operations. They are much less useful for judging whether an analyst understood the sample.
Malware analysis is a memory-heavy job. The analyst is trying to keep track of why a claim was made, what evidence supports it, what remains ambiguous, and which paths have already been checked. Tooling that speeds up labeling while dropping that context does not make the work easier. It just moves the burden back onto the person at the keyboard.
The harder problem is not producing another label. It is preserving enough of the reasoning trail that a human can trust, challenge, or revise the label.
The real bottleneck
The bottleneck is often the analyst's working memory.
One common loop looks like this: open the binary in Ghidra, inspect a suspicious function, jump to the string view, check xrefs, compare a sandbox note, search prior case notes, then return to the original function and wonder why the tool marked it as credential access. Maybe the label is right. Maybe the string came from a bundled library. Maybe the function is unreachable. Maybe the analyst already rejected the same pattern yesterday.
If the system saved only the final label, that context is gone.
Throughput metrics can miss that cost. A tool can process more samples while creating more work per sample. It can fill a queue with plausible labels that still require manual reconstruction. The analyst is checking whether a claim is true and how the claim was assembled.
Verdicts are not enough
A verdict is helpful at the start of triage. It is not enough for serious analysis.
For malware behavior, a claim needs a review surface. A persistence claim might come from a registry string, an installer artifact, a startup-folder path, or reviewed code. An injection claim might come from an import list, a process-memory API neighborhood, or a function-level review. Those are not interchangeable.
When a tool hides the source of the claim, the analyst has to reopen the investigation. That means more time in the disassembler, more note comparison, more searches through string neighborhoods, and more skepticism toward model output. Some skepticism is healthy. Forced skepticism caused by missing context is wasted effort.
The interface should show whether a claim came from an import, a string neighborhood, a code region, a graph relationship, a sandbox event, a rule hit, or a previous analyst decision. That one design choice changes the work from "trust this label" to "inspect this evidence."
Context decays quickly
Analysts rarely work on one clean object from start to finish. They get interrupted. They pivot to related files. They compare a family label against a vendor name. They check whether a domain was already seen in another case. They leave a note, come back later, and need to remember why a function mattered.
Tooling often treats that messy trail as secondary metadata. It is not secondary. It is the analysis.
Accepted suggestions, rejected suggestions, explicit negatives, caveats, and "needs context" decisions all matter. A rejected claim can save the next analyst from repeating the same path. An explicit negative can teach the system that a tempting pattern is not enough. A caveat can prevent a weak static indicator from becoming a public overclaim.
Without that trail, each analyst starts from the surface again.
Uncertainty needs a place to live
Malware analysis has many legitimate half-states. A string may be suspicious but unused. A function may look like unpacking but lack a clear transfer path. A network artifact may be configuration, a decoy, or real command infrastructure. Static evidence may show capability without proving execution.
Interfaces often force these cases into true or false too early. That creates bad labels and bad habits. Analysts learn either to ignore the tool or to accept language that is stronger than the evidence.
The better pattern is to let uncertainty remain visible. Candidate, reviewed positive, rejected suggestion, explicit negative, and needs-context are different states. They should look different in the product. They should also survive export into a report or handoff note.
This is not caution for its own sake. It is how a team avoids laundering weak evidence into confident prose.
Explanations need to be checkable
Explanations are often treated like a paragraph attached after detection. That is not enough.
An explanation helps when it reduces the cost of checking the tool. "Credential access likely" is not an explanation. "Credential-access candidate from string neighborhood near reviewed parser logic" is closer. A pointer to the function, block, string, event, or note is better still.
Some visual explanations look useful while giving the analyst little to inspect. A heatmap over bytes, a keyword list, or a generic summary can be a starting point, but the analyst still needs a reviewable object. Where is the evidence? What else is nearby? Has anyone rejected this pattern before? Does the claim depend on runtime behavior that has not been observed?
If checking the tool takes longer than checking the sample, the tool has already lost.
Disagreement is part of the workflow
Analysts will disagree with automated labels. The product should expect that.
If rejecting a claim feels like fighting the interface, the review trail will be incomplete. If revising a label requires a separate note, the correction may never make it back into the system. If the tool only records accepted detections, it loses the boundary cases that would make future suggestions better.
Disagreement is especially useful around behavior claims. A rejected injection candidate, a downgraded persistence claim, or a note that says "library artifact, not live behavior" can prevent repeated false confidence. That material is analyst commentary and training data for the organization, even if no model ever sees it.
Measuring what matters
Analyst-centered tools still need measurement. The measures have to reflect the work.
Useful questions include how often analysts revise labels, which claim types create repeated disagreement, how often reports need caveat correction, and whether another analyst can reconstruct the decision from saved evidence. Time to verdict matters, but only after the verdict is worth keeping.
Throughput is still part of the picture. A tool that cannot handle volume will not survive operational use. But volume without traceability turns into a backlog of untrusted labels.
The practical goal is modest: keep the evidence, the caveat, and the review decision attached to the claim. When the analyst comes back tomorrow, or when a second analyst picks up the case, the tool should still remember why the claim existed.
It will not look as impressive on a dashboard. It will matter more during review.
References
- Arp et al., Drebin: Effective and Explainable Detection of Android Malware in Your Pocket
- Guo et al., LEMNA: Explaining Deep Learning Based Security Applications
- Warnecke et al., Evaluating Explanation Methods for Deep Learning in Security
- David et al., Feasibility Study for Supporting Static Malware Analysis with LLMs
- NSA, Ghidra Software Reverse Engineering Framework
- MITRE, Malware Behavior Catalog