Technical Note

Behavioral Clustering Across Malware Families

What recurring patterns reveal about campaign infrastructure — and why file hashes aren't enough.

Malware family names are useful, but they are not the ground truth of an intrusion. They are labels applied by humans, vendors, tooling, and history. Some refer to code lineage. Some refer to a payload role. Some survive because the name was convenient long after the activity changed.

For defenders, the practical question is often smaller and more stubborn: what behavior is repeating?

That question matters because operators reuse habits even when file hashes, packers, infrastructure, and family names change. Behavioral clustering tries to group artifacts by the evidence they expose and the jobs they appear built to perform. It does not replace family labeling. It gives analysts another way to see related work.

What is actually clustered

Behavioral clustering can sound vague unless the objects are named.

At the artifact level, a cluster might use imports, strings, section traits, packer traits, config artifacts, rule hits, and coarse file structure. At the code level, it might use function-level evidence, API neighborhoods, call relationships, control-flow patterns, or disassembly-derived features. At the analysis level, it might use analyst-reviewed behavior claims, rejected suggestions, sandbox notes, or prior case annotations.

Those objects should not all carry the same weight. A string match is not equal to a reviewed behavior claim. A packer trait is not equal to payload logic. A family label is not equal to attribution. The clustering system has to keep those distinctions visible, or it will produce groups that look precise but collapse very different kinds of evidence.

Where family labels fail

Family labels fail most often when the label names the wrong layer.

A loader family may be used by several unrelated actors. A commodity stealer may be delivered by different campaigns with different infrastructure. A packed sample may inherit a vendor label from the wrapper rather than the payload. Two files may share a family name because they came from the same builder, while their role in the operation is different.

The reverse also happens. Two samples may receive different family names because vendors saw different payloads, different stages, or different naming histories. An analyst who stops at the label may miss that both samples use similar configuration handling, persistence logic, command parsing, or staging behavior.

Family labels remain useful. They help with search, triage, reporting, and leakage control in evaluation. They become risky when they are treated as the only way to organize evidence.

A concrete miss

Consider two Windows artifacts with different vendor family names.

One is tagged as a loader. The other is tagged as a backdoor. On the surface, they belong in separate buckets. During review, both show a similar config grammar, a related decryption routine, the same pattern for resolving network endpoints, and a repeated way of handling failed connections. None of that proves shared operator control. It does suggest the samples may belong in the same analytic neighborhood.

That neighborhood is useful even if the final conclusion is cautious. It tells the analyst what to compare next. It can surface older notes, related infrastructure, or a prior rejection that prevents the same weak claim from being repeated.

The cluster is not the answer. It is a way to keep connected evidence from being scattered across family names.

A concrete overmerge

Now reverse the case.

Two samples share a family label because both use a common loader. One drops a credential theft payload. The other installs a remote administration component. If the cluster is built mainly from the family label or loader traits, it will overmerge them. The analyst may see a single group where there are two different operational roles.

Behavioral clustering helps only when it can separate wrapper evidence from payload evidence, weak indicators from reviewed behavior, and shared tooling from shared activity. Without those boundaries, clustering becomes another source of false confidence.

Infrastructure as behavior evidence

Infrastructure is often tracked as its own graph: domains, IPs, certificates, hosting, redirection, accounts, and timing. That graph matters. It matters more when connected to software behavior.

A domain in a string table is only an indicator. A repeated pattern of configuration parsing, endpoint rotation, fallback handling, and beacon formatting is closer to an operational habit. The difference is important. The first can be changed quickly. The second may survive across builds because it is part of how the tool works.

Behavioral clustering can connect those layers. A hash identifies one artifact. A family name summarizes historical labeling. Infrastructure shows external relationships. Behavior evidence shows what the artifact appears organized to do.

A cluster ID helps only if the analyst can see what pulled the samples together.

Clustering is not attribution

Similarity is not attribution. Shared behavior is not proof of a shared operator. Shared infrastructure is not proof of a single campaign. Shared code is not proof of shared control.

This caveat should be built into the workflow, not buried in a footnote. A behavioral cluster is a hypothesis about related evidence. It may support triage, prioritization, comparison, or intelligence production. It should not silently become an actor claim.

Keeping that boundary visible lets the cluster stay useful without asking it to prove more than it can.

What makes a cluster useful

A cluster earns its keep when it gives the analyst inspectable reasons for membership.

The analyst should be able to see whether the cluster is driven by API neighborhoods, strings, function-level evidence, config extraction, packer traits, rule hits, infrastructure overlap, or reviewed behavior claims. If the reason is mostly weak evidence, the cluster should look weak. If reviewed claims anchor the group, that should be visible.

Useful clusters also preserve disagreement. A rejected behavior suggestion can explain why a sample should stay out of a group. A caveat can explain why two samples are similar in loader behavior but different in payload behavior. A prior analyst note can keep the same comparison from being repeated.

Most teams already have hashes, YARA hits, sandbox notes, AV names, and half-finished analyst comments. They usually have trouble keeping those objects connected after the sample leaves the queue.

Behavioral clustering is useful when it keeps related evidence close without pretending the grouping proves more than it does.

References

Sebastian Garcia, Martin Grill, Jan Stiborek, and Alejandro Zunino, AVClass: A Tool for Massive Malware Labeling
Marcos Sebastian, Richard Rivera, Platon Kotzias, and Juan Caballero, AVClass2: Massive Malware Tag Extraction from AV Labels
MITRE, ATT&CK
MITRE, Malware Behavior Catalog
Rieck et al., Transcend: Detecting Concept Drift in Malware Classification Models
Yang et al., BODMAS: An Open Dataset for Learning Based Temporal Analysis of PE Malware