Compare commits

..

268 Commits

Author SHA1 Message Date
Willi Ballenthin
7043aae69b plans: add chat sessions and transcripts exploring mapa problem statements 2026-03-17 12:21:35 +01:00
Willi Ballenthin
c8e277e8b4 mapa: html-map show function neighborhood
Fix html tooltip newline escaping
2026-03-17 09:58:30 +01:00
Willi Ballenthin
fe2373161d mapa: html-map assemblage ground truth 2026-03-16 20:34:53 +01:00
Willi Ballenthin
ac49d292cb mapa: consistent naming 2026-03-16 19:54:45 +01:00
Willi Ballenthin
877d8da73c mapa: add --output=html-map
Polish MAPA html split view
2026-03-16 19:54:45 +01:00
Willi Ballenthin
ae3a1b5dbf mapa: add assemblage ground truth rendering 2026-03-16 16:34:04 +01:00
Willi Ballenthin
22d90b157f mapa: QS string tags
mapa: add QS string tags

Vendor QS string databases and tag each string row with right-aligned
database-derived tags (#zlib, #winapi, #capa, #common, #code-junk,
etc.).
Tags are matched against raw strings before display trimming. The
visible
tag policy suppresses #common when a more-specific tag is present.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-16 15:58:24 +01:00
Willi Ballenthin
6db9e571eb plans: mapa QS string tags 2026-03-16 15:58:10 +01:00
Willi Ballenthin
a364659cc3 mapa: port from Lancelot/BinExport2 to IDALib/ida-domain
Replace the Lancelot/BinExport2 backend with an IDALib-only implementation
using ida-domain as the primary query surface.

New mapa/ package with four layers:
- model.py: backend-neutral dataclasses (MapaReport, MapaFunction, etc.)
- ida_db.py: database lifecycle with SHA-256 caching and flock guards
- collector.py: populates MapaReport from an open ida_domain.Database
- renderer.py: Rich-based text output from MapaReport
- cli.py: argument parsing, capa/assemblage loading, orchestration

Key behaviors preserved from the original:
- Report sections: meta, sections, libraries, functions (modules removed)
- Thunk chain resolution (depth 5, matching capa's THUNK_CHAIN_DEPTH_DELTA)
- Caller forwarding through thunks
- CFG stats with NOEXT|PREDS flags
- String extraction via data-reference chains (depth 10)
- Assemblage overlay and capa match attachment

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

mapa: suppress Lumina via IdaCommandOptions.plugin_options

Match capa's loader.py behavior: disable primary and secondary Lumina
servers by passing plugin_options through IdaCommandOptions, which maps
to IDA's -O switch. load_resources=True already provides -R.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

mapa: add __main__.py for python -m mapa invocation

scripts/mapa.py shadows the mapa package when run directly because
Python adds scripts/ to sys.path. The canonical invocation is now:

    python -m mapa <input_file> [options]

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

mapa: import idapro before ida_auto

idapro must be imported first because it mutates sys.path to make
ida_auto and other IDA modules available.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

mapa: guard against InvalidEAError in string/xref lookups

ida-domain raises InvalidEAError for unmapped addresses instead of
returning None. Guard data_refs_from_ea and strings.get_at calls
so the collector handles broken reference chains gracefully.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

mapa: change default/key theme color from black to blue

Black text is invisible on dark terminals. Use blue for function names,
keys, and values.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

mapa: use module.dll!func format for APIs and libraries

IDA strips .dll from PE import module names. Add it back so libraries
render as 'KERNEL32.dll' and API entries as 'KERNEL32.dll!CreateFileW'.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

mapa: lowercase module names in libraries and API entries

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

mapa: use FLOSS/capa regex-based string extraction instead of IDA string list

IDA's built-in string list has a minimum length threshold (~5 chars)
that silently drops short strings like "exec". Replace db.strings and
ida_bytes.get_strlit_contents with regex-based extraction from FLOSS/capa
that scans raw segment bytes for ASCII and UTF-16 LE strings (min 4 chars).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

mapa: simplify string extraction to on-demand via get_cstring_at

Replace upfront segment-scanning index with on-demand reads using
db.bytes.get_cstring_at, validated against FLOSS/capa printable ASCII
charset. The index approach missed mid-string references and did
unnecessary work scanning entire segments.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

mapa: add UTF-16 LE wide string extraction

Read raw bytes at data reference targets and check for both ASCII and
UTF-16 LE strings using FLOSS/capa printability heuristics. Neither
ida_domain's get_cstring_at nor get_string_at handle wide strings, so
we parse the byte patterns directly.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-16 15:04:26 +01:00
Willi Ballenthin
2456559bc0 plans: mapa idalib port 2026-03-16 12:15:28 +01:00
Willi Ballenthin
5dd1f49023 import codemap as mapa 2026-03-16 11:23:35 +01:00
dependabot[bot]
7b23834d8e build(deps-dev): bump black from 25.12.0 to 26.3.0 (#2902)
* build(deps-dev): bump black from 25.12.0 to 26.3.0

Bumps [black](https://github.com/psf/black) from 25.12.0 to 26.3.0.
- [Release notes](https://github.com/psf/black/releases)
- [Changelog](https://github.com/psf/black/blob/main/CHANGES.md)
- [Commits](https://github.com/psf/black/compare/25.12.0...26.3.0)

---
updated-dependencies:
- dependency-name: black
  dependency-version: 26.3.0
  dependency-type: direct:development
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>

* style: auto-format with black and isort

---------

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Moritz <mr-tz@users.noreply.github.com>
Co-authored-by: Capa Bot <capa-dev@mandiant.com>
2026-03-13 15:46:13 +01:00
Capa Bot
f1800b5eb4 Sync capa rules submodule 2026-03-12 17:41:51 +00:00
Capa Bot
43f556caf9 Sync capa rules submodule 2026-03-12 17:08:39 +00:00
Capa Bot
5f8c06c650 Sync capa rules submodule 2026-03-12 17:04:53 +00:00
Devyansh Somvanshi
ceaa3b6d03 webui: include feature type in global search (match, regex, api, …) (#2906)
* webui: include feature type in global search (match, regex, etc.)

Searching for "match" or "regex" in the capa Explorer web UI produced
no results because PrimeVue's globalFilterFields only included the
name field, while the feature kind (e.g. "match", "regex", "api") is
stored in the separate typeValue field.

Add 'typeValue' to globalFilterFields so that the global search box
matches nodes by both their value (name) and their kind (typeValue).
No change to rendering or data structure; only the set of fields
consulted during filtering is widened.

Fixes #2349.

* changelog: add entry for #2349 webui global search fix
2026-03-12 10:43:49 -06:00
Devyansh Somvanshi
c03d833a84 rules: handle empty or invalid YAML documents in Rule.from_yaml (#2903)
* rules: handle empty or invalid YAML documents in Rule.from_yaml

Empty or whitespace-only .yml files caused a cryptic TypeError in
Rule.from_dict (NoneType not subscriptable) when yaml.load returned None.
This made lint.py abort with a stack trace instead of a clear message.

Add an early guard in Rule.from_yaml that raises InvalidRule with a
descriptive message when the parsed document is None or structurally
invalid.  get_rules() now logs a warning and skips such files so that
scripts/lint.py completes cleanly even when placeholder .yml files
exist in the rules/ or rules/nursery/ directories.

Fixes #2900.

* changelog: add entry for #2900 empty YAML handling

* rules: fix exception check and add get_rules skip test

- Use e.args[0] instead of str(e) to check the error message.
  InvalidRule.__str__ prepends "invalid rule: " so str(e) never
  matched the bare message, causing every InvalidRule to be re-raised.
- Add test_get_rules_skips_empty_yaml to cover the get_rules skip path,
  confirming that an empty file is warned-and-skipped while a valid
  sibling rule is still loaded.

* fix: correct isort import ordering in tests/test_rules.py

Move capa.engine import before capa.rules.cache to satisfy
isort --length-sort ordering.
2026-03-10 15:04:11 -06:00
Devyansh Somvanshi
1f4a16cbcc loader: skip PE files with unrealistically large section virtual sizes (#2905)
* loader: skip PE files with unrealistically large section virtual sizes

Some malformed PE samples declare section virtual sizes orders of
magnitude larger than the file itself (e.g. a ~400 KB file with a
900 MB section).  vivisect attempts to map these regions, causing
unbounded CPU and memory consumption (see #1989).

Add _is_probably_corrupt_pe() which uses pefile (fast_load=True) to
check whether any section's Misc_VirtualSize exceeds
max(file_size * 128, 512 MB).  If the check fires, get_workspace()
raises CorruptFile before vivisect is invoked, keeping the existing
exception handling path consistent.

Thresholds are intentionally conservative to avoid false positives on
large but legitimate binaries.  When pefile is unavailable the helper
returns False and behaviour is unchanged.

Fixes #1989.

* changelog: add entry for #1989 corrupt PE large sections

* loader: apply Gemini review improvements

- Extend corrupt-PE check to FORMAT_AUTO so malformed PE files
  cannot bypass the guard when format is auto-detected (the helper
  returns False for non-PE files so there is no false-positive risk).
- Replace magic literals 128 and 512*1024*1024 with named constants
  _VSIZE_FILE_RATIO and _MAX_REASONABLE_VSIZE for clarity.
- Remove redundant int() cast around getattr(Misc_VirtualSize); keep
  the `or 0` guard for corrupt files where pefile may return None.
- Extend test to cover FORMAT_AUTO path alongside FORMAT_PE.

* tests: remove mock-only corrupt PE test per maintainer request

williballenthin noted the test doesn't add real value since it only
exercises the mock, not the actual heuristic. Removing it per feedback.

* fix: resolve flake8 NIC002 implicit string concat and add missing test

Fix the implicit string concatenation across multiple lines that caused
code_style CI to fail. Also add the test_corrupt_pe_with_unrealistic_section_size_short_circuits
test that was described in the PR body but not committed.
2026-03-10 15:03:35 -06:00
Devyansh Somvanshi
2c9e30c3e1 perf: eliminate O(n²) tuple growth and reduce per-match overhead (#2890)
* perf: eliminate O(n²) tuple growth and reduce per-match overhead

Four data-driven performance improvements identified by profiling
the hot paths in capa's rule-matching and capability-finding pipeline:

1. find_static_capabilities / find_dynamic_capabilities (O(n²) → O(n))
   Tuple concatenation with `t += (item,)` copies the entire tuple on
   every iteration. For a binary with N functions this allocates O(N²)
   total objects. Replace with list accumulation and a single
   `tuple(list)` conversion at the end.

2. RuleSet._match: pre-compute rule_index_by_rule_name (O(n) → O(1))
   `_match` is called once per instruction / basic-block / function scope
   (potentially millions of times). Previously it rebuilt the name→index
   dict on every call. The dict is now computed once in `__init__` and
   stored as `_rule_index_by_scope`, reducing each call to a dict lookup.

3. RuleSet._match: candidate_rules.pop(0) → deque.popleft() (O(n) → O(1))
   `list.pop(0)` is O(n) because it shifts every remaining element.
   Switch to `collections.deque` for O(1) left-side consumption.

4. RuleSet._extract_subscope_rules: list.pop(0) → deque.popleft() (O(n²) → O(n))
   Same issue: BFS over rules used list.pop(0), making the whole loop
   quadratic. Changed to a deque queue for linear-time processing.

Fixes #2880

* perf: use sorted merge instead of full re-sort for new rule candidates

When a rule matches and introduces new dependent candidates into
_match's work queue, the previous approach converted the deque to a
list, extended it with the new items, and re-sorted the whole
collection — O((k+m) log(k+m)).

Because the existing deque is already topologically sorted, we only
need to sort the new additions — O(m log m) — and then merge the two
sorted sequences in O(k+m) using heapq.merge.

Also adds a CHANGELOG entry for the performance improvements in #2890.

* perf: simplify candidate_rules to LIFO list, revert heapq.merge

Address reviewer feedback:
- Replace deque+popleft with list+pop (LIFO stack) in _extract_subscope_rules;
  processing order doesn't affect correctness, and list.pop() is O(1).
- Replace deque+popleft with list+pop (LIFO stack) in _match; sort candidate
  rules descending so pop() from the end yields the topologically-first rule.
- Revert heapq.merge back to the simpler extend+re-sort pattern; the added
  complexity wasn't justified given the typically small candidate set.
- Remove now-unused `import heapq`.
2026-03-10 14:21:48 -06:00
Devyansh Somvanshi
8c138e3d22 loader: handle struct.error from dnfile and raise CorruptFile with a clear message (#2872)
* loader: handle struct.error from dnfile and show clear CorruptFile message

When .NET metadata is truncated or invalid, dnfile can raise struct.error.
Catch it in DnfileFeatureExtractor and DotnetFileFeatureExtractor and
re-raise as CorruptFile with a user-friendly message.

Fixes #2442

* dnfile: centralize dnPE() loading in load_dotnet_image helper

Add load_dotnet_image() to dnfile/helpers.py that calls dnfile.dnPE()
and catches struct.error, raising CorruptFile with the original error
message included (f"Invalid or truncated .NET metadata: {e}").

Both DnfileFeatureExtractor and DotnetFileFeatureExtractor now call the
helper instead of duplicating the try/except block, and their direct
import of struct is removed.

Addresses review feedback on #2872.

* style: reformat dnfile files with black --line-length=120

Fixes CI code_style failure by applying the project's configured
line length (120) instead of black's default (88).

---------

Co-authored-by: Moritz <mr-tz@users.noreply.github.com>
Co-authored-by: Mike Hunhoff <mike.hunhoff@gmail.com>
2026-03-10 14:20:48 -06:00
Moritz
a11a03bc30 build(deps): bump minimatch and editorconfig in /web/explorer (#2892)
Bumps [minimatch](https://github.com/isaacs/minimatch) and [editorconfig](https://github.com/editorconfig/editorconfig-core-js). These dependencies needed to be updated together.

Updates `minimatch` from 3.1.2 to 3.1.5
- [Changelog](https://github.com/isaacs/minimatch/blob/main/changelog.md)
- [Commits](https://github.com/isaacs/minimatch/compare/v3.1.2...v3.1.5)

Updates `minimatch` from 9.0.5 to 9.0.9
- [Changelog](https://github.com/isaacs/minimatch/blob/main/changelog.md)
- [Commits](https://github.com/isaacs/minimatch/compare/v3.1.2...v3.1.5)

Updates `editorconfig` from 1.0.4 to 1.0.7
- [Release notes](https://github.com/editorconfig/editorconfig-core-js/releases)
- [Changelog](https://github.com/editorconfig/editorconfig-core-js/blob/main/CHANGELOG.md)
- [Commits](https://github.com/editorconfig/editorconfig-core-js/compare/v1.0.4...v1.0.7)

---
updated-dependencies:
- dependency-name: minimatch
  dependency-version: 3.1.5
  dependency-type: indirect
- dependency-name: minimatch
  dependency-version: 9.0.9
  dependency-type: indirect
- dependency-name: editorconfig
  dependency-version: 1.0.7
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-03-10 12:14:56 +01:00
dependabot[bot]
1173dc5fa5 build(deps): bump protobuf from 6.33.5 to 7.34.0 (#2891)
Bumps [protobuf](https://github.com/protocolbuffers/protobuf) from 6.33.5 to 7.34.0.
- [Release notes](https://github.com/protocolbuffers/protobuf/releases)
- [Commits](https://github.com/protocolbuffers/protobuf/commits)

---
updated-dependencies:
- dependency-name: protobuf
  dependency-version: 7.34.0
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Mike Hunhoff <mike.hunhoff@gmail.com>
2026-03-05 15:23:16 +01:00
Priyank Patel
e53f6abc1e ci: add black auto-format workflow (#2827) (#2883)
* ci: add black auto-format workflow (#2827)

Signed-off-by: priyank <priyank8445@gmail.com>

* ci: use pre-commit to run black and isort (#2827)

* ci: fix install dependencies to include dev extras

---------

Signed-off-by: priyank <priyank8445@gmail.com>
2026-03-05 12:29:33 +01:00
Aditya Pandey
038c46da16 features: fix Regex.get_value_str() returning escaped pattern, breaking capa2yara #1909 (#2886)
Co-authored-by: Moritz <mr-tz@users.noreply.github.com>
2026-03-05 12:14:27 +01:00
Capa Bot
563071349f Sync capa rules submodule 2026-03-04 20:30:04 +00:00
Capa Bot
517dfe154a Sync capa rules submodule 2026-03-03 07:48:23 +00:00
Devyansh Somvanshi
2e36f67e11 doc: add table comparing ways to consume capa output (#2874)
* doc: add table comparing ways to consume capa output

Add a short table to usage.md for CLI, IDA, Ghidra, CAPE, and web.

Fixes #2273

* doc: add links to each option in the ways-to-consume table

Addresses reviewer feedback to provide a link to learn more for each
consumption method (IDA Pro, Ghidra, CAPE, Web/capa Explorer).

Refs #2273

* doc: add Binary Ninja to ways-to-consume table

Fixes #2273
2026-03-02 10:17:53 -07:00
dependabot[bot]
7bd04fe297 build(deps): bump minimatch and editorconfig in /web/explorer
Bumps [minimatch](https://github.com/isaacs/minimatch) and [editorconfig](https://github.com/editorconfig/editorconfig-core-js). These dependencies needed to be updated together.

Updates `minimatch` from 3.1.2 to 3.1.5
- [Changelog](https://github.com/isaacs/minimatch/blob/main/changelog.md)
- [Commits](https://github.com/isaacs/minimatch/compare/v3.1.2...v3.1.5)

Updates `minimatch` from 9.0.5 to 9.0.9
- [Changelog](https://github.com/isaacs/minimatch/blob/main/changelog.md)
- [Commits](https://github.com/isaacs/minimatch/compare/v3.1.2...v3.1.5)

Updates `editorconfig` from 1.0.4 to 1.0.7
- [Release notes](https://github.com/editorconfig/editorconfig-core-js/releases)
- [Changelog](https://github.com/editorconfig/editorconfig-core-js/blob/main/CHANGELOG.md)
- [Commits](https://github.com/editorconfig/editorconfig-core-js/compare/v1.0.4...v1.0.7)

---
updated-dependencies:
- dependency-name: minimatch
  dependency-version: 3.1.5
  dependency-type: indirect
- dependency-name: minimatch
  dependency-version: 9.0.9
  dependency-type: indirect
- dependency-name: editorconfig
  dependency-version: 1.0.7
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
2026-03-02 16:08:23 +00:00
dependabot[bot]
9f781ec21b build(deps): bump rollup from 4.36.0 to 4.59.0 in /web/explorer (#2885)
Bumps [rollup](https://github.com/rollup/rollup) from 4.36.0 to 4.59.0.
- [Release notes](https://github.com/rollup/rollup/releases)
- [Changelog](https://github.com/rollup/rollup/blob/master/CHANGELOG.md)
- [Commits](https://github.com/rollup/rollup/compare/v4.36.0...v4.59.0)

---
updated-dependencies:
- dependency-name: rollup
  dependency-version: 4.59.0
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Moritz <mr-tz@users.noreply.github.com>
2026-03-02 17:07:20 +01:00
kamran ul haq
da1abed3f8 ci: pin pip-audit action SHAs and update to v1.1.0 (#2884)
* Fix conftest imports to use relative imports

After adding tests/__init__.py, conftest.py needs relative imports

* ci: pin pip-audit action SHAs and update to v1.1.0

* revert: remove unrelated conftest.py changes
2026-02-27 17:36:28 +01:00
Capa Bot
3bce2a9b62 Sync capa rules submodule 2026-02-26 16:44:29 +00:00
Devyansh Somvanshi
d97b61551d webui: show error when JSON does not follow expected result document schema (#2871)
* webui: show error when JSON does not follow expected schema

Validate result document has required fields (meta, meta.version,
meta.analysis, meta.analysis.layout, rules) after parse. Show
user-friendly error; for URL loads suggest reanalyzing (e.g. VT).

Fixes #2363

* webui: fix array validation bug and deduplicate VT suggestion string

- introduce isInvalidObject() helper (checks !v || typeof !== "object" || Array.isArray)
  so that arrays are correctly rejected in schema validation
- extract VT_REANALYZE_SUGGESTION constant to eliminate the duplicated string
  in loadRdoc()

Addresses review feedback on #2871

* webui: address review - validate feature_counts, hoist VT_REANALYZE_SUGGESTION

- Add validation for meta.analysis.feature_counts in validateRdocSchema()
  so parseFunctionCapabilities and other consumers do not hit missing/invalid
  feature_counts at runtime.
- Require feature_counts to have either 'functions' or 'processes' array
  (static vs dynamic result documents).
- Move VT_REANALYZE_SUGGESTION to module top level to avoid redefining
  on every loadRdoc call.

* webui: allow file-scoped-only result documents in schema validation

- Validation: allow feature_counts without functions/processes arrays; if
  present they must be arrays.
- rdocParser: default feature_counts.functions to [] when missing so
  file-scoped-only docs do not throw.

* webui: remove leading space from VT_REANALYZE_SUGGESTION constant

Per review feedback: the concatenation at call sites handles spacing,
so the constant should not carry a leading space.
2026-02-26 09:35:24 -07:00
Devyansh Somvanshi
e1ffa1dd09 ida-explorer: fix TypeError when sorting locations with mixed address types (#2867)
* ida-explorer: fix TypeError when sorting mixed address types

When a feature has multiple locations and those locations contain a mix
of integer-based addresses (e.g. AbsoluteVirtualAddress) and non-integer
addresses (e.g. _NoAddress), calling sorted() raises a TypeError because
Python falls back to the reflected comparison (__gt__) which is not
defined on _NoAddress.

Add a sort key to sorted() that places integer-based addresses first
(sorted by value) and non-integer addresses last, avoiding the
cross-type comparison.

Fixes #2195

* ida-explorer: fix comparison at source so sorted(locations) works everywhere

Implement the gt solution per review: fix comparison for all addresses
so we can use sorted(locations) / sorted(addrs) consistently without
per-call-site sort keys.

- Add _NoAddress.__gt__ so mixed-type comparison works: (real_address <
  NO_ADDRESS) invokes it and NoAddress sorts last. Avoids TypeError
  when sorting AbsoluteVirtualAddress with _NoAddress.
- In ida/plugin/model.py, use sorted(locations) instead of a custom
  key. view.py (lines 1054, 1077) already use sorted(); they now work
  with mixed address types without change.

Fixes #2195

* changelog: move address sort fix to Bug Fixes section

Per maintainer feedback: fix applies beyond ida-explorer.
2026-02-26 09:33:05 -07:00
kamran ul haq
10dfd287b4 ci: cache vivisect workspaces between CI runs to speed up tests (#2881) 2026-02-25 23:15:13 +01:00
Devyansh Somvanshi
e9b3311338 binja: add mypy config for top-level binaryninja module (#2877)
Allow mypy to skip missing 'binaryninja' (not just binaryninja.*)
when the Binary Ninja API is not installed.

Fixes #2399
2026-02-24 08:57:46 -07:00
Moritz
54cc4ee7a3 Merge pull request #2873 from devs6186/fix/1865-vv-api-color 2026-02-23 23:01:39 +01:00
Capa Bot
12863ab4f2 Sync capa rules submodule 2026-02-23 20:52:03 +00:00
Devyansh Somvanshi
e41b5fb150 webui: fix 404 for "View rule in capa-rules" by using proper URL encoding (#2868)
* webui: fix 404 for \"View rule in capa-rules\" links

The createCapaRulesUrl function was constructing URLs by lowercasing
the rule name and replacing spaces with hyphens, which produced URLs
like /rules/packaged-as-single-file-.net-application/ (404).

The capa-rules website uses the original rule name with URL encoding
(e.g. /rules/packaged%20as%20single-file%20.NET%20application/).

Use encodeURIComponent() on the rule name to produce correct URLs.

Fixes #2482

* refactor: extract baseUrl constant in createCapaRulesUrl per code review
2026-02-23 13:10:31 -07:00
dependabot[bot]
4697902310 build(deps-dev): bump isort from 7.0.0 to 8.0.0 (#2879)
Bumps [isort](https://github.com/PyCQA/isort) from 7.0.0 to 8.0.0.
- [Release notes](https://github.com/PyCQA/isort/releases)
- [Changelog](https://github.com/PyCQA/isort/blob/main/CHANGELOG.md)
- [Commits](https://github.com/PyCQA/isort/compare/7.0.0...8.0.0)

---
updated-dependencies:
- dependency-name: isort
  dependency-version: 8.0.0
  dependency-type: direct:development
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-02-23 12:50:28 -07:00
Capa Bot
ed0783c31e Sync capa rules submodule 2026-02-23 16:33:25 +00:00
devs6186
f03ee75d69 doc: document that default output shows top-level matches only; -v/-vv show nested matches (#2875) 2026-02-22 07:41:15 +01:00
devs6186
f526357def main: suggest --os flag when ELF OS detection fails (#2869)
* main: suggest --os flag when OS detection fails for ELF files

When capa cannot detect the target OS of an ELF file, it exits with an
error. Some ELF files lack the standard metadata capa uses for OS
detection (GNU ABI tag, OSABI field, library dependencies, etc.) even
though they do target a valid OS (e.g. a stripped Linux binary using
only raw syscalls).

Add a hint to the unsupported-OS error message telling users they can
specify the OS explicitly with the --os flag, matching the workaround
recommended in the issue.

Fixes #2577
2026-02-20 14:28:43 +01:00
Moritz
c1ec826a9f Merge pull request #2866 from devs6186/fix/2699-rich-markup-escape
render: escape sample-controlled strings to prevent rich MarkupError
2026-02-20 14:06:45 +01:00
devs6186
5ef4ad96ee doc: fix typo and add documentation links in README
- usage.md: fix 'occurance' -> 'occurrence'
- README: add short doc links (usage, installation, limitations, FAQ)

Fixes #2274
2026-02-20 11:15:01 +01:00
Capa Bot
8aef630a7f Sync capa rules submodule 2026-02-19 20:33:40 +00:00
Moritz
d1c9d20668 Merge pull request #2865 from mandiant/lsc-1771433500.551532
Refactor Github Action per b/485167538
2026-02-19 21:32:29 +01:00
devs6186
8ccd35d0cf render: use default styling for dynamic -vv API/call details
Stop wrapping call name and arguments in mute (dim) so API details
remain readable in -vv output.

Fixes #1865
2026-02-20 00:52:14 +05:30
devs6186
3f72b43f48 render: escape sample-controlled strings to prevent rich MarkupError
Strings extracted from analyzed samples may contain bracket characters
that Rich interprets as markup (e.g. [/tag]). When these are embedded
directly in markup templates like f"[dim]{s}", Rich raises a
MarkupError if the brackets form an invalid tag.

Use rich.markup.escape() to sanitize all user-controlled strings before
embedding them in Rich markup templates in bold(), bold2(), mute(), and
warn().

Fixes #2699
2026-02-19 03:42:05 +05:30
Ben Knutson
f7bb889f30 Refactor Github Action per b/485167538 2026-02-18 16:51:42 +00:00
Capa Bot
e0bd6d5ea6 Sync capa rules submodule 2026-02-17 21:19:08 +00:00
Capa Bot
239bafd285 Sync capa-testfiles submodule 2026-02-17 21:10:09 +00:00
dependabot[bot]
2033c4ab83 build(deps-dev): bump pyinstaller from 6.18.0 to 6.19.0 (#2856)
Bumps [pyinstaller](https://github.com/pyinstaller/pyinstaller) from 6.18.0 to 6.19.0.
- [Release notes](https://github.com/pyinstaller/pyinstaller/releases)
- [Changelog](https://github.com/pyinstaller/pyinstaller/blob/develop/doc/CHANGES.rst)
- [Commits](https://github.com/pyinstaller/pyinstaller/compare/v6.18.0...v6.19.0)

---
updated-dependencies:
- dependency-name: pyinstaller
  dependency-version: 6.19.0
  dependency-type: direct:development
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-02-17 13:40:23 -07:00
dependabot[bot]
cbe005ae0f bump ruff from 0.14.7 to 0.15.0 (#2853)
---
updated-dependencies:
- dependency-name: ruff
  dependency-version: 0.15.0
  dependency-type: direct:development
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-02-09 13:55:24 -07:00
kamran ul haq
26aba8067f loader: handle SegmentationViolation for malformed ELF files (#2799)
Catch envi.exc.SegmentationViolation raised by vivisect when processing
malformed ELF files with invalid relocations and convert it to a
CorruptFile exception with a descriptive message.

Closes #2794

Co-authored-by: Mike Hunhoff <mike.hunhoff@gmail.com>
2026-02-05 12:24:48 -07:00
Aditya Pandey
3582bce6fd vmray: skip processes with invalid PID or missing filename (#2807) (#2845)
Co-authored-by: Mike Hunhoff <mike.hunhoff@gmail.com>
2026-02-05 12:11:26 -07:00
dependabot[bot]
535faf281d build(deps): bump protobuf from 6.33.1 to 6.33.5 (#2851)
Bumps [protobuf](https://github.com/protocolbuffers/protobuf) from 6.33.1 to 6.33.5.
- [Release notes](https://github.com/protocolbuffers/protobuf/releases)
- [Commits](https://github.com/protocolbuffers/protobuf/commits)

---
updated-dependencies:
- dependency-name: protobuf
  dependency-version: 6.33.5
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Mike Hunhoff <mike.hunhoff@gmail.com>
2026-02-05 10:55:26 -07:00
dependabot[bot]
fe27335136 build(deps): bump pip from 25.3 to 26.0 (#2847)
Bumps [pip](https://github.com/pypa/pip) from 25.3 to 26.0.
- [Changelog](https://github.com/pypa/pip/blob/main/NEWS.rst)
- [Commits](https://github.com/pypa/pip/compare/25.3...26.0)

---
updated-dependencies:
- dependency-name: pip
  dependency-version: '26.0'
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Mike Hunhoff <mike.hunhoff@gmail.com>
2026-02-05 10:53:55 -07:00
dependabot[bot]
a40ae162ef build(deps): bump dnfile from 0.17.0 to 0.18.0 (#2848)
Bumps [dnfile](https://github.com/malwarefrank/dnfile) from 0.17.0 to 0.18.0.
- [Changelog](https://github.com/malwarefrank/dnfile/blob/master/HISTORY.rst)
- [Commits](https://github.com/malwarefrank/dnfile/compare/v0.17.0...v0.18.0)

---
updated-dependencies:
- dependency-name: dnfile
  dependency-version: 0.18.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Mike Hunhoff <mike.hunhoff@gmail.com>
2026-02-05 10:50:00 -07:00
dependabot[bot]
1500a34984 build(deps): bump rich from 14.2.0 to 14.3.2 (#2849)
* build(deps): bump rich from 14.2.0 to 14.3.2

Bumps [rich](https://github.com/Textualize/rich) from 14.2.0 to 14.3.2.
- [Release notes](https://github.com/Textualize/rich/releases)
- [Changelog](https://github.com/Textualize/rich/blob/master/CHANGELOG.md)
- [Commits](https://github.com/Textualize/rich/compare/v14.2.0...v14.3.2)

---
updated-dependencies:
- dependency-name: rich
  dependency-version: 14.3.2
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>

* add hiddenimports for rich module

---------

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Mike Hunhoff <mike.hunhoff@gmail.com>
2026-02-05 09:31:15 -07:00
Daniel Adeboye
77440c03f5 vmray: extract number features for registry key handles (#2835)
* vmray: extract number features for whitelisted void_ptr parameters

* added changelog

* Apply suggestions from code review

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

* fix lint

* fix lint

* fix test

* remove unused import

* Add hKey parameter extraction and tests

---------

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Mike Hunhoff <mike.hunhoff@gmail.com>
2026-01-30 15:10:57 -07:00
Capa Bot
26fd6b8569 Sync capa rules submodule 2026-01-30 17:41:05 +00:00
Capa Bot
2540dd688b Sync capa rules submodule 2026-01-30 17:04:59 +00:00
Moritz
ff8e7ef52f Add AI usage checkbox (#2844)
* Add AI usage checkbox

* Apply suggestion from @gemini-code-assist[bot]

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

---------

Co-authored-by: Mike Hunhoff <mike.hunhoff@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-01-29 09:12:54 -07:00
Capa Bot
6f078734c3 Sync capa rules submodule 2026-01-28 17:43:11 +00:00
Capa Bot
93c11d2d4e Sync capa-testfiles submodule 2026-01-28 16:22:42 +00:00
Capa Bot
89c71f4d81 Sync capa rules submodule 2026-01-26 16:41:20 +00:00
dependabot[bot]
9599fbac02 build(deps): bump setuptools from 80.9.0 to 80.10.1 (#2837)
Bumps [setuptools](https://github.com/pypa/setuptools) from 80.9.0 to 80.10.1.
- [Release notes](https://github.com/pypa/setuptools/releases)
- [Changelog](https://github.com/pypa/setuptools/blob/main/NEWS.rst)
- [Commits](https://github.com/pypa/setuptools/compare/v80.9.0...v80.10.1)

---
updated-dependencies:
- dependency-name: setuptools
  dependency-version: 80.10.1
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-01-23 12:41:04 -07:00
dependabot[bot]
b4c0f1369e build(deps): bump pycparser from 2.23 to 3.0 (#2838)
Bumps [pycparser](https://github.com/eliben/pycparser) from 2.23 to 3.0.
- [Release notes](https://github.com/eliben/pycparser/releases)
- [Commits](https://github.com/eliben/pycparser/compare/release_v2.23...release_v3.00)

---
updated-dependencies:
- dependency-name: pycparser
  dependency-version: '3.0'
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-01-23 12:37:46 -07:00
Daniel Adeboye
37f2a897ff tests: remove redundant test_ida_features.py (#2834) 2026-01-23 09:46:58 -07:00
Maijin
e39e610f66 Create a vivisect group in dependabot.yml (#2830)
* Add msgpack group in dependabot.yml

Add msgpack group in dependabot.yml

* Change to make a vivisect group

Change to make a vivisect group

* Update dependabot.yml
2026-01-23 09:37:04 -07:00
Maijin
073760f279 fix(lint): disable rule caching during linting (#2817) 2026-01-22 09:27:02 -07:00
dependabot[bot]
52a761ebb3 build(deps-dev): bump lodash from 4.17.21 to 4.17.23 in /web/explorer (#2833)
Bumps [lodash](https://github.com/lodash/lodash) from 4.17.21 to 4.17.23.
- [Release notes](https://github.com/lodash/lodash/releases)
- [Commits](https://github.com/lodash/lodash/compare/4.17.21...4.17.23)

---
updated-dependencies:
- dependency-name: lodash
  dependency-version: 4.17.23
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-01-22 08:56:03 -07:00
Moritz
2a44482076 Merge pull request #2821 from mandiant/dependabot/pip/mypy-protobuf-5.0.0
build(deps-dev): bump mypy-protobuf from 4.0.0 to 5.0.0
2026-01-20 10:31:57 +01:00
Moritz
a359745765 build(deps-dev): bump pyinstaller from 6.17.0 to 6.18.0 (#2822)
Bumps [pyinstaller](https://github.com/pyinstaller/pyinstaller) from 6.17.0 to 6.18.0.
- [Release notes](https://github.com/pyinstaller/pyinstaller/releases)
- [Changelog](https://github.com/pyinstaller/pyinstaller/blob/develop/doc/CHANGES.rst)
- [Commits](https://github.com/pyinstaller/pyinstaller/compare/v6.17.0...v6.18.0)

---
updated-dependencies:
- dependency-name: pyinstaller
  dependency-version: 6.18.0
  dependency-type: direct:development
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-01-20 10:31:35 +01:00
Maijin
203cc0aa0c Merge pull request #2824 from Maijin/patch-1
Group pyasn modules and vivisect in dependabot.yml
2026-01-20 10:18:35 +01:00
Moritz
3642ca94a6 Merge pull request #2820 from mandiant/dependabot/pip/vivisect-1.3.0
build(deps): bump vivisect from 1.2.1 to 1.3.0
2026-01-19 20:57:00 +01:00
dependabot[bot]
8e233ca69d build(deps-dev): bump pyinstaller from 6.17.0 to 6.18.0
Bumps [pyinstaller](https://github.com/pyinstaller/pyinstaller) from 6.17.0 to 6.18.0.
- [Release notes](https://github.com/pyinstaller/pyinstaller/releases)
- [Changelog](https://github.com/pyinstaller/pyinstaller/blob/develop/doc/CHANGES.rst)
- [Commits](https://github.com/pyinstaller/pyinstaller/compare/v6.17.0...v6.18.0)

---
updated-dependencies:
- dependency-name: pyinstaller
  dependency-version: 6.18.0
  dependency-type: direct:development
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
2026-01-19 16:45:40 +00:00
dependabot[bot]
d5c23486e3 build(deps-dev): bump mypy-protobuf from 4.0.0 to 5.0.0
Bumps [mypy-protobuf](https://github.com/nipunn1313/mypy-protobuf) from 4.0.0 to 5.0.0.
- [Changelog](https://github.com/nipunn1313/mypy-protobuf/blob/main/CHANGELOG.md)
- [Commits](https://github.com/nipunn1313/mypy-protobuf/commits)

---
updated-dependencies:
- dependency-name: mypy-protobuf
  dependency-version: 5.0.0
  dependency-type: direct:development
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
2026-01-19 16:45:32 +00:00
dependabot[bot]
7600dd077b build(deps): bump vivisect from 1.2.1 to 1.3.0
Bumps [vivisect](https://github.com/vivisect/vivisect) from 1.2.1 to 1.3.0.
- [Changelog](https://github.com/vivisect/vivisect/blob/master/CHANGELOG.rst)
- [Commits](https://github.com/vivisect/vivisect/compare/v1.2.1...v1.3.0)

---
updated-dependencies:
- dependency-name: vivisect
  dependency-version: 1.3.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
2026-01-19 16:45:26 +00:00
Moritz
3de84eff1b Merge pull request #2813 from doomedraven/patch-1
Add '2.5-CAPE' to tested versions
2026-01-16 20:28:39 +01:00
doomedraven
7e16ed741c Add '2.5-CAPE' to tested versions
hello, we just released CAPE v2.5, there are no behavior/structural changes. Is focused on webgui improvements, and some other improvements that doesnt impact CAPA.
2026-01-16 14:58:48 +00:00
Mike Hunhoff
5a5545aa14 ghidra: fix unit tests (#2812)
* ghidra: fix unit tests

* fix formatting
2026-01-15 12:34:43 -07:00
Moritz
6ad4fbbb9b Merge pull request #2742 from mandiant/idalib-tests 2026-01-13 21:48:30 +01:00
dependabot[bot]
8105214dc6 build(deps-dev): bump build from 1.3.0 to 1.4.0 (#2809)
Bumps [build](https://github.com/pypa/build) from 1.3.0 to 1.4.0.
- [Release notes](https://github.com/pypa/build/releases)
- [Changelog](https://github.com/pypa/build/blob/main/CHANGELOG.rst)
- [Commits](https://github.com/pypa/build/compare/1.3.0...1.4.0)

---
updated-dependencies:
- dependency-name: build
  dependency-version: 1.4.0
  dependency-type: direct:development
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-01-13 09:08:58 -07:00
Willi Ballenthin
d1fc8446f6 pyproject: ida: silence SWIG related warnings from IDA bindings 2026-01-13 16:15:31 +01:00
Willi Ballenthin
0686305f43 ida: loader: load resource sections to help discovery of embedded files 2026-01-13 16:15:31 +01:00
Willi Ballenthin
8d6b878e79 ida: fix return value from open_database 2026-01-13 16:15:31 +01:00
Willi Ballenthin
3646fcefa2 ida: helpers: refactor discovery of alternative names 2026-01-13 16:15:31 +01:00
Willi Ballenthin
ce67d99e49 ida: skip function-name features for default names (sub_*) 2026-01-13 16:15:31 +01:00
Willi Ballenthin
c89871f257 ci: pin setup-uv 2026-01-13 16:15:31 +01:00
Willi Ballenthin
03cc901f7b tests: idalib: xfail resource test on 9.0 2026-01-13 16:15:31 +01:00
Willi Ballenthin
412ab62c42 ida: pep8 2026-01-13 16:15:31 +01:00
Willi Ballenthin
f72bd49a5f ci: enable testing of IDA 9.0 2026-01-13 16:15:31 +01:00
Willi Ballenthin
1d561bd038 tests: idalib: xfail two tests on 9.0 and 9.1 2026-01-13 16:15:31 +01:00
Willi Ballenthin
c5808c4c41 tests: idalib: use 9.1 instead of 9.0 as min ver
9.0 doesn't support disabling lumina (or loading resources, for that
matter, too)
2026-01-13 16:15:31 +01:00
Willi Ballenthin
200c8037dd tests: fix logging message 2026-01-13 16:15:31 +01:00
mr-tz
4fb6ac0d1b add ida version to test matrix name 2026-01-13 16:15:31 +01:00
mr-tz
87fb96d08b load resource for test sample 2026-01-13 16:15:31 +01:00
Willi Ballenthin
e1fd184805 ida: function: extract function name
somehow we were extracting alternate names but not function names
2026-01-13 16:15:31 +01:00
Willi Ballenthin
82be20be64 loader: idalib: disable lumina
see #2742 in which Lumina names overwrote names provided by debug info
2026-01-13 16:15:31 +01:00
Willi Ballenthin
132e64a991 tests: idalib: better detect missing idapro package 2026-01-13 16:15:31 +01:00
Willi Ballenthin
9c6db00775 ci: add configuration for idalib tests 2026-01-13 16:15:31 +01:00
Moritz
7bdd1f11bb Merge branch 'master' into idalib-tests 2026-01-13 16:15:31 +01:00
kamran ul haq
7f3e35ee62 loader: gracefully handle ELF files with unsupported architectures (#2800)
* loader: gracefully handle ELF files with unsupported architectures

When analyzing ELF files with unsupported architectures (e.g., ARM64 variant),
vivisect raises a generic Exception with message 'Unsupported Architecture: %d'.
This was not caught by existing error handlers, causing capa to crash with an
unfriendly error message.

This change adds exception handling to detect the 'Unsupported Architecture'
error message and convert it to a user-friendly CorruptFile exception,
following the same pattern as the existing 'Couldn't convert rva' handler.

The architecture number is extracted from the exception args and included
in the error message to help users understand what went wrong.

closes #2793

* loader: address review feedback for PR #2800

- Add e.args check to prevent IndexError when accessing exception arguments
- Use error_msg variable instead of directly accessing e.args[0]
- Update CHANGELOG to reference PR #2800 instead of issue #2793

Addresses feedback from @mike-hunhoff and gemini-code-assist bot

* chore: move unsupported architecture bug fix to master (unreleased) section
2026-01-09 16:20:43 -07:00
Capa Bot
80c085b08b Sync capa rules submodule 2026-01-06 17:02:03 +00:00
Capa Bot
bfd1b09176 Sync capa-testfiles submodule 2026-01-06 16:50:00 +00:00
dependabot[bot]
dc47de1439 build(deps): bump ruamel-yaml from 0.18.6 to 0.19.1 (#2803)
Bumps ruamel-yaml from 0.18.6 to 0.19.1.

---
updated-dependencies:
- dependency-name: ruamel-yaml
  dependency-version: 0.19.1
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-01-05 10:08:43 -07:00
dependabot[bot]
2f7db1f446 build(deps-dev): bump flake8-simplify from 0.22.0 to 0.30.0 (#2804)
Bumps [flake8-simplify](https://github.com/MartinThoma/flake8-simplify) from 0.22.0 to 0.30.0.
- [Release notes](https://github.com/MartinThoma/flake8-simplify/releases)
- [Changelog](https://github.com/MartinThoma/flake8-simplify/blob/main/CHANGELOG.md)
- [Commits](https://github.com/MartinThoma/flake8-simplify/commits)

---
updated-dependencies:
- dependency-name: flake8-simplify
  dependency-version: 0.30.0
  dependency-type: direct:development
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-01-05 10:08:35 -07:00
dependabot[bot]
0908343ca1 build(deps): bump intervaltree from 3.1.0 to 3.2.1 (#2805)
Bumps [intervaltree](https://github.com/chaimleib/intervaltree) from 3.1.0 to 3.2.1.
- [Release notes](https://github.com/chaimleib/intervaltree/releases)
- [Changelog](https://github.com/chaimleib/intervaltree/blob/master/CHANGELOG.md)
- [Commits](https://github.com/chaimleib/intervaltree/compare/3.1.0...3.2.1)

---
updated-dependencies:
- dependency-name: intervaltree
  dependency-version: 3.2.1
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-01-05 10:07:36 -07:00
dependabot[bot]
342cb9d15a build(deps-dev): bump psutil from 7.1.2 to 7.2.1 (#2806)
Bumps [psutil](https://github.com/giampaolo/psutil) from 7.1.2 to 7.2.1.
- [Changelog](https://github.com/giampaolo/psutil/blob/master/HISTORY.rst)
- [Commits](https://github.com/giampaolo/psutil/compare/release-7.1.2...release-7.2.1)

---
updated-dependencies:
- dependency-name: psutil
  dependency-version: 7.2.1
  dependency-type: direct:development
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-01-05 10:07:17 -07:00
Capa Bot
9aad2591c4 Sync capa rules submodule 2025-12-29 17:21:22 +00:00
dependabot[bot]
1153ca4cf7 build(deps-dev): bump types-psutil from 7.1.3.20251202 to 7.2.0.20251228 (#2801)
Bumps [types-psutil](https://github.com/typeshed-internal/stub_uploader) from 7.1.3.20251202 to 7.2.0.20251228.
- [Commits](https://github.com/typeshed-internal/stub_uploader/commits)

---
updated-dependencies:
- dependency-name: types-psutil
  dependency-version: 7.2.0.20251228
  dependency-type: direct:development
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-12-29 09:48:14 -07:00
dependabot[bot]
4500dd80b3 build(deps-dev): bump pygithub from 2.6.0 to 2.8.1 (#2798)
Bumps [pygithub](https://github.com/pygithub/pygithub) from 2.6.0 to 2.8.1.
- [Release notes](https://github.com/pygithub/pygithub/releases)
- [Changelog](https://github.com/PyGithub/PyGithub/blob/main/doc/changes.rst)
- [Commits](https://github.com/pygithub/pygithub/compare/v2.6.0...v2.8.1)

---
updated-dependencies:
- dependency-name: pygithub
  dependency-version: 2.8.1
  dependency-type: direct:development
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-12-29 09:46:44 -07:00
dependabot[bot]
a35379d32b build(deps): bump humanize from 4.14.0 to 4.15.0 (#2797)
Bumps [humanize](https://github.com/python-humanize/humanize) from 4.14.0 to 4.15.0.
- [Release notes](https://github.com/python-humanize/humanize/releases)
- [Commits](https://github.com/python-humanize/humanize/compare/4.14.0...4.15.0)

---
updated-dependencies:
- dependency-name: humanize
  dependency-version: 4.15.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-12-29 09:44:53 -07:00
dependabot[bot]
29a8fa263e build(deps-dev): bump mypy-protobuf from 3.6.0 to 4.0.0 (#2796)
Bumps [mypy-protobuf](https://github.com/nipunn1313/mypy-protobuf) from 3.6.0 to 4.0.0.
- [Changelog](https://github.com/nipunn1313/mypy-protobuf/blob/main/CHANGELOG.md)
- [Commits](https://github.com/nipunn1313/mypy-protobuf/compare/v3.6.0...v4.0.0)

---
updated-dependencies:
- dependency-name: mypy-protobuf
  dependency-version: 4.0.0
  dependency-type: direct:development
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-12-29 09:44:12 -07:00
dependabot[bot]
5dcf98b1af build(deps-dev): bump pytest from 8.0.0 to 9.0.2 (#2795)
Bumps [pytest](https://github.com/pytest-dev/pytest) from 8.0.0 to 9.0.2.
- [Release notes](https://github.com/pytest-dev/pytest/releases)
- [Changelog](https://github.com/pytest-dev/pytest/blob/main/CHANGELOG.rst)
- [Commits](https://github.com/pytest-dev/pytest/compare/8.0.0...9.0.2)

---
updated-dependencies:
- dependency-name: pytest
  dependency-version: 9.0.2
  dependency-type: direct:development
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-12-29 09:43:11 -07:00
dependabot[bot]
0ad45bfdcc build(deps-dev): bump mypy from 1.17.1 to 1.19.1 (#2789)
Bumps [mypy](https://github.com/python/mypy) from 1.17.1 to 1.19.1.
- [Changelog](https://github.com/python/mypy/blob/master/CHANGELOG.md)
- [Commits](https://github.com/python/mypy/compare/v1.17.1...v1.19.1)

---
updated-dependencies:
- dependency-name: mypy
  dependency-version: 1.19.1
  dependency-type: direct:development
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Mike Hunhoff <mike.hunhoff@gmail.com>
2025-12-19 10:29:59 -07:00
dependabot[bot]
acad501b07 build(deps-dev): bump pyinstaller from 6.16.0 to 6.17.0 (#2790)
Bumps [pyinstaller](https://github.com/pyinstaller/pyinstaller) from 6.16.0 to 6.17.0.
- [Release notes](https://github.com/pyinstaller/pyinstaller/releases)
- [Changelog](https://github.com/pyinstaller/pyinstaller/blob/develop/doc/CHANGES.rst)
- [Commits](https://github.com/pyinstaller/pyinstaller/compare/v6.16.0...v6.17.0)

---
updated-dependencies:
- dependency-name: pyinstaller
  dependency-version: 6.17.0
  dependency-type: direct:development
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Mike Hunhoff <mike.hunhoff@gmail.com>
2025-12-19 09:37:59 -07:00
dependabot[bot]
6da6035e7e build(deps-dev): bump isort from 6.0.0 to 7.0.0 (#2791)
Bumps [isort](https://github.com/PyCQA/isort) from 6.0.0 to 7.0.0.
- [Release notes](https://github.com/PyCQA/isort/releases)
- [Changelog](https://github.com/PyCQA/isort/blob/main/CHANGELOG.md)
- [Commits](https://github.com/PyCQA/isort/compare/6.0.0...7.0.0)

---
updated-dependencies:
- dependency-name: isort
  dependency-version: 7.0.0
  dependency-type: direct:development
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Mike Hunhoff <mike.hunhoff@gmail.com>
2025-12-19 09:36:51 -07:00
Mike Hunhoff
66dc70a775 ghidra: support PyGhidra (#2788)
* ghidra: init commit switch to PyGhidra

* update CHANGELOG and PyGhidra version requirements

* Update capa/features/extractors/ghidra/helpers.py

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

* fix black errors

* support Ghidra v12

* remove deprecated APIs

* refactor outdated code

* fix pyinstaller, code refactoring

* address PR feedback

* add back capa_explorer.py

* beef up capa_explorer.py script

* refactor README

* refactor README

* fix #2747

* add sha256 check for workflows

* add sha256 check for workflows

* add sha256 check for workflows

---------

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-12-18 17:55:49 -07:00
Moritz
50300f1c8e Merge pull request #2792 from mandiant/dependabot/pip/deptry-0.24.0 2025-12-16 21:08:55 +01:00
dependabot[bot]
03f94536ca build(deps-dev): bump deptry from 0.23.0 to 0.24.0
Bumps [deptry](https://github.com/fpgmaas/deptry) from 0.23.0 to 0.24.0.
- [Release notes](https://github.com/fpgmaas/deptry/releases)
- [Changelog](https://github.com/fpgmaas/deptry/blob/main/CHANGELOG.md)
- [Commits](https://github.com/fpgmaas/deptry/compare/0.23.0...0.24.0)

---
updated-dependencies:
- dependency-name: deptry
  dependency-version: 0.24.0
  dependency-type: direct:development
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
2025-12-15 14:02:35 +00:00
mr-tz
dc08843e2d address idalib-based test fails 2025-12-11 14:18:13 +00:00
Moritz
40b01f0998 Merge pull request #2787 from mandiant/dependabot/pip/msgspec-0.20.0
build(deps): bump msgspec from 0.19.0 to 0.20.0
2025-12-11 11:17:57 +01:00
Moritz
b96a3b6b23 Merge pull request #2786 from mandiant/dependabot/pip/black-25.12.0
build(deps-dev): bump black from 25.11.0 to 25.12.0
2025-12-11 11:17:33 +01:00
Moritz
43e5e60901 Merge pull request #2785 from mandiant/dependabot/pip/types-psutil-7.1.3.20251202
build(deps-dev): bump types-psutil from 7.0.0.20250218 to 7.1.3.20251202
2025-12-11 11:17:14 +01:00
Moritz
0f9f72dbd5 build(deps-dev): bump flake8-bugbear from 25.10.21 to 25.11.29 (#2784)
Bumps [flake8-bugbear](https://github.com/PyCQA/flake8-bugbear) from 25.10.21 to 25.11.29.
- [Release notes](https://github.com/PyCQA/flake8-bugbear/releases)
- [Commits](https://github.com/PyCQA/flake8-bugbear/compare/25.10.21...25.11.29)

---
updated-dependencies:
- dependency-name: flake8-bugbear
  dependency-version: 25.11.29
  dependency-type: direct:development
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-12-11 11:16:51 +01:00
dependabot[bot]
fd9f584cc4 build(deps): bump msgspec from 0.19.0 to 0.20.0
Bumps [msgspec](https://github.com/jcrist/msgspec) from 0.19.0 to 0.20.0.
- [Release notes](https://github.com/jcrist/msgspec/releases)
- [Changelog](https://github.com/jcrist/msgspec/blob/main/docs/changelog.md)
- [Commits](https://github.com/jcrist/msgspec/compare/0.19.0...0.20.0)

---
updated-dependencies:
- dependency-name: msgspec
  dependency-version: 0.20.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
2025-12-08 14:02:34 +00:00
dependabot[bot]
c3b785e217 build(deps-dev): bump black from 25.11.0 to 25.12.0
Bumps [black](https://github.com/psf/black) from 25.11.0 to 25.12.0.
- [Release notes](https://github.com/psf/black/releases)
- [Changelog](https://github.com/psf/black/blob/main/CHANGES.md)
- [Commits](https://github.com/psf/black/compare/25.11.0...25.12.0)

---
updated-dependencies:
- dependency-name: black
  dependency-version: 25.12.0
  dependency-type: direct:development
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
2025-12-08 14:02:27 +00:00
dependabot[bot]
6ae17f7ef4 build(deps-dev): bump types-psutil from 7.0.0.20250218 to 7.1.3.20251202
Bumps [types-psutil](https://github.com/typeshed-internal/stub_uploader) from 7.0.0.20250218 to 7.1.3.20251202.
- [Commits](https://github.com/typeshed-internal/stub_uploader/commits)

---
updated-dependencies:
- dependency-name: types-psutil
  dependency-version: 7.1.3.20251202
  dependency-type: direct:development
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
2025-12-08 14:02:22 +00:00
dependabot[bot]
13297ad324 build(deps-dev): bump flake8-bugbear from 25.10.21 to 25.11.29
Bumps [flake8-bugbear](https://github.com/PyCQA/flake8-bugbear) from 25.10.21 to 25.11.29.
- [Release notes](https://github.com/PyCQA/flake8-bugbear/releases)
- [Commits](https://github.com/PyCQA/flake8-bugbear/compare/25.10.21...25.11.29)

---
updated-dependencies:
- dependency-name: flake8-bugbear
  dependency-version: 25.11.29
  dependency-type: direct:development
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
2025-12-08 14:02:19 +00:00
dependabot[bot]
9b42b45d21 build(deps-dev): bump flake8-bugbear from 24.12.12 to 25.10.21 (#2773)
* build(deps-dev): bump flake8-bugbear from 24.12.12 to 25.10.21

Bumps [flake8-bugbear](https://github.com/PyCQA/flake8-bugbear) from 24.12.12 to 25.10.21.
- [Release notes](https://github.com/PyCQA/flake8-bugbear/releases)
- [Commits](https://github.com/PyCQA/flake8-bugbear/compare/24.12.12...25.10.21)

---
updated-dependencies:
- dependency-name: flake8-bugbear
  dependency-version: 25.10.21
  dependency-type: direct:development
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>

* fix flake8 raised bugs

* use super

---------

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: mr-tz <moritz.raabe@mandiant.com>
Co-authored-by: Moritz <mr-tz@users.noreply.github.com>
2025-12-04 10:19:16 -07:00
Capa Bot
d17264c928 Sync capa rules submodule 2025-12-04 17:17:51 +00:00
Capa Bot
f313852e70 Sync capa rules submodule 2025-12-04 12:11:09 +00:00
Capa Bot
c0ae1352c6 Sync capa-testfiles submodule 2025-12-03 21:00:48 +00:00
dependabot[bot]
ccb3e6de74 build(deps-dev): bump flake8-comprehensions from 3.16.0 to 3.17.0 (#2782)
Bumps [flake8-comprehensions](https://github.com/adamchainz/flake8-comprehensions) from 3.16.0 to 3.17.0.
- [Changelog](https://github.com/adamchainz/flake8-comprehensions/blob/main/CHANGELOG.rst)
- [Commits](https://github.com/adamchainz/flake8-comprehensions/compare/3.16.0...3.17.0)

---
updated-dependencies:
- dependency-name: flake8-comprehensions
  dependency-version: 3.17.0
  dependency-type: direct:development
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-12-01 11:27:08 -07:00
dependabot[bot]
26c6ffd62d build(deps-dev): bump ruff from 0.12.0 to 0.14.7 (#2781)
Bumps [ruff](https://github.com/astral-sh/ruff) from 0.12.0 to 0.14.7.
- [Release notes](https://github.com/astral-sh/ruff/releases)
- [Changelog](https://github.com/astral-sh/ruff/blob/main/CHANGELOG.md)
- [Commits](https://github.com/astral-sh/ruff/compare/0.12.0...0.14.7)

---
updated-dependencies:
- dependency-name: ruff
  dependency-version: 0.14.7
  dependency-type: direct:development
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-12-01 11:26:34 -07:00
Capa Bot
18923601c7 Sync capa rules submodule 2025-11-25 20:39:18 +00:00
0x1622
1568ce4832 Use SafeLoader for YAML (#2776) 2025-11-25 07:01:23 -07:00
Mike Hunhoff
ffce77b13d ci: deprecate macos-13 runner and use Python v3.13 for testing (#2777) 2025-11-24 19:53:39 -07:00
Moritz
074f7c742c Merge branch 'master' into idalib-tests 2025-11-24 19:52:40 +01:00
dependabot[bot]
895b2440c0 build(deps-dev): bump pre-commit from 4.2.0 to 4.5.0 (#2772)
Bumps [pre-commit](https://github.com/pre-commit/pre-commit) from 4.2.0 to 4.5.0.
- [Release notes](https://github.com/pre-commit/pre-commit/releases)
- [Changelog](https://github.com/pre-commit/pre-commit/blob/main/CHANGELOG.md)
- [Commits](https://github.com/pre-commit/pre-commit/compare/v4.2.0...v4.5.0)

---
updated-dependencies:
- dependency-name: pre-commit
  dependency-version: 4.5.0
  dependency-type: direct:development
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-11-24 08:40:13 -07:00
dependabot[bot]
c901f809a2 build(deps-dev): bump black from 25.1.0 to 25.11.0 (#2771)
Bumps [black](https://github.com/psf/black) from 25.1.0 to 25.11.0.
- [Release notes](https://github.com/psf/black/releases)
- [Changelog](https://github.com/psf/black/blob/main/CHANGES.md)
- [Commits](https://github.com/psf/black/compare/25.1.0...25.11.0)

---
updated-dependencies:
- dependency-name: black
  dependency-version: 25.11.0
  dependency-type: direct:development
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-11-24 08:39:17 -07:00
dependabot[bot]
308b3e5c1c build(deps): bump xmltodict from 0.14.2 to 1.0.2 (#2774)
Bumps [xmltodict](https://github.com/martinblech/xmltodict) from 0.14.2 to 1.0.2.
- [Release notes](https://github.com/martinblech/xmltodict/releases)
- [Changelog](https://github.com/martinblech/xmltodict/blob/master/CHANGELOG.md)
- [Commits](https://github.com/martinblech/xmltodict/compare/v0.14.2...v1.0.2)

---
updated-dependencies:
- dependency-name: xmltodict
  dependency-version: 1.0.2
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-11-24 08:30:16 -07:00
Mike Hunhoff
7844ebb144 v9.3.1 (#2769) 2025-11-20 08:37:49 -07:00
dependabot[bot]
e393cff0e1 build(deps): bump glob from 10.4.2 to 10.5.0 in /web/explorer (#2766)
Bumps [glob](https://github.com/isaacs/node-glob) from 10.4.2 to 10.5.0.
- [Changelog](https://github.com/isaacs/node-glob/blob/main/changelog.md)
- [Commits](https://github.com/isaacs/node-glob/compare/v10.4.2...v10.5.0)

---
updated-dependencies:
- dependency-name: glob
  dependency-version: 10.5.0
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-11-19 08:34:25 -07:00
Mike Hunhoff
7780b9e8a8 explorer: add missing ida-netnode dependency to project.toml (#2765) 2025-11-18 08:55:57 -07:00
Mike Hunhoff
8d39765e7b ci: bump binja minor version (#2763) 2025-11-17 11:10:46 -07:00
dependabot[bot]
dec0bcfe79 build(deps-dev): bump js-yaml from 4.1.0 to 4.1.1 in /web/explorer (#2758)
Bumps [js-yaml](https://github.com/nodeca/js-yaml) from 4.1.0 to 4.1.1.
- [Changelog](https://github.com/nodeca/js-yaml/blob/master/CHANGELOG.md)
- [Commits](https://github.com/nodeca/js-yaml/compare/4.1.0...4.1.1)

---
updated-dependencies:
- dependency-name: js-yaml
  dependency-version: 4.1.1
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-11-17 10:02:54 -07:00
dependabot[bot]
99ccecba4e build(deps): bump humanize from 4.13.0 to 4.14.0 (#2762)
Bumps [humanize](https://github.com/python-humanize/humanize) from 4.13.0 to 4.14.0.
- [Release notes](https://github.com/python-humanize/humanize/releases)
- [Commits](https://github.com/python-humanize/humanize/compare/4.13.0...4.14.0)

---
updated-dependencies:
- dependency-name: humanize
  dependency-version: 4.14.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-11-17 09:10:40 -07:00
dependabot[bot]
af27463c37 build(deps-dev): bump pyinstaller from 6.14.1 to 6.16.0 (#2761)
Bumps [pyinstaller](https://github.com/pyinstaller/pyinstaller) from 6.14.1 to 6.16.0.
- [Release notes](https://github.com/pyinstaller/pyinstaller/releases)
- [Changelog](https://github.com/pyinstaller/pyinstaller/blob/develop/doc/CHANGES.rst)
- [Commits](https://github.com/pyinstaller/pyinstaller/compare/v6.14.1...v6.16.0)

---
updated-dependencies:
- dependency-name: pyinstaller
  dependency-version: 6.16.0
  dependency-type: direct:development
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-11-17 09:10:30 -07:00
dependabot[bot]
f4f47b4d55 build(deps): bump protobuf from 6.31.1 to 6.33.1 (#2760)
Bumps [protobuf](https://github.com/protocolbuffers/protobuf) from 6.31.1 to 6.33.1.
- [Release notes](https://github.com/protocolbuffers/protobuf/releases)
- [Changelog](https://github.com/protocolbuffers/protobuf/blob/main/protobuf_release.bzl)
- [Commits](https://github.com/protocolbuffers/protobuf/commits)

---
updated-dependencies:
- dependency-name: protobuf
  dependency-version: 6.33.1
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-11-17 09:10:19 -07:00
dependabot[bot]
adc2401136 build(deps): bump pycparser from 2.22 to 2.23 (#2759)
Bumps [pycparser](https://github.com/eliben/pycparser) from 2.22 to 2.23.
- [Release notes](https://github.com/eliben/pycparser/releases)
- [Changelog](https://github.com/eliben/pycparser/blob/main/CHANGES)
- [Commits](https://github.com/eliben/pycparser/compare/release_v2.22...release_v2.23)

---
updated-dependencies:
- dependency-name: pycparser
  dependency-version: '2.23'
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-11-17 09:10:09 -07:00
Moritz
0ff7855467 Release v9.3.0 2025-11-12 17:30:07 +01:00
Capa Bot
d5411cadad Sync capa rules submodule 2025-11-12 10:26:46 +00:00
dependabot[bot]
cbd6d2a189 build(deps): bump pydantic from 2.11.4 to 2.12.4 (#2755)
Bumps [pydantic](https://github.com/pydantic/pydantic) from 2.11.4 to 2.12.4.
- [Release notes](https://github.com/pydantic/pydantic/releases)
- [Changelog](https://github.com/pydantic/pydantic/blob/v2.12.4/HISTORY.md)
- [Commits](https://github.com/pydantic/pydantic/compare/v2.11.4...v2.12.4)

---
updated-dependencies:
- dependency-name: pydantic
  dependency-version: 2.12.4
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-11-10 09:24:02 -07:00
dependabot[bot]
29af6dc875 build(deps-dev): bump build from 1.2.2 to 1.3.0 (#2754)
Bumps [build](https://github.com/pypa/build) from 1.2.2 to 1.3.0.
- [Release notes](https://github.com/pypa/build/releases)
- [Changelog](https://github.com/pypa/build/blob/main/CHANGELOG.rst)
- [Commits](https://github.com/pypa/build/compare/1.2.2...1.3.0)

---
updated-dependencies:
- dependency-name: build
  dependency-version: 1.3.0
  dependency-type: direct:development
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-11-10 09:23:14 -07:00
Moritz
66a3aac815 Merge pull request #2753 from mandiant/fix/capa-explorer-qt-load
delay import to not load Qt* when running under idalib
2025-11-10 16:51:14 +01:00
mr-tz
7525de7bbd delay import to not load Qt* when running under idalib
closes #2752
2025-11-10 12:13:35 +00:00
Capa Bot
cbd1cb2b7d Sync capa rules submodule 2025-11-07 07:40:05 +00:00
Matthew Haigh
503c34b8f9 added mailinglist cta (#2744)
* added mailinglist cta

* Update README.md

Added mailto: link for better user experience

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

---------

Co-authored-by: Matt Haigh <matthaigh@google.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-11-03 09:11:14 -07:00
dependabot[bot]
888295b37a build(deps): bump types-protobuf from 6.30.2.20250516 to 6.32.1.20250918 (#2733)
Bumps [types-protobuf](https://github.com/typeshed-internal/stub_uploader) from 6.30.2.20250516 to 6.32.1.20250918.
- [Commits](https://github.com/typeshed-internal/stub_uploader/commits)

---
updated-dependencies:
- dependency-name: types-protobuf
  dependency-version: 6.32.1.20250918
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-11-03 09:07:26 -07:00
dependabot[bot]
5f9c908315 build(deps): bump pip from 25.2 to 25.3 (#2741)
Bumps [pip](https://github.com/pypa/pip) from 25.2 to 25.3.
- [Changelog](https://github.com/pypa/pip/blob/main/NEWS.rst)
- [Commits](https://github.com/pypa/pip/compare/25.2...25.3)

---
updated-dependencies:
- dependency-name: pip
  dependency-version: '25.3'
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Moritz <mr-tz@users.noreply.github.com>
2025-11-03 09:07:02 -07:00
Willi Ballenthin
cb2e2323f9 explorer: add support for IDA 9.2 (#2723)
* ida: add Qt compatibility layer for PyQt5 and PySide6

Introduce a new module `qt_compat.py` providing a unified import
interface and API compatibility for Qt modules. It handles differences between
PyQt5 (used in IDA <9.2) and PySide6 (used in IDA >=9.2). Update all
plugin modules to import Qt components via this compatibility layer
instead of directly importing from PyQt5. This enhances plugin
compatibility across different IDA versions.

thanks @mike-hunhoff!

changelog

* qt_compat: use __all__ rather than noqa

---------

Co-authored-by: Moritz <mr-tz@users.noreply.github.com>
2025-11-03 13:29:06 +01:00
Willi Ballenthin
cf463676b2 fixtures: remove dups 2025-11-03 12:47:12 +01:00
Willi Ballenthin
b5e5840a63 lints 2025-10-29 20:29:08 +01:00
Willi Ballenthin
f252b6bbd0 changelog 2025-10-29 20:23:12 +01:00
Willi Ballenthin
eda53ab3c1 tests: add feature tests for idalib 2025-10-29 20:20:57 +01:00
Willi Ballenthin
5ea63770ba Merge pull request #2724 from HexRays-plugin-contributions/ida-plugin-json
add `ida-plugin.json`
2025-10-29 17:55:49 +01:00
Capa Bot
6795813fbe Sync capa rules submodule 2025-10-28 15:21:05 +00:00
Capa Bot
ca708ca52e Sync capa-testfiles submodule 2025-10-28 15:15:42 +00:00
Capa Bot
68cf74d60c Sync capa rules submodule 2025-10-28 13:12:29 +00:00
Moritz
5a0c47419f Merge pull request #2735 from mandiant/dependabot/npm_and_yarn/web/explorer/vite-6.4.1
build(deps-dev): bump vite from 6.4.0 to 6.4.1 in /web/explorer
2025-10-24 12:32:50 +02:00
Moritz
4dbdd9dcfa Merge branch 'master' into dependabot/npm_and_yarn/web/explorer/vite-6.4.1 2025-10-24 12:30:15 +02:00
Moritz
82cbfd33db Merge pull request #2732 from xusheng6/test_fix_binja_crash
binja: fix crash in binja feature extraction when MLIL is unavailable…
2025-10-24 12:29:51 +02:00
dependabot[bot]
5906bb3ecf build(deps-dev): bump vite from 6.4.0 to 6.4.1 in /web/explorer
Bumps [vite](https://github.com/vitejs/vite/tree/HEAD/packages/vite) from 6.4.0 to 6.4.1.
- [Release notes](https://github.com/vitejs/vite/releases)
- [Changelog](https://github.com/vitejs/vite/blob/main/packages/vite/CHANGELOG.md)
- [Commits](https://github.com/vitejs/vite/commits/create-vite@6.4.1/packages/vite)

---
updated-dependencies:
- dependency-name: vite
  dependency-version: 6.4.1
  dependency-type: direct:development
...

Signed-off-by: dependabot[bot] <support@github.com>
2025-10-21 04:19:00 +00:00
Moritz
08319f598f Merge pull request #2730 from mandiant/dependabot/npm_and_yarn/web/explorer/vite-6.4.0
build(deps-dev): bump vite from 6.3.4 to 6.4.0 in /web/explorer
2025-10-20 17:28:58 +02:00
Capa Bot
e6df6ad0cd Sync capa rules submodule 2025-10-20 15:27:46 +00:00
Capa Bot
add09df061 Sync capa-testfiles submodule 2025-10-20 15:18:32 +00:00
Mike Hunhoff
acb34e88d6 Update CHANGELOG.md
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-10-20 09:05:30 -06:00
Xusheng
0099e75704 binja: fix crash in binja feature extraction when MLIL is unavailable. Fix https://github.com/mandiant/capa/issues/2714 2025-10-20 18:46:53 +08:00
dependabot[bot]
da0803b671 build(deps-dev): bump vite from 6.3.4 to 6.4.0 in /web/explorer
Bumps [vite](https://github.com/vitejs/vite/tree/HEAD/packages/vite) from 6.3.4 to 6.4.0.
- [Release notes](https://github.com/vitejs/vite/releases)
- [Changelog](https://github.com/vitejs/vite/blob/main/packages/vite/CHANGELOG.md)
- [Commits](https://github.com/vitejs/vite/commits/create-vite@6.4.0/packages/vite)

---
updated-dependencies:
- dependency-name: vite
  dependency-version: 6.4.0
  dependency-type: direct:development
...

Signed-off-by: dependabot[bot] <support@github.com>
2025-10-16 10:58:41 +00:00
Moritz
789747282d Merge pull request #2728 from mandiant/dependabot/pip/rich-14.2.0
build(deps): bump rich from 14.0.0 to 14.2.0
2025-10-16 12:57:18 +02:00
Capa Bot
3bc2d9915c Sync capa-testfiles submodule 2025-10-13 18:52:26 +00:00
dependabot[bot]
5974440ab7 build(deps): bump rich from 14.0.0 to 14.2.0
Bumps [rich](https://github.com/Textualize/rich) from 14.0.0 to 14.2.0.
- [Release notes](https://github.com/Textualize/rich/releases)
- [Changelog](https://github.com/Textualize/rich/blob/master/CHANGELOG.md)
- [Commits](https://github.com/Textualize/rich/compare/v14.0.0...v14.2.0)

---
updated-dependencies:
- dependency-name: rich
  dependency-version: 14.2.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
2025-10-13 14:25:06 +00:00
dependabot[bot]
b9d517a70b build(deps): bump pip from 25.1.1 to 25.2 (#2717)
Bumps [pip](https://github.com/pypa/pip) from 25.1.1 to 25.2.
- [Changelog](https://github.com/pypa/pip/blob/main/NEWS.rst)
- [Commits](https://github.com/pypa/pip/compare/25.1.1...25.2)

---
updated-dependencies:
- dependency-name: pip
  dependency-version: '25.2'
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Mike Hunhoff <mike.hunhoff@gmail.com>
2025-10-06 08:32:13 -06:00
dependabot[bot]
e5b8788620 build(deps): bump humanize from 4.12.0 to 4.13.0 (#2716)
Bumps [humanize](https://github.com/python-humanize/humanize) from 4.12.0 to 4.13.0.
- [Release notes](https://github.com/python-humanize/humanize/releases)
- [Commits](https://github.com/python-humanize/humanize/compare/4.12.0...4.13.0)

---
updated-dependencies:
- dependency-name: humanize
  dependency-version: 4.13.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Mike Hunhoff <mike.hunhoff@gmail.com>
2025-10-06 08:31:46 -06:00
axelmierczuk
ec411f1552 Update pyproject.toml
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-10-01 19:00:26 +02:00
axelmierczuk
6871adc9dc Pin ida-settings version to 2.1.0 2025-10-01 19:00:26 +02:00
Capa Bot
07880c1418 Sync capa rules submodule 2025-09-23 20:18:16 +00:00
Capa Bot
5a6c8ca7c1 Sync capa rules submodule 2025-09-09 19:22:11 +00:00
Capa Bot
3bd8371d0c Sync capa rules submodule 2025-09-03 16:27:26 +00:00
dependabot[bot]
d0c87ef32c build(deps): bump markdown-it-py from 3.0.0 to 4.0.0 (#2711)
Bumps [markdown-it-py](https://github.com/executablebooks/markdown-it-py) from 3.0.0 to 4.0.0.
- [Release notes](https://github.com/executablebooks/markdown-it-py/releases)
- [Changelog](https://github.com/executablebooks/markdown-it-py/blob/master/CHANGELOG.md)
- [Commits](https://github.com/executablebooks/markdown-it-py/compare/v3.0.0...v4.0.0)

---
updated-dependencies:
- dependency-name: markdown-it-py
  dependency-version: 4.0.0
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-09-03 10:11:25 -06:00
dependabot[bot]
bd2731f87f build(deps): bump pytest-sugar from 1.0.0 to 1.1.1 (#2710)
Bumps [pytest-sugar](https://github.com/Teemu/pytest-sugar) from 1.0.0 to 1.1.1.
- [Release notes](https://github.com/Teemu/pytest-sugar/releases)
- [Changelog](https://github.com/Teemu/pytest-sugar/blob/main/CHANGES.rst)
- [Commits](https://github.com/Teemu/pytest-sugar/compare/v1.0.0...v1.1.1)

---
updated-dependencies:
- dependency-name: pytest-sugar
  dependency-version: 1.1.1
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-09-03 10:10:32 -06:00
Capa Bot
4a167d7188 Sync capa rules submodule 2025-09-03 16:08:58 +00:00
Capa Bot
c01bc346fc Sync capa rules submodule 2025-09-03 16:05:36 +00:00
Capa Bot
826330f511 Sync capa-testfiles submodule 2025-09-03 15:58:45 +00:00
Capa Bot
40e5095577 Sync capa-testfiles submodule 2025-09-03 15:55:29 +00:00
Capa Bot
c7eede3c53 Sync capa-testfiles submodule 2025-09-03 15:51:51 +00:00
Capa Bot
1a5f50195a Sync capa rules submodule 2025-08-25 19:08:17 +00:00
Capa Bot
aafca2e00a Sync capa-testfiles submodule 2025-08-25 18:59:27 +00:00
Capa Bot
3a24fabeb6 Sync capa rules submodule 2025-08-22 14:58:24 +00:00
Capa Bot
2f81bb79f9 Sync capa rules submodule 2025-08-21 14:57:07 +00:00
Capa Bot
fc83b7b0a1 Sync capa rules submodule 2025-08-21 14:56:48 +00:00
dependabot[bot]
d430aea04e build(deps): bump dnfile from 0.15.0 to 0.16.4 (#2700)
---
updated-dependencies:
- dependency-name: dnfile
  dependency-version: 0.16.4
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Mike Hunhoff <mike.hunhoff@gmail.com>
2025-08-20 15:11:17 -06:00
dependabot[bot]
1eb42599cf build(deps): bump mypy from 1.16.0 to 1.17.1 (#2704)
Bumps [mypy](https://github.com/python/mypy) from 1.16.0 to 1.17.1.
- [Changelog](https://github.com/python/mypy/blob/master/CHANGELOG.md)
- [Commits](https://github.com/python/mypy/compare/v1.16.0...v1.17.1)

---
updated-dependencies:
- dependency-name: mypy
  dependency-version: 1.17.1
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Mike Hunhoff <mike.hunhoff@gmail.com>
2025-08-20 15:10:52 -06:00
dependabot[bot]
618ae2111b build(deps): bump form-data from 4.0.0 to 4.0.4 in /web/explorer (#2702)
Bumps [form-data](https://github.com/form-data/form-data) from 4.0.0 to 4.0.4.
- [Release notes](https://github.com/form-data/form-data/releases)
- [Changelog](https://github.com/form-data/form-data/blob/master/CHANGELOG.md)
- [Commits](https://github.com/form-data/form-data/compare/v4.0.0...v4.0.4)

---
updated-dependencies:
- dependency-name: form-data
  dependency-version: 4.0.4
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Mike Hunhoff <mike.hunhoff@gmail.com>
2025-08-20 12:11:46 -06:00
Mike Hunhoff
42b6d8106a binja: update core version info check (#2709) 2025-08-20 11:56:56 -06:00
Capa Bot
78a020e1ac Sync capa rules submodule 2025-08-20 16:04:49 +00:00
Capa Bot
a80f85aab4 Sync capa-testfiles submodule 2025-08-20 15:57:15 +00:00
Capa Bot
f94f554d15 Sync capa-testfiles submodule 2025-08-20 15:32:08 +00:00
Capa Bot
d456d52e81 Sync capa rules submodule 2025-08-14 20:59:31 +00:00
Capa Bot
2a18b08a80 Sync capa rules submodule 2025-08-14 15:11:56 +00:00
Capa Bot
dd2e350a1a Sync capa-testfiles submodule 2025-08-14 15:08:18 +00:00
Capa Bot
164a7bdfb5 Sync capa rules submodule 2025-08-13 14:40:23 +00:00
Capa Bot
d7c896bbc6 Sync capa rules submodule 2025-08-12 16:21:29 +00:00
Capa Bot
8185ac4dde Sync capa rules submodule 2025-08-12 15:43:50 +00:00
Capa Bot
92a6ddff99 Sync capa rules submodule 2025-08-12 15:42:57 +00:00
Capa Bot
af87fae036 Sync capa-testfiles submodule 2025-08-12 15:38:12 +00:00
Capa Bot
c774db26f0 Sync capa-testfiles submodule 2025-08-12 15:37:46 +00:00
Capa Bot
eb0afc806e Sync capa rules submodule 2025-08-04 18:41:07 +00:00
dependabot[bot]
9a09f667bf build(deps): bump flake8 from 7.2.0 to 7.3.0 (#2697)
Bumps [flake8](https://github.com/pycqa/flake8) from 7.2.0 to 7.3.0.
- [Commits](https://github.com/pycqa/flake8/compare/7.2.0...7.3.0)

---
updated-dependencies:
- dependency-name: flake8
  dependency-version: 7.3.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-06-24 09:30:33 -06:00
dependabot[bot]
15a1dc3409 build(deps): bump ruff from 0.11.0 to 0.12.0 (#2698)
Bumps [ruff](https://github.com/astral-sh/ruff) from 0.11.0 to 0.12.0.
- [Release notes](https://github.com/astral-sh/ruff/releases)
- [Changelog](https://github.com/astral-sh/ruff/blob/main/CHANGELOG.md)
- [Commits](https://github.com/astral-sh/ruff/compare/0.11.0...0.12.0)

---
updated-dependencies:
- dependency-name: ruff
  dependency-version: 0.12.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-06-24 09:30:20 -06:00
Capa Bot
a18fe34d01 Sync capa rules submodule 2025-06-20 16:16:27 +00:00
Capa Bot
edcea18c52 Sync capa-testfiles submodule 2025-06-17 19:17:09 +00:00
dependabot[bot]
92f0306f96 build(deps): bump mypy from 1.15.0 to 1.16.0 (#2696)
Bumps [mypy](https://github.com/python/mypy) from 1.15.0 to 1.16.0.
- [Changelog](https://github.com/python/mypy/blob/master/CHANGELOG.md)
- [Commits](https://github.com/python/mypy/compare/v1.15.0...v1.16.0)

---
updated-dependencies:
- dependency-name: mypy
  dependency-version: 1.16.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-06-16 12:00:21 -06:00
dependabot[bot]
f2ed75c339 build(deps): bump protobuf from 6.30.1 to 6.31.1 (#2686)
Bumps [protobuf](https://github.com/protocolbuffers/protobuf) from 6.30.1 to 6.31.1.
- [Release notes](https://github.com/protocolbuffers/protobuf/releases)
- [Changelog](https://github.com/protocolbuffers/protobuf/blob/main/protobuf_release.bzl)
- [Commits](https://github.com/protocolbuffers/protobuf/compare/v6.30.1...v6.31.1)

---
updated-dependencies:
- dependency-name: protobuf
  dependency-version: 6.31.1
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Mike Hunhoff <mike.hunhoff@gmail.com>
2025-06-11 16:28:23 -06:00
dependabot[bot]
6e18657ca7 build(deps): bump types-protobuf from 5.29.1.20241207 to 6.30.2.20250516 (#2687)
Bumps [types-protobuf](https://github.com/typeshed-internal/stub_uploader) from 5.29.1.20241207 to 6.30.2.20250516.
- [Commits](https://github.com/typeshed-internal/stub_uploader/commits)

---
updated-dependencies:
- dependency-name: types-protobuf
  dependency-version: 6.30.2.20250516
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Mike Hunhoff <mike.hunhoff@gmail.com>
2025-06-11 16:28:11 -06:00
dependabot[bot]
8ba48d11d0 build(deps): bump pyinstaller from 6.12.0 to 6.14.1 (#2688)
Bumps [pyinstaller](https://github.com/pyinstaller/pyinstaller) from 6.12.0 to 6.14.1.
- [Release notes](https://github.com/pyinstaller/pyinstaller/releases)
- [Changelog](https://github.com/pyinstaller/pyinstaller/blob/develop/doc/CHANGES.rst)
- [Commits](https://github.com/pyinstaller/pyinstaller/compare/v6.12.0...v6.14.1)

---
updated-dependencies:
- dependency-name: pyinstaller
  dependency-version: 6.14.1
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Mike Hunhoff <mike.hunhoff@gmail.com>
2025-06-11 16:27:52 -06:00
Mike Hunhoff
d6f442b5bd ci: remove redundant "test_run" action (#2692)
* ci: remove redundant "test_run" action

* update CHANGELOG
2025-06-10 16:43:10 -06:00
Spencer Heywood
0da5d7c5b5 add support for arm64 binary releases (#2691) 2025-06-10 15:52:55 -06:00
Mike Hunhoff
fa5d9a9302 v9.2.1 (#2685) 2025-06-06 14:16:25 -06:00
Mike Hunhoff
30fb4751f6 fix build issues discovered during and after v9.2.0 release (#2684)
* ci: downgrade Ubuntu version to accommodate older GLIBC versions

* ci: upgrade Windows version to avoid deprecation

* ci: exclude pkg_resources from PyInstaller build

* update CHANGELOG

* update spec file

* ci: check if build runs without warnings or errors

* update CHANGELOG

* update build commands

* update build commands

* update build commands

* update build commands

* update build commands
2025-06-06 12:02:06 -06:00
Mike Hunhoff
a8eab7ddf0 update "publish package" worflow to 1.12.4 (#2680) 2025-06-05 13:44:30 -06:00
Mike Hunhoff
5ad1dda918 v9.2.0 (#2677) 2025-06-05 09:03:56 -06:00
Capa Bot
eabb2cc809 Sync capa rules submodule 2025-06-02 19:16:57 +00:00
Capa Bot
a34c3ecc57 Sync capa rules submodule 2025-06-02 18:47:17 +00:00
dependabot[bot]
d22de5cf7f build(deps): bump rich from 13.9.2 to 14.0.0 (#2675)
Bumps [rich](https://github.com/Textualize/rich) from 13.9.2 to 14.0.0.
- [Release notes](https://github.com/Textualize/rich/releases)
- [Changelog](https://github.com/Textualize/rich/blob/master/CHANGELOG.md)
- [Commits](https://github.com/Textualize/rich/compare/v13.9.2...v14.0.0)

---
updated-dependencies:
- dependency-name: rich
  dependency-version: 14.0.0
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-06-02 12:39:31 -06:00
dependabot[bot]
8f78834cae build(deps): bump setuptools from 78.1.1 to 80.9.0 (#2674)
Bumps [setuptools](https://github.com/pypa/setuptools) from 78.1.1 to 80.9.0.
- [Release notes](https://github.com/pypa/setuptools/releases)
- [Changelog](https://github.com/pypa/setuptools/blob/main/NEWS.rst)
- [Commits](https://github.com/pypa/setuptools/compare/v78.1.1...v80.9.0)

---
updated-dependencies:
- dependency-name: setuptools
  dependency-version: 80.9.0
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-06-02 12:39:08 -06:00
dependabot[bot]
08dbb0e02d build(deps): bump flake8-simplify from 0.21.0 to 0.22.0 (#2673)
Bumps [flake8-simplify](https://github.com/MartinThoma/flake8-simplify) from 0.21.0 to 0.22.0.
- [Release notes](https://github.com/MartinThoma/flake8-simplify/releases)
- [Changelog](https://github.com/MartinThoma/flake8-simplify/blob/main/CHANGELOG.md)
- [Commits](https://github.com/MartinThoma/flake8-simplify/commits)

---
updated-dependencies:
- dependency-name: flake8-simplify
  dependency-version: 0.22.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-06-02 12:38:41 -06:00
Mike Hunhoff
98725c52dc vmray: do not restrict analysis to PE and ELF files, e.g. docx (#2672)
* vmray: do not restrict analysis to PE and ELF files, e.g. docx

* update CHANGELOG
2025-05-30 13:53:12 -06:00
dependabot[bot]
eb87153064 build(deps): bump setuptools from 76.0.0 to 78.1.1 (#2667)
Bumps [setuptools](https://github.com/pypa/setuptools) from 76.0.0 to 78.1.1.
- [Release notes](https://github.com/pypa/setuptools/releases)
- [Changelog](https://github.com/pypa/setuptools/blob/main/NEWS.rst)
- [Commits](https://github.com/pypa/setuptools/compare/v76.0.0...v78.1.1)

---
updated-dependencies:
- dependency-name: setuptools
  dependency-version: 78.1.1
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Moritz <mr-tz@users.noreply.github.com>
Co-authored-by: Mike Hunhoff <mike.hunhoff@gmail.com>
2025-05-30 12:35:35 -06:00
dependabot[bot]
56aa7176b0 build(deps): bump pip from 25.0 to 25.1.1 (#2666)
Bumps [pip](https://github.com/pypa/pip) from 25.0 to 25.1.1.
- [Changelog](https://github.com/pypa/pip/blob/main/NEWS.rst)
- [Commits](https://github.com/pypa/pip/compare/25.0...25.1.1)

---
updated-dependencies:
- dependency-name: pip
  dependency-version: 25.1.1
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Moritz <mr-tz@users.noreply.github.com>
Co-authored-by: Mike Hunhoff <mike.hunhoff@gmail.com>
2025-05-30 12:35:15 -06:00
dependabot[bot]
8b41671409 build(deps): bump flake8 from 7.1.1 to 7.2.0 (#2648)
* build(deps): bump flake8 from 7.1.1 to 7.2.0

Bumps [flake8](https://github.com/pycqa/flake8) from 7.1.1 to 7.2.0.
- [Commits](https://github.com/pycqa/flake8/compare/7.1.1...7.2.0)

---
updated-dependencies:
- dependency-name: flake8
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>

* fix lints (#2671)

---------

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Mike Hunhoff <mike.hunhoff@gmail.com>
2025-05-30 12:09:16 -06:00
dependabot[bot]
5dbbc2b468 build(deps): bump pre-commit from 4.1.0 to 4.2.0 (#2646)
Bumps [pre-commit](https://github.com/pre-commit/pre-commit) from 4.1.0 to 4.2.0.
- [Release notes](https://github.com/pre-commit/pre-commit/releases)
- [Changelog](https://github.com/pre-commit/pre-commit/blob/main/CHANGELOG.md)
- [Commits](https://github.com/pre-commit/pre-commit/compare/v4.1.0...v4.2.0)

---
updated-dependencies:
- dependency-name: pre-commit
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Mike Hunhoff <mike.hunhoff@gmail.com>
2025-05-30 11:28:19 -06:00
Mike Hunhoff
96d1eb64c3 update binja core version (#2670)
* update binja core version

* update CHANGELOG
2025-05-30 10:52:56 -06:00
Capa Bot
9234b33051 Sync capa rules submodule 2025-05-28 20:44:44 +00:00
Capa Bot
51f5114ad7 Sync capa rules submodule 2025-05-22 18:53:54 +00:00
Capa Bot
4b72f8a872 Sync capa-testfiles submodule 2025-05-22 17:48:58 +00:00
Moritz
8206a97b0f Merge pull request #2659 from mandiant/dependabot/npm_and_yarn/web/explorer/vite-6.3.4
build(deps-dev): bump vite from 6.2.3 to 6.3.4 in /web/explorer
2025-05-20 16:47:50 +02:00
Capa Bot
5a33b4b2a8 Sync capa rules submodule 2025-05-19 18:21:38 +00:00
Capa Bot
fcfdeec377 Sync capa rules submodule 2025-05-19 18:21:13 +00:00
Capa Bot
37a63a751c Sync capa-testfiles submodule 2025-05-19 18:12:00 +00:00
zdw@
3a9f2136bb lint: log the failed example+rule (#2661)
* lint: log the failed example+rule

* Update scripts/lint.py

Co-authored-by: Mike Hunhoff <mike.hunhoff@gmail.com>

* fix lint

---------

Co-authored-by: Mike Hunhoff <mike.hunhoff@gmail.com>
2025-05-12 15:11:22 -06:00
Capa Bot
390e2a6315 Sync capa-testfiles submodule 2025-05-12 16:17:27 +00:00
Capa Bot
6a43084915 Sync capa-testfiles submodule 2025-05-12 16:06:51 +00:00
dependabot[bot]
6d7ca57fa9 build(deps): bump pydantic from 2.10.1 to 2.11.4 (#2660)
Bumps [pydantic](https://github.com/pydantic/pydantic) from 2.10.1 to 2.11.4.
- [Release notes](https://github.com/pydantic/pydantic/releases)
- [Changelog](https://github.com/pydantic/pydantic/blob/main/HISTORY.md)
- [Commits](https://github.com/pydantic/pydantic/compare/v2.10.1...v2.11.4)

---
updated-dependencies:
- dependency-name: pydantic
  dependency-version: 2.11.4
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-05-12 09:48:46 -06:00
Ana María Martínez Gómez
d1090e8391 ci: Update Ubuntu version in actions (#2656)
ubuntu-20.04 has been deprecated causing several GH actions to fail:
https://github.com/actions/runner-images/issues/11101
2025-05-09 15:40:59 -06:00
dependabot[bot]
b07efe773b build(deps-dev): bump vite from 6.2.3 to 6.3.4 in /web/explorer
Bumps [vite](https://github.com/vitejs/vite/tree/HEAD/packages/vite) from 6.2.3 to 6.3.4.
- [Release notes](https://github.com/vitejs/vite/releases)
- [Changelog](https://github.com/vitejs/vite/blob/main/packages/vite/CHANGELOG.md)
- [Commits](https://github.com/vitejs/vite/commits/v6.3.4/packages/vite)

---
updated-dependencies:
- dependency-name: vite
  dependency-version: 6.3.4
  dependency-type: direct:development
...

Signed-off-by: dependabot[bot] <support@github.com>
2025-04-30 18:53:28 +00:00
Moritz
9d3d3be21d Merge pull request #2644 from mandiant/dependabot/npm_and_yarn/web/explorer/vite-6.2.3 2025-03-25 22:06:15 +01:00
dependabot[bot]
8251a4c16f build(deps-dev): bump vite from 6.2.2 to 6.2.3 in /web/explorer
Bumps [vite](https://github.com/vitejs/vite/tree/HEAD/packages/vite) from 6.2.2 to 6.2.3.
- [Release notes](https://github.com/vitejs/vite/releases)
- [Changelog](https://github.com/vitejs/vite/blob/v6.2.3/packages/vite/CHANGELOG.md)
- [Commits](https://github.com/vitejs/vite/commits/v6.2.3/packages/vite)

---
updated-dependencies:
- dependency-name: vite
  dependency-type: direct:development
...

Signed-off-by: dependabot[bot] <support@github.com>
2025-03-25 15:30:15 +00:00
Mike Hunhoff
7407cb39ca add lint for registry control set regex that is not complete (#2643)
* add lint for registry control set regex that is not complete

* update CHANGELOG
2025-03-24 12:17:12 -06:00
Capa Bot
0162e447fd Sync capa rules submodule 2025-03-24 16:38:44 +00:00
Capa Bot
829dae388f Sync capa rules submodule 2025-03-21 16:15:53 +00:00
Capa Bot
2a4d0ae080 Sync capa rules submodule 2025-03-21 14:40:08 +00:00
Capa Bot
d9a754730c Sync capa rules submodule 2025-03-20 15:06:54 +00:00
Capa Bot
4acacba9d6 Sync capa rules submodule 2025-03-20 15:00:54 +00:00
Capa Bot
d00f172973 Sync capa rules submodule 2025-03-19 17:29:32 +00:00
182 changed files with 9003 additions and 2358 deletions

27
.bumpversion.toml Normal file
View File

@@ -0,0 +1,27 @@
[tool.bumpversion]
current_version = "9.3.1"
[[tool.bumpversion.files]]
filename = "capa/version.py"
search = '__version__ = "{current_version}"'
replace = '__version__ = "{new_version}"'
[[tool.bumpversion.files]]
filename = "capa/ida/plugin/ida-plugin.json"
search = '"version": "{current_version}"'
replace = '"version": "{new_version}"'
[[tool.bumpversion.files]]
filename = "capa/ida/plugin/ida-plugin.json"
search = '"flare-capa=={current_version}"'
replace = '"flare-capa=={new_version}"'
[[tool.bumpversion.files]]
filename = "CHANGELOG.md"
search = "v{current_version}...master"
replace = "v{current_version}...{new_version}"
[[tool.bumpversion.files]]
filename = "CHANGELOG.md"
search = "master (unreleased)"
replace = "v{new_version}"

View File

@@ -4,6 +4,13 @@ updates:
directory: "/"
schedule:
interval: "weekly"
groups:
vivisect:
patterns:
- "vivisect"
- "pyasn1"
- "pyasn1-modules"
- "msgpack"
ignore:
- dependency-name: "*"
update-types: ["version-update:semver-patch"]

2
.github/flake8.ini vendored
View File

@@ -33,8 +33,6 @@ per-file-ignores =
scripts/*: T201
# capa.exe is meant to print output
capa/main.py: T201
# IDA tests emit results to output window so need to print
tests/test_ida_features.py: T201
# utility used to find the Binary Ninja API via invoking python.exe
capa/features/extractors/binja/find_binja_api.py: T201

View File

@@ -63,6 +63,9 @@ ignore_missing_imports = True
[mypy-PyQt5.*]
ignore_missing_imports = True
[mypy-binaryninja]
ignore_missing_imports = True
[mypy-binaryninja.*]
ignore_missing_imports = True

View File

@@ -20,3 +20,5 @@ closes #issue_number
- [ ] No new tests needed
<!-- Please help us keeping capa documentation up-to-date -->
- [ ] No documentation update needed
<!-- Please indicate if and how you have used AI to generate (parts of) your code submission. Include your prompt, model, tool, etc. -->
- [ ] This submission includes AI-generated code and I have provided details in the description.

View File

@@ -17,6 +17,8 @@ import sys
import capa.rules.cache
from PyInstaller.utils.hooks import collect_submodules
from pathlib import Path
# SPECPATH is a global variable which points to .spec file path
@@ -34,6 +36,7 @@ a = Analysis(
["../../capa/main.py"],
pathex=["capa"],
binaries=None,
hiddenimports=collect_submodules('rich'),
datas=[
# when invoking pyinstaller from the project root,
# this gets invoked from the directory of the spec file,
@@ -74,6 +77,10 @@ a = Analysis(
# only be installed locally.
"binaryninja",
"ida",
"ghidra",
# remove once https://github.com/mandiant/capa/issues/2681 has
# been addressed by PyInstaller
"pkg_resources",
],
)

62
.github/workflows/black-format.yml vendored Normal file
View File

@@ -0,0 +1,62 @@
name: black auto-format
on:
pull_request:
branches: [ master ]
paths-ignore:
- 'web/**'
- 'doc/**'
- '**.md'
workflow_dispatch: # allow manual trigger
permissions:
contents: write
jobs:
black-format:
# only run on dependabot PRs or manual trigger
if: github.actor == 'dependabot[bot]' || github.event_name == 'workflow_dispatch'
runs-on: ubuntu-22.04
steps:
- name: Checkout repository
uses: actions/checkout@b4ffde65f46336ab88eb53be808477a3936bae11 # v4.1.1
with:
ref: ${{ github.head_ref }}
# need a token with write access to push the commit
token: ${{ secrets.GITHUB_TOKEN }}
- name: Set up Python 3.13
uses: actions/setup-python@0a5c61591373683505ea898e09a3ea4f39ef2b9c # v5.0.0
with:
python-version: "3.13"
- name: Install dependencies
run: |
pip install -r requirements.txt
pip install -e .[dev,scripts]
- name: Run isort
run: pre-commit run isort --all-files
- name: Run black/continue
# black returns non-zero error code after formatting, which is what we expect
continue-on-error: true
run: pre-commit run black --all-files
- name: Check for changes
id: changes
run: |
if git diff --quiet; then
echo "has_changes=false" >> "$GITHUB_OUTPUT"
else
echo "has_changes=true" >> "$GITHUB_OUTPUT"
fi
- name: Commit and push formatting changes
if: steps.changes.outputs.has_changes == 'true'
run: |
git config user.name "${GITHUB_ACTOR}"
git config user.email "${GITHUB_ACTOR_ID}+${GITHUB_ACTOR}@users.noreply.github.com"
git add -A
git commit -m "style: auto-format with black and isort"
git push

View File

@@ -9,6 +9,7 @@ on:
- '**.md'
release:
types: [edited, published]
workflow_dispatch: # manual trigger for testing
permissions:
contents: write
@@ -22,24 +23,43 @@ jobs:
fail-fast: true
matrix:
include:
- os: ubuntu-20.04
- os: ubuntu-22.04
# use old linux so that the shared library versioning is more portable
artifact_name: capa
asset_name: linux
python_version: '3.10'
- os: ubuntu-20.04
# for Ghidra
java-version: '21'
ghidra-version: '12.0'
public-version: 'PUBLIC_20251205'
ghidra-sha256: 'af43e8cfb2fa4490cf6020c3a2bde25c159d83f45236a0542688a024e8fc1941'
- os: ubuntu-22.04-arm
artifact_name: capa
asset_name: linux-arm64
python_version: '3.10'
- os: ubuntu-22.04
artifact_name: capa
asset_name: linux-py312
python_version: '3.12'
- os: windows-2019
- os: windows-2022
artifact_name: capa.exe
asset_name: windows
python_version: '3.10'
- os: macos-13
# use older macOS for assumed better portability
# Windows 11 ARM64 complains of conflicting package version
# Additionally, there is no ARM64 build of Python for Python 3.10 on Windows 11 ARM: https://raw.githubusercontent.com/actions/python-versions/main/versions-manifest.json
#- os: windows-11-arm
# artifact_name: capa.exe
# asset_name: windows-arm64
# python_version: '3.12'
- os: macos-15-intel
# macos-15-intel is the lowest native intel build
artifact_name: capa
asset_name: macos
python_version: '3.10'
- os: macos-14
artifact_name: capa
asset_name: macos-arm64
python_version: '3.10'
steps:
- name: Checkout capa
uses: actions/checkout@b4ffde65f46336ab88eb53be808477a3936bae11 # v4.1.1
@@ -49,7 +69,7 @@ jobs:
uses: actions/setup-python@0a5c61591373683505ea898e09a3ea4f39ef2b9c # v5.0.0
with:
python-version: ${{ matrix.python_version }}
- if: matrix.os == 'ubuntu-20.04'
- if: matrix.os == 'ubuntu-22.04' || matrix.os == 'ubuntu-22.04-arm'
run: sudo apt-get install -y libyaml-dev
- name: Upgrade pip, setuptools
run: python -m pip install --upgrade pip setuptools
@@ -59,6 +79,28 @@ jobs:
pip install -e .[build]
- name: Build standalone executable
run: pyinstaller --log-level DEBUG .github/pyinstaller/pyinstaller.spec
- name: Does it run without warnings or errors?
shell: bash
run: |
if [[ "${{ matrix.os }}" == "windows-2022" ]] || [[ "${{ matrix.os }}" == "windows-11-arm" ]]; then
EXECUTABLE=".\\dist\\capa"
else
EXECUTABLE="./dist/capa"
fi
output=$(${EXECUTABLE} --version 2>&1)
exit_code=$?
echo "${output}"
echo "${exit_code}"
if echo "${output}" | grep -iE 'error|warning'; then
exit 1
fi
if [[ "${exit_code}" -ne 0 ]]; then
exit 1
fi
- name: Does it run (PE)?
run: dist/capa -d "tests/data/Practical Malware Analysis Lab 01-01.dll_"
- name: Does it run (Shellcode)?
@@ -69,39 +111,29 @@ jobs:
run: |
7z e "tests/data/dynamic/cape/v2.2/d46900384c78863420fb3e297d0a2f743cd2b6b3f7f82bf64059a168e07aceb7.json.gz"
dist/capa -d "d46900384c78863420fb3e297d0a2f743cd2b6b3f7f82bf64059a168e07aceb7.json"
- name: Set up Java ${{ matrix.java-version }}
if: matrix.os == 'ubuntu-22.04' && matrix.python_version == '3.10'
uses: actions/setup-java@387ac29b308b003ca37ba93a6cab5eb57c8f5f93 # v4.0.0
with:
distribution: 'temurin'
java-version: ${{ matrix.java-version }}
- name: Install Ghidra ${{ matrix.ghidra-version }}
if: matrix.os == 'ubuntu-22.04' && matrix.python_version == '3.10'
run: |
mkdir ./.github/ghidra
wget "https://github.com/NationalSecurityAgency/ghidra/releases/download/Ghidra_${{ matrix.ghidra-version }}_build/ghidra_${{ matrix.ghidra-version }}_${{ matrix.public-version }}.zip" -O ./.github/ghidra/ghidra_${{ matrix.ghidra-version }}_PUBLIC.zip
echo "${{ matrix.ghidra-sha256 }} ./.github/ghidra/ghidra_${{ matrix.ghidra-version }}_PUBLIC.zip" | sha256sum -c -
unzip .github/ghidra/ghidra_${{ matrix.ghidra-version }}_PUBLIC.zip -d .github/ghidra/
- name: Does it run (Ghidra)?
if: matrix.os == 'ubuntu-22.04' && matrix.python_version == '3.10'
env:
GHIDRA_INSTALL_DIR: ${{ github.workspace }}/.github/ghidra/ghidra_${{ matrix.ghidra-version }}_PUBLIC
run: dist/capa -b ghidra -d "tests/data/Practical Malware Analysis Lab 01-01.dll_"
- uses: actions/upload-artifact@5d5d22a31266ced268874388b861e4b58bb5c2f3 # v4.3.1
with:
name: ${{ matrix.asset_name }}
path: dist/${{ matrix.artifact_name }}
test_run:
name: Test run on ${{ matrix.os }} / ${{ matrix.asset_name }}
runs-on: ${{ matrix.os }}
needs: [build]
strategy:
matrix:
include:
# OSs not already tested above
- os: ubuntu-22.04
artifact_name: capa
asset_name: linux
- os: ubuntu-22.04
artifact_name: capa
asset_name: linux-py312
- os: windows-2022
artifact_name: capa.exe
asset_name: windows
steps:
- name: Download ${{ matrix.asset_name }}
uses: actions/download-artifact@eaceaf801fd36c7dee90939fad912460b18a1ffe # v4.1.2
with:
name: ${{ matrix.asset_name }}
- name: Set executable flag
if: matrix.os != 'windows-2022'
run: chmod +x ${{ matrix.artifact_name }}
- name: Run capa
run: ./${{ matrix.artifact_name }} -h
zip_and_upload:
# upload zipped binaries to Release page
if: github.event_name == 'release'
@@ -113,12 +145,18 @@ jobs:
include:
- asset_name: linux
artifact_name: capa
- asset_name: linux-arm64
artifact_name: capa
- asset_name: linux-py312
artifact_name: capa
- asset_name: windows
artifact_name: capa.exe
#- asset_name: windows-arm64
# artifact_name: capa.exe
- asset_name: macos
artifact_name: capa
- asset_name: macos-arm64
artifact_name: capa
steps:
- name: Download ${{ matrix.asset_name }}
uses: actions/download-artifact@eaceaf801fd36c7dee90939fad912460b18a1ffe # v4.1.2
@@ -129,7 +167,7 @@ jobs:
- name: Set zip name
run: echo "zip_name=capa-${GITHUB_REF#refs/tags/}-${{ matrix.asset_name }}.zip" >> $GITHUB_ENV
- name: Zip ${{ matrix.artifact_name }} into ${{ env.zip_name }}
run: zip ${{ env.zip_name }} ${{ matrix.artifact_name }}
run: zip ${ZIP_NAME} ${{ matrix.artifact_name }}
- name: Upload ${{ env.zip_name }} to GH Release
uses: svenstaro/upload-release-action@2728235f7dc9ff598bd86ce3c274b74f802d2208 # v2
with:

View File

@@ -14,8 +14,8 @@ jobs:
steps:
- name: Check out repository code
uses: actions/checkout@v4
uses: actions/checkout@b4ffde65f46336ab88eb53be808477a3936bae11 # v4.1.1
- uses: pypa/gh-action-pip-audit@v1.0.8
- uses: pypa/gh-action-pip-audit@1220774d901786e6f652ae159f7b6bc8fea6d266 # v1.1.0
with:
inputs: .

View File

@@ -35,7 +35,7 @@ jobs:
with:
path: dist/*
- name: publish package
uses: pypa/gh-action-pypi-publish@f5622bde02b04381239da3573277701ceca8f6a0 # release/v1
uses: pypa/gh-action-pypi-publish@76f52bc884231f62b9a034ebfe128415bbaabdfc # release/v1.12.4
with:
skip-existing: true
verbose: true

View File

@@ -21,8 +21,10 @@ jobs:
# user information is needed to create annotated tags (with a message)
git config user.email 'capa-dev@mandiant.com'
git config user.name 'Capa Bot'
name=${{ github.event.release.tag_name }}
name=${GITHUB_EVENT_RELEASE_TAG_NAME}
git tag $name -m "https://github.com/mandiant/capa/releases/$name"
env:
GITHUB_EVENT_RELEASE_TAG_NAME: ${{ github.event.release.tag_name }}
# TODO update branch name-major=${name%%.*}
- name: Push tag to capa-rules
uses: ad-m/github-push-action@d91a481090679876dfc4178fef17f286781251df # v0.8.0

View File

@@ -42,10 +42,10 @@ jobs:
- name: Checkout capa
uses: actions/checkout@b4ffde65f46336ab88eb53be808477a3936bae11 # v4.1.1
# use latest available python to take advantage of best performance
- name: Set up Python 3.12
- name: Set up Python 3.13
uses: actions/setup-python@0a5c61591373683505ea898e09a3ea4f39ef2b9c # v5.0.0
with:
python-version: "3.12"
python-version: "3.13"
- name: Install dependencies
run: |
pip install -r requirements.txt
@@ -70,10 +70,10 @@ jobs:
uses: actions/checkout@b4ffde65f46336ab88eb53be808477a3936bae11 # v4.1.1
with:
submodules: recursive
- name: Set up Python 3.12
- name: Set up Python 3.13
uses: actions/setup-python@0a5c61591373683505ea898e09a3ea4f39ef2b9c # v5.0.0
with:
python-version: "3.12"
python-version: "3.13"
- name: Install capa
run: |
pip install -r requirements.txt
@@ -88,16 +88,14 @@ jobs:
strategy:
fail-fast: false
matrix:
os: [ubuntu-20.04, windows-2019, macos-13]
os: [ubuntu-22.04, ubuntu-22.04-arm, windows-2022, macos-15-intel, macos-14]
# across all operating systems
python-version: ["3.10", "3.11"]
python-version: ["3.10", "3.13"]
include:
# on Ubuntu run these as well
- os: ubuntu-20.04
python-version: "3.10"
- os: ubuntu-20.04
- os: ubuntu-22.04
python-version: "3.11"
- os: ubuntu-20.04
- os: ubuntu-22.04
python-version: "3.12"
steps:
- name: Checkout capa with submodules
@@ -109,12 +107,17 @@ jobs:
with:
python-version: ${{ matrix.python-version }}
- name: Install pyyaml
if: matrix.os == 'ubuntu-20.04'
if: matrix.os == 'ubuntu-22.04'
run: sudo apt-get install -y libyaml-dev
- name: Install capa
run: |
pip install -r requirements.txt
pip install -e .[dev,scripts]
- name: Cache vivisect workspaces
uses: actions/cache@0057852bfaa89a56745cba8c7296529d2fc39830 # v4.3.0
with:
path: tests/data/**/*.viv
key: viv-${{ runner.os }}-${{ runner.arch }}-${{ matrix.python-version }}-${{ hashFiles('**/requirements.txt') }}
- name: Run tests (fast)
# this set of tests runs about 80% of the cases in 20% of the time,
# and should catch most errors quickly.
@@ -131,7 +134,7 @@ jobs:
strategy:
fail-fast: false
matrix:
python-version: ["3.10", "3.11"]
python-version: ["3.10", "3.13"]
steps:
- name: Checkout capa with submodules
# do only run if BN_SERIAL is available, have to do this in every step, see https://github.com/orgs/community/discussions/26726#discussioncomment-3253118
@@ -157,7 +160,7 @@ jobs:
run: |
mkdir ./.github/binja
curl "https://raw.githubusercontent.com/Vector35/binaryninja-api/6812c97/scripts/download_headless.py" -o ./.github/binja/download_headless.py
python ./.github/binja/download_headless.py --serial ${{ env.BN_SERIAL }} --output .github/binja/BinaryNinja-headless.zip
python ./.github/binja/download_headless.py --serial ${BN_SERIAL} --output .github/binja/BinaryNinja-headless.zip
unzip .github/binja/BinaryNinja-headless.zip -d .github/binja/
python .github/binja/binaryninja/scripts/install_api.py --install-on-root --silent
- name: Run tests
@@ -168,16 +171,16 @@ jobs:
ghidra-tests:
name: Ghidra tests for ${{ matrix.python-version }}
runs-on: ubuntu-20.04
runs-on: ubuntu-22.04
needs: [tests]
strategy:
fail-fast: false
matrix:
python-version: ["3.10", "3.11"]
java-version: ["17"]
ghidra-version: ["11.0.1"]
public-version: ["PUBLIC_20240130"] # for ghidra releases
ghidrathon-version: ["4.0.0"]
python-version: ["3.10", "3.13"]
java-version: ["21"]
ghidra-version: ["12.0"]
public-version: ["PUBLIC_20251205"] # for ghidra releases
ghidra-sha256: ['af43e8cfb2fa4490cf6020c3a2bde25c159d83f45236a0542688a024e8fc1941']
steps:
- name: Checkout capa with submodules
uses: actions/checkout@b4ffde65f46336ab88eb53be808477a3936bae11 # v4.1.1
@@ -196,26 +199,66 @@ jobs:
run: |
mkdir ./.github/ghidra
wget "https://github.com/NationalSecurityAgency/ghidra/releases/download/Ghidra_${{ matrix.ghidra-version }}_build/ghidra_${{ matrix.ghidra-version }}_${{ matrix.public-version }}.zip" -O ./.github/ghidra/ghidra_${{ matrix.ghidra-version }}_PUBLIC.zip
echo "${{ matrix.ghidra-sha256 }} ./.github/ghidra/ghidra_${{ matrix.ghidra-version }}_PUBLIC.zip" | sha256sum -c -
unzip .github/ghidra/ghidra_${{ matrix.ghidra-version }}_PUBLIC.zip -d .github/ghidra/
- name: Install Ghidrathon
run : |
mkdir ./.github/ghidrathon
wget "https://github.com/mandiant/Ghidrathon/releases/download/v${{ matrix.ghidrathon-version }}/Ghidrathon-v${{ matrix.ghidrathon-version}}.zip" -O ./.github/ghidrathon/ghidrathon-v${{ matrix.ghidrathon-version }}.zip
unzip .github/ghidrathon/ghidrathon-v${{ matrix.ghidrathon-version }}.zip -d .github/ghidrathon/
python -m pip install -r .github/ghidrathon/requirements.txt
python .github/ghidrathon/ghidrathon_configure.py $(pwd)/.github/ghidra/ghidra_${{ matrix.ghidra-version }}_PUBLIC
unzip .github/ghidrathon/Ghidrathon-v${{ matrix.ghidrathon-version }}.zip -d .github/ghidra/ghidra_${{ matrix.ghidra-version }}_PUBLIC/Ghidra/Extensions
- name: Install pyyaml
run: sudo apt-get install -y libyaml-dev
- name: Install capa with Ghidra extra
run: |
pip install -e .[dev,ghidra]
- name: Run tests
env:
GHIDRA_INSTALL_DIR: ${{ github.workspace }}/.github/ghidra/ghidra_${{ matrix.ghidra-version }}_PUBLIC
run: pytest -v tests/test_ghidra_features.py
idalib-tests:
name: IDA ${{ matrix.ida.version }} tests for ${{ matrix.python-version }}
runs-on: ubuntu-22.04
needs: [tests]
env:
IDA_LICENSE_ID: ${{ secrets.IDA_LICENSE_ID }}
strategy:
fail-fast: false
matrix:
python-version: ["3.10", "3.13"]
ida:
- version: 9.0
slug: "release/9.0/ida-essential/ida-essential_90_x64linux.run"
- version: 9.1
slug: "release/9.1/ida-essential/ida-essential_91_x64linux.run"
- version: 9.2
slug: "release/9.2/ida-essential/ida-essential_92_x64linux.run"
steps:
- name: Checkout capa with submodules
# do only run if IDA_LICENSE_ID is available, have to do this in every step, see https://github.com/orgs/community/discussions/26726#discussioncomment-3253118
if: ${{ env.IDA_LICENSE_ID != 0 }}
uses: actions/checkout@b4ffde65f46336ab88eb53be808477a3936bae11 # v4.1.1
with:
submodules: recursive
- name: Set up Python ${{ matrix.python-version }}
if: ${{ env.IDA_LICENSE_ID != 0 }}
uses: actions/setup-python@0a5c61591373683505ea898e09a3ea4f39ef2b9c # v5.0.0
with:
python-version: ${{ matrix.python-version }}
- name: Setup uv
if: ${{ env.IDA_LICENSE_ID != 0 }}
uses: astral-sh/setup-uv@61cb8a9741eeb8a550a1b8544337180c0fc8476b # v7.2.0
- name: Install dependencies
if: ${{ env.IDA_LICENSE_ID != 0 }}
run: sudo apt-get install -y libyaml-dev
- name: Install capa
if: ${{ env.IDA_LICENSE_ID != 0 }}
run: |
pip install -r requirements.txt
pip install -e .[dev,scripts]
pip install idapro
- name: Install IDA ${{ matrix.ida.version }}
if: ${{ env.IDA_LICENSE_ID != 0 }}
run: |
uv run hcli --disable-updates ida install --download-id ${{ matrix.ida.slug }} --license-id ${{ secrets.IDA_LICENSE_ID }} --set-default --yes
env:
HCLI_API_KEY: ${{ secrets.HCLI_API_KEY }}
IDA_LICENSE_ID: ${{ secrets.IDA_LICENSE_ID }}
- name: Run tests
run: |
mkdir ./.github/ghidra/project
.github/ghidra/ghidra_${{ matrix.ghidra-version }}_PUBLIC/support/analyzeHeadless .github/ghidra/project ghidra_test -Import ./tests/data/mimikatz.exe_ -ScriptPath ./tests/ -PostScript test_ghidra_features.py > ../output.log
cat ../output.log
exit_code=$(cat ../output.log | grep exit | awk '{print $NF}')
exit $exit_code
if: ${{ env.IDA_LICENSE_ID != 0 }}
run: pytest -v tests/test_idalib_features.py # explicitly refer to the idalib tests for performance. other tests run above.

View File

@@ -18,14 +18,18 @@ jobs:
- uses: actions/checkout@v4
- name: Set release name
run: echo "RELEASE_NAME=capa-explorer-web-v${{ github.event.inputs.version }}-${GITHUB_SHA::7}" >> $GITHUB_ENV
run: echo "RELEASE_NAME=capa-explorer-web-v${GITHUB_EVENT_INPUTS_VERSION}-${GITHUB_SHA::7}" >> $GITHUB_ENV
env:
GITHUB_EVENT_INPUTS_VERSION: ${{ github.event.inputs.version }}
- name: Check if release already exists
run: |
if ls web/explorer/releases/capa-explorer-web-v${{ github.event.inputs.version }}-* 1> /dev/null 2>&1; then
echo "::error:: A release with version ${{ github.event.inputs.version }} already exists"
if ls web/explorer/releases/capa-explorer-web-v${GITHUB_EVENT_INPUTS_VERSION}-* 1> /dev/null 2>&1; then
echo "::error:: A release with version ${GITHUB_EVENT_INPUTS_VERSION} already exists"
exit 1
fi
env:
GITHUB_EVENT_INPUTS_VERSION: ${{ github.event.inputs.version }}
- name: Set up Node.js
uses: actions/setup-node@0a44ba7841725637a19e28fa30b79a866c81b0a6 # v4.0.4
@@ -43,24 +47,24 @@ jobs:
working-directory: web/explorer
- name: Compress bundle
run: zip -r ${{ env.RELEASE_NAME }}.zip capa-explorer-web
run: zip -r ${RELEASE_NAME}.zip capa-explorer-web
working-directory: web/explorer
- name: Create releases directory
run: mkdir -vp web/explorer/releases
- name: Move release to releases folder
run: mv web/explorer/${{ env.RELEASE_NAME }}.zip web/explorer/releases
run: mv web/explorer/${RELEASE_NAME}.zip web/explorer/releases
- name: Compute release SHA256 hash
run: |
echo "RELEASE_SHA256=$(sha256sum web/explorer/releases/${{ env.RELEASE_NAME }}.zip | awk '{print $1}')" >> $GITHUB_ENV
echo "RELEASE_SHA256=$(sha256sum web/explorer/releases/${RELEASE_NAME}.zip | awk '{print $1}')" >> $GITHUB_ENV
- name: Update CHANGELOG.md
run: |
echo "## ${{ env.RELEASE_NAME }}" >> web/explorer/releases/CHANGELOG.md
echo "## ${RELEASE_NAME}" >> web/explorer/releases/CHANGELOG.md
echo "- Release Date: $(date -u '+%Y-%m-%d %H:%M:%S UTC')" >> web/explorer/releases/CHANGELOG.md
echo "- SHA256: ${{ env.RELEASE_SHA256 }}" >> web/explorer/releases/CHANGELOG.md
echo "- SHA256: ${RELEASE_SHA256}" >> web/explorer/releases/CHANGELOG.md
echo "" >> web/explorer/releases/CHANGELOG.md
cat web/explorer/releases/CHANGELOG.md
@@ -73,7 +77,7 @@ jobs:
run: |
git config --local user.email "capa-dev@mandiant.com"
git config --local user.name "Capa Bot"
git add -f web/explorer/releases/${{ env.RELEASE_NAME }}.zip web/explorer/releases/CHANGELOG.md
git add -f web/explorer/releases/${RELEASE_NAME}.zip web/explorer/releases/CHANGELOG.md
git add -u web/explorer/releases/
- name: Create Pull Request

1
.gitignore vendored
View File

@@ -122,6 +122,7 @@ scripts/perf/*.zip
*/.DS_Store
Pipfile
Pipfile.lock
uv.lock
/cache/
.github/binja/binaryninja
.github/binja/download_headless.py

View File

@@ -136,8 +136,8 @@ repos:
- "tests/"
- "--ignore=tests/test_binja_features.py"
- "--ignore=tests/test_ghidra_features.py"
- "--ignore=tests/test_ida_features.py"
- "--ignore=tests/test_viv_features.py"
- "--ignore=tests/test_idalib_features.py"
- "--ignore=tests/test_main.py"
- "--ignore=tests/test_scripts.py"
always_run: true

View File

@@ -4,28 +4,214 @@
### New Features
- ghidra: support PyGhidra @mike-hunhoff #2788
- vmray: extract number features from whitelisted void_ptr parameters (hKey, hKeyRoot) @adeboyedn #2835
### Breaking Changes
### New Rules (4)
### New Rules (23)
- communication/socket/connect-socket moritz.raabe@mandiant.com joakim@intezer.com mrhafizfarhad@gmail.com
- communication/socket/udp/connect-udp-socket mrhafizfarhad@gmail.com
- nursery/enter-debug-mode-in-dotnet @v1bh475u
- nursery/run-as-nodejs-native-module mehunhoff@google.com
- nursery/inject-shellcode-using-thread-pool-work-insertion-with-tp_io still@teamt5.org
- nursery/inject-shellcode-using-thread-pool-work-insertion-with-tp_timer still@teamt5.org
- nursery/inject-shellcode-using-thread-pool-work-insertion-with-tp_work still@teamt5.org
- data-manipulation/encryption/hc-256/encrypt-data-using-hc-256 wballenthin@hex-rays.com
- anti-analysis/anti-llm/terminate-anthropic-session-via-magic-strings wballenthin@hex-rays.com
- nursery/access-aws-credentials maximemorin@google.com
- nursery/access-cloudflare-credentials maximemorin@google.com
- nursery/access-docker-credentials maximemorin@google.com
- nursery/access-gcp-credentials maximemorin@google.com
- nursery/access-kubernetes-credentials maximemorin@google.com
- nursery/enumerate-aws-cloudformation maximemorin@google.com
- nursery/enumerate-aws-cloudtrail maximemorin@google.com
- nursery/enumerate-aws-direct-connect maximemorin@google.com
- nursery/enumerate-aws-ec2 maximemorin@google.com
- nursery/enumerate-aws-iam maximemorin@google.com
- nursery/enumerate-aws-s3 maximemorin@google.com
- nursery/enumerate-aws-support-cases maximemorin@google.com
- persistence/registry/persist-via-shellserviceobjectdelayload-registry-key xpzhxhm@gmail.com
- nursery/get-http-response-date @cosmoworker
- host-interaction/process/create/create-process-in-dotnet moritz.raabe@mandiant.com social.tarang@gmail.com
- nursery/read-file-in-dotnet moritz.raabe@mandiant.com anushka.virgaonkar@mandiant.com
- nursery/write-file-in-dotnet william.ballenthin@mandiant.com anushka.virgaonkar@mandiant.com
-
### Bug Fixes
- cape: make some fields optional @williballenthin #2631 #2632
- lint: add WARN for regex features that contain unescaped dot #2635
- main: suggest --os flag in unsupported OS error message to help users override ELF OS detection @devs6186 #2577
- render: escape sample-controlled strings before passing to Rich to prevent MarkupError @devs6186 #2699
- rules: handle empty or invalid YAML documents gracefully in `Rule.from_yaml` and `get_rules` @devs6186 #2900
- Fixed insecure deserialization vulnerability in YAML loading @0x1622 (#2770)
- loader: gracefully handle ELF files with unsupported architectures kamranulhaq2002@gmail.com #2800
- loader: handle SegmentationViolation for malformed ELF files @kami922 #2799
- lint: disable rule caching during linting @Maijin #2817
- vmray: skip processes with invalid PID or missing filename @EclipseAditya #2807
- features: fix Regex.get_value_str() returning escaped pattern instead of raw regex @EclipseAditya #1909
- render: use default styling for dynamic -vv API/call details so they are easier to see @devs6186 #1865
- loader: handle struct.error from dnfile and show clear CorruptFile message @devs6186 #2442
- address: fix TypeError when sorting locations containing mixed address types @devs6186 #2195
- loader: skip PE files with unrealistically large section virtual sizes to prevent resource exhaustion @devs6186 #1989
### capa Explorer Web
- webui: fix 404 for "View rule in capa-rules" by using encodeURIComponent for rule name in URL @devs6186 #2482
- webui: show error when JSON does not follow expected result document schema; suggest reanalyzing for VT URLs @devs6186 #2363
- webui: fix global search to match feature types (match, regex, api, …) @devs6186 #2349
### capa Explorer IDA Pro plugin
### Performance
- perf: eliminate O(n²) tuple growth and reduce per-match overhead @devs6186 #2890
### Development
- doc: document that default output shows top-level matches only; -v/-vv show nested matches @devs6186 #1410
- doc: fix typo in usage.md, add documentation links to README @devs6186 #2274
- doc: add table comparing ways to consume capa output (CLI, IDA, Ghidra, dynamic sandbox, web) @devs6186 #2273
- binja: add mypy config for top-level binaryninja module to fix mypy issues @devs6186 #2399
- ci: deprecate macos-13 runner and use Python v3.13 for testing @mike-hunhoff #2777
- ci: pin pip-audit action SHAs and update to v1.1.0 @kami922 #1131
### Raw diffs
- [capa v9.3.1...master](https://github.com/mandiant/capa/compare/v9.3.1...master)
- [capa-rules v9.3.1...master](https://github.com/mandiant/capa-rules/compare/v9.3.1...master)
## v9.3.1
This patch release fixes a missing import for the capa explorer plugin for IDA Pro.
### Bug Fixes
- add missing ida-netnode dependency to project.toml @mike-hunhoff #2765
### Development
- ci: bump binja min version @mike-hunhoff #2763
### Raw diffs
- [capa v9.3.0...master](https://github.com/mandiant/capa/compare/v9.3.0...master)
- [capa-rules v9.3.0...master](https://github.com/mandiant/capa-rules/compare/v9.3.0...master)
## v9.3.0
capa v9.3.0 comes with over 20 new and/or impoved rules.
For IDA users the capa explorer plugin is now available via the IDA Pro plugin repository and contains Qt compatibility layer for PyQt5 and PySide6 support.
Additionally a Binary Ninja bug has been fixed. Released binaries now include ARM64 binaries (Linux and macOS).
### New Features
- ci: add support for arm64 binary releases
- tests: run tests against IDA via idalib @williballenthin #2742
### Breaking Changes
### New Rules (24)
- anti-analysis/anti-vm/vm-detection/detect-mouse-movement-via-activity-checks-on-windows tevajdr@gmail.com
- nursery/create-executable-heap moritz.raabe@mandiant.com
- anti-analysis/packer/dxpack/packed-with-dxpack jakubjozwiak@google.com
- anti-analysis/anti-av/patch-bitdefender-hooking-dll-function jakubjozwiak@google.com
- nursery/acquire-load-driver-privileges mehunhoff@google.com
- nursery/communicate-using-ftp mehunhoff@google.com
- linking/static/eclipse-paho-mqtt-c/linked-against-eclipse-paho-mqtt-c jakubjozwiak@google.com
- linking/static/qmqtt/linked-against-qmqtt jakubjozwiak@google.com
- anti-analysis/anti-forensic/disable-powershell-transcription jakubjozwiak@google.com
- host-interaction/powershell/bypass-powershell-constrained-language-mode-via-getsystemlockdownpolicy-patch jakubjozwiak@google.com
- linking/static/grpc/linked-against-grpc jakubjozwiak@google.com
- linking/static/hp-socket/linked-against-hp-socket jakubjozwiak@google.com
- load-code/execute-jscript-via-vsaengine-in-dotnet jakubjozwiak@google.com
- linking/static/funchook/linked-against-funchook jakubjozwiak@google.com
- linking/static/plthook/linked-against-plthook jakubjozwiak@google.com
- host-interaction/network/enumerate-tcp-connections-via-wmi-com-api jakubjozwiak@google.com
- host-interaction/network/routing-table/create-routing-table-entry jakubjozwiak@google.com
- host-interaction/network/routing-table/get-routing-table michael.hunhoff@mandiant.com
- host-interaction/file-system/use-io_uring-io-interface-on-linux jakubjozwiak@google.com
- collection/keylog/log-keystrokes-via-direct-input zeze-zeze
- nursery/compiled-from-fsharp mehunhoff@google.com
- nursery/decrypt-data-using-aes-via-dotnet mehunhoff@google.com
- nursery/get-dotnet-assembly-entry-point mehunhoff@google.com
### Bug Fixes
- binja: fix a crash during feature extraction when the MLIL is unavailable @xusheng6 #2714
### capa Explorer Web
### capa Explorer IDA Pro plugin
- add `ida-plugin.json` for inclusion in the IDA Pro plugin repository @williballenthin
- ida plugin: add Qt compatibility layer for PyQt5 and PySide6 support @williballenthin #2707
- delay import to not load Qt* when running under idalib @mr-tz #2752
### Development
- ci: remove redundant "test_run" action from build workflow @mike-hunhoff #2692
- dev: add bumpmyversion to bump and sync versions across the project @mr-tz
### Raw diffs
- [capa v9.1.0...master](https://github.com/mandiant/capa/compare/v9.1.0...master)
- [capa-rules v9.1.0...master](https://github.com/mandiant/capa-rules/compare/v9.1.0...master)
- [capa v9.2.1...9.3.0](https://github.com/mandiant/capa/compare/v9.2.1...9.3.0)
- [capa-rules v9.2.1...9.3.0](https://github.com/mandiant/capa-rules/compare/v9.2.1...9.3.0)
## v9.2.1
This point release fixes bugs including removing an unnecessary PyInstaller warning message and enabling the standalone binary to execute on systems running older versions of glibc.
### Bug Fixes
- ci: exclude pkg_resources from PyInstaller build @mike-hunhoff #2684
- ci: downgrade Ubuntu version to accommodate older glibc versions @mike-hunhoff #2684
### Development
- ci: upgrade Windows version to avoid deprecation @mike-hunhoff #2684
- ci: check if build runs without warnings or errors @mike-hunhoff #2684
### Raw diffs
- [capa v9.2.0...v9.2.1](https://github.com/mandiant/capa/compare/v9.2.0...v9.2.1)
- [capa-rules v9.2.0...v9.2.1](https://github.com/mandiant/capa-rules/compare/v9.2.0...v9.2.1)
## v9.2.0
This release improves a few aspects of dynamic analysis, including relaxing our validation on fields across many CAPE versions and processing additional VMRay submission file types, for example.
It also includes an updated rule pack containing new rules and rule fixes.
### New Features
- vmray: do not restrict analysis to PE and ELF files, e.g. docx @mike-hunhoff #2672
### Breaking Changes
### New Rules (22)
- communication/socket/connect-socket moritz.raabe@mandiant.com joakim@intezer.com mrhafizfarhad@gmail.com
- communication/socket/udp/connect-udp-socket mrhafizfarhad@gmail.com
- nursery/enter-debug-mode-in-dotnet @v1bh475u
- nursery/decrypt-data-using-tripledes-in-dotnet 0xRavenspar
- nursery/encrypt-data-using-tripledes-in-dotnet 0xRavenspar
- nursery/disable-system-features-via-registry-on-windows mehunhoff@google.com
- data-manipulation/encryption/chaskey/encrypt-data-using-chaskey still@teamt5.org
- data-manipulation/encryption/speck/encrypt-data-using-speck still@teamt5.org
- load-code/dotnet/load-assembly-via-iassembly still@teamt5.org
- malware-family/donut-loader/load-shellcode-via-donut still@teamt5.org
- nursery/disable-device-guard-features-via-registry-on-windows mehunhoff@google.com
- nursery/disable-firewall-features-via-registry-on-windows mehunhoff@google.com
- nursery/disable-system-restore-features-via-registry-on-windows mehunhoff@google.com
- nursery/disable-windows-defender-features-via-registry-on-windows mehunhoff@google.com
- host-interaction/file-system/write/clear-file-content jakeperalta7
- host-interaction/filter/unload-minifilter-driver JakePeralta7
- exploitation/enumeration/make-suspicious-ntquerysysteminformation-call zdw@google.com
- exploitation/gadgets/load-ntoskrnl zdw@google.com
- exploitation/gadgets/resolve-ntoskrnl-gadgets zdw@google.com
- exploitation/spraying/make-suspicious-ntfscontrolfile-call zdw@google.com
- anti-analysis/anti-forensic/unload-sysmon JakePeralta7
### Bug Fixes
- cape: make some fields optional @williballenthin #2631 #2632
- lint: add WARN for regex features that contain unescaped dot #2635
- lint: add ERROR for incomplete registry control set regex #2643
- binja: update unit test core version #2670
### Raw diffs
- [capa v9.1.0...v9.2.0](https://github.com/mandiant/capa/compare/v9.1.0...v9.2.0)
- [capa-rules v9.1.0...v9.2.0](https://github.com/mandiant/capa-rules/compare/v9.1.0...v9.2.0)
## v9.1.0

View File

@@ -87,6 +87,33 @@ Download stable releases of the standalone capa binaries [here](https://github.c
To use capa as a library or integrate with another tool, see [doc/installation.md](https://github.com/mandiant/capa/blob/master/doc/installation.md) for further setup instructions.
**Documentation:** [Usage and tips](doc/usage.md) · [Installation](doc/installation.md) · [Limitations](doc/limitations.md) · [FAQ](doc/faq.md)
# mapa html map
mapa can render a standalone HTML overview of functions, string tags, and referenced strings.
```bash
python -m mapa binaries/01/16/mpbindump.exe --output html-map > report.html
```
On the `mpbindump.exe` sample, the generated report contains:
```text
doctype <!doctype html>
functions 1406
tags 12
strings 81
```
Use `--open` to write the report to a temporary file and open it in your browser.
```bash
python -m mapa binaries/01/16/mpbindump.exe --output html-map --open
```
The executable proof for this example is in `doc/demos/mapa-html-map.md`.
# capa Explorer Web
The [capa Explorer Web](https://mandiant.github.io/capa/explorer/) enables you to interactively explore capa results in your web browser. Besides the online version you can download a standalone HTML file for local offline usage.
@@ -291,11 +318,17 @@ It also uses your local changes to the .idb to extract better features, such as
![capa + IDA Pro integration](https://github.com/mandiant/capa/blob/master/doc/img/explorer_expanded.png)
# Ghidra integration
If you use Ghidra, then you can use the [capa + Ghidra integration](/capa/ghidra/) to run capa's analysis directly on your Ghidra database and render the results in Ghidra's user interface.
capa supports using Ghidra (via [PyGhidra](https://github.com/NationalSecurityAgency/ghidra/tree/master/Ghidra/Features/PyGhidra)) as a feature extraction backend. This allows you to run capa against binaries using Ghidra's analysis engine.
You can run and view capa results in the Ghidra UI using [capa explorer for Ghidra](https://github.com/mandiant/capa/tree/master/capa/ghidra/plugin).
<img src="https://github.com/mandiant/capa/assets/66766340/eeae33f4-99d4-42dc-a5e8-4c1b8c661492" width=300>
You can also run capa from the command line using the [Ghidra backend](https://github.com/mandiant/capa/tree/master/capa/ghidra).
# blog posts
- [Riding Dragons: capa Harnesses Ghidra](https://www.mandiant.com/resources/blog/capa-harnesses-ghidra)
- [Dynamic capa: Exploring Executable Run-Time Behavior with the CAPE Sandbox](https://www.mandiant.com/resources/blog/dynamic-capa-executable-behavior-cape-sandbox)
- [capa v4: casting a wider .NET](https://www.mandiant.com/resources/blog/capa-v4-casting-wider-net) (.NET support)
- [ELFant in the Room capa v3](https://www.mandiant.com/resources/elfant-in-the-room-capa-v3) (ELF support)
@@ -315,3 +348,6 @@ If you use Ghidra, then you can use the [capa + Ghidra integration](/capa/ghidra
## capa testfiles
The [capa-testfiles repository](https://github.com/mandiant/capa-testfiles) contains the data we use to test capa's code and rules
## mailing list
Subscribe to the FLARE mailing list for community announcements! Email "subscribe" to [flare-external@google.com](mailto:flare-external@google.com?subject=subscribe).

View File

@@ -277,7 +277,9 @@ def find_dynamic_capabilities(
all_span_matches: MatchResults = collections.defaultdict(list)
all_call_matches: MatchResults = collections.defaultdict(list)
feature_counts = rdoc.DynamicFeatureCounts(file=0, processes=())
# Accumulate into a list to avoid O(n²) tuple concatenation.
# Tuples are immutable, so `t += (x,)` copies the entire tuple each time.
process_feature_counts: list[rdoc.ProcessFeatureCount] = []
assert isinstance(extractor, DynamicFeatureExtractor)
processes: list[ProcessHandle] = list(extractor.get_processes())
@@ -289,10 +291,10 @@ def find_dynamic_capabilities(
task = pbar.add_task("matching", total=n_processes, unit="processes")
for p in processes:
process_capabilities = find_process_capabilities(ruleset, extractor, p)
feature_counts.processes += (
process_feature_counts.append(
rdoc.ProcessFeatureCount(
address=frz.Address.from_capa(p.address), count=process_capabilities.feature_count
),
)
)
for rule_name, res in process_capabilities.process_matches.items():
@@ -317,7 +319,11 @@ def find_dynamic_capabilities(
capa.engine.index_rule_matches(process_and_lower_features, rule, locations)
all_file_capabilities = find_file_capabilities(ruleset, extractor, process_and_lower_features)
feature_counts.file = all_file_capabilities.feature_count
feature_counts = rdoc.DynamicFeatureCounts(
file=all_file_capabilities.feature_count,
processes=tuple(process_feature_counts),
)
matches = dict(
itertools.chain(

View File

@@ -156,8 +156,11 @@ def find_static_capabilities(
all_bb_matches: MatchResults = collections.defaultdict(list)
all_insn_matches: MatchResults = collections.defaultdict(list)
feature_counts = rdoc.StaticFeatureCounts(file=0, functions=())
library_functions: tuple[rdoc.LibraryFunction, ...] = ()
# Accumulate into lists to avoid O(n²) tuple concatenation.
# Tuples are immutable, so `t += (x,)` copies the entire tuple each time.
# For binaries with thousands of functions this becomes quadratic in memory work.
function_feature_counts: list[rdoc.FunctionFeatureCount] = []
library_functions_list: list[rdoc.LibraryFunction] = []
assert isinstance(extractor, StaticFeatureExtractor)
functions: list[FunctionHandle] = list(extractor.get_functions())
@@ -176,20 +179,20 @@ def find_static_capabilities(
if extractor.is_library_function(f.address):
function_name = extractor.get_function_name(f.address)
logger.debug("skipping library function 0x%x (%s)", f.address, function_name)
library_functions += (
rdoc.LibraryFunction(address=frz.Address.from_capa(f.address), name=function_name),
library_functions_list.append(
rdoc.LibraryFunction(address=frz.Address.from_capa(f.address), name=function_name)
)
n_libs = len(library_functions)
n_libs = len(library_functions_list)
percentage = round(100 * (n_libs / n_funcs))
pbar.update(task, postfix=f"skipped {n_libs} library functions, {percentage}%")
pbar.advance(task)
continue
code_capabilities = find_code_capabilities(ruleset, extractor, f)
feature_counts.functions += (
function_feature_counts.append(
rdoc.FunctionFeatureCount(
address=frz.Address.from_capa(f.address), count=code_capabilities.feature_count
),
)
)
t1 = time.time()
@@ -230,7 +233,11 @@ def find_static_capabilities(
capa.engine.index_rule_matches(function_and_lower_features, rule, locations)
all_file_capabilities = find_file_capabilities(ruleset, extractor, function_and_lower_features)
feature_counts.file = all_file_capabilities.feature_count
feature_counts = rdoc.StaticFeatureCounts(
file=all_file_capabilities.feature_count,
functions=tuple(function_feature_counts),
)
matches: MatchResults = dict(
itertools.chain(
@@ -244,4 +251,4 @@ def find_static_capabilities(
)
)
return Capabilities(matches, feature_counts, library_functions)
return Capabilities(matches, feature_counts, tuple(library_functions_list))

View File

@@ -189,6 +189,11 @@ class _NoAddress(Address):
def __lt__(self, other):
return False
def __gt__(self, other):
# Mixed-type comparison: (real_address < NO_ADDRESS) invokes this so sort works.
# NoAddress sorts last.
return other is not self
def __hash__(self):
return hash(0)

View File

@@ -369,6 +369,12 @@ class Regex(String):
else:
return Result(False, _MatchedRegex(self, {}), [])
def get_value_str(self) -> str:
# return the raw regex pattern, not the escaped version from String.get_value_str().
# see #1909.
assert isinstance(self.value, str)
return self.value
def __str__(self):
assert isinstance(self.value, str)
return f"regex(string =~ {self.value})"

View File

@@ -20,6 +20,7 @@ Proto files generated via protobuf v24.4:
from BinExport2 at 6916731d5f6693c4a4f0a052501fd3bd92cfd08b
https://github.com/google/binexport/blob/6916731/binexport2.proto
"""
import io
import hashlib
import logging

View File

@@ -84,16 +84,14 @@ def extract_insn_number_features(
yield OperandOffset(i, value), ih.address
OFFSET_PATTERNS = BinExport2InstructionPatternMatcher.from_str(
"""
OFFSET_PATTERNS = BinExport2InstructionPatternMatcher.from_str("""
ldr|ldrb|ldrh|ldrsb|ldrsh|ldrex|ldrd|str|strb|strh|strex|strd reg, [reg(not-stack), #int] ; capture #int
ldr|ldrb|ldrh|ldrsb|ldrsh|ldrex|ldrd|str|strb|strh|strex|strd reg, [reg(not-stack), #int]! ; capture #int
ldr|ldrb|ldrh|ldrsb|ldrsh|ldrex|ldrd|str|strb|strh|strex|strd reg, [reg(not-stack)], #int ; capture #int
ldp|ldpd|stp|stpd reg, reg, [reg(not-stack), #int] ; capture #int
ldp|ldpd|stp|stpd reg, reg, [reg(not-stack), #int]! ; capture #int
ldp|ldpd|stp|stpd reg, reg, [reg(not-stack)], #int ; capture #int
"""
)
""")
def extract_insn_offset_features(
@@ -117,12 +115,10 @@ def extract_insn_offset_features(
yield OperandOffset(match.operand_index, value), ih.address
NZXOR_PATTERNS = BinExport2InstructionPatternMatcher.from_str(
"""
NZXOR_PATTERNS = BinExport2InstructionPatternMatcher.from_str("""
eor reg, reg, reg
eor reg, reg, #int
"""
)
""")
def extract_insn_nzxor_characteristic_features(
@@ -144,11 +140,9 @@ def extract_insn_nzxor_characteristic_features(
yield Characteristic("nzxor"), ih.address
INDIRECT_CALL_PATTERNS = BinExport2InstructionPatternMatcher.from_str(
"""
INDIRECT_CALL_PATTERNS = BinExport2InstructionPatternMatcher.from_str("""
blx|bx|blr reg
"""
)
""")
def extract_function_indirect_call_characteristic_features(

View File

@@ -34,17 +34,14 @@ from capa.features.extractors.binexport2.arch.intel.helpers import SECURITY_COOK
logger = logging.getLogger(__name__)
IGNORE_NUMBER_PATTERNS = BinExport2InstructionPatternMatcher.from_str(
"""
IGNORE_NUMBER_PATTERNS = BinExport2InstructionPatternMatcher.from_str("""
ret #int
retn #int
add reg(stack), #int
sub reg(stack), #int
"""
)
""")
NUMBER_PATTERNS = BinExport2InstructionPatternMatcher.from_str(
"""
NUMBER_PATTERNS = BinExport2InstructionPatternMatcher.from_str("""
push #int0 ; capture #int0
# its a little tedious to enumerate all the address forms
@@ -64,8 +61,7 @@ NUMBER_PATTERNS = BinExport2InstructionPatternMatcher.from_str(
# imagine reg is zero'd out, then this is like `mov reg, #int`
# which is not uncommon.
lea reg, [reg + #int] ; capture #int
"""
)
""")
def extract_insn_number_features(
@@ -100,8 +96,7 @@ def extract_insn_number_features(
yield OperandOffset(match.operand_index, value), ih.address
OFFSET_PATTERNS = BinExport2InstructionPatternMatcher.from_str(
"""
OFFSET_PATTERNS = BinExport2InstructionPatternMatcher.from_str("""
mov|movzx|movsb|cmp [reg + reg * #int + #int0], #int ; capture #int0
mov|movzx|movsb|cmp [reg * #int + #int0], #int ; capture #int0
mov|movzx|movsb|cmp [reg + reg + #int0], #int ; capture #int0
@@ -114,18 +109,15 @@ OFFSET_PATTERNS = BinExport2InstructionPatternMatcher.from_str(
mov|movzx|movsb|cmp|lea reg, [reg * #int + #int0] ; capture #int0
mov|movzx|movsb|cmp|lea reg, [reg + reg + #int0] ; capture #int0
mov|movzx|movsb|cmp|lea reg, [reg(not-stack) + #int0] ; capture #int0
"""
)
""")
# these are patterns that access offset 0 from some pointer
# (pointer is not the stack pointer).
OFFSET_ZERO_PATTERNS = BinExport2InstructionPatternMatcher.from_str(
"""
OFFSET_ZERO_PATTERNS = BinExport2InstructionPatternMatcher.from_str("""
mov|movzx|movsb [reg(not-stack)], reg
mov|movzx|movsb [reg(not-stack)], #int
lea reg, [reg(not-stack)]
"""
)
""")
def extract_insn_offset_features(
@@ -189,12 +181,10 @@ def is_security_cookie(
return False
NZXOR_PATTERNS = BinExport2InstructionPatternMatcher.from_str(
"""
NZXOR_PATTERNS = BinExport2InstructionPatternMatcher.from_str("""
xor|xorpd|xorps|pxor reg, reg
xor|xorpd|xorps|pxor reg, #int
"""
)
""")
def extract_insn_nzxor_characteristic_features(
@@ -228,8 +218,7 @@ def extract_insn_nzxor_characteristic_features(
yield Characteristic("nzxor"), ih.address
INDIRECT_CALL_PATTERNS = BinExport2InstructionPatternMatcher.from_str(
"""
INDIRECT_CALL_PATTERNS = BinExport2InstructionPatternMatcher.from_str("""
call|jmp reg0
call|jmp [reg + reg * #int + #int]
call|jmp [reg + reg * #int]
@@ -237,8 +226,7 @@ INDIRECT_CALL_PATTERNS = BinExport2InstructionPatternMatcher.from_str(
call|jmp [reg + reg + #int]
call|jmp [reg + #int]
call|jmp [reg]
"""
)
""")
def extract_function_indirect_call_characteristic_features(

View File

@@ -19,7 +19,6 @@ from binaryninja import (
Function,
BinaryView,
SymbolType,
ILException,
RegisterValueType,
VariableSourceType,
LowLevelILOperation,
@@ -192,9 +191,8 @@ def extract_stackstring(fh: FunctionHandle):
if bv is None:
return
try:
mlil = func.mlil
except ILException:
mlil = func.mlil
if mlil is None:
return
for block in mlil.basic_blocks:

View File

@@ -21,9 +21,9 @@ import capa.features.extractors.cape.file
import capa.features.extractors.cape.thread
import capa.features.extractors.cape.global_
import capa.features.extractors.cape.process
from capa.exceptions import EmptyReportError
from capa.exceptions import EmptyReportError, UnsupportedFormatError
from capa.features.common import Feature
from capa.features.address import NO_ADDRESS, Address, AbsoluteVirtualAddress, _NoAddress
from capa.features.address import Address, AbsoluteVirtualAddress, _NoAddress
from capa.features.extractors.cape.models import Call, Static, Process, CapeReport
from capa.features.extractors.base_extractor import (
CallHandle,
@@ -35,7 +35,7 @@ from capa.features.extractors.base_extractor import (
logger = logging.getLogger(__name__)
TESTED_VERSIONS = {"2.2-CAPE", "2.4-CAPE"}
TESTED_VERSIONS = {"2.2-CAPE", "2.4-CAPE", "2.5-CAPE"}
class CapeExtractor(DynamicFeatureExtractor):
@@ -53,14 +53,9 @@ class CapeExtractor(DynamicFeatureExtractor):
self.global_features = list(capa.features.extractors.cape.global_.extract_features(self.report))
def get_base_address(self) -> Union[AbsoluteVirtualAddress, _NoAddress, None]:
if self.report.static is None:
return NO_ADDRESS
if self.report.static.pe is None:
# TODO: handle ELF
return NO_ADDRESS
# value according to the PE header, the actual trace may use a different imagebase
assert self.report.static is not None
assert self.report.static.pe is not None
return AbsoluteVirtualAddress(self.report.static.pe.imagebase)
def extract_global_features(self) -> Iterator[tuple[Feature, Address]]:
@@ -125,10 +120,8 @@ class CapeExtractor(DynamicFeatureExtractor):
parts.append(" -> ")
if call.pretty_return:
parts.append(call.pretty_return)
elif call.return_:
parts.append(hex(call.return_))
else:
parts.append("?")
parts.append(hex(call.return_))
return "".join(parts)
@@ -139,11 +132,25 @@ class CapeExtractor(DynamicFeatureExtractor):
if cr.info.version not in TESTED_VERSIONS:
logger.warning("CAPE version '%s' not tested/supported yet", cr.info.version)
# TODO(mr-tz): support more file types
# https://github.com/mandiant/capa/issues/1933
if "PE" not in cr.target.file.type:
logger.error(
"capa currently only supports PE target files, this target file's type is: '%s'.\nPlease report this at: https://github.com/mandiant/capa/issues/1933",
cr.target.file.type,
)
# observed in 2.4-CAPE reports from capesandbox.com
if cr.static is None and cr.target.file.pe is not None:
cr.static = Static()
cr.static.pe = cr.target.file.pe
if cr.static is None:
raise UnsupportedFormatError("CAPE report missing static analysis")
if cr.static.pe is None:
raise UnsupportedFormatError("CAPE report missing PE analysis")
if len(cr.behavior.processes) == 0:
raise EmptyReportError("CAPE did not capture any processes")

View File

@@ -32,13 +32,7 @@ def get_processes(report: CapeReport) -> Iterator[ProcessHandle]:
"""
seen_processes = {}
for process in report.behavior.processes:
if process.parent_id is None:
# on CAPE for Linux, the root process may have no parent id, so we set that to 0
ppid = 0
else:
ppid = process.parent_id
addr = ProcessAddress(pid=process.process_id, ppid=ppid)
addr = ProcessAddress(pid=process.process_id, ppid=process.parent_id)
yield ProcessHandle(address=addr, inner=process)
# check for pid and ppid reuse
@@ -58,13 +52,7 @@ def extract_import_names(report: CapeReport) -> Iterator[tuple[Feature, Address]
"""
extract imported function names
"""
if report.static is None:
return
if report.static.pe is None:
# TODO: elf
return
assert report.static is not None and report.static.pe is not None
imports = report.static.pe.imports
if isinstance(imports, dict):
@@ -82,25 +70,13 @@ def extract_import_names(report: CapeReport) -> Iterator[tuple[Feature, Address]
def extract_export_names(report: CapeReport) -> Iterator[tuple[Feature, Address]]:
if report.static is None:
return
if report.static.pe is None:
# TODO: elf
return
assert report.static is not None and report.static.pe is not None
for function in report.static.pe.exports:
yield Export(function.name), AbsoluteVirtualAddress(function.address)
def extract_section_names(report: CapeReport) -> Iterator[tuple[Feature, Address]]:
if report.static is None:
return
if report.static.pe is None:
# TODO: elf
return
assert report.static is not None and report.static.pe is not None
for section in report.static.pe.sections:
yield Section(section.name), AbsoluteVirtualAddress(section.virtual_address)

View File

@@ -42,6 +42,9 @@ def extract_arch(report: CapeReport) -> Iterator[tuple[Feature, Address]]:
yield Arch(ARCH_AMD64), NO_ADDRESS
else:
logger.warning("unrecognized Architecture: %s", report.target.file.type)
raise ValueError(
f"unrecognized Architecture from the CAPE report; output of file command: {report.target.file.type}"
)
def extract_format(report: CapeReport) -> Iterator[tuple[Feature, Address]]:
@@ -51,6 +54,9 @@ def extract_format(report: CapeReport) -> Iterator[tuple[Feature, Address]]:
yield Format(FORMAT_ELF), NO_ADDRESS
else:
logger.warning("unknown file format, file command output: %s", report.target.file.type)
raise ValueError(
f"unrecognized file format from the CAPE report; output of file command: {report.target.file.type}"
)
def extract_os(report: CapeReport) -> Iterator[tuple[Feature, Address]]:
@@ -74,10 +80,7 @@ def extract_os(report: CapeReport) -> Iterator[tuple[Feature, Address]]:
else:
# if the operating system information is missing from the cape report, it's likely a bug
logger.warning("unrecognized OS: %s", file_output)
elif report.info.machine and report.info.machine.platform == "windows":
yield OS(OS_WINDOWS), NO_ADDRESS
elif report.info.machine and report.info.machine.platform == "linux":
yield OS(OS_LINUX), NO_ADDRESS
raise ValueError(f"unrecognized OS from the CAPE report; output of file command: {file_output}")
else:
# the sample is shellcode
logger.debug("unsupported file format, file command output: %s", file_output)

View File

@@ -29,26 +29,8 @@ def validate_hex_bytes(value):
return bytes.fromhex(value) if isinstance(value, str) else value
def validate_status_code(value):
if isinstance(value, str):
if value == "?":
# TODO: check for this in the return handling
return None
# like: -1 EINVAL (Invalid argument)
# like: 0 (Timeout)
# like: 0x8002 (flags O_RDWR|O_LARGEFILE)
assert value.endswith(")")
num = value.partition(" ")[0]
return int(num, 16) if num.startswith("0x") else int(num, 10)
else:
return value
HexInt = Annotated[int, BeforeValidator(validate_hex_int)]
HexBytes = Annotated[bytes, BeforeValidator(validate_hex_bytes)]
# this is a status code, such as returned by CAPE for Linux, like: "0 (Timeout)" or "0x8002 (flags O_RDWR|O_LARGEFILE)
StatusCode = Annotated[int | None, BeforeValidator(validate_status_code)]
# a model that *cannot* have extra fields
@@ -89,13 +71,8 @@ Emptydict: TypeAlias = BaseModel
EmptyList: TypeAlias = list[Any]
class Machine(FlexibleModel):
platform: Optional[str] = None
class Info(FlexibleModel):
version: str
machine: Optional[Machine] = None
class ImportedSymbol(FlexibleModel):
@@ -310,38 +287,16 @@ class Argument(FlexibleModel):
pretty_value: Optional[str] = None
def validate_argument(value):
if isinstance(value, str):
# for a few calls on CAPE for Linux, we see arguments like in this call:
#
# timestamp: "18:12:17.199276"
# category: "misc"
# api: "uname"
# return: "0"
# ▽ arguments:
# [0]: "{sysname=\"Linux\", nodename=\"laptop\", ...}"
#
# which is just a string with a JSON-like thing inside,
# that we want to map a default unnamed argument.
return Argument(name="", value=value)
else:
return value
# mypy isn't happy about assigning to type
Argument = Annotated[Argument, BeforeValidator(validate_argument)] # type: ignore
class Call(FlexibleModel):
# timestamp: str
thread_id: int | None = None
thread_id: int
# category: str
api: str
arguments: list[Argument]
# status: bool
return_: HexInt | StatusCode = Field(alias="return")
return_: HexInt = Field(alias="return")
pretty_return: Optional[str] = None
# repeated: int
@@ -360,12 +315,12 @@ class Call(FlexibleModel):
class Process(FlexibleModel):
process_id: int
process_name: str
parent_id: int | None
parent_id: int
# module_path: str
# first_seen: str
calls: list[Call]
threads: list[int] | None = None # this can be None for CAPE for Linux, which doesn't track threads.
environ: dict[str, str] = Field(default_factory=dict) # type: ignore
threads: list[int]
environ: dict[str, str]
"""

View File

@@ -29,13 +29,6 @@ def get_threads(ph: ProcessHandle) -> Iterator[ThreadHandle]:
get the threads associated with a given process
"""
process: Process = ph.inner
if not process.threads:
# CAPE for linux doesn't record threads
# so we return a default 0 value
yield ThreadHandle(address=ThreadAddress(process=ph.address, tid=0), inner={})
return
threads: list[int] = process.threads
for thread in threads:
@@ -49,9 +42,6 @@ def extract_environ_strings(ph: ProcessHandle) -> Iterator[tuple[Feature, Addres
"""
process: Process = ph.inner
if not process.environ:
return
for value in (value for value in process.environ.values() if value):
yield String(value), ph.address

View File

@@ -29,16 +29,8 @@ def get_calls(ph: ProcessHandle, th: ThreadHandle) -> Iterator[CallHandle]:
tid = th.address.tid
for call_index, call in enumerate(process.calls):
if call.thread_id is None:
# CAPE for linux doesn't record threads
# so this must be the 0 value
# and we'll enumerate all the calls in this process
assert tid == 0
else:
if call.thread_id != tid:
continue
if call.thread_id != tid:
continue
for symbol in generate_symbols("", call.api):
call.api = symbol

View File

@@ -27,7 +27,12 @@ import capa.features.extractors.dnfile.file
import capa.features.extractors.dnfile.insn
import capa.features.extractors.dnfile.function
from capa.features.common import Feature
from capa.features.address import NO_ADDRESS, Address, DNTokenAddress, DNTokenOffsetAddress
from capa.features.address import (
NO_ADDRESS,
Address,
DNTokenAddress,
DNTokenOffsetAddress,
)
from capa.features.extractors.dnfile.types import DnType, DnUnmanagedMethod
from capa.features.extractors.base_extractor import (
BBHandle,
@@ -39,6 +44,7 @@ from capa.features.extractors.base_extractor import (
from capa.features.extractors.dnfile.helpers import (
get_dotnet_types,
get_dotnet_fields,
load_dotnet_image,
get_dotnet_managed_imports,
get_dotnet_managed_methods,
get_dotnet_unmanaged_imports,
@@ -83,7 +89,7 @@ class DnFileFeatureExtractorCache:
class DnfileFeatureExtractor(StaticFeatureExtractor):
def __init__(self, path: Path):
self.pe: dnfile.dnPE = dnfile.dnPE(str(path))
self.pe = load_dotnet_image(path)
super().__init__(hashes=SampleHashes.from_bytes(path.read_bytes()))
# pre-compute .NET token lookup tables; each .NET method has access to this cache for feature extraction
@@ -112,7 +118,12 @@ class DnfileFeatureExtractor(StaticFeatureExtractor):
fh: FunctionHandle = FunctionHandle(
address=DNTokenAddress(token),
inner=method,
ctx={"pe": self.pe, "calls_from": set(), "calls_to": set(), "cache": self.token_cache},
ctx={
"pe": self.pe,
"calls_from": set(),
"calls_to": set(),
"cache": self.token_cache,
},
)
# method tokens should be unique

View File

@@ -15,8 +15,10 @@
from __future__ import annotations
import struct
import logging
from typing import Union, Iterator, Optional
from pathlib import Path
import dnfile
from dncil.cil.body import CilMethodBody
@@ -30,6 +32,16 @@ from capa.features.extractors.dnfile.types import DnType, DnUnmanagedMethod
logger = logging.getLogger(__name__)
def load_dotnet_image(path: Path) -> dnfile.dnPE:
"""load a .NET PE file, raising CorruptFile on struct.error with the original error message."""
try:
return dnfile.dnPE(str(path))
except struct.error as e:
from capa.loader import CorruptFile
raise CorruptFile(f"Invalid or truncated .NET metadata: {e}") from e
class DnfileMethodBodyReader(CilMethodBodyReaderBase):
def __init__(self, pe: dnfile.dnPE, row: dnfile.mdtable.MethodDefRow):
self.pe: dnfile.dnPE = pe
@@ -151,7 +163,9 @@ def get_dotnet_managed_imports(pe: dnfile.dnPE) -> Iterator[DnType]:
)
def get_dotnet_methoddef_property_accessors(pe: dnfile.dnPE) -> Iterator[tuple[int, str]]:
def get_dotnet_methoddef_property_accessors(
pe: dnfile.dnPE,
) -> Iterator[tuple[int, str]]:
"""get MethodDef methods used to access properties
see https://www.ntcore.com/files/dotnetformat.htm
@@ -226,7 +240,13 @@ def get_dotnet_managed_methods(pe: dnfile.dnPE) -> Iterator[DnType]:
typedefnamespace, typedefname = resolve_nested_typedef_name(nested_class_table, rid, typedef, pe)
yield DnType(token, typedefname, namespace=typedefnamespace, member=method_name, access=access)
yield DnType(
token,
typedefname,
namespace=typedefnamespace,
member=method_name,
access=access,
)
def get_dotnet_fields(pe: dnfile.dnPE) -> Iterator[DnType]:
@@ -259,7 +279,9 @@ def get_dotnet_fields(pe: dnfile.dnPE) -> Iterator[DnType]:
yield DnType(token, typedefname, namespace=typedefnamespace, member=field.row.Name)
def get_dotnet_managed_method_bodies(pe: dnfile.dnPE) -> Iterator[tuple[int, CilMethodBody]]:
def get_dotnet_managed_method_bodies(
pe: dnfile.dnPE,
) -> Iterator[tuple[int, CilMethodBody]]:
"""get managed methods from MethodDef table"""
for rid, method_def in iter_dotnet_table(pe, dnfile.mdtable.MethodDef.number):
assert isinstance(method_def, dnfile.mdtable.MethodDefRow)
@@ -338,7 +360,10 @@ def get_dotnet_table_row(pe: dnfile.dnPE, table_index: int, row_index: int) -> O
def resolve_nested_typedef_name(
nested_class_table: dict, index: int, typedef: dnfile.mdtable.TypeDefRow, pe: dnfile.dnPE
nested_class_table: dict,
index: int,
typedef: dnfile.mdtable.TypeDefRow,
pe: dnfile.dnPE,
) -> tuple[str, tuple[str, ...]]:
"""Resolves all nested TypeDef class names. Returns the namespace as a str and the nested TypeRef name as a tuple"""

View File

@@ -42,6 +42,7 @@ from capa.features.extractors.dnfile.types import DnType
from capa.features.extractors.base_extractor import SampleHashes, StaticFeatureExtractor
from capa.features.extractors.dnfile.helpers import (
iter_dotnet_table,
load_dotnet_image,
is_dotnet_mixed_mode,
get_dotnet_managed_imports,
get_dotnet_managed_methods,
@@ -184,8 +185,8 @@ GLOBAL_HANDLERS = (
class DotnetFileFeatureExtractor(StaticFeatureExtractor):
def __init__(self, path: Path):
super().__init__(hashes=SampleHashes.from_bytes(path.read_bytes()))
self.path: Path = path
self.pe: dnfile.dnPE = dnfile.dnPE(str(path))
self.path = path
self.pe = load_dotnet_image(path)
def get_base_address(self):
return NO_ADDRESS
@@ -217,7 +218,10 @@ class DotnetFileFeatureExtractor(StaticFeatureExtractor):
assert self.pe.net.struct.MajorRuntimeVersion is not None
assert self.pe.net.struct.MinorRuntimeVersion is not None
return self.pe.net.struct.MajorRuntimeVersion, self.pe.net.struct.MinorRuntimeVersion
return (
self.pe.net.struct.MajorRuntimeVersion,
self.pe.net.struct.MinorRuntimeVersion,
)
def get_meta_version_string(self) -> str:
assert self.pe.net is not None

View File

@@ -83,7 +83,7 @@ def bb_contains_stackstring(bb: ghidra.program.model.block.CodeBlock) -> bool:
true if basic block contains enough moves of constant bytes to the stack
"""
count = 0
for insn in currentProgram().getListing().getInstructions(bb, True): # type: ignore [name-defined] # noqa: F821
for insn in capa.features.extractors.ghidra.helpers.get_current_program().getListing().getInstructions(bb, True):
if is_mov_imm_to_stack(insn):
count += get_printable_len(insn.getScalar(1))
if count > MIN_STACKSTRING_LEN:
@@ -96,7 +96,9 @@ def _bb_has_tight_loop(bb: ghidra.program.model.block.CodeBlock):
parse tight loops, true if last instruction in basic block branches to bb start
"""
# Reverse Ordered, first InstructionDB
last_insn = currentProgram().getListing().getInstructions(bb, False).next() # type: ignore [name-defined] # noqa: F821
last_insn = (
capa.features.extractors.ghidra.helpers.get_current_program().getListing().getInstructions(bb, False).next()
)
if last_insn.getFlowType().isJump():
return last_insn.getAddress(0) == bb.getMinAddress()
@@ -140,20 +142,3 @@ def extract_features(fh: FunctionHandle, bbh: BBHandle) -> Iterator[tuple[Featur
for bb_handler in BASIC_BLOCK_HANDLERS:
for feature, addr in bb_handler(fh, bbh):
yield feature, addr
def main():
features = []
from capa.features.extractors.ghidra.extractor import GhidraFeatureExtractor
for fh in GhidraFeatureExtractor().get_functions():
for bbh in capa.features.extractors.ghidra.helpers.get_function_blocks(fh):
features.extend(list(extract_features(fh, bbh)))
import pprint
pprint.pprint(features) # noqa: T203
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,44 @@
# Copyright 2023 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from typing import Optional
class GhidraContext:
"""
State holder for the Ghidra backend to avoid passing state to every function.
PyGhidra uses a context manager to set up the Ghidra environment (program, transaction, etc.).
We store the relevant objects here to allow easy access throughout the extractor
without needing to pass them as arguments to every feature extraction method.
"""
def __init__(self, program, flat_api, monitor):
self.program = program
self.flat_api = flat_api
self.monitor = monitor
_context: Optional[GhidraContext] = None
def set_context(program, flat_api, monitor):
global _context
_context = GhidraContext(program, flat_api, monitor)
def get_context() -> GhidraContext:
if _context is None:
raise RuntimeError("GhidraContext not initialized")
return _context

View File

@@ -12,11 +12,14 @@
# See the License for the specific language governing permissions and
# limitations under the License.
import weakref
import contextlib
from typing import Iterator
import capa.features.extractors.ghidra.file
import capa.features.extractors.ghidra.insn
import capa.features.extractors.ghidra.global_
import capa.features.extractors.ghidra.helpers as ghidra_helpers
import capa.features.extractors.ghidra.function
import capa.features.extractors.ghidra.basicblock
from capa.features.common import Feature
@@ -31,19 +34,20 @@ from capa.features.extractors.base_extractor import (
class GhidraFeatureExtractor(StaticFeatureExtractor):
def __init__(self):
import capa.features.extractors.ghidra.helpers as ghidra_helpers
def __init__(self, ctx_manager=None, tmpdir=None):
self.ctx_manager = ctx_manager
self.tmpdir = tmpdir
super().__init__(
SampleHashes(
md5=capa.ghidra.helpers.get_file_md5(),
md5=ghidra_helpers.get_current_program().getExecutableMD5(),
# ghidra doesn't expose this hash.
# https://ghidra.re/ghidra_docs/api/ghidra/program/model/listing/Program.html
#
# the hashes are stored in the database, not computed on the fly,
# so it's probably not trivial to add SHA1.
sha1="",
sha256=capa.ghidra.helpers.get_file_sha256(),
sha256=ghidra_helpers.get_current_program().getExecutableSHA256(),
)
)
@@ -55,8 +59,14 @@ class GhidraFeatureExtractor(StaticFeatureExtractor):
self.externs = ghidra_helpers.get_file_externs()
self.fakes = ghidra_helpers.map_fake_import_addrs()
# Register cleanup to run when the extractor is garbage collected or when the program exits.
# We use weakref.finalize instead of __del__ to avoid issues with reference cycles and
# to ensure deterministic cleanup on interpreter shutdown.
if self.ctx_manager or self.tmpdir:
weakref.finalize(self, cleanup, self.ctx_manager, self.tmpdir)
def get_base_address(self):
return AbsoluteVirtualAddress(currentProgram().getImageBase().getOffset()) # type: ignore [name-defined] # noqa: F821
return AbsoluteVirtualAddress(ghidra_helpers.get_current_program().getImageBase().getOffset())
def extract_global_features(self):
yield from self.global_features
@@ -65,7 +75,6 @@ class GhidraFeatureExtractor(StaticFeatureExtractor):
yield from capa.features.extractors.ghidra.file.extract_features()
def get_functions(self) -> Iterator[FunctionHandle]:
import capa.features.extractors.ghidra.helpers as ghidra_helpers
for fhandle in ghidra_helpers.get_function_symbols():
fh: FunctionHandle = FunctionHandle(
@@ -77,14 +86,14 @@ class GhidraFeatureExtractor(StaticFeatureExtractor):
@staticmethod
def get_function(addr: int) -> FunctionHandle:
func = getFunctionContaining(toAddr(addr)) # type: ignore [name-defined] # noqa: F821
func = ghidra_helpers.get_flat_api().getFunctionContaining(ghidra_helpers.get_flat_api().toAddr(addr))
return FunctionHandle(address=AbsoluteVirtualAddress(func.getEntryPoint().getOffset()), inner=func)
def extract_function_features(self, fh: FunctionHandle) -> Iterator[tuple[Feature, Address]]:
yield from capa.features.extractors.ghidra.function.extract_features(fh)
def get_basic_blocks(self, fh: FunctionHandle) -> Iterator[BBHandle]:
import capa.features.extractors.ghidra.helpers as ghidra_helpers
yield from ghidra_helpers.get_function_blocks(fh)
@@ -92,9 +101,17 @@ class GhidraFeatureExtractor(StaticFeatureExtractor):
yield from capa.features.extractors.ghidra.basicblock.extract_features(fh, bbh)
def get_instructions(self, fh: FunctionHandle, bbh: BBHandle) -> Iterator[InsnHandle]:
import capa.features.extractors.ghidra.helpers as ghidra_helpers
yield from ghidra_helpers.get_insn_in_range(bbh)
def extract_insn_features(self, fh: FunctionHandle, bbh: BBHandle, ih: InsnHandle):
yield from capa.features.extractors.ghidra.insn.extract_features(fh, bbh, ih)
def cleanup(ctx_manager, tmpdir):
if ctx_manager:
with contextlib.suppress(Exception):
ctx_manager.__exit__(None, None, None)
if tmpdir:
with contextlib.suppress(Exception):
tmpdir.cleanup()

View File

@@ -80,22 +80,54 @@ def extract_file_embedded_pe() -> Iterator[tuple[Feature, Address]]:
for i in range(256)
]
for block in currentProgram().getMemory().getBlocks(): # type: ignore [name-defined] # noqa: F821
for block in capa.features.extractors.ghidra.helpers.get_current_program().getMemory().getBlocks():
if not all((block.isLoaded(), block.isInitialized(), "Headers" not in block.getName())):
continue
for off, _ in find_embedded_pe(capa.features.extractors.ghidra.helpers.get_block_bytes(block), mz_xor):
# add offset back to block start
ea: int = block.getStart().add(off).getOffset()
ea_addr = block.getStart().add(off)
ea = ea_addr.getOffset()
f_offset = capa.features.extractors.ghidra.helpers.get_file_offset(ea_addr)
if f_offset != -1:
ea = f_offset
yield Characteristic("embedded pe"), FileOffsetAddress(ea)
def extract_file_export_names() -> Iterator[tuple[Feature, Address]]:
"""extract function exports"""
st = currentProgram().getSymbolTable() # type: ignore [name-defined] # noqa: F821
program = capa.features.extractors.ghidra.helpers.get_current_program()
st = program.getSymbolTable()
for addr in st.getExternalEntryPointIterator():
yield Export(st.getPrimarySymbol(addr).getName()), AbsoluteVirtualAddress(addr.getOffset())
sym = st.getPrimarySymbol(addr)
name = sym.getName()
# Check for forwarded export
is_forwarded = False
refs = program.getReferenceManager().getReferencesFrom(addr)
for ref in refs:
if ref.getToAddress().isExternalAddress():
ext_sym = st.getPrimarySymbol(ref.getToAddress())
if ext_sym:
ext_loc = program.getExternalManager().getExternalLocation(ext_sym)
if ext_loc:
# It is a forwarded export
libname = ext_loc.getLibraryName()
if libname.lower().endswith(".dll"):
libname = libname[:-4]
forwarded_name = f"{libname}.{ext_loc.getLabel()}"
forwarded_name = capa.features.extractors.helpers.reformat_forwarded_export_name(forwarded_name)
yield Export(forwarded_name), AbsoluteVirtualAddress(addr.getOffset())
yield Characteristic("forwarded export"), AbsoluteVirtualAddress(addr.getOffset())
is_forwarded = True
break
if not is_forwarded:
yield Export(name), AbsoluteVirtualAddress(addr.getOffset())
def extract_file_import_names() -> Iterator[tuple[Feature, Address]]:
@@ -110,7 +142,7 @@ def extract_file_import_names() -> Iterator[tuple[Feature, Address]]:
- importname
"""
for f in currentProgram().getFunctionManager().getExternalFunctions(): # type: ignore [name-defined] # noqa: F821
for f in capa.features.extractors.ghidra.helpers.get_current_program().getFunctionManager().getExternalFunctions():
for r in f.getSymbol().getReferences():
if r.getReferenceType().isData():
addr = r.getFromAddress().getOffset() # gets pointer to fake external addr
@@ -126,14 +158,14 @@ def extract_file_import_names() -> Iterator[tuple[Feature, Address]]:
def extract_file_section_names() -> Iterator[tuple[Feature, Address]]:
"""extract section names"""
for block in currentProgram().getMemory().getBlocks(): # type: ignore [name-defined] # noqa: F821
for block in capa.features.extractors.ghidra.helpers.get_current_program().getMemory().getBlocks():
yield Section(block.getName()), AbsoluteVirtualAddress(block.getStart().getOffset())
def extract_file_strings() -> Iterator[tuple[Feature, Address]]:
"""extract ASCII and UTF-16 LE strings"""
for block in currentProgram().getMemory().getBlocks(): # type: ignore [name-defined] # noqa: F821
for block in capa.features.extractors.ghidra.helpers.get_current_program().getMemory().getBlocks():
if not block.isInitialized():
continue
@@ -153,7 +185,8 @@ def extract_file_function_names() -> Iterator[tuple[Feature, Address]]:
extract the names of statically-linked library functions.
"""
for sym in currentProgram().getSymbolTable().getAllSymbols(True): # type: ignore [name-defined] # noqa: F821
for sym in capa.features.extractors.ghidra.helpers.get_current_program().getSymbolTable().getAllSymbols(True):
# .isExternal() misses more than this config for the function symbols
if sym.getSymbolType() == SymbolType.FUNCTION and sym.getSource() == SourceType.ANALYSIS and sym.isGlobal():
name = sym.getName() # starts to resolve names based on Ghidra's FidDB
@@ -170,7 +203,7 @@ def extract_file_function_names() -> Iterator[tuple[Feature, Address]]:
def extract_file_format() -> Iterator[tuple[Feature, Address]]:
ef = currentProgram().getExecutableFormat() # type: ignore [name-defined] # noqa: F821
ef = capa.features.extractors.ghidra.helpers.get_current_program().getExecutableFormat()
if "PE" in ef:
yield Format(FORMAT_PE), NO_ADDRESS
elif "ELF" in ef:
@@ -198,14 +231,3 @@ FILE_HANDLERS = (
extract_file_function_names,
extract_file_format,
)
def main():
""" """
import pprint
pprint.pprint(list(extract_features())) # noqa: T203
if __name__ == "__main__":
main()

View File

@@ -26,21 +26,25 @@ from capa.features.extractors.base_extractor import FunctionHandle
def extract_function_calls_to(fh: FunctionHandle):
"""extract callers to a function"""
f: ghidra.program.database.function.FunctionDB = fh.inner
f: "ghidra.program.database.function.FunctionDB" = fh.inner
for ref in f.getSymbol().getReferences():
if ref.getReferenceType().isCall():
yield Characteristic("calls to"), AbsoluteVirtualAddress(ref.getFromAddress().getOffset())
def extract_function_loop(fh: FunctionHandle):
f: ghidra.program.database.function.FunctionDB = fh.inner
f: "ghidra.program.database.function.FunctionDB" = fh.inner
edges = []
for block in SimpleBlockIterator(BasicBlockModel(currentProgram()), f.getBody(), monitor()): # type: ignore [name-defined] # noqa: F821
dests = block.getDestinations(monitor()) # type: ignore [name-defined] # noqa: F821
for block in SimpleBlockIterator(
BasicBlockModel(capa.features.extractors.ghidra.helpers.get_current_program()),
f.getBody(),
capa.features.extractors.ghidra.helpers.get_monitor(),
):
dests = block.getDestinations(capa.features.extractors.ghidra.helpers.get_monitor())
s_addrs = block.getStartAddresses()
while dests.hasNext(): # For loop throws Python TypeError
while dests.hasNext():
for addr in s_addrs:
edges.append((addr.getOffset(), dests.next().getDestinationAddress().getOffset()))
@@ -49,32 +53,17 @@ def extract_function_loop(fh: FunctionHandle):
def extract_recursive_call(fh: FunctionHandle):
f: ghidra.program.database.function.FunctionDB = fh.inner
f: "ghidra.program.database.function.FunctionDB" = fh.inner
for func in f.getCalledFunctions(monitor()): # type: ignore [name-defined] # noqa: F821
for func in f.getCalledFunctions(capa.features.extractors.ghidra.helpers.get_monitor()):
if func.getEntryPoint().getOffset() == f.getEntryPoint().getOffset():
yield Characteristic("recursive call"), AbsoluteVirtualAddress(f.getEntryPoint().getOffset())
def extract_features(fh: FunctionHandle) -> Iterator[tuple[Feature, Address]]:
for func_handler in FUNCTION_HANDLERS:
for feature, addr in func_handler(fh):
for function_handler in FUNCTION_HANDLERS:
for feature, addr in function_handler(fh):
yield feature, addr
FUNCTION_HANDLERS = (extract_function_calls_to, extract_function_loop, extract_recursive_call)
def main():
""" """
features = []
for fhandle in capa.features.extractors.ghidra.helpers.get_function_symbols():
features.extend(list(extract_features(fhandle)))
import pprint
pprint.pprint(features) # noqa: T203
if __name__ == "__main__":
main()

View File

@@ -26,7 +26,7 @@ logger = logging.getLogger(__name__)
def extract_os() -> Iterator[tuple[Feature, Address]]:
format_name: str = currentProgram().getExecutableFormat() # type: ignore [name-defined] # noqa: F821
format_name: str = capa.features.extractors.ghidra.helpers.get_current_program().getExecutableFormat()
if "PE" in format_name:
yield OS(OS_WINDOWS), NO_ADDRESS
@@ -53,7 +53,7 @@ def extract_os() -> Iterator[tuple[Feature, Address]]:
def extract_arch() -> Iterator[tuple[Feature, Address]]:
lang_id = currentProgram().getMetadata().get("Language ID") # type: ignore [name-defined] # noqa: F821
lang_id = capa.features.extractors.ghidra.helpers.get_current_program().getMetadata().get("Language ID")
if "x86" in lang_id and "64" in lang_id:
yield Arch(ARCH_AMD64), NO_ADDRESS

View File

@@ -22,9 +22,22 @@ from ghidra.program.model.symbol import SourceType, SymbolType
from ghidra.program.model.address import AddressSpace
import capa.features.extractors.helpers
import capa.features.extractors.ghidra.context as ghidra_context
from capa.features.common import THUNK_CHAIN_DEPTH_DELTA
from capa.features.address import AbsoluteVirtualAddress
from capa.features.extractors.base_extractor import BBHandle, InsnHandle, FunctionHandle
from capa.features.extractors.base_extractor import BBHandle, InsnHandle
def get_current_program():
return ghidra_context.get_context().program
def get_monitor():
return ghidra_context.get_context().monitor
def get_flat_api():
return ghidra_context.get_context().flat_api
def ints_to_bytes(bytez: list[int]) -> bytes:
@@ -36,7 +49,7 @@ def ints_to_bytes(bytez: list[int]) -> bytes:
return bytes([b & 0xFF for b in bytez])
def find_byte_sequence(addr: ghidra.program.model.address.Address, seq: bytes) -> Iterator[int]:
def find_byte_sequence(addr: "ghidra.program.model.address.Address", seq: bytes) -> Iterator[int]:
"""yield all ea of a given byte sequence
args:
@@ -44,12 +57,25 @@ def find_byte_sequence(addr: ghidra.program.model.address.Address, seq: bytes) -
seq: bytes to search e.g. b"\x01\x03"
"""
seqstr = "".join([f"\\x{b:02x}" for b in seq])
eas = findBytes(addr, seqstr, java.lang.Integer.MAX_VALUE, 1) # type: ignore [name-defined] # noqa: F821
eas = get_flat_api().findBytes(addr, seqstr, java.lang.Integer.MAX_VALUE, 1)
yield from eas
def get_bytes(addr: ghidra.program.model.address.Address, length: int) -> bytes:
def get_file_offset(addr: "ghidra.program.model.address.Address") -> int:
"""get file offset for an address"""
block = get_current_program().getMemory().getBlock(addr)
if not block:
return -1
for info in block.getSourceInfos():
if info.contains(addr):
return info.getFileBytesOffset(addr)
return -1
def get_bytes(addr: "ghidra.program.model.address.Address", length: int) -> bytes:
"""yield length bytes at addr
args:
@@ -57,12 +83,12 @@ def get_bytes(addr: ghidra.program.model.address.Address, length: int) -> bytes:
length: length of bytes to pull
"""
try:
return ints_to_bytes(getBytes(addr, length)) # type: ignore [name-defined] # noqa: F821
except RuntimeError:
return ints_to_bytes(get_flat_api().getBytes(addr, int(length)))
except Exception:
return b""
def get_block_bytes(block: ghidra.program.model.mem.MemoryBlock) -> bytes:
def get_block_bytes(block: "ghidra.program.model.mem.MemoryBlock") -> bytes:
"""yield all bytes in a given block
args:
@@ -73,20 +99,21 @@ def get_block_bytes(block: ghidra.program.model.mem.MemoryBlock) -> bytes:
def get_function_symbols():
"""yield all non-external function symbols"""
yield from currentProgram().getFunctionManager().getFunctionsNoStubs(True) # type: ignore [name-defined] # noqa: F821
yield from get_current_program().getFunctionManager().getFunctionsNoStubs(True)
def get_function_blocks(fh: FunctionHandle) -> Iterator[BBHandle]:
"""yield BBHandle for each bb in a given function"""
def get_function_blocks(fh: "capa.features.extractors.base_extractor.FunctionHandle") -> Iterator[BBHandle]:
"""
yield the basic blocks of the function
"""
func: ghidra.program.database.function.FunctionDB = fh.inner
for bb in SimpleBlockIterator(BasicBlockModel(currentProgram()), func.getBody(), monitor()): # type: ignore [name-defined] # noqa: F821
yield BBHandle(address=AbsoluteVirtualAddress(bb.getMinAddress().getOffset()), inner=bb)
for block in SimpleBlockIterator(BasicBlockModel(get_current_program()), fh.inner.getBody(), get_monitor()):
yield BBHandle(address=AbsoluteVirtualAddress(block.getMinAddress().getOffset()), inner=block)
def get_insn_in_range(bbh: BBHandle) -> Iterator[InsnHandle]:
"""yield InshHandle for each insn in a given basicblock"""
for insn in currentProgram().getListing().getInstructions(bbh.inner, True): # type: ignore [name-defined] # noqa: F821
for insn in get_current_program().getListing().getInstructions(bbh.inner, True):
yield InsnHandle(address=AbsoluteVirtualAddress(insn.getAddress().getOffset()), inner=insn)
@@ -95,7 +122,7 @@ def get_file_imports() -> dict[int, list[str]]:
import_dict: dict[int, list[str]] = {}
for f in currentProgram().getFunctionManager().getExternalFunctions(): # type: ignore [name-defined] # noqa: F821
for f in get_current_program().getFunctionManager().getExternalFunctions():
for r in f.getSymbol().getReferences():
if r.getReferenceType().isData():
addr = r.getFromAddress().getOffset() # gets pointer to fake external addr
@@ -133,7 +160,7 @@ def get_file_externs() -> dict[int, list[str]]:
extern_dict: dict[int, list[str]] = {}
for sym in currentProgram().getSymbolTable().getAllSymbols(True): # type: ignore [name-defined] # noqa: F821
for sym in get_current_program().getSymbolTable().getAllSymbols(True):
# .isExternal() misses more than this config for the function symbols
if sym.getSymbolType() == SymbolType.FUNCTION and sym.getSource() == SourceType.ANALYSIS and sym.isGlobal():
name = sym.getName() # starts to resolve names based on Ghidra's FidDB
@@ -171,7 +198,7 @@ def map_fake_import_addrs() -> dict[int, list[int]]:
"""
fake_dict: dict[int, list[int]] = {}
for f in currentProgram().getFunctionManager().getExternalFunctions(): # type: ignore [name-defined] # noqa: F821
for f in get_current_program().getFunctionManager().getExternalFunctions():
for r in f.getSymbol().getReferences():
if r.getReferenceType().isData():
fake_dict.setdefault(f.getEntryPoint().getOffset(), []).append(r.getFromAddress().getOffset())
@@ -180,7 +207,7 @@ def map_fake_import_addrs() -> dict[int, list[int]]:
def check_addr_for_api(
addr: ghidra.program.model.address.Address,
addr: "ghidra.program.model.address.Address",
fakes: dict[int, list[int]],
imports: dict[int, list[str]],
externs: dict[int, list[str]],
@@ -202,18 +229,18 @@ def check_addr_for_api(
return False
def is_call_or_jmp(insn: ghidra.program.database.code.InstructionDB) -> bool:
def is_call_or_jmp(insn: "ghidra.program.database.code.InstructionDB") -> bool:
return any(mnem in insn.getMnemonicString() for mnem in ["CALL", "J"]) # JMP, JNE, JNZ, etc
def is_sp_modified(insn: ghidra.program.database.code.InstructionDB) -> bool:
def is_sp_modified(insn: "ghidra.program.database.code.InstructionDB") -> bool:
for i in range(insn.getNumOperands()):
if insn.getOperandType(i) == OperandType.REGISTER:
return "SP" in insn.getRegister(i).getName() and insn.getOperandRefType(i).isWrite()
return False
def is_stack_referenced(insn: ghidra.program.database.code.InstructionDB) -> bool:
def is_stack_referenced(insn: "ghidra.program.database.code.InstructionDB") -> bool:
"""generic catch-all for stack references"""
for i in range(insn.getNumOperands()):
if insn.getOperandType(i) == OperandType.REGISTER:
@@ -225,7 +252,7 @@ def is_stack_referenced(insn: ghidra.program.database.code.InstructionDB) -> boo
return any(ref.isStackReference() for ref in insn.getReferencesFrom())
def is_zxor(insn: ghidra.program.database.code.InstructionDB) -> bool:
def is_zxor(insn: "ghidra.program.database.code.InstructionDB") -> bool:
# assume XOR insn
# XOR's against the same operand zero out
ops = []
@@ -241,29 +268,29 @@ def is_zxor(insn: ghidra.program.database.code.InstructionDB) -> bool:
return all(n == operands[0] for n in operands)
def handle_thunk(addr: ghidra.program.model.address.Address):
def handle_thunk(addr: "ghidra.program.model.address.Address"):
"""Follow thunk chains down to a reasonable depth"""
ref = addr
for _ in range(THUNK_CHAIN_DEPTH_DELTA):
thunk_jmp = getInstructionAt(ref) # type: ignore [name-defined] # noqa: F821
thunk_jmp = get_flat_api().getInstructionAt(ref)
if thunk_jmp and is_call_or_jmp(thunk_jmp):
if OperandType.isAddress(thunk_jmp.getOperandType(0)):
ref = thunk_jmp.getAddress(0)
else:
thunk_dat = getDataContaining(ref) # type: ignore [name-defined] # noqa: F821
thunk_dat = get_flat_api().getDataContaining(ref)
if thunk_dat and thunk_dat.isDefined() and thunk_dat.isPointer():
ref = thunk_dat.getValue()
break # end of thunk chain reached
return ref
def dereference_ptr(insn: ghidra.program.database.code.InstructionDB):
def dereference_ptr(insn: "ghidra.program.database.code.InstructionDB"):
addr_code = OperandType.ADDRESS | OperandType.CODE
to_deref = insn.getAddress(0)
dat = getDataContaining(to_deref) # type: ignore [name-defined] # noqa: F821
dat = get_flat_api().getDataContaining(to_deref)
if insn.getOperandType(0) == addr_code:
thfunc = getFunctionContaining(to_deref) # type: ignore [name-defined] # noqa: F821
thfunc = get_flat_api().getFunctionContaining(to_deref)
if thfunc and thfunc.isThunk():
return handle_thunk(to_deref)
else:
@@ -294,7 +321,7 @@ def find_data_references_from_insn(insn, max_depth: int = 10):
to_addr = reference.getToAddress()
for _ in range(max_depth - 1):
data = getDataAt(to_addr) # type: ignore [name-defined] # noqa: F821
data = get_flat_api().getDataAt(to_addr)
if data and data.isPointer():
ptr_value = data.getValue()

View File

@@ -234,7 +234,7 @@ def extract_insn_bytes_features(fh: FunctionHandle, bb: BBHandle, ih: InsnHandle
push offset iid_004118d4_IShellLinkA ; riid
"""
for addr in capa.features.extractors.ghidra.helpers.find_data_references_from_insn(ih.inner):
data = getDataAt(addr) # type: ignore [name-defined] # noqa: F821
data = capa.features.extractors.ghidra.helpers.get_flat_api().getDataAt(addr)
if data and not data.hasStringValue():
extracted_bytes = capa.features.extractors.ghidra.helpers.get_bytes(addr, MAX_BYTES_FEATURE_SIZE)
if extracted_bytes and not capa.features.extractors.helpers.all_zeros(extracted_bytes):
@@ -249,9 +249,9 @@ def extract_insn_string_features(fh: FunctionHandle, bb: BBHandle, ih: InsnHandl
push offset aAcr ; "ACR > "
"""
for addr in capa.features.extractors.ghidra.helpers.find_data_references_from_insn(ih.inner):
data = getDataAt(addr) # type: ignore [name-defined] # noqa: F821
data = capa.features.extractors.ghidra.helpers.get_flat_api().getDataAt(addr)
if data and data.hasStringValue():
yield String(data.getValue()), ih.address
yield String(str(data.getValue())), ih.address
def extract_insn_mnemonic_features(
@@ -361,8 +361,8 @@ def extract_insn_cross_section_cflow(
if capa.features.extractors.ghidra.helpers.check_addr_for_api(ref, fakes, imports, externs):
return
this_mem_block = getMemoryBlock(insn.getAddress()) # type: ignore [name-defined] # noqa: F821
ref_block = getMemoryBlock(ref) # type: ignore [name-defined] # noqa: F821
this_mem_block = capa.features.extractors.ghidra.helpers.get_flat_api().getMemoryBlock(insn.getAddress())
ref_block = capa.features.extractors.ghidra.helpers.get_flat_api().getMemoryBlock(ref)
if ref_block != this_mem_block:
yield Characteristic("cross section flow"), ih.address
@@ -425,19 +425,19 @@ def check_nzxor_security_cookie_delta(
Check if insn within last addr of last bb - delta
"""
model = SimpleBlockModel(currentProgram()) # type: ignore [name-defined] # noqa: F821
model = SimpleBlockModel(capa.features.extractors.ghidra.helpers.get_current_program())
insn_addr = insn.getAddress()
func_asv = fh.getBody()
first_addr = func_asv.getMinAddress()
if insn_addr < first_addr.add(SECURITY_COOKIE_BYTES_DELTA):
first_bb = model.getFirstCodeBlockContaining(first_addr, monitor()) # type: ignore [name-defined] # noqa: F821
first_bb = model.getFirstCodeBlockContaining(first_addr, capa.features.extractors.ghidra.helpers.get_monitor())
if first_bb.contains(insn_addr):
return True
last_addr = func_asv.getMaxAddress()
if insn_addr > last_addr.add(SECURITY_COOKIE_BYTES_DELTA * -1):
last_bb = model.getFirstCodeBlockContaining(last_addr, monitor()) # type: ignore [name-defined] # noqa: F821
last_bb = model.getFirstCodeBlockContaining(last_addr, capa.features.extractors.ghidra.helpers.get_monitor())
if last_bb.contains(insn_addr):
return True
@@ -488,22 +488,3 @@ INSTRUCTION_HANDLERS = (
extract_function_calls_from,
extract_function_indirect_call_characteristic_features,
)
def main():
""" """
features = []
from capa.features.extractors.ghidra.extractor import GhidraFeatureExtractor
for fh in GhidraFeatureExtractor().get_functions():
for bb in capa.features.extractors.ghidra.helpers.get_function_blocks(fh):
for insn in capa.features.extractors.ghidra.helpers.get_insn_in_range(bb):
features.extend(list(extract_features(fh, bb, insn)))
import pprint
pprint.pprint(features) # noqa: T203
if __name__ == "__main__":
main()

View File

@@ -18,6 +18,7 @@ import idaapi
import idautils
import capa.features.extractors.ida.helpers
from capa.features.file import FunctionName
from capa.features.common import Feature, Characteristic
from capa.features.address import Address, AbsoluteVirtualAddress
from capa.features.extractors import loops
@@ -50,10 +51,39 @@ def extract_recursive_call(fh: FunctionHandle):
yield Characteristic("recursive call"), fh.address
def extract_function_name(fh: FunctionHandle) -> Iterator[tuple[Feature, Address]]:
ea = fh.inner.start_ea
name = idaapi.get_name(ea)
if name.startswith("sub_"):
# skip default names, like "sub_401000"
return
yield FunctionName(name), fh.address
if name.startswith("_"):
# some linkers may prefix linked routines with a `_` to avoid name collisions.
# extract features for both the mangled and un-mangled representations.
# e.g. `_fwrite` -> `fwrite`
# see: https://stackoverflow.com/a/2628384/87207
yield FunctionName(name[1:]), fh.address
def extract_function_alternative_names(fh: FunctionHandle):
"""Get all alternative names for an address."""
for aname in capa.features.extractors.ida.helpers.get_function_alternative_names(fh.inner.start_ea):
yield FunctionName(aname), fh.address
def extract_features(fh: FunctionHandle) -> Iterator[tuple[Feature, Address]]:
for func_handler in FUNCTION_HANDLERS:
for feature, addr in func_handler(fh):
yield feature, addr
FUNCTION_HANDLERS = (extract_function_calls_to, extract_function_loop, extract_recursive_call)
FUNCTION_HANDLERS = (
extract_function_calls_to,
extract_function_loop,
extract_recursive_call,
extract_function_name,
extract_function_alternative_names,
)

View File

@@ -20,6 +20,7 @@ import idaapi
import ida_nalt
import idautils
import ida_bytes
import ida_funcs
import ida_segment
from capa.features.address import AbsoluteVirtualAddress
@@ -436,3 +437,16 @@ def is_basic_block_return(bb: idaapi.BasicBlock) -> bool:
def has_sib(oper: idaapi.op_t) -> bool:
# via: https://reverseengineering.stackexchange.com/a/14300
return oper.specflag1 == 1
def find_alternative_names(cmt: str):
for line in cmt.split("\n"):
if line.startswith("Alternative name is '") and line.endswith("'"):
name = line[len("Alternative name is '") : -1] # Extract name between quotes
yield name
def get_function_alternative_names(fva: int):
"""Get all alternative names for an address."""
yield from find_alternative_names(ida_bytes.get_cmt(fva, False) or "")
yield from find_alternative_names(ida_funcs.get_func_cmt(idaapi.get_func(fva), False) or "")

View File

@@ -22,6 +22,7 @@ import idautils
import capa.features.extractors.helpers
import capa.features.extractors.ida.helpers
from capa.features.file import FunctionName
from capa.features.insn import API, MAX_STRUCTURE_SIZE, Number, Offset, Mnemonic, OperandNumber, OperandOffset
from capa.features.common import MAX_BYTES_FEATURE_SIZE, THUNK_CHAIN_DEPTH_DELTA, Bytes, String, Feature, Characteristic
from capa.features.address import Address, AbsoluteVirtualAddress
@@ -129,8 +130,8 @@ def extract_insn_api_features(fh: FunctionHandle, bbh: BBHandle, ih: InsnHandle)
# not a function (start)
return
if target_func.flags & idaapi.FUNC_LIB:
name = idaapi.get_name(target_func.start_ea)
name = idaapi.get_name(target_func.start_ea)
if target_func.flags & idaapi.FUNC_LIB or not name.startswith("sub_"):
yield API(name), ih.address
if name.startswith("_"):
# some linkers may prefix linked routines with a `_` to avoid name collisions.
@@ -139,6 +140,10 @@ def extract_insn_api_features(fh: FunctionHandle, bbh: BBHandle, ih: InsnHandle)
# see: https://stackoverflow.com/a/2628384/87207
yield API(name[1:]), ih.address
for altname in capa.features.extractors.ida.helpers.get_function_alternative_names(target_func.start_ea):
yield FunctionName(altname), ih.address
yield API(altname), ih.address
def extract_insn_number_features(
fh: FunctionHandle, bbh: BBHandle, ih: InsnHandle

View File

@@ -56,7 +56,7 @@ def get_previous_instructions(vw: VivWorkspace, va: int) -> list[int]:
if ploc is not None:
# from vivisect.const:
# location: (L_VA, L_SIZE, L_LTYPE, L_TINFO)
(pva, _, ptype, pinfo) = ploc
pva, _, ptype, pinfo = ploc
if ptype == LOC_OP and not (pinfo & IF_NOFALL):
ret.append(pva)

View File

@@ -176,7 +176,7 @@ def extract_insn_api_features(fh: FunctionHandle, bb, ih: InsnHandle) -> Iterato
elif isinstance(insn.opers[0], envi.archs.i386.disasm.i386RegOper):
try:
(_, target) = resolve_indirect_call(f.vw, insn.va, insn=insn)
_, target = resolve_indirect_call(f.vw, insn.va, insn=insn)
except NotFoundError:
# not able to resolve the indirect call, sorry
return

View File

@@ -96,14 +96,7 @@ class VMRayAnalysis:
% (self.submission_name, self.submission_type)
)
if self.submission_static is not None:
if self.submission_static.pe is None and self.submission_static.elf is None:
# we only support static analysis for PE and ELF files for now
raise UnsupportedFormatError(
"archive does not contain a supported file format (submission_name: %s, submission_type: %s)"
% (self.submission_name, self.submission_type)
)
else:
if self.submission_static is None:
# VMRay may not record static analysis for certain file types, e.g. MSI, but we'd still like to match dynamic
# execution so we continue without and accept that the results may be incomplete
logger.warning(

View File

@@ -26,6 +26,16 @@ from capa.features.extractors.base_extractor import CallHandle, ThreadHandle, Pr
logger = logging.getLogger(__name__)
VOID_PTR_NUMBER_PARAMS = frozenset(
{
"hKey",
"hKeyRoot",
"hkResult",
"samDesired",
}
)
def get_call_param_features(param: Param, ch: CallHandle) -> Iterator[tuple[Feature, Address]]:
if param.deref is not None:
# pointer types contain a special "deref" member that stores the deref'd value
@@ -39,10 +49,31 @@ def get_call_param_features(param: Param, ch: CallHandle) -> Iterator[tuple[Feat
# parsing the data up to here results in double-escaped backslashes, remove those here
yield String(param.deref.value.replace("\\\\", "\\")), ch.address
else:
logger.debug("skipping deref param type %s", param.deref.type_)
if param.name in VOID_PTR_NUMBER_PARAMS:
try:
yield Number(hexint(param.deref.value)), ch.address
except (ValueError, TypeError) as e:
logger.debug(
"failed to parse whitelisted void_ptr param %s value %s: %s",
param.name,
param.deref.value,
e,
)
else:
logger.debug("skipping deref param type %s", param.deref.type_)
elif param.value is not None:
if param.type_ in PARAM_TYPE_INT:
yield Number(hexint(param.value)), ch.address
elif param.type_ == "void_ptr" and param.name in VOID_PTR_NUMBER_PARAMS:
try:
yield Number(hexint(param.value)), ch.address
except (ValueError, TypeError) as e:
logger.debug(
"failed to parse whitelisted void_ptr param %s value %s: %s",
param.name,
param.value,
e,
)
def extract_call_features(ph: ProcessHandle, th: ThreadHandle, ch: CallHandle) -> Iterator[tuple[Feature, Address]]:

View File

@@ -12,7 +12,7 @@
# See the License for the specific language governing permissions and
# limitations under the License.
import logging
from typing import Iterator
from pathlib import Path
@@ -39,6 +39,8 @@ from capa.features.extractors.base_extractor import (
DynamicFeatureExtractor,
)
logger = logging.getLogger(__name__)
def get_formatted_params(params: ParamList) -> list[str]:
params_list: list[str] = []
@@ -87,6 +89,16 @@ class VMRayExtractor(DynamicFeatureExtractor):
def get_processes(self) -> Iterator[ProcessHandle]:
for monitor_process in self.analysis.monitor_processes.values():
# skip invalid/incomplete monitor process entries, see #2807
if monitor_process.pid == 0 or not monitor_process.filename:
logger.debug(
"skipping incomplete process entry: pid=%d, filename=%s, monitor_id=%d",
monitor_process.pid,
monitor_process.filename,
monitor_process.monitor_id,
)
continue
address: ProcessAddress = ProcessAddress(pid=monitor_process.pid, ppid=monitor_process.ppid)
yield ProcessHandle(address, inner=monitor_process)

View File

@@ -1,107 +1,75 @@
<div align="center">
<img src="../../doc/img/ghidra_backend_logo.png" width=240 height=125>
</div>
# capa analysis using Ghidra
# capa + Ghidra
capa supports using Ghidra (via [PyGhidra](https://github.com/NationalSecurityAgency/ghidra/tree/master/Ghidra/Features/PyGhidra)) as a feature extraction backend. This enables you to run capa against binaries using Ghidra's analysis engine.
[capa](https://github.com/mandiant/capa) is the FLARE teams open-source tool that detects capabilities in executable files. [Ghidra](https://github.com/NationalSecurityAgency/ghidra) is an open-source software reverse engineering framework created and maintained by the National Security Agency Research Directorate. capa + Ghidra brings capas detection capabilities directly to Ghidras user interface helping speed up your reverse engineering tasks by identifying what parts of a program suggest interesting behavior, such as setting a registry value. You can execute the included Python 3 scripts [capa_explorer.py](https://raw.githubusercontent.com/mandiant/capa/master/capa/ghidra/capa_explorer.py) or [capa_ghidra.py](https://raw.githubusercontent.com/mandiant/capa/master/capa/ghidra/capa_ghidra.py) to run capas analysis and view the results in Ghidra. You may be asking yourself, “Python 3 scripts in Ghidra?”. You read that correctly. This integration is written entirely in Python 3 and relies on [Ghidrathon]( https://github.com/mandiant/ghidrathon), an open source Ghidra extension that adds Python 3 scripting to Ghidra.
Check out our capa + Ghidra blog posts:
* [Riding Dragons: capa Harnesses Ghidra](https://www.mandiant.com/resources/blog/capa-harnesses-ghidra)
## UI Integration
[capa_explorer.py](https://raw.githubusercontent.com/mandiant/capa/master/capa/ghidra/capa_explorer.py) renders capa results in Ghidra's UI to help you quickly navigate them. This includes adding matched functions to Ghidras Symbol Tree and Bookmarks windows and adding comments to functions that indicate matched capabilities and features. You can execute this script using Ghidras Script Manager window.
### Symbol Tree Window
Matched functions are added to Ghidra's Symbol Tree window under a custom namespace that maps to the capabilities' [capa namespace](https://github.com/mandiant/capa-rules/blob/master/doc/format.md#rule-namespace).
<div align="center">
<img src="https://github.com/mandiant/capa/assets/66766340/eeae33f4-99d4-42dc-a5e8-4c1b8c661492" width=300>
</div>
### Comments
Comments are added at the beginning of matched functions indicating matched capabilities and inline comments are added to functions indicating matched features. You can view these comments in Ghidras Disassembly Listing and Decompile windows.
<div align="center">
<img src="https://github.com/mandiant/capa/assets/66766340/bb2b4170-7fd4-45fc-8c7b-ff8f2e2f101b" width=1000>
</div>
### Bookmarks
Bookmarks are added to functions that matched a capability that is mapped to a MITRE ATT&CK and/or Malware Behavior Catalog (MBC) technique. You can view these bookmarks in Ghidra's Bookmarks window.
<div align="center">
<img src="https://github.com/mandiant/capa/assets/66766340/7f9a66a9-7be7-4223-91c6-4b8fc4651336" width=825>
</div>
## Text-based Integration
[capa_ghidra.py](https://raw.githubusercontent.com/mandiant/capa/master/capa/ghidra/capa_ghidra.py) outputs text-based capa results that mirror the output of capas standalone tool. You can execute this script using Ghidras Script Manager and view its output in Ghidras Console window.
<div align="center">
<img src="../../doc/img/ghidra_script_mngr_output.png" width=700>
</div>
You can also execute [capa_ghidra.py](https://raw.githubusercontent.com/mandiant/capa/master/capa/ghidra/capa_ghidra.py) using Ghidra's Headless Analyzer to view its output in a terminal window.
<div align="center">
<img src="../../doc/img/ghidra_headless_analyzer.png">
</div>
# Getting Started
## Requirements
| Tool | Version | Source |
|------------|---------|--------|
| capa | `>= 7.0.0` | https://github.com/mandiant/capa/releases |
| Ghidrathon | `>= 3.0.0` | https://github.com/mandiant/Ghidrathon/releases |
| Ghidra | `>= 10.3.2` | https://github.com/NationalSecurityAgency/ghidra/releases |
| Python | `>= 3.10.0` | https://www.python.org/downloads |
## Installation
**Note**: capa + Ghidra relies on [Ghidrathon]( https://github.com/mandiant/ghidrathon) to execute Python 3 code in Ghidra. You must first install and configure Ghidrathon using the [steps outlined in its README]( https://github.com/mandiant/ghidrathon?tab=readme-ov-file#installing-ghidrathon). Then, you must use the Python 3 interpreter that you configured with Ghidrathon to complete the following steps:
1. Install capa and its dependencies from PyPI using the following command:
```bash
$ pip install flare-capa
$ capa -b ghidra Practical\ Malware\ Analysis\ Lab\ 01-01.exe_
┌──────────┬──────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ md5 │ bb7425b82141a1c0f7d60e5106676bb1 │
│ sha1 │ │
│ sha256 │ 58898bd42c5bd3bf9b1389f0eee5b39cd59180e8370eb9ea838a0b327bd6fe47 │
│ analysis │ static │
│ os │ windows │
│ format │ pe │
│ arch │ i386 │
│ path │ ~/Documents/capa/tests/data/Practical Malware Analysis Lab 01-01.exe_ │
└──────────┴──────────────────────────────────────────────────────────────────────────────────────────────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ ATT&CK Tactic ┃ ATT&CK Technique ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ DISCOVERY │ File and Directory Discovery [T1083]
└────────────────────────────────────┴─────────────────────────────────────────────────────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ MBC Objective ┃ MBC Behavior ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ DISCOVERY │ File and Directory Discovery [E1083]
│ FILE SYSTEM │ Copy File [C0045]
│ │ Read File [C0051]
│ PROCESS │ Terminate Process [C0018]
└────────────────────────────────────┴─────────────────────────────────────────────────────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Capability ┃ Namespace ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ copy file │ host-interaction/file-system/copy │
│ enumerate files recursively │ host-interaction/file-system/files/list │
read file via mapping (2 matches) │ host-interaction/file-system/read │
│ terminate process (2 matches) │ host-interaction/process/terminate │
│ resolve function by parsing PE exports │ load-code/pe │
└────────────────────────────────────────────────┴─────────────────────────────────────────────────┘
```
2. Download and extract the [official capa rules](https://github.com/mandiant/capa-rules/releases) that match the capa version you have installed. You can use the following command to view the version of capa you have installed:
## getting started
### requirements
- [Ghidra](https://github.com/NationalSecurityAgency/ghidra) >= 12.0 must be installed and available via the `GHIDRA_INSTALL_DIR` environment variable.
#### standalone binary (recommended)
The capa [standalone binary](https://github.com/mandiant/capa/releases) is the preferred way to run capa with the Ghidra backend.
Although the binary does not bundle the Java environment or Ghidra itself, it will dynamically load them at runtime.
#### python package
You can also use the Ghidra backend with the capa Python package by installing `flare-capa` with the `ghidra` extra.
```bash
$ pip show flare-capa
OR
$ capa --version
$ pip install "flare-capa[ghidra]"
```
3. Copy [capa_explorer.py](https://raw.githubusercontent.com/mandiant/capa/master/capa/ghidra/capa_explorer.py) and [capa_ghidra.py](https://raw.githubusercontent.com/mandiant/capa/master/capa/ghidra/capa_ghidra.py) to your `ghidra_scripts` directory or manually add the parent directory of each script using Ghidras Script Manager.
### usage
## Usage
To use the Ghidra backend, specify it with the `-b` or `--backend` flag:
You can execute [capa_explorer.py](https://raw.githubusercontent.com/mandiant/capa/master/capa/ghidra/capa_explorer.py) and [capa_ghidra.py](https://raw.githubusercontent.com/mandiant/capa/master/capa/ghidra/capa_ghidra.py) using Ghidras Script Manager. [capa_ghidra.py](https://raw.githubusercontent.com/mandiant/capa/master/capa/ghidra/capa_ghidra.py) can also be executed using Ghidra's Headless Analyzer.
### Execution using Ghidras Script Manager
You can execute [capa_explorer.py](https://raw.githubusercontent.com/mandiant/capa/master/capa/ghidra/capa_explorer.py) and [capa_ghidra.py](https://raw.githubusercontent.com/mandiant/capa/master/capa/ghidra/capa_ghidra.py) using Ghidra's Script Manager as follows:
1. Navigate to `Window > Script Manager`
2. Expand the `Python 3 > capa` category
3. Double-click a script to execute it
Both scripts ask you to provide the path of your capa rules directory (see installation step 2). [capa_ghidra.py](https://raw.githubusercontent.com/mandiant/capa/master/capa/ghidra/capa_ghidra.py) also has you choose one of `default`, `verbose`, and `vverbose` output formats which mirror the output formats of capas standalone tool.
### Execution using Ghidras Headless Analyzer
You can execute [capa_ghidra.py](https://raw.githubusercontent.com/mandiant/capa/master/capa/ghidra/capa_ghidra.py) using Ghidras Headless Analyzer by invoking the `analyzeHeadless` script included with Ghidra in its `support` directory. The following arguments must be provided:
| Argument | Description |
|----|----|
|`<project_path>`| Path to Ghidra project|
| `<project_name>`| Name of Ghidra Project|
| `-Process <sample_name>` OR `-Import <sample_path>`| Name of sample `<sample_name>` already imported into `<project_name>` OR absolute path of sample `<sample_path>` to import into `<project_name>`|
| `-ScriptPath <script_path>`| OPTIONAL parent directory `<script_path>` of [capa_ghidra.py](https://raw.githubusercontent.com/mandiant/capa/master/capa/ghidra/capa_ghidra.py)|
| `-PostScript capa_ghidra.py`| Execute [capa_ghidra.py](https://raw.githubusercontent.com/mandiant/capa/master/capa/ghidra/capa_ghidra.py) after Ghidra analysis|
| `"<script_args>"`| Quoted string `"<script_args>"` containing script arguments passed to [capa_ghidra.py](https://raw.githubusercontent.com/mandiant/capa/master/capa/ghidra/capa_ghidra.py) that must specify a capa rules path and optionally the output format (`--verbose`, `--vverbose`, `--json`) you can specify `”help”` to view the scripts help message |
The following is an example of combining these arguments into a single `analyzeHeadless` script command:
```bash
$ analyzeHeadless /home/wumbo/demo demo -Import /home/wumbo/capa/tests/data/Practical\ Malware\ Analysis\ Lab\ 01-01.dll_ -PostScript capa_ghidra.py "/home/wumbo/capa/rules --verbose"
$ capa -b ghidra /path/to/sample
```
capa will:
1. Initialize a headless Ghidra instance.
2. Create a temporary project.
3. Import and analyze the sample.
4. Extract features and match rules.
5. Clean up the temporary project.
**Note:** The first time you run this, it may take a few moments to initialize the Ghidra environment.

View File

@@ -1,174 +0,0 @@
# Run capa against loaded Ghidra database and render results in Ghidra Console window
# @author Mike Hunhoff (mehunhoff@google.com)
# @category Python 3.capa
# Copyright 2023 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import sys
import logging
import pathlib
import argparse
import capa
import capa.main
import capa.rules
import capa.ghidra.helpers
import capa.render.default
import capa.capabilities.common
import capa.features.extractors.ghidra.extractor
logger = logging.getLogger("capa_ghidra")
def run_headless():
parser = argparse.ArgumentParser(description="The FLARE team's open-source tool to integrate capa with Ghidra.")
parser.add_argument(
"rules",
type=str,
help="path to rule file or directory",
)
parser.add_argument(
"-v", "--verbose", action="store_true", help="enable verbose result document (no effect with --json)"
)
parser.add_argument(
"-vv", "--vverbose", action="store_true", help="enable very verbose result document (no effect with --json)"
)
parser.add_argument("-d", "--debug", action="store_true", help="enable debugging output on STDERR")
parser.add_argument("-q", "--quiet", action="store_true", help="disable all output but errors")
parser.add_argument("-j", "--json", action="store_true", help="emit JSON instead of text")
script_args = list(getScriptArgs()) # type: ignore [name-defined] # noqa: F821
if not script_args or len(script_args) > 1:
script_args = []
else:
script_args = script_args[0].split()
for idx, arg in enumerate(script_args):
if arg.lower() == "help":
script_args[idx] = "--help"
args = parser.parse_args(args=script_args)
if args.quiet:
logging.basicConfig(level=logging.WARNING)
logging.getLogger().setLevel(logging.WARNING)
elif args.debug:
logging.basicConfig(level=logging.DEBUG)
logging.getLogger().setLevel(logging.DEBUG)
else:
logging.basicConfig(level=logging.INFO)
logging.getLogger().setLevel(logging.INFO)
logger.debug("running in Ghidra headless mode")
rules_path = pathlib.Path(args.rules)
logger.debug("rule path: %s", rules_path)
rules = capa.rules.get_rules([rules_path])
meta = capa.ghidra.helpers.collect_metadata([rules_path])
extractor = capa.features.extractors.ghidra.extractor.GhidraFeatureExtractor()
capabilities = capa.capabilities.common.find_capabilities(rules, extractor, False)
meta.analysis.feature_counts = capabilities.feature_counts
meta.analysis.library_functions = capabilities.library_functions
meta.analysis.layout = capa.loader.compute_layout(rules, extractor, capabilities.matches)
if capa.capabilities.common.has_static_limitation(rules, capabilities, is_standalone=True):
logger.info("capa encountered warnings during analysis")
if args.json:
print(capa.render.json.render(meta, rules, capabilities.matches)) # noqa: T201
elif args.vverbose:
print(capa.render.vverbose.render(meta, rules, capabilities.matches)) # noqa: T201
elif args.verbose:
print(capa.render.verbose.render(meta, rules, capabilities.matches)) # noqa: T201
else:
print(capa.render.default.render(meta, rules, capabilities.matches)) # noqa: T201
return 0
def run_ui():
logging.basicConfig(level=logging.INFO)
logging.getLogger().setLevel(logging.INFO)
rules_dir: str = ""
try:
selected_dir = askDirectory("Choose capa rules directory", "Ok") # type: ignore [name-defined] # noqa: F821
if selected_dir:
rules_dir = selected_dir.getPath()
except RuntimeError:
# RuntimeError thrown when user selects "Cancel"
pass
if not rules_dir:
logger.info("You must choose a capa rules directory before running capa.")
return capa.main.E_MISSING_RULES
verbose = askChoice( # type: ignore [name-defined] # noqa: F821
"capa output verbosity", "Choose capa output verbosity", ["default", "verbose", "vverbose"], "default"
)
rules_path: pathlib.Path = pathlib.Path(rules_dir)
logger.info("running capa using rules from %s", str(rules_path))
rules = capa.rules.get_rules([rules_path])
meta = capa.ghidra.helpers.collect_metadata([rules_path])
extractor = capa.features.extractors.ghidra.extractor.GhidraFeatureExtractor()
capabilities = capa.capabilities.common.find_capabilities(rules, extractor, True)
meta.analysis.feature_counts = capabilities.feature_counts
meta.analysis.library_functions = capabilities.library_functions
meta.analysis.layout = capa.loader.compute_layout(rules, extractor, capabilities.matches)
if capa.capabilities.common.has_static_limitation(rules, capabilities, is_standalone=False):
logger.info("capa encountered warnings during analysis")
if verbose == "vverbose":
print(capa.render.vverbose.render(meta, rules, capabilities.matches)) # noqa: T201
elif verbose == "verbose":
print(capa.render.verbose.render(meta, rules, capabilities.matches)) # noqa: T201
else:
print(capa.render.default.render(meta, rules, capabilities.matches)) # noqa: T201
return 0
def main():
if not capa.ghidra.helpers.is_supported_ghidra_version():
return capa.main.E_UNSUPPORTED_GHIDRA_VERSION
if not capa.ghidra.helpers.is_supported_file_type():
return capa.main.E_INVALID_FILE_TYPE
if not capa.ghidra.helpers.is_supported_arch_type():
return capa.main.E_INVALID_FILE_ARCH
if isRunningHeadless(): # type: ignore [name-defined] # noqa: F821
return run_headless()
else:
return run_ui()
if __name__ == "__main__":
if sys.version_info < (3, 10):
from capa.exceptions import UnsupportedRuntimeError
raise UnsupportedRuntimeError("This version of capa can only be used with Python 3.10+")
sys.exit(main())

View File

@@ -22,6 +22,7 @@ import capa.version
import capa.features.common
import capa.features.freeze
import capa.render.result_document as rdoc
import capa.features.extractors.ghidra.context as ghidra_context
import capa.features.extractors.ghidra.helpers
from capa.features.address import AbsoluteVirtualAddress
@@ -31,6 +32,18 @@ logger = logging.getLogger("capa")
SUPPORTED_FILE_TYPES = ("Executable and Linking Format (ELF)", "Portable Executable (PE)", "Raw Binary")
def get_current_program():
return ghidra_context.get_context().program
def get_flat_api():
return ghidra_context.get_context().flat_api
def get_monitor():
return ghidra_context.get_context().monitor
class GHIDRAIO:
"""
An object that acts as a file-like object,
@@ -48,7 +61,12 @@ class GHIDRAIO:
self.offset = offset
def read(self, size):
logger.debug("reading 0x%x bytes at 0x%x (ea: 0x%x)", size, self.offset, currentProgram().getImageBase().add(self.offset).getOffset()) # type: ignore [name-defined] # noqa: F821
logger.debug(
"reading 0x%x bytes at 0x%x (ea: 0x%x)",
size,
self.offset,
get_current_program().getImageBase().add(self.offset).getOffset(),
)
if size > len(self.bytes_) - self.offset:
logger.debug("cannot read 0x%x bytes at 0x%x (ea: BADADDR)", size, self.offset)
@@ -60,7 +78,7 @@ class GHIDRAIO:
return
def get_bytes(self):
file_bytes = currentProgram().getMemory().getAllFileBytes()[0] # type: ignore [name-defined] # noqa: F821
file_bytes = get_current_program().getMemory().getAllFileBytes()[0]
# getOriginalByte() allows for raw file parsing on the Ghidra side
# other functions will fail as Ghidra will think that it's reading uninitialized memory
@@ -70,21 +88,32 @@ class GHIDRAIO:
def is_supported_ghidra_version():
version = float(getGhidraVersion()[:4]) # type: ignore [name-defined] # noqa: F821
if version < 10.2:
warning_msg = "capa does not support this Ghidra version"
logger.warning(warning_msg)
logger.warning("Your Ghidra version is: %s. Supported versions are: Ghidra >= 10.2", version)
import ghidra.framework
version = ghidra.framework.Application.getApplicationVersion()
try:
# version format example: "11.1.2" or "11.4"
major, minor = map(int, version.split(".")[:2])
if major < 12:
logger.error("-" * 80)
logger.error(" Ghidra version %s is not supported.", version)
logger.error(" ")
logger.error(" capa requires Ghidra 12.0 or higher.")
logger.error("-" * 80)
return False
except ValueError:
logger.warning("could not parse Ghidra version: %s", version)
return False
return True
def is_running_headless():
return isRunningHeadless() # type: ignore [name-defined] # noqa: F821
return True # PyGhidra is always headless in this context
def is_supported_file_type():
file_info = currentProgram().getExecutableFormat() # type: ignore [name-defined] # noqa: F821
file_info = get_current_program().getExecutableFormat()
if file_info not in SUPPORTED_FILE_TYPES:
logger.error("-" * 80)
logger.error(" Input file does not appear to be a supported file type.")
@@ -99,7 +128,7 @@ def is_supported_file_type():
def is_supported_arch_type():
lang_id = str(currentProgram().getLanguageID()).lower() # type: ignore [name-defined] # noqa: F821
lang_id = str(get_current_program().getLanguageID()).lower()
if not all((lang_id.startswith("x86"), any(arch in lang_id for arch in ("32", "64")))):
logger.error("-" * 80)
@@ -112,18 +141,18 @@ def is_supported_arch_type():
def get_file_md5():
return currentProgram().getExecutableMD5() # type: ignore [name-defined] # noqa: F821
return get_current_program().getExecutableMD5()
def get_file_sha256():
return currentProgram().getExecutableSHA256() # type: ignore [name-defined] # noqa: F821
return get_current_program().getExecutableSHA256()
def collect_metadata(rules: list[Path]):
md5 = get_file_md5()
sha256 = get_file_sha256()
info = currentProgram().getLanguageID().toString() # type: ignore [name-defined] # noqa: F821
info = get_current_program().getLanguageID().toString()
if "x86" in info and "64" in info:
arch = "x86_64"
elif "x86" in info and "32" in info:
@@ -131,11 +160,11 @@ def collect_metadata(rules: list[Path]):
else:
arch = "unknown arch"
format_name: str = currentProgram().getExecutableFormat() # type: ignore [name-defined] # noqa: F821
format_name: str = get_current_program().getExecutableFormat()
if "PE" in format_name:
os = "windows"
elif "ELF" in format_name:
with contextlib.closing(capa.ghidra.helpers.GHIDRAIO()) as f:
with contextlib.closing(GHIDRAIO()) as f:
os = capa.features.extractors.elf.detect_elf_os(f)
else:
os = "unknown os"
@@ -148,16 +177,18 @@ def collect_metadata(rules: list[Path]):
md5=md5,
sha1="",
sha256=sha256,
path=currentProgram().getExecutablePath(), # type: ignore [name-defined] # noqa: F821
path=get_current_program().getExecutablePath(),
),
flavor=rdoc.Flavor.STATIC,
analysis=rdoc.StaticAnalysis(
format=currentProgram().getExecutableFormat(), # type: ignore [name-defined] # noqa: F821
format=get_current_program().getExecutableFormat(),
arch=arch,
os=os,
extractor="ghidra",
rules=tuple(r.resolve().absolute().as_posix() for r in rules),
base_address=capa.features.freeze.Address.from_capa(AbsoluteVirtualAddress(currentProgram().getImageBase().getOffset())), # type: ignore [name-defined] # noqa: F821
base_address=capa.features.freeze.Address.from_capa(
AbsoluteVirtualAddress(get_current_program().getImageBase().getOffset())
),
layout=rdoc.StaticLayout(
functions=(),
),

View File

@@ -0,0 +1,54 @@
<div align="center">
<img src="https://github.com/mandiant/capa/blob/master/doc/img/ghidra_backend_logo.png" width=240 height=125>
</div>
# capa explorer for Ghidra
capa explorer for Ghidra brings capas detection capabilities directly to Ghidras user interface helping speed up your reverse engineering tasks by identifying what parts of a program suggest interesting behavior, such as setting a registry value. You can execute (via [PyGhidra](https://github.com/NationalSecurityAgency/ghidra/tree/master/Ghidra/Features/PyGhidra)) the script [capa_explorer.py](https://raw.githubusercontent.com/mandiant/capa/master/capa/ghidra/plugin/capa_explorer.py) using Ghidras Script Manager window to run capas analysis and view the results in Ghidra.
## ui integration
[capa_explorer.py](https://raw.githubusercontent.com/mandiant/capa/master/capa/ghidra/capa_explorer.py) renders capa results in Ghidra's UI to help you quickly navigate them. This includes adding matched functions to Ghidras Symbol Tree and Bookmarks windows and adding comments to functions that indicate matched capabilities and features. You can execute this script using Ghidras Script Manager window.
### symbol tree window
Matched functions are added to Ghidra's Symbol Tree window under a custom namespace that maps to the capabilities' [capa namespace](https://github.com/mandiant/capa-rules/blob/master/doc/format.md#rule-namespace).
<div align="center">
<img src="https://github.com/mandiant/capa/assets/66766340/eeae33f4-99d4-42dc-a5e8-4c1b8c661492" width=300>
</div>
### comments
Comments are added at the beginning of matched functions indicating matched capabilities and inline comments are added to functions indicating matched features. You can view these comments in Ghidras Disassembly Listing and Decompile windows.
<div align="center">
<img src="https://github.com/mandiant/capa/assets/66766340/bb2b4170-7fd4-45fc-8c7b-ff8f2e2f101b" width=1000>
</div>
### bookmarks
Bookmarks are added to functions that matched a capability that is mapped to a MITRE ATT&CK and/or Malware Behavior Catalog (MBC) technique. You can view these bookmarks in Ghidra's Bookmarks window.
<div align="center">
<img src="https://github.com/mandiant/capa/assets/66766340/7f9a66a9-7be7-4223-91c6-4b8fc4651336" width=825>
</div>
# getting started
## requirements
- [Ghidra](https://github.com/NationalSecurityAgency/ghidra) >= 12.0 must be installed.
- [flare-capa](https://pypi.org/project/flare-capa/) >= 10.0 must be installed (virtual environment recommended) with the `ghidra` extra (e.g., `pip install "flare-capa[ghidra]"`).
- [capa rules](https://github.com/mandiant/capa-rules) must be downloaded for the version of capa you are using.
## execution
### 1. run Ghidra with PyGhidra
You must start Ghidra using the `pyghidraRun` script provided in the support directory of your Ghidra installation to ensure the Python environment is correctly loaded. You should execute `pyghidraRun` from within the Python environment that you used to install capa.
```bash
<ghidra_install>/support/pyghidraRun
```
### 2. run capa_explorer.py
1. Open your Ghidra project and CodeBrowser.
2. Open the Script Manager.
3. Add [capa_explorer.py](https://raw.githubusercontent.com/mandiant/capa/master/capa/ghidra/plugin/capa_explorer.py) to the script directories.
4. Filter for capa and run the script.
5. When prompted, select the directory containing the downloaded capa rules.

View File

@@ -1,7 +1,3 @@
# Run capa against loaded Ghidra database and render results in Ghidra UI
# @author Colton Gabertan (gabertan.colton@gmail.com)
# @category Python 3.capa
# Copyright 2024 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
@@ -16,36 +12,63 @@
# See the License for the specific language governing permissions and
# limitations under the License.
import sys
# Run capa against loaded Ghidra database and render results in Ghidra UI
# @author Colton Gabertan (gabertan.colton@gmail.com)
# @category capa
# @runtime PyGhidra
import json
import logging
import pathlib
from typing import Any
from java.util import ArrayList
from ghidra.util import Msg
from ghidra.app.cmd.label import AddLabelCmd, CreateNamespacesCmd
from ghidra.util.exception import CancelledException
from ghidra.program.flatapi import FlatProgramAPI
from ghidra.program.model.symbol import Namespace, SourceType, SymbolType
import capa
import capa.main
import capa.rules
import capa.version
import capa.render.json
import capa.ghidra.helpers
import capa.capabilities.common
import capa.features.extractors.ghidra.context
import capa.features.extractors.ghidra.extractor
logger = logging.getLogger("capa_explorer")
def show_monitor_message(msg):
capa.ghidra.helpers.get_monitor().checkCanceled()
capa.ghidra.helpers.get_monitor().setMessage(msg)
def show_error(msg):
Msg.showError(None, None, "capa explorer", msg)
def show_warn(msg):
Msg.showWarn(None, None, "capa explorer", msg)
def show_info(msg):
Msg.showInfo(None, None, "capa explorer", msg)
def add_bookmark(addr, txt, category="CapaExplorer"):
"""create bookmark at addr"""
currentProgram().getBookmarkManager().setBookmark(addr, "Info", category, txt) # type: ignore [name-defined] # noqa: F821
capa.ghidra.helpers.get_current_program().getBookmarkManager().setBookmark(addr, "Info", category, txt)
def create_namespace(namespace_str):
"""create new Ghidra namespace for each capa namespace"""
cmd = CreateNamespacesCmd(namespace_str, SourceType.USER_DEFINED)
cmd.applyTo(currentProgram()) # type: ignore [name-defined] # noqa: F821
cmd.applyTo(capa.ghidra.helpers.get_current_program())
return cmd.getNamespace()
@@ -53,7 +76,7 @@ def create_label(ghidra_addr, name, capa_namespace):
"""custom label cmd to overlay symbols under capa-generated namespaces"""
# prevent duplicate labels under the same capa-generated namespace
symbol_table = currentProgram().getSymbolTable() # type: ignore [name-defined] # noqa: F821
symbol_table = capa.ghidra.helpers.get_current_program().getSymbolTable()
for sym in symbol_table.getSymbols(ghidra_addr):
if sym.getName(True) == capa_namespace.getName(True) + Namespace.DELIMITER + name:
return
@@ -61,7 +84,7 @@ def create_label(ghidra_addr, name, capa_namespace):
# create SymbolType.LABEL at addr
# prioritize capa-generated namespace (duplicate match @ new addr), else put under global Ghidra one (new match)
cmd = AddLabelCmd(ghidra_addr, name, True, SourceType.USER_DEFINED)
cmd.applyTo(currentProgram()) # type: ignore [name-defined] # noqa: F821
cmd.applyTo(capa.ghidra.helpers.get_current_program())
# assign new match overlay label to capa-generated namespace
cmd.getSymbol().setNamespace(capa_namespace)
@@ -92,8 +115,8 @@ class CapaMatchData:
return
for key in self.matches.keys():
addr = toAddr(hex(key)) # type: ignore [name-defined] # noqa: F821
func = getFunctionContaining(addr) # type: ignore [name-defined] # noqa: F821
addr = capa.ghidra.helpers.get_flat_api().toAddr(hex(key))
func = capa.ghidra.helpers.get_flat_api().getFunctionContaining(addr)
# bookmark & tag MITRE ATT&CK tactics & MBC @ function scope
if func is not None:
@@ -117,140 +140,160 @@ class CapaMatchData:
def set_plate_comment(self, ghidra_addr):
"""set plate comments at matched functions"""
comment = getPlateComment(ghidra_addr) # type: ignore [name-defined] # noqa: F821
comment = capa.ghidra.helpers.get_flat_api().getPlateComment(ghidra_addr)
rule_path = self.namespace.replace(Namespace.DELIMITER, "/")
# 2 calls to avoid duplicate comments via subsequent script runs
if comment is None:
# first comment @ function
comment = rule_path + "\n"
setPlateComment(ghidra_addr, comment) # type: ignore [name-defined] # noqa: F821
capa.ghidra.helpers.get_flat_api().setPlateComment(ghidra_addr, comment)
elif rule_path not in comment:
comment = comment + rule_path + "\n"
setPlateComment(ghidra_addr, comment) # type: ignore [name-defined] # noqa: F821
capa.ghidra.helpers.get_flat_api().setPlateComment(ghidra_addr, comment)
else:
return
def set_pre_comment(self, ghidra_addr, sub_type, description):
"""set pre comments at subscoped matches of main rules"""
comment = getPreComment(ghidra_addr) # type: ignore [name-defined] # noqa: F821
comment = capa.ghidra.helpers.get_flat_api().getPreComment(ghidra_addr)
if comment is None:
comment = "capa: " + sub_type + "(" + description + ")" + ' matched in "' + self.capability + '"\n'
setPreComment(ghidra_addr, comment) # type: ignore [name-defined] # noqa: F821
capa.ghidra.helpers.get_flat_api().setPreComment(ghidra_addr, comment)
elif self.capability not in comment:
comment = (
comment + "capa: " + sub_type + "(" + description + ")" + ' matched in "' + self.capability + '"\n'
)
setPreComment(ghidra_addr, comment) # type: ignore [name-defined] # noqa: F821
capa.ghidra.helpers.get_flat_api().setPreComment(ghidra_addr, comment)
else:
return
def label_matches(self):
def label_matches(self, do_namespaces, do_comments):
"""label findings at function scopes and comment on subscope matches"""
capa_namespace = create_namespace(self.namespace)
symbol_table = currentProgram().getSymbolTable() # type: ignore [name-defined] # noqa: F821
capa_namespace = None
if do_namespaces:
capa_namespace = create_namespace(self.namespace)
symbol_table = capa.ghidra.helpers.get_current_program().getSymbolTable()
# handle function main scope of matched rule
# these will typically contain further matches within
if self.scope == "function":
for addr in self.matches.keys():
ghidra_addr = toAddr(hex(addr)) # type: ignore [name-defined] # noqa: F821
ghidra_addr = capa.ghidra.helpers.get_flat_api().toAddr(hex(addr))
# classify new function label under capa-generated namespace
sym = symbol_table.getPrimarySymbol(ghidra_addr)
if sym is not None:
if sym.getSymbolType() == SymbolType.FUNCTION:
create_label(ghidra_addr, sym.getName(), capa_namespace)
self.set_plate_comment(ghidra_addr)
if do_namespaces:
sym = symbol_table.getPrimarySymbol(ghidra_addr)
if sym is not None:
if sym.getSymbolType() == SymbolType.FUNCTION:
create_label(ghidra_addr, sym.getName(), capa_namespace)
# parse the corresponding nodes, and pre-comment subscope matched features
# under the encompassing function(s)
for sub_match in self.matches.get(addr):
for loc, node in sub_match.items():
sub_ghidra_addr = toAddr(hex(loc)) # type: ignore [name-defined] # noqa: F821
if sub_ghidra_addr == ghidra_addr:
# skip duplicates
continue
if do_comments:
self.set_plate_comment(ghidra_addr)
# precomment subscope matches under the function
if node != {}:
for sub_type, description in parse_node(node):
self.set_pre_comment(sub_ghidra_addr, sub_type, description)
# parse the corresponding nodes, and pre-comment subscope matched features
# under the encompassing function(s)
for sub_match in self.matches.get(addr):
for loc, node in sub_match.items():
sub_ghidra_addr = capa.ghidra.helpers.get_flat_api().toAddr(hex(loc))
if sub_ghidra_addr == ghidra_addr:
# skip duplicates
continue
# precomment subscope matches under the function
if node != {} and do_comments:
for sub_type, description in parse_node(node):
self.set_pre_comment(sub_ghidra_addr, sub_type, description)
else:
# resolve the encompassing function for the capa namespace
# of non-function scoped main matches
for addr in self.matches.keys():
ghidra_addr = toAddr(hex(addr)) # type: ignore [name-defined] # noqa: F821
ghidra_addr = capa.ghidra.helpers.get_flat_api().toAddr(hex(addr))
# basic block / insn scoped main matches
# Ex. See "Create Process on Windows" Rule
func = getFunctionContaining(ghidra_addr) # type: ignore [name-defined] # noqa: F821
func = capa.ghidra.helpers.get_flat_api().getFunctionContaining(ghidra_addr)
if func is not None:
func_addr = func.getEntryPoint()
create_label(func_addr, func.getName(), capa_namespace)
self.set_plate_comment(func_addr)
if do_namespaces:
create_label(func_addr, func.getName(), capa_namespace)
if do_comments:
self.set_plate_comment(func_addr)
# create subscope match precomments
for sub_match in self.matches.get(addr):
for loc, node in sub_match.items():
sub_ghidra_addr = toAddr(hex(loc)) # type: ignore [name-defined] # noqa: F821
sub_ghidra_addr = capa.ghidra.helpers.get_flat_api().toAddr(hex(loc))
if node != {}:
if func is not None:
# basic block/ insn scope under resolved function
for sub_type, description in parse_node(node):
self.set_pre_comment(sub_ghidra_addr, sub_type, description)
if do_comments:
for sub_type, description in parse_node(node):
self.set_pre_comment(sub_ghidra_addr, sub_type, description)
else:
# this would be a global/file scoped main match
# try to resolve the encompassing function via the subscope match, instead
# Ex. "run as service" rule
sub_func = getFunctionContaining(sub_ghidra_addr) # type: ignore [name-defined] # noqa: F821
sub_func = capa.ghidra.helpers.get_flat_api().getFunctionContaining(sub_ghidra_addr)
if sub_func is not None:
sub_func_addr = sub_func.getEntryPoint()
# place function in capa namespace & create the subscope match label in Ghidra's global namespace
create_label(sub_func_addr, sub_func.getName(), capa_namespace)
self.set_plate_comment(sub_func_addr)
for sub_type, description in parse_node(node):
self.set_pre_comment(sub_ghidra_addr, sub_type, description)
if do_namespaces:
create_label(sub_func_addr, sub_func.getName(), capa_namespace)
if do_comments:
self.set_plate_comment(sub_func_addr)
if do_comments:
for sub_type, description in parse_node(node):
self.set_pre_comment(sub_ghidra_addr, sub_type, description)
else:
# addr is in some other file section like .data
# represent this location with a label symbol under the capa namespace
# Ex. See "Reference Base64 String" rule
for sub_type, description in parse_node(node):
# in many cases, these will be ghidra-labeled data, so just add the existing
# label symbol to the capa namespace
for sym in symbol_table.getSymbols(sub_ghidra_addr):
if sym.getSymbolType() == SymbolType.LABEL:
sym.setNamespace(capa_namespace)
self.set_pre_comment(sub_ghidra_addr, sub_type, description)
if do_namespaces:
for _sub_type, _description in parse_node(node):
# in many cases, these will be ghidra-labeled data, so just add the existing
# label symbol to the capa namespace
for sym in symbol_table.getSymbols(sub_ghidra_addr):
if sym.getSymbolType() == SymbolType.LABEL:
sym.setNamespace(capa_namespace)
if do_comments:
for sub_type, description in parse_node(node):
self.set_pre_comment(sub_ghidra_addr, sub_type, description)
def get_capabilities():
rules_dir: str = ""
try:
selected_dir = askDirectory("Choose capa rules directory", "Ok") # type: ignore [name-defined] # noqa: F821
if selected_dir:
rules_dir = selected_dir.getPath()
except RuntimeError:
# RuntimeError thrown when user selects "Cancel"
pass
rules_dir = ""
show_monitor_message(f"requesting capa {capa.version.__version__} rules directory")
selected_dir = askDirectory(f"choose capa {capa.version.__version__} rules directory", "Ok") # type: ignore [name-defined] # noqa: F821
if selected_dir:
rules_dir = selected_dir.getPath()
if not rules_dir:
logger.info("You must choose a capa rules directory before running capa.")
return "" # return empty str to avoid handling both int and str types
raise CancelledException
rules_path: pathlib.Path = pathlib.Path(rules_dir)
logger.info("running capa using rules from %s", str(rules_path))
show_monitor_message(f"loading rules from {rules_path}")
rules = capa.rules.get_rules([rules_path])
meta = capa.ghidra.helpers.collect_metadata([rules_path])
extractor = capa.features.extractors.ghidra.extractor.GhidraFeatureExtractor()
show_monitor_message("collecting binary metadata")
meta = capa.ghidra.helpers.collect_metadata([rules_path])
show_monitor_message("running capa analysis")
extractor = capa.features.extractors.ghidra.extractor.GhidraFeatureExtractor()
capabilities = capa.capabilities.common.find_capabilities(rules, extractor, True)
show_monitor_message("checking for static limitations")
if capa.capabilities.common.has_static_limitation(rules, capabilities, is_standalone=False):
popup("capa explorer encountered warnings during analysis. Please check the console output for more information.") # type: ignore [name-defined] # noqa: F821
logger.info("capa encountered warnings during analysis")
show_warn(
"capa explorer encountered warnings during analysis. Please check the console output for more information.",
)
show_monitor_message("rendering results")
return capa.render.json.render(meta, rules, capabilities.matches)
@@ -328,12 +371,12 @@ def parse_json(capa_data):
# this requires the correct delimiter used by Ghidra
# Ex. 'communication/named-pipe/create/create pipe' -> capa::communication::named-pipe::create::create-pipe
namespace_str = Namespace.DELIMITER.join(meta["namespace"].split("/"))
namespace = "capa" + Namespace.DELIMITER + namespace_str + fmt_rule
namespace = "capa_explorer" + Namespace.DELIMITER + namespace_str + fmt_rule
else:
# lib rules via the official rules repo will not contain data
# for the "namespaces" key, so format using rule itself
# Ex. 'contain loop' -> capa::lib::contain-loop
namespace = "capa" + Namespace.DELIMITER + "lib" + fmt_rule
namespace = "capa_explorer" + Namespace.DELIMITER + "lib" + fmt_rule
yield CapaMatchData(namespace, scope, rule, rule_matches, attack, mbc)
@@ -342,44 +385,79 @@ def main():
logging.basicConfig(level=logging.INFO)
logging.getLogger().setLevel(logging.INFO)
if isRunningHeadless(): # type: ignore [name-defined] # noqa: F821
logger.error("unsupported Ghidra execution mode")
return capa.main.E_UNSUPPORTED_GHIDRA_EXECUTION_MODE
choices = ["namespaces", "bookmarks", "comments"]
# use ArrayList to resolve ambiguous askChoices overloads (List vs List, List) in PyGhidra
choices_java = ArrayList()
for c in choices:
choices_java.add(c)
choice_labels = [
'add "capa_explorer" namespace for matched functions',
"add bookmarks for matched functions",
"add comments to matched functions",
]
# use ArrayList to resolve ambiguous askChoices overloads (List vs List, List) in PyGhidra
choice_labels_java = ArrayList()
for c in choice_labels:
choice_labels_java.add(c)
selected = list(askChoices("capa explorer", "select actions:", choices_java, choice_labels_java)) # type: ignore [name-defined] # noqa: F821
do_namespaces = "namespaces" in selected
do_comments = "comments" in selected
do_bookmarks = "bookmarks" in selected
if not any((do_namespaces, do_comments, do_bookmarks)):
raise CancelledException("no actions selected")
# initialize the context for the extractor/helpers
capa.features.extractors.ghidra.context.set_context(
currentProgram, # type: ignore [name-defined] # noqa: F821
FlatProgramAPI(currentProgram), # type: ignore [name-defined] # noqa: F821
monitor, # type: ignore [name-defined] # noqa: F821
)
show_monitor_message("checking supported Ghidra version")
if not capa.ghidra.helpers.is_supported_ghidra_version():
logger.error("unsupported Ghidra version")
show_error("unsupported Ghidra version")
return capa.main.E_UNSUPPORTED_GHIDRA_VERSION
show_monitor_message("checking supported file type")
if not capa.ghidra.helpers.is_supported_file_type():
logger.error("unsupported file type")
show_error("unsupported file type")
return capa.main.E_INVALID_FILE_TYPE
show_monitor_message("checking supported file architecture")
if not capa.ghidra.helpers.is_supported_arch_type():
logger.error("unsupported file architecture")
show_error("unsupported file architecture")
return capa.main.E_INVALID_FILE_ARCH
# capa_data will always contain {'meta':..., 'rules':...}
# if the 'rules' key contains no values, then there were no matches
capa_data = json.loads(get_capabilities())
if capa_data.get("rules") is None:
logger.info("capa explorer found no matches")
popup("capa explorer found no matches.") # type: ignore [name-defined] # noqa: F821
show_info("capa explorer found no matches.")
return capa.main.E_EMPTY_REPORT
show_monitor_message("processing matches")
for item in parse_json(capa_data):
item.bookmark_functions()
item.label_matches()
logger.info("capa explorer analysis complete")
popup("capa explorer analysis complete.\nPlease see results in the Bookmarks Window and Namespaces section of the Symbol Tree Window.") # type: ignore [name-defined] # noqa: F821
if do_bookmarks:
show_monitor_message("adding bookmarks")
item.bookmark_functions()
if do_namespaces or do_comments:
show_monitor_message("adding labels")
item.label_matches(do_namespaces, do_comments)
show_info("capa explorer analysis complete.")
return 0
if __name__ == "__main__":
if sys.version_info < (3, 10):
from capa.exceptions import UnsupportedRuntimeError
raise UnsupportedRuntimeError("This version of capa can only be used with Python 3.10+")
exit_code = main()
if exit_code != 0:
popup("capa explorer encountered errors during analysis. Please check the console output for more information.") # type: ignore [name-defined] # noqa: F821
sys.exit(exit_code)
try:
if main() != 0:
show_error(
"capa explorer encountered errors during analysis. Please check the console output for more information.",
)
except CancelledException:
show_info("capa explorer analysis cancelled.")

View File

@@ -96,11 +96,7 @@ def is_runtime_ida():
def is_runtime_ghidra():
try:
currentProgram # type: ignore [name-defined] # noqa: F821
except NameError:
return False
return True
return importlib.util.find_spec("ghidra") is not None
def assert_never(value) -> NoReturn:
@@ -331,6 +327,9 @@ def log_unsupported_os_error():
logger.error(" ")
logger.error(" capa currently only analyzes executables for some operating systems")
logger.error(" (including Windows, Linux, and Android).")
logger.error(" ")
logger.error(" If you know the target OS, you can specify it explicitly, for example:")
logger.error(" capa --os linux <sample>")
logger.error("-" * 80)

View File

@@ -17,7 +17,6 @@ import logging
import idaapi
import ida_kernwin
from capa.ida.plugin.form import CapaExplorerForm
from capa.ida.plugin.icon import ICON
logger = logging.getLogger(__name__)
@@ -74,6 +73,9 @@ class CapaExplorerPlugin(idaapi.plugin_t):
arg (int): bitflag. Setting LSB enables automatic analysis upon
loading. The other bits are currently undefined. See `form.Options`.
"""
# delay import to not trigger load of Qt components when not running in idaq, i.e., in idalib
from capa.ida.plugin.form import CapaExplorerForm
if not self.form:
self.form = CapaExplorerForm(self.PLUGIN_NAME, arg)
else:

View File

@@ -14,9 +14,9 @@
import ida_kernwin
from PyQt5 import QtCore
from capa.ida.plugin.error import UserCancelledError
from capa.ida.plugin.qt_compat import QtCore, Signal
from capa.features.extractors.ida.extractor import IdaFeatureExtractor
from capa.features.extractors.base_extractor import FunctionHandle
@@ -24,7 +24,7 @@ from capa.features.extractors.base_extractor import FunctionHandle
class CapaExplorerProgressIndicator(QtCore.QObject):
"""implement progress signal, used during feature extraction"""
progress = QtCore.pyqtSignal(str)
progress = Signal(str)
def update(self, text):
"""emit progress update

View File

@@ -23,7 +23,6 @@ from pathlib import Path
import idaapi
import ida_kernwin
import ida_settings
from PyQt5 import QtGui, QtCore, QtWidgets
import capa.main
import capa.rules
@@ -51,6 +50,7 @@ from capa.ida.plugin.hooks import CapaExplorerIdaHooks
from capa.ida.plugin.model import CapaExplorerDataModel
from capa.ida.plugin.proxy import CapaExplorerRangeProxyModel, CapaExplorerSearchProxyModel
from capa.ida.plugin.extractor import CapaExplorerFeatureExtractor
from capa.ida.plugin.qt_compat import QtGui, QtCore, QtWidgets
from capa.features.extractors.base_extractor import FunctionHandle
logger = logging.getLogger(__name__)
@@ -1358,7 +1358,7 @@ class CapaExplorerForm(idaapi.PluginForm):
@param state: checked state
"""
if state == QtCore.Qt.Checked:
if state:
self.limit_results_to_function(idaapi.get_func(idaapi.get_screen_ea()))
else:
self.range_model_proxy.reset_address_range_filter()
@@ -1367,7 +1367,7 @@ class CapaExplorerForm(idaapi.PluginForm):
def slot_checkbox_limit_features_by_ea(self, state):
""" """
if state == QtCore.Qt.Checked:
if state:
self.view_rulegen_features.filter_items_by_ea(idaapi.get_screen_ea())
else:
self.view_rulegen_features.show_all_items()

View File

@@ -0,0 +1,38 @@
{
"IDAMetadataDescriptorVersion": 1,
"plugin": {
"name": "capa",
"entryPoint": "capa_explorer.py",
"version": "9.3.1",
"idaVersions": ">=7.4",
"description": "Identify capabilities in executable files using FLARE's capa framework",
"license": "Apache-2.0",
"categories": [
"malware-analysis",
"api-scripting-and-automation",
"ui-ux-and-visualization"
],
"pythonDependencies": ["flare-capa==9.3.1"],
"urls": {
"repository": "https://github.com/mandiant/capa"
},
"authors": [
{"name": "Willi Ballenthin", "email": "wballenthin@hex-rays.com"},
{"name": "Moritz Raabe", "email": "moritzraabe@google.com"},
{"name": "Mike Hunhoff", "email": "mike.hunhoff@gmail.com"},
{"name": "Yacine Elhamer", "email": "elhamer.yacine@gmail.com"}
],
"keywords": [
"capability-detection",
"malware-analysis",
"behavior-analysis",
"reverse-engineering",
"att&ck",
"rule-engine",
"feature-extraction",
"yara-like-rules",
"static-analysis",
"dynamic-analysis"
]
}
}

View File

@@ -18,10 +18,10 @@ from typing import Iterator, Optional
import idc
import idaapi
from PyQt5 import QtCore
import capa.ida.helpers
from capa.features.address import Address, FileOffsetAddress, AbsoluteVirtualAddress
from capa.ida.plugin.qt_compat import QtCore, qt_get_item_flag_tristate
def info_to_name(display):
@@ -55,7 +55,7 @@ class CapaExplorerDataItem:
self.flags = QtCore.Qt.ItemIsEnabled | QtCore.Qt.ItemIsSelectable
if self._can_check:
self.flags = self.flags | QtCore.Qt.ItemIsUserCheckable | QtCore.Qt.ItemIsTristate
self.flags = self.flags | QtCore.Qt.ItemIsUserCheckable | qt_get_item_flag_tristate()
if self.pred:
self.pred.appendChild(self)

View File

@@ -18,7 +18,6 @@ from collections import deque
import idc
import idaapi
from PyQt5 import QtGui, QtCore
import capa.rules
import capa.ida.helpers
@@ -42,6 +41,7 @@ from capa.ida.plugin.item import (
CapaExplorerInstructionViewItem,
)
from capa.features.address import Address, AbsoluteVirtualAddress
from capa.ida.plugin.qt_compat import QtGui, QtCore
# default highlight color used in IDA window
DEFAULT_HIGHLIGHT = 0xE6C700
@@ -269,7 +269,7 @@ class CapaExplorerDataModel(QtCore.QAbstractItemModel):
visited.add(child_index)
for idx in range(self.rowCount(child_index)):
stack.append(child_index.child(idx, 0))
stack.append(self.index(idx, 0, child_index))
def reset_ida_highlighting(self, item, checked):
"""reset IDA highlight for item

View File

@@ -12,10 +12,8 @@
# See the License for the specific language governing permissions and
# limitations under the License.
from PyQt5 import QtCore
from PyQt5.QtCore import Qt
from capa.ida.plugin.model import CapaExplorerDataModel
from capa.ida.plugin.qt_compat import Qt, QtCore
class CapaExplorerRangeProxyModel(QtCore.QSortFilterProxyModel):

View File

@@ -0,0 +1,79 @@
# Copyright 2020 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Qt compatibility layer for capa IDA Pro plugin.
Handles PyQt5 (IDA < 9.2) vs PySide6 (IDA >= 9.2) differences.
This module provides a unified import interface for Qt modules and handles
API changes between Qt5 and Qt6.
"""
try:
# IDA 9.2+ uses PySide6
from PySide6 import QtGui, QtCore, QtWidgets
from PySide6.QtGui import QAction
QT_LIBRARY = "PySide6"
Signal = QtCore.Signal
except ImportError:
# Older IDA versions use PyQt5
try:
from PyQt5 import QtGui, QtCore, QtWidgets
from PyQt5.QtWidgets import QAction
QT_LIBRARY = "PyQt5"
Signal = QtCore.pyqtSignal
except ImportError:
raise ImportError("Neither PySide6 nor PyQt5 is available. Cannot initialize capa IDA plugin.")
Qt = QtCore.Qt
def qt_get_item_flag_tristate():
"""
Get the tristate item flag compatible with Qt5 and Qt6.
Qt5 (PyQt5): Uses Qt.ItemIsTristate
Qt6 (PySide6): Qt.ItemIsTristate was removed, uses Qt.ItemIsAutoTristate
ItemIsAutoTristate automatically manages tristate based on child checkboxes,
matching the original ItemIsTristate behavior where parent checkboxes reflect
the check state of their children.
Returns:
int: The appropriate flag value for the Qt version
Raises:
AttributeError: If the tristate flag cannot be found in the Qt library
"""
if QT_LIBRARY == "PySide6":
# Qt6: ItemIsTristate was removed, replaced with ItemIsAutoTristate
# Try different possible locations (API varies slightly across PySide6 versions)
if hasattr(Qt, "ItemIsAutoTristate"):
return Qt.ItemIsAutoTristate
elif hasattr(Qt, "ItemFlag") and hasattr(Qt.ItemFlag, "ItemIsAutoTristate"):
return Qt.ItemFlag.ItemIsAutoTristate
else:
raise AttributeError(
"Cannot find ItemIsAutoTristate in PySide6. "
+ "Your PySide6 version may be incompatible with capa. "
+ f"Available Qt attributes: {[attr for attr in dir(Qt) if 'Item' in attr]}"
)
else:
# Qt5: Use the original ItemIsTristate flag
return Qt.ItemIsTristate
__all__ = ["qt_get_item_flag_tristate", "Signal", "QAction", "QtGui", "QtCore", "QtWidgets"]

View File

@@ -18,7 +18,6 @@ from collections import Counter
import idc
import idaapi
from PyQt5 import QtGui, QtCore, QtWidgets
import capa.rules
import capa.engine
@@ -28,6 +27,7 @@ import capa.features.basicblock
from capa.ida.plugin.item import CapaExplorerFunctionItem
from capa.features.address import AbsoluteVirtualAddress, _NoAddress
from capa.ida.plugin.model import CapaExplorerDataModel
from capa.ida.plugin.qt_compat import QtGui, QtCore, Signal, QAction, QtWidgets
MAX_SECTION_SIZE = 750
@@ -147,7 +147,7 @@ def calc_item_depth(o):
def build_action(o, display, data, slot):
""" """
action = QtWidgets.QAction(display, o)
action = QAction(display, o)
action.setData(data)
action.triggered.connect(lambda checked: slot(action))
@@ -312,7 +312,7 @@ class CapaExplorerRulegenPreview(QtWidgets.QTextEdit):
class CapaExplorerRulegenEditor(QtWidgets.QTreeWidget):
updated = QtCore.pyqtSignal()
updated = Signal()
def __init__(self, preview, parent=None):
""" """

View File

@@ -12,7 +12,6 @@
# See the License for the specific language governing permissions and
# limitations under the License.
import io
import os
import logging
import datetime
@@ -23,24 +22,13 @@ from pathlib import Path
from rich.console import Console
from typing_extensions import assert_never
import capa.perf
import capa.rules
import capa.engine
import capa.helpers
import capa.version
import capa.render.json
import capa.rules.cache
import capa.render.default
import capa.render.verbose
import capa.features.common
import capa.features.freeze as frz
import capa.render.vverbose
import capa.features.extractors
import capa.render.result_document
import capa.render.result_document as rdoc
import capa.features.extractors.common
import capa.features.extractors.base_extractor
import capa.features.extractors.cape.extractor
from capa.rules import RuleSet
from capa.engine import MatchResults
from capa.exceptions import UnsupportedOSError, UnsupportedArchError, UnsupportedFormatError
@@ -79,6 +67,7 @@ BACKEND_VMRAY = "vmray"
BACKEND_FREEZE = "freeze"
BACKEND_BINEXPORT2 = "binexport2"
BACKEND_IDA = "ida"
BACKEND_GHIDRA = "ghidra"
class CorruptFile(ValueError):
@@ -137,6 +126,57 @@ def get_meta_str(vw):
return f"{', '.join(meta)}, number of functions: {len(vw.getFunctions())}"
def _is_probably_corrupt_pe(path: Path) -> bool:
"""
Heuristic check for obviously malformed PE samples that provoke
pathological behavior in vivisect (see GH-1989).
We treat a PE as "probably corrupt" when any section declares an
unrealistically large virtual size compared to the file size, e.g.
hundreds of megabytes in a tiny file. Such cases lead vivisect to
try to map enormous regions and can exhaust CPU/memory.
"""
try:
import pefile
except Exception:
# If pefile is unavailable, fall back to existing behavior.
return False
try:
pe = pefile.PE(str(path), fast_load=True)
except pefile.PEFormatError:
# Not a PE file (or badly formed); let existing checks handle it.
return False
except Exception:
return False
try:
file_size = path.stat().st_size
except OSError:
return False
if file_size <= 0:
return False
# Flag sections whose declared virtual size is wildly disproportionate
# to the file size (e.g. 900MB section in a ~400KB sample).
_VSIZE_FILE_RATIO = 128
_MAX_REASONABLE_VSIZE = 512 * 1024 * 1024 # 512 MB
max_reasonable = max(file_size * _VSIZE_FILE_RATIO, _MAX_REASONABLE_VSIZE)
for section in getattr(pe, "sections", []):
vsize = getattr(section, "Misc_VirtualSize", 0) or 0
if vsize > max_reasonable:
logger.debug(
"detected unrealistic PE section virtual size: 0x%x (file size: 0x%x), treating as corrupt",
vsize,
file_size,
)
return True
return False
def get_workspace(path: Path, input_format: str, sigpaths: list[Path]):
"""
load the program at the given path into a vivisect workspace using the given format.
@@ -154,11 +194,18 @@ def get_workspace(path: Path, input_format: str, sigpaths: list[Path]):
"""
# lazy import enables us to not require viv if user wants another backend.
import envi.exc
import viv_utils
import viv_utils.flirt
logger.debug("generating vivisect workspace for: %s", path)
if input_format in (FORMAT_PE, FORMAT_AUTO) and _is_probably_corrupt_pe(path):
raise CorruptFile(
"PE file appears to contain unrealistically large sections and is likely corrupt"
+ " - skipping analysis to avoid excessive resource usage."
)
try:
if input_format == FORMAT_AUTO:
if not is_supported_format(path):
@@ -175,11 +222,20 @@ def get_workspace(path: Path, input_format: str, sigpaths: list[Path]):
vw = viv_utils.getShellcodeWorkspaceFromFile(str(path), arch="amd64", analyze=False)
else:
raise ValueError("unexpected format: " + input_format)
except envi.exc.SegmentationViolation as e:
raise CorruptFile(f"Invalid memory access during binary parsing: {e}") from e
except Exception as e:
# vivisect raises raw Exception instances, and we don't want
# to do a subclass check via isinstance.
if type(e) is Exception and "Couldn't convert rva" in e.args[0]:
raise CorruptFile(e.args[0]) from e
if type(e) is Exception and e.args:
error_msg = str(e.args[0])
if "Couldn't convert rva" in error_msg:
raise CorruptFile(error_msg) from e
elif "Unsupported Architecture" in error_msg:
# Extract architecture number if available
arch_info = e.args[1] if len(e.args) > 1 else "unknown"
raise CorruptFile(f"Unsupported architecture: {arch_info}") from e
raise
viv_utils.flirt.register_flirt_signature_analyzers(vw, [str(s) for s in sigpaths])
@@ -338,12 +394,24 @@ def get_extractor(
import capa.features.extractors.ida.extractor
logger.debug("idalib: opening database...")
# idalib writes to stdout (ugh), so we have to capture that
# so as not to screw up structured output.
with capa.helpers.stdout_redirector(io.BytesIO()):
with console.status("analyzing program...", spinner="dots"):
if idapro.open_database(str(input_path), run_auto_analysis=True):
raise RuntimeError("failed to analyze input file")
idapro.enable_console_messages(False)
with console.status("analyzing program...", spinner="dots"):
# we set the primary and secondary Lumina servers to 0.0.0.0 to disable Lumina,
# which sometimes provides bad names, including overwriting names from debug info.
#
# use -R to load resources, which can help us embedded PE files.
#
# return values from open_database:
# 0 - Success
# 2 - User cancelled or 32-64 bit conversion failed
# 4 - Database initialization failed
# -1 - Generic errors (database already open, auto-analysis failed, etc.)
# -2 - User cancelled operation
ret = idapro.open_database(
str(input_path), run_auto_analysis=True, args="-Olumina:host=0.0.0.0 -Osecondary_lumina:host=0.0.0.0 -R"
)
if ret != 0:
raise RuntimeError("failed to analyze input file")
logger.debug("idalib: waiting for analysis...")
ida_auto.auto_wait()
@@ -351,6 +419,69 @@ def get_extractor(
return capa.features.extractors.ida.extractor.IdaFeatureExtractor()
elif backend == BACKEND_GHIDRA:
import pyghidra
with console.status("analyzing program...", spinner="dots"):
if not pyghidra.started():
pyghidra.start()
import capa.ghidra.helpers
if not capa.ghidra.helpers.is_supported_ghidra_version():
raise RuntimeError("unsupported Ghidra version")
import tempfile
tmpdir = tempfile.TemporaryDirectory()
project_cm = pyghidra.open_project(tmpdir.name, "CapaProject", create=True)
project = project_cm.__enter__()
try:
from ghidra.util.task import TaskMonitor
monitor = TaskMonitor.DUMMY
# Import file
loader = pyghidra.program_loader().project(project).source(str(input_path)).name(input_path.name)
with loader.load() as load_results:
load_results.save(monitor)
# Open program
program, consumer = pyghidra.consume_program(project, "/" + input_path.name)
# Analyze
pyghidra.analyze(program, monitor)
from ghidra.program.flatapi import FlatProgramAPI
flat_api = FlatProgramAPI(program)
import capa.features.extractors.ghidra.context as ghidra_context
ghidra_context.set_context(program, flat_api, monitor)
# Wrapper to handle cleanup of program (consumer) and project
class GhidraContextWrapper:
def __init__(self, project_cm, program, consumer):
self.project_cm = project_cm
self.program = program
self.consumer = consumer
def __exit__(self, exc_type, exc_val, exc_tb):
self.program.release(self.consumer)
self.project_cm.__exit__(exc_type, exc_val, exc_tb)
cm = GhidraContextWrapper(project_cm, program, consumer)
except Exception:
project_cm.__exit__(None, None, None)
tmpdir.cleanup()
raise
import capa.features.extractors.ghidra.extractor
return capa.features.extractors.ghidra.extractor.GhidraFeatureExtractor(ctx_manager=cm, tmpdir=tmpdir)
else:
raise ValueError("unexpected backend: " + backend)

View File

@@ -55,6 +55,7 @@ from capa.loader import (
BACKEND_VMRAY,
BACKEND_DOTNET,
BACKEND_FREEZE,
BACKEND_GHIDRA,
BACKEND_PEFILE,
BACKEND_DRAKVUF,
BACKEND_BINEXPORT2,
@@ -298,6 +299,7 @@ def install_common_args(parser, wanted=None):
(BACKEND_BINJA, "Binary Ninja"),
(BACKEND_DOTNET, ".NET"),
(BACKEND_BINEXPORT2, "BinExport2"),
(BACKEND_GHIDRA, "Ghidra"),
(BACKEND_FREEZE, "capa freeze"),
(BACKEND_CAPE, "CAPE"),
(BACKEND_DRAKVUF, "DRAKVUF"),
@@ -392,6 +394,7 @@ class ShouldExitError(Exception):
"""raised when a main-related routine indicates the program should exit."""
def __init__(self, status_code: int):
super().__init__(status_code)
self.status_code = status_code
@@ -658,7 +661,9 @@ def get_rules_from_cli(args) -> RuleSet:
raises:
ShouldExitError: if the program is invoked incorrectly and should exit.
"""
enable_cache: bool = True
enable_cache: bool = getattr(args, "enable_cache", True)
# this allows calling functions to easily disable rule caching, e.g., used by the rule linter to avoid
try:
if capa.helpers.is_running_standalone() and args.is_default_rules:
cache_dir = get_default_root() / "cache"
@@ -940,8 +945,7 @@ def main(argv: Optional[list[str]] = None):
argv = sys.argv[1:]
desc = "The FLARE team's open-source tool to identify capabilities in executable files."
epilog = textwrap.dedent(
"""
epilog = textwrap.dedent("""
By default, capa uses a default set of embedded rules.
You can see the rule set here:
https://github.com/mandiant/capa-rules
@@ -968,8 +972,7 @@ def main(argv: Optional[list[str]] = None):
filter rules by meta fields, e.g. rule name or namespace
capa -t "create TCP socket" suspicious.exe
"""
)
""")
parser = argparse.ArgumentParser(
description=desc, epilog=epilog, formatter_class=argparse.RawDescriptionHelpFormatter
@@ -1104,14 +1107,26 @@ def ida_main():
def ghidra_main():
from ghidra.program.flatapi import FlatProgramAPI
import capa.rules
import capa.ghidra.helpers
import capa.render.default
import capa.features.extractors.ghidra.context
import capa.features.extractors.ghidra.extractor
logging.basicConfig(level=logging.INFO)
logging.getLogger().setLevel(logging.INFO)
# These are provided by the Ghidra scripting environment
# but are not available when running standard python
# so we have to ignore the linting errors
program = currentProgram # type: ignore [name-defined] # noqa: F821
monitor_ = monitor # type: ignore [name-defined] # noqa: F821
flat_api = FlatProgramAPI(program)
capa.features.extractors.ghidra.context.set_context(program, flat_api, monitor_)
logger.debug("-" * 80)
logger.debug(" Using default embedded rules.")
logger.debug(" ")

View File

@@ -31,6 +31,7 @@ $ protoc.exe --python_out=. --mypy_out=. <path_to_proto> (e.g. capa/render/proto
Alternatively, --pyi_out=. can be used to generate a Python Interface file that supports development
"""
import datetime
from typing import Any, Union

View File

@@ -17,6 +17,7 @@ import io
from typing import Union, Iterator, Optional
import rich.console
from rich.markup import escape
from rich.progress import Text
import capa.render.result_document as rd
@@ -24,21 +25,21 @@ import capa.render.result_document as rd
def bold(s: str) -> Text:
"""draw attention to the given string"""
return Text.from_markup(f"[cyan]{s}")
return Text.from_markup(f"[cyan]{escape(s)}")
def bold2(s: str) -> Text:
"""draw attention to the given string, within a `bold` section"""
return Text.from_markup(f"[green]{s}")
return Text.from_markup(f"[green]{escape(s)}")
def mute(s: str) -> Text:
"""draw attention away from the given string"""
return Text.from_markup(f"[dim]{s}")
return Text.from_markup(f"[dim]{escape(s)}")
def warn(s: str) -> Text:
return Text.from_markup(f"[yellow]{s}")
return Text.from_markup(f"[yellow]{escape(s)}")
def format_parts_id(data: Union[rd.AttackSpec, rd.MBCSpec]):

View File

@@ -159,9 +159,8 @@ def render_call(layout: rd.DynamicLayout, addr: frz.Address) -> str:
s.append(f"){rest}")
newline = "\n"
return (
f"{pname}{{pid:{call.thread.process.pid},tid:{call.thread.tid},call:{call.id}}}\n{rutils.mute(newline.join(s))}"
)
# Use default (non-dim) styling for API details so they remain readable in -vv output
return f"{pname}{{pid:{call.thread.process.pid},tid:{call.thread.tid},call:{call.id}}}\n{newline.join(s)}"
def render_short_call(layout: rd.DynamicLayout, addr: frz.Address) -> str:
@@ -180,7 +179,8 @@ def render_short_call(layout: rd.DynamicLayout, addr: frz.Address) -> str:
s.append(f"){rest}")
newline = "\n"
return f"call:{call.id}\n{rutils.mute(newline.join(s))}"
# Use default (non-dim) styling for API details so they remain readable in -vv output
return f"call:{call.id}\n{newline.join(s)}"
def render_static_meta(console: Console, meta: rd.StaticMetadata):

View File

@@ -274,12 +274,8 @@ SUPPORTED_FEATURES[Scope.FUNCTION].update(SUPPORTED_FEATURES[Scope.BASIC_BLOCK])
class InvalidRule(ValueError):
def __init__(self, msg):
super().__init__()
self.msg = msg
def __str__(self):
return f"invalid rule: {self.msg}"
return f"invalid rule: {super().__str__()}"
def __repr__(self):
return str(self)
@@ -289,20 +285,15 @@ class InvalidRuleWithPath(InvalidRule):
def __init__(self, path, msg):
super().__init__(msg)
self.path = path
self.msg = msg
self.__cause__ = None
def __str__(self):
return f"invalid rule: {self.path}: {self.msg}"
return f"invalid rule: {self.path}: {super(InvalidRule, self).__str__()}"
class InvalidRuleSet(ValueError):
def __init__(self, msg):
super().__init__()
self.msg = msg
def __str__(self):
return f"invalid rule set: {self.msg}"
return f"invalid rule set: {super().__str__()}"
def __repr__(self):
return str(self)
@@ -1102,15 +1093,15 @@ class Rule:
@lru_cache()
def _get_yaml_loader():
try:
# prefer to use CLoader to be fast, see #306
# prefer to use CLoader to be fast, see #306 / CSafeLoader is the same as CLoader but with safe loading
# on Linux, make sure you install libyaml-dev or similar
# on Windows, get WHLs from pyyaml.org/pypi
logger.debug("using libyaml CLoader.")
return yaml.CLoader
logger.debug("using libyaml CSafeLoader.")
return yaml.CSafeLoader
except Exception:
logger.debug("unable to import libyaml CLoader, falling back to Python yaml parser.")
logger.debug("unable to import libyaml CSafeLoader, falling back to Python yaml parser.")
logger.debug("this will be slower to load rules.")
return yaml.Loader
return yaml.SafeLoader
@staticmethod
def _get_ruamel_yaml_parser():
@@ -1152,6 +1143,8 @@ class Rule:
else:
# use pyyaml because it can be much faster than ruamel (pure python)
doc = yaml.load(s, Loader=cls._get_yaml_loader())
if doc is None or not isinstance(doc, dict) or "rule" not in doc:
raise InvalidRule("empty or invalid YAML document")
return cls.from_dict(doc, s)
@classmethod
@@ -1456,6 +1449,13 @@ class RuleSet:
scope: self._index_rules_by_feature(scope, self.rules_by_scope[scope], scores_by_rule) for scope in scopes
}
# Pre-compute the topological index mapping for each scope.
# This avoids rebuilding the dict on every call to _match (which runs once per
# instruction/basic-block/function/file scope, i.e. potentially millions of times).
self._rule_index_by_scope: dict[Scope, dict[str, int]] = {
scope: {rule.name: i for i, rule in enumerate(self.rules_by_scope[scope])} for scope in scopes
}
@property
def file_rules(self):
return self.rules_by_scope[Scope.FILE]
@@ -1885,11 +1885,13 @@ class RuleSet:
"""
done = []
# use a queue of rules, because we'll be modifying the list (appending new items) as we go.
while rules:
rule = rules.pop(0)
# use a list as a stack: append new items and pop() from the end, both O(1).
# order doesn't matter here since every rule in the queue is processed eventually.
rules_stack = list(rules)
while rules_stack:
rule = rules_stack.pop()
for subscope_rule in rule.extract_subscope_rules():
rules.append(subscope_rule)
rules_stack.append(subscope_rule)
done.append(rule)
return done
@@ -1938,11 +1940,11 @@ class RuleSet:
"""
feature_index: RuleSet._RuleFeatureIndex = self._feature_indexes_by_scopes[scope]
rules: list[Rule] = self.rules_by_scope[scope]
# Topologic location of rule given its name.
# That is, rules with a lower index should be evaluated first, since their dependencies
# will be evaluated later.
rule_index_by_rule_name = {rule.name: i for i, rule in enumerate(rules)}
# Pre-computed in __init__ to avoid rebuilding on every _match call.
rule_index_by_rule_name = self._rule_index_by_scope[scope]
# This algorithm is optimized to evaluate as few rules as possible,
# because the less work we do, the faster capa can run.
@@ -2038,7 +2040,9 @@ class RuleSet:
candidate_rules = [self.rules[name] for name in candidate_rule_names]
# Order rules topologically, so that rules with dependencies work correctly.
# Sort descending so pop() from the end yields the topologically-first rule in O(1).
RuleSet._sort_rules_by_index(rule_index_by_rule_name, candidate_rules)
candidate_rules.reverse()
#
# The following is derived from ceng.match
@@ -2053,7 +2057,7 @@ class RuleSet:
augmented_features = features
while candidate_rules:
rule = candidate_rules.pop(0)
rule = candidate_rules.pop()
res = rule.evaluate(augmented_features, short_circuit=True)
if res:
# we first matched the rule with short circuiting enabled.
@@ -2092,6 +2096,7 @@ class RuleSet:
candidate_rule_names.update(new_candidates)
candidate_rules.extend([self.rules[rule_name] for rule_name in new_candidates])
RuleSet._sort_rules_by_index(rule_index_by_rule_name, candidate_rules)
candidate_rules.reverse()
return (augmented_features, results)
@@ -2228,7 +2233,10 @@ def get_rules(
try:
rule = Rule.from_yaml(content.decode("utf-8"))
except InvalidRule:
except InvalidRule as e:
if e.args and e.args[0] == "empty or invalid YAML document":
logger.warning("skipping %s: %s", path, e)
continue
raise
else:
rule.meta["capa/path"] = path.as_posix()

View File

@@ -12,7 +12,7 @@
# See the License for the specific language governing permissions and
# limitations under the License.
__version__ = "9.1.0"
__version__ = "9.3.1"
def get_major_version():

View File

@@ -0,0 +1,34 @@
# mapa html-map demo
*2026-03-16T17:05:38Z by Showboat 0.6.1*
<!-- showboat-id: 1cf46a16-c3d7-459d-8593-c23080cb12f6 -->
Generate a standalone HTML report for a sample binary and summarize the report contents.
```bash
tmp=$(mktemp /tmp/mapa-html-map-XXXXXX.html)
PYTHONWARNINGS=ignore ./.venv/bin/python -m mapa binaries/01/16/mpbindump.exe --output html-map --quiet > "$tmp"
PYTHONWARNINGS=ignore /usr/bin/python3 - "$tmp" <<"PY"
import json
import re
import sys
from pathlib import Path
text = Path(sys.argv[1]).read_text()
match = re.search(r"<script type=\"application/json\" id=\"mapa-data\">(.*?)</script>", text, re.S)
data = json.loads(match.group(1))
print("doctype", text.splitlines()[0])
print("functions", len(data["functions"]))
print("tags", len(data["tags"]))
print("strings", len(data["strings"]))
PY
rm "$tmp"
```
```output
doctype <!doctype html>
functions 1406
tags 12
strings 81
```
To open the report directly in your browser, use `python -m mapa <sample> --output html-map --open`.

Binary file not shown.

Before

Width:  |  Height:  |  Size: 210 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 108 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 110 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 79 KiB

View File

@@ -0,0 +1,105 @@
# Compilation Unit Boundary Evaluation
Ground truth data from assemblage: ~500K binaries with PDB-derived source file mappings, ~1000 functions each. This document describes how to turn that data into a repeatable evaluation pipeline for CU boundary detection.
## Ground truth definition
Each function in a binary has a source file path from debug info. Source files are either **impl files** (`.c`, `.cpp`, `.cc`, `.cxx`) or **headers** (`.h`, `.hpp`, extensionless STL headers like `vector`, `xmemory`, etc.). A compilation unit is defined by its impl file — header-sourced functions are template instantiations or inline expansions compiled into whichever CU includes them.
To assign each function to a CU:
1. Sort functions by start address.
2. For each function, if its source is an impl file, that's its CU.
3. Otherwise, forward-fill: inherit the CU of the most recent preceding impl function. If none exists (header functions at the start of the binary), backward-fill from the first impl file.
4. Ground truth boundaries = the set of gap indices where `CU[i] != CU[i-1]`.
This assignment was validated on 10 binaries (21K functions). Every ground truth boundary has an impl file on at least one side — there are zero boundaries where both flanking functions come from headers. Forward-fill produces CU fragmentation ratios of 1.0-1.2x (nearly perfectly contiguous CUs).
## What the data looks like
Source file paths fall into classifiable categories based on filesystem structure. User code lives under `c:\assemblage\builds\HASH\PROJECT\*.cpp` (not under `external/`). Vendored libraries are at `...\PROJECT\external\LIBNAME\*`. MSVC STL headers are under `C:\Program Files\...\MSVC\...\include\` (e.g., `vector`, `xmemory`). CRT sources come from `D:\a\_work\...\vctools\crt\...` or `d:\th\minkernel\crts\...`.
Library identity is derivable from the path prefix. The assemblage build system makes external libraries explicit via the `external/LIBNAME/` convention.
### Template/STL interleaving
Functions from STL headers appear heavily interspersed within CUs. In SmxDecompiler.exe, 72% of functions come from MSVC STL headers, appearing in runs of up to 258 functions within a single CU. These are per-CU template instantiation copies (the same `std::_Verify_range` appears in 4 different CUs), not independent entities.
This means "source file changed" is a terrible boundary signal because it fires on every header-to-impl or header-to-header transition within a CU. On 3DSIFT.exe (heavy Eigen template use), source-file-change achieves only 3% precision. The actual CU boundary rate is low: only 5.9% of all inter-function gaps are real CU boundaries.
### CRT fragmentation
CRT code (from the ucrt static library) shows heavy fragmentation: dozens of tiny 1-3 function CUs interleaved by the linker. These are the hardest boundaries to detect and arguably the least useful to distinguish (separating `delete_scalar.cpp` from `new_array.cpp` has limited practical value). Consider collapsing all CRT functions into a single "CRT" meta-CU for scoring, or reporting CRT and non-CRT scores separately.
### Multi-fragment CUs
Some impl files appear in non-contiguous blocks (e.g., `inflate.c` interrupted by `infback.c`/`inffast.c` in zlib). This is uncommon in application code but frequent in CRT. For boundary scoring, treat each contiguous fragment as its own segment — the metric measures where transitions occur regardless of whether distant segments share a label.
## Metrics
### Primary: Boundary F1
For a predicted set of boundary gap indices vs the ground truth set, compute precision (fraction of predicted boundaries that are real), recall (fraction of real boundaries that are predicted), and F1 (harmonic mean). This directly measures the "where are the seams?" question. It's strict: a boundary predicted one function off from the true position scores as one FP + one FN.
### Secondary: WindowDiff
Standard text segmentation metric, more forgiving of near-misses. Slides a window of size `k` across the sequence and counts positions where the number of boundaries within the window differs between prediction and ground truth. Lower is better. Use `k = avg_segment_length / 2`.
### Library-level metrics
| Problem | Ground truth | Metric |
|---------|-------------|--------|
| CU boundaries | Impl-file assignment (above) | Boundary F1 + WindowDiff |
| Library clustering | Path-derived category (user-code, external:zlib, crt, msvc-stdlib, ...) | Adjusted Rand Index |
| User vs library | Binary classification from category | Per-function accuracy |
These are separate evaluations that can be run independently and solved in order: CU boundaries first, then library grouping, then user/library classification.
## Baseline results
Tested on the 10-binary sample (21K functions, 1238 CU boundaries):
| Heuristic | Precision | Recall | F1 range |
|-----------|-----------|--------|----------|
| No boundaries | - | 0.00 | 0.00 |
| Every source-file change | 0.03-0.79 | 1.00 | 0.06-0.88 |
| Category change | 0.11-0.50 | 0.02-0.28 | 0.04-0.29 |
| Impl-file stream tracking | 1.00 | 1.00 | 1.00 |
The impl-file oracle (skip headers, flag when the .c/.cpp changes) achieves perfect F1. This is the ceiling; it confirms that CU boundaries are exactly the impl-file transitions in the source data. The gap between the naive baselines and this oracle is the space we're trying to close using binary-level features.
## Evaluation pipeline
### Step 1: Ground truth extraction
From debug info, produce a normalized table per binary:
```
binary_hash | func_index | func_start | func_end | cu_id | library_category
```
`cu_id` is the impl file path (or a hash of it). `library_category` is derived from the path prefix. Store as Parquet or SQLite since the CSV format won't scale to 500K binaries.
Pre-compute per-binary boundary sets as arrays of gap indices.
### Step 2: Predictor interface
A predictor takes a binary's function list (start addresses and sizes only, no source info, no debug info) and returns a set of predicted boundary indices. The predictor can use any features extractable from the raw binary: inter-function gap sizes and padding patterns, call graph edge density across each gap, data reference patterns, string attribution, function naming (if symbols present), alignment byte patterns, etc.
### Step 3: Scoring
For each binary, compute boundary F1 between prediction and ground truth. Aggregate across the dataset:
Report both macro-average (mean F1 across binaries, treating each binary equally) and micro-average (pool all gaps, compute F1 once, biased toward larger binaries). Add per-category breakdowns (user code only, CRT only, etc.) to understand where the predictor succeeds and fails.
### Step 4: Iterate
Try different feature combinations and scoring strategies, compare F1. Per mapa-cu-ideas.md, the boundary scorer is the spine of the pipeline and every algorithm becomes a feature generator for gap scoring.
## Design decisions
Boundary F1 was chosen over clustering metrics (ARI, NMI) because those conflate boundary detection with label assignment. We want to decouple the two: find boundaries first, then label segments.
Forward-fill for CU assignment matches how compilation works. The compiler processes the impl file's own functions first, then template instantiations from included headers. Header functions belong to the most recent impl file. The 1.0-1.2x fragmentation ratio validates this model.
Header-to-header transitions are never counted as boundaries because the data shows zero CU boundaries where both flanking functions come from headers. Every CU has at least one impl file as its root. Template instantiations from different headers within the same CU are not separate CUs.

100
doc/plans/design.md Normal file
View File

@@ -0,0 +1,100 @@
# mapa design
## Architecture
Seven layers, each in its own module under the `mapa/` package:
| Module | Responsibility |
|---|---|
| `mapa/model.py` | Backend-neutral dataclasses: `MapaReport`, `MapaMeta`, `MapaSection`, `MapaLibrary`, `MapaFunction`, `MapaCall`, `MapaCaller`, `MapaString`, `MapaProgramString`, `AssemblageRecord` |
| `mapa/assemblage.py` | Assemblage CSV loading, column validation, SHA-256 filtering, RVA-to-VA conversion, exact-row deduplication |
| `mapa/ida_db.py` | IDA database lifecycle: `resolve_database()`, `open_database_session()`, SHA-256 caching, flock-based concurrency guard |
| `mapa/collector.py` | Populates `MapaReport` from an open `ida_domain.Database`. All IDA queries live here. |
| `mapa/renderer.py` | Rich-based text rendering and plain single-function summary formatting from `MapaReport`. No IDA dependency. |
| `mapa/html_renderer.py` | Self-contained `html-map` rendering from `MapaReport`. No IDA dependency. |
| `mapa/cli.py` | Argument parsing, capa/assemblage loading, output-mode selection, `--open` temp-file/browser handling, orchestration |
`scripts/mapa.py` is a thin entry point that delegates to `mapa.cli.main()`.
The CLI validates output-mode combinations before analysis. For `--output html-map --open`, it renders the HTML once, writes it to a temporary `.html` file via `NamedTemporaryFile(delete=False)`, and opens the browser with `webbrowser.open(file://...)`.
## Database lifecycle
Modeled on `idals.py` from idawilli:
1. If input is `.i64`/`.idb`, use directly.
2. Otherwise, hash the file (MD5 + SHA-256), check `~/.cache/mandiant/mapa/<sha256>.i64`.
3. On cache miss: acquire advisory flock, create database via `Database.open()` with `IdaCommandOptions(auto_analysis=True, new_database=True, output_database=..., load_resources=True)`, wait for `ida_auto.auto_wait()`.
4. On cache hit or after creation: open read-only with `new_database=False, save_on_close=False`.
5. Concurrency guard: poll for `.nam` file disappearance + `fcntl.flock` on `<db>.lock` + TOCTOU re-check.
## Assemblage loading
Assemblage loading is deferred until after mapa opens the IDA database, because the effective sample SHA-256 may come from either the raw input file or the database metadata.
`mapa.assemblage.load_assemblage_records()`:
- reads the CSV with `csv.DictReader`
- requires `hash`, `name`, `start`, `end`, and `source_file`
- filters rows by sample SHA-256, case-insensitively
- converts `start` and `end` RVAs to VAs by adding `db.base_address`
- strips the trailing provenance suffix from `source_file` only at render time, via `AssemblageRecord.source_path`
- deduplicates exact duplicate rows while preserving CSV order for distinct ambiguous matches
The result is `dict[int, list[AssemblageRecord]]`, keyed by function start VA.
## Collector design
The collector builds several indexes before the main function loop:
- import_index: `dict[int, (module, name)]` from `db.imports.get_all_imports()`
- extern_addrs: `set[int]` from functions in XTRN segments
- thunk_targets: `dict[int, int]` via `_resolve_thunk_target()` — follows code refs then data refs, max depth 5, single-target chains only
- resolved_callers/callees: built by walking all non-thunk function flowcharts, resolving call targets through thunk chains, classifying as internal vs API
String extraction follows single data-reference chains from each instruction up to depth 10. The collector returns both the discovered string VA and the raw string value for each hit.
The collector stores string data in two shapes:
- `MapaFunction.strings` for the text report and tooltip summaries. These stay function-local and deduplicate by trimmed display value.
- `MapaReport.program_strings` for `html-map`. These are keyed by string VA, preserve duplicate display values at different addresses, merge tags across repeated references, and track the set of referencing function addresses.
Assemblage data is attached per function during collection. `MapaFunction.assemblage_records` carries zero or more `AssemblageRecord` values for the function start address. The collector does not use Assemblage to rename functions, callers, or callees.
## ida-domain API usage
Primary queries used:
- `db.functions` — iteration, `get_at()`, `get_name()`, `get_flags()`, `get_flowchart()`
- `db.segments.get_all()` — section enumeration
- `db.imports.get_all_modules()`, `get_all_imports()` — library/import enumeration
- `db.xrefs.code_refs_from_ea()`, `data_refs_from_ea()`, `calls_from_ea()` — call/thunk resolution
- `db.instructions.is_call_instruction()`, `get_mnemonic()` — instruction classification
- `db.heads.size()` — instruction byte size
- `FlowChart` with `FlowChartFlags.NOEXT | FlowChartFlags.PREDS` — CFG traversal
- `FunctionFlags.THUNK`, `FunctionFlags.LIB` — function classification
No legacy `ida_*` module calls are used. All queries go through `ida-domain`.
## Rendering
`mapa/renderer.py` prints the text report in function address order. For each function, it prints the IDA-derived header first and then any Assemblage annotations as `assemblage name:` and `assemblage file:` lines. When multiple distinct Assemblage rows map to one function start address, the renderer prints all of them in order.
The text renderer also exposes a plain single-function summary formatter used by `html-map` tooltips. The row order matches text mode: Assemblage lines, xrefs, CFG stats, capa matches, internal calls, APIs, and strings.
For source-file separators, mapa uses the first Assemblage record's normalized source path as the function's primary source path. The text renderer tracks the last seen non-empty primary path across the function list. Missing Assemblage data does not trigger a separator and does not reset that state. When a later function introduces a different primary path, the renderer prints a muted horizontal rule with `[ <path> ]` immediately before that function.
`mapa/html_renderer.py` renders a single self-contained HTML document. It emits a split view: a left function pane and a right string pane, both with independent scrolling. The panes are separated by a draggable vertical divider implemented with a small inline pointer-event handler. The renderer emits one square per function in address order, one program-string row per string VA in address order, tag controls with visible function counts, direction and depth controls for neighborhood traversal, a small legend for heat/seed/dim semantics, right-aligned visible tags in each string row, inline JSON data for function summaries, direct tag memberships, direct string memberships, and caller/callee adjacency, plus a single floating tooltip and a small inline script. That script resolves the active seed source from a hovered or locked function, tag, or string; chooses caller-only, callee-only, or undirected traversal; runs a bounded breadth-first search from each seed; sums geometric-decay contributions using per-seed shortest distance; and renders the result as a heat overlay with a distinct seed outline.
## String tagging
Vendored Quantum Strand string databases live under `mapa/string_tags/data/` in five families: OSS/CRT libraries (gzipped JSONL), expert rules (plain JSONL), Windows API names (gzipped text), global prevalence (gzipped JSONL + binary hash files), and junk-code strings (gzipped JSONL).
The `mapa/string_tags/` package has three modules:
- `model.py``StringTagMatch` and `StringTagResult` dataclasses
- `loaders.py` — file-format readers using `msgspec`, `gzip`, `hashlib`, and `importlib.resources`
- `tagger.py``StringTagger` class with `tag_string(raw) -> StringTagResult`, plus `load_default_tagger()` which lazily loads and caches all databases process-wide
The collector tags raw strings before `rstrip()` trimming. When two raw strings collapse to the same display value, their tags and match metadata are merged. `MapaString` carries `tags: tuple[str, ...]` and `tag_matches: tuple[StringTagMatch, ...]`.
The text renderer uses a Rich `Text`-based helper to right-align the visible tag column on `string:` rows. The HTML renderer reuses the same visible-tag policy, builds its top tag controls from those visible tags only, shows the distinct-function count for each visible tag, and renders the visible tags right-aligned in each program-string row. The visible tag policy suppresses `#common` when a more-specific tag is also present.

256
doc/plans/mapa-cu-ideas.md Normal file
View File

@@ -0,0 +1,256 @@
> read problem-transcript.md and digest it, summarize it, restructure it and explain it.
>
> ok. now ultrathink. consider the problem deeply. try to give some insights, some novel perspective, ideas for techniques and algorithms. what strikes you?
I've been sitting with this for a while. Here's what strikes me, organized from the structural observations that feel most fundamental down to concrete algorithmic ideas.
------
## The Deep Structure: A Graph Pinned to a Line
The most mathematically distinctive thing about this problem is that you have a *graph* (the call graph, the data xref graph) whose nodes are *pinned to a one-dimensional manifold* (the address space). This is a very specific structure that most generic graph algorithms ignore entirely. It means you're not just doing community detection — you're doing community detection where the communities are *intervals*. That's a massive constraint that dramatically shrinks the search space.
Think about it: if you have 10,000 functions, the number of arbitrary partitions is astronomically large. But the number of ways to partition 10,000 linearly-ordered nodes into *contiguous groups* is only on the order of 2^10,000 in the worst case, and if you're looking for, say, 20 boundaries, it's just C(10000, 20) — large, but tractable with dynamic programming. The contiguity assumption turns a combinatorially explosive problem into something closer to 1D segmentation, which is a much better-understood and more tractable class of problems.
## The Call Edge Length Distribution — A Fingerprint of Structure
Here's an idea I find really compelling. For every edge in the call graph, compute the *address distance* between caller and callee: `|addr(A) - addr(B)|`. Now plot the histogram of these edge lengths across the whole binary.
I'd expect this histogram to be multimodal. There should be a peak near zero representing intra-compilation-unit calls (functions calling their neighbors). There should be a broader hump at medium range representing intra-library cross-CU calls. And there should be long-range outliers representing cross-boundary calls — where user code calls into a library at a distant address.
The valleys between these modes are telling you the *characteristic scales* of the hierarchy. The first valley gives you the typical size of a compilation unit. The second gives you the typical size of a library. This is essentially a multi-scale analysis of the binary's structure, and you get it almost for free before running any clustering algorithm. You could use those characteristic distances as thresholds for later steps.
## Min-Cut / Max-Flow for Boundary Detection
The transcript describes a "tension" along paths between known first-party anchors (like `main`) and known library anchors (like FLIRT-matched or string-anchored functions). This maps almost perfectly to a **minimum cut problem**.
Set it up like this: create a super-source *S* connected to all known first-party nodes, and a super-sink *T* connected to all known library nodes. The edges in the call graph get capacities. What should the capacity be? This is where you encode your priors — you could weight by *inverse address distance* (so cutting a long-range edge is cheap, cutting a short-range edge between neighbors is expensive), or by the number of shared data references, or both. Then the min-cut between S and T partitions the graph into first-party and third-party, and the *cut edges themselves are literally the API boundary* — the places where user code transitions into library code.
This is elegant because it naturally handles ambiguity. You don't have to hand-define a propagation rule; the flow algorithm figures out where the cheapest place to draw the line is. And you can add new anchors (from FLIRT, strings, human input) simply by connecting them to S or T, and the cut re-adjusts.
## The Genome Annotation Analogy
This problem is strikingly similar to **genome annotation** in bioinformatics. You have a long linear sequence (the binary). It contains functional regions (genes = functions). These regions cluster into operons and gene families (compilation units, libraries). You have some known markers (homolog matches ≈ FLIRT signatures). You want to segment the sequence into regions of different character, exploiting the fact that related elements are co-located.
The classic tool for this is the **Hidden Markov Model**. Imagine an HMM where the hidden states represent "this function belongs to first-party code," "this function belongs to library A, CU 1," etc. The observations at each state are the features of that function — its string references, its call graph characteristics, its instruction patterns. The transition probabilities encode the contiguity assumption: staying in the same CU is high probability, transitioning between CUs within a library is medium, jumping between libraries is low.
You'd train this in a semi-supervised way, anchoring some states with your ground truth, and letting Viterbi or forward-backward propagate through the rest. The HMM framework is natural here because the binary *is* a sequence, and the Markov property (what matters is what group you're currently in, not the full history) matches the linker's behavior well.
## Personalized PageRank as a Soft Classification
Here's a technique that handles the graph propagation cleanly. Run **personalized PageRank** (PPR) seeded from each anchor set. One PPR vector seeded from `main` gives you a "reachability-from-first-party" score for every node. Another seeded from known zlib functions gives a "zlib affinity" score. A third from OpenSSL anchors gives an "OpenSSL affinity" score.
Each function is then softly classified by which PPR vector dominates at that node. The beauty is that PPR naturally decays with graph distance and respects the connectivity structure — a tightly-knit library will "trap" the random walk, giving high PPR scores internally and low scores outside. The boundary between two regions is where the competing PPR scores cross over.
You could even visualize this as a heat map over the address space, and the boundaries would pop out visually.
## Exploiting the Bipartite Structure: Code ↔ Data
The transcript mentions that data locality mirrors code locality, but I think this signal is underexploited if you treat it as a separate, independent confirmation channel. Consider the *bipartite graph* between functions and the data items they reference. Two functions that reference overlapping or nearby data are likely from the same compilation unit. This gives you a similarity measure between functions that's *completely independent* of the call graph.
You could build a function-to-function similarity matrix where similarity = how much their referenced data overlaps or how close their referenced data is in the data section. Then cluster on this matrix. The resulting clusters should correspond to compilation units, and they give you a second opinion that you can reconcile with the call-graph-derived clusters.
When both signals agree, you have high confidence. Where they disagree, something interesting is happening — maybe a utility function shared across CUs, or a data section that got reordered.
## Directed Graph Asymmetry as a Library Detector
Here's a subtlety the transcript alludes to but doesn't fully develop. Call graphs are *directed*, and the directionality matters. Library code has a specific shape: it gets **called from far away** (user code invokes the library's public API) but **calls to nearby functions** (internal library routines). So a library entry point has high *in-degree from distant addresses*but low *out-degree to distant addresses*.
You could compute, for each function, two metrics: the average address distance of its callers, and the average address distance of its callees. Plot every function on this 2D plane. I'd expect clusters to emerge — first-party code has short-range callers and a mix of callees (some local, some long-range calls into libraries), while library internals have short-range everything. Library *entry points* are the distinctive ones: long-range callers, short-range callees. They're the "necks" between the communities.
## The Bisection Oracle Strategy
The "git bisect" idea from the transcript is really interesting and I think it deserves formal treatment. If you have an expensive oracle (heavy analysis, LLM query, human analyst) and you want to minimize queries, this is an **active learning** problem.
Concretely: you have a sequence of functions ordered by address, and you know that somewhere in this sequence there's a boundary between first-party and library code. You believe the sequence is monotonic (once you cross from first-party to library, you don't come back — at least within a region). Binary search finds the boundary in O(log n) queries. But real binaries might have multiple interleaved regions, so you'd need a slightly more sophisticated approach — maybe something like the **multi-armed bandit** formulation, where each candidate boundary location has an uncertainty estimate, and you query the point that maximally reduces global uncertainty.
## Strongly Connected Components as Ground Truth CU Detectors
One more concrete trick: compute the **strongly connected components** of the call graph. Mutual recursion (A calls B, B calls A) almost always happens within a single compilation unit. So SCCs give you "for free" sets of functions that definitely belong together. These become seeds for clustering — merge each SCC into a single super-node, then run your boundary detection on the reduced graph.
------
## Synthesis
If I were designing the pipeline, I'd layer these techniques:
Start with **cheap deterministic anchors** — FLIRT matches, string attribution, `main` identification. Then use the **edge-length histogram** to estimate the characteristic scales. Run **SCC detection** to collapse definite-same-CU functions. Build edge weights from address proximity and shared data references, then use **min-cut** from known-first-party to known-library to find the primary boundary. Within each side, use **agglomerative clustering** on the address axis, breaking only at points where the call graph edge density drops off, to find compilation unit boundaries. Use **personalized PageRank** from each library's anchors for soft assignment. Finally, reconcile against the **data xref bipartite clustering** as independent validation.
The result would be a hierarchical segmentation of the address space with confidence scores, and the boundaries would be exactly the API transition points that a reverse engineer cares about.
---
These are good, and the interesting thing is where they overlap with what I already proposed — and more importantly, where the *gaps between them* reveal something deeper. Let me think through what these trigger.
------
> here are some additional ideas. do they cause you to have any new insights or ideas?
>
> To transition MAPA from a good heuristic tool to a cutting-edge analytical engine, here are four algorithms that fit perfectly into their architecture: Algorithm 1: Spatial-Graph Label Propagation (Solving the "Tension") Instead of manually traversing graphs, use Label Propagation.
>
> 1. Build an adjacency matrix of the functions.
> 2. Weight the edges by both call frequency AND physical memory proximity. (e.g., Function A and B are strongly linked if A calls B and they are 100 bytes apart).
> 3. Lock the "Anchor" nodes (Ground Truth).
> 4. Run the propagation algorithm. It will mathematically resolve the "tension" the speaker mentioned, assigning a probability score to every single function (e.g., "Function 0x40100 is 88% likely to be zlib"). Algorithm 2: Spectral Clustering for "Compilation Units" The speaker wants to identify individual compilation units (source files). Because linkers lay out source files contiguously, compilation units form ultra-dense "cliques" in the graph.
>
> - Use Spectral Graph Partitioning. By calculating the Laplacian of the call graph, the algorithm will naturally shatter the binary into discrete chunks.
> - Because compilation units are sequential in memory, you can penalize the clustering algorithm for grouping functions that are physically far apart. The output will be distinct, colored blocks perfectly mapping to the original .c or .cpp files. Algorithm 3: Minimum Cut / Maximum Flow for API Boundaries To find the exact boundary between user code and library code, treat the graph like a pipe network carrying water from main to the library strings.
> - Run a Min-Cut/Max-Flow algorithm. The algorithm will identify the "bottleneck" edges where water struggles to pass from the malware to the library.
> - Those bottlenecks are, almost by definition, the public API functions of the statically linked library.
## The Three Algorithms Are Solving Three Different Sub-Problems
The most important realization is that these three aren't competing approaches — they're a **layered pipeline** attacking different levels of the hierarchy. Min-cut finds the macro boundary (first-party vs. library). Spectral clustering finds the meso-structure (compilation units). Label propagation assigns soft identities (which library, what confidence). That ordering matters, and each stage's output feeds the next.
But there's something missing from this pipeline: none of them address **how to discover the right number of groups**. Min-cut gives you a bisection, but a real binary might have first-party code plus six libraries. Spectral clustering needs you to pick *k*. Label propagation needs seed labels. So the first real question is: **how many libraries are we even looking for?**
## The Eigengap Tells You How Many Libraries Exist
This is where the spectral approach becomes more powerful than the proposal suggests. Don't just use spectral clustering to *partition* — first use the **eigenvalue spectrum of the graph Laplacian** to *discover* the structure. Specifically, compute the first, say, 50 eigenvalues of the address-weighted call graph Laplacian and look for the **eigengap** — the largest jump between consecutive eigenvalues. If the first three eigenvalues are near zero and then the fourth jumps up, that's telling you the graph has roughly three natural communities. You've learned the number of libraries (plus first-party code) before doing any clustering at all.
Even better: the **eigenvectors** themselves, when plotted against function address, should look like step functions. Within a compilation unit, all functions have similar eigenvector values (because they're tightly coupled). At a boundary, the values jump. So you can literally do **change-point detection on the Fiedler vector** (the eigenvector corresponding to the second-smallest eigenvalue) plotted over the address axis, and the change points are your compilation unit boundaries. This fuses the spectral and spatial approaches in a way that's more principled than penalizing distance in the clustering objective.
## The Weighting Scheme Deserves More Thought
Algorithm 1 proposes weighting edges by call frequency AND memory proximity. This sounds right but the *combination function* matters enormously. Additive weighting (`w = α·calls + β·proximity`) treats them as interchangeable signals. Multiplicative weighting (`w = calls · f(proximity)`) treats proximity as a **prior** that modulates graph evidence. I think the multiplicative form is more correct.
Here's why: a call edge between two functions 50 bytes apart is *qualitatively different* from a call edge between two functions 500KB apart. The first is almost certainly intra-CU. The second is almost certainly cross-boundary. The call exists in both cases, but its *meaning* changes based on distance. A multiplicative weight captures this — a long-range call gets a small weight, so cutting it is cheap (it's probably a cross-boundary call), while cutting a short-range call is expensive.
Concretely, something like `w(e) = calls(e) · exp(-distance(e) / λ)` where λ is the characteristic CU size (which you estimated from the edge-length histogram I described earlier). This exponential decay naturally encodes "nearby calls bind tightly, distant calls bind weakly."
## A Critique of Min-Cut That Leads Somewhere Interesting
The min-cut formulation as described has a subtle problem: it's **binary**. It partitions into first-party and library. But what if there are three libraries? You'd need to run it multiple times, or use multi-terminal cut, which gets expensive.
But here's what the min-cut framing *does* give you that's uniquely valuable: the **cut edges are the API boundary**, and the **capacity of the cut is a confidence measure**. A large min-cut value means the boundary is well-defined (lots of edges had to be severed). A small min-cut means the separation is tenuous — maybe you've drawn the line in the wrong place, or maybe the "library" is actually tightly integrated with user code.
This leads to a practical idea: **iterate**. Run min-cut to get a first-party/library partition. Then *within the library side*, run min-cut again between the two most distant anchor sets (say, zlib anchors vs. OpenSSL anchors). This recursive bisection via min-cut naturally produces the hierarchy. And at each level, the cut capacity tells you how confident you should be.
## A New Idea These Three Together Trigger: Diffusion Distance as a Universal Metric
All three algorithms are really measuring the same underlying thing from different angles: **how easily does information flow between two nodes?** Label propagation is a diffusion process. Spectral clustering uses the eigenvectors of the diffusion operator. Min-cut measures the bottleneck in flow.
This suggests a unifying abstraction: compute the **diffusion distance** between every pair of functions. Diffusion distance is the L2 distance between the rows of the matrix `exp(-t·L)` where L is the graph Laplacian and t is a time parameter. At small t, only very close (tightly connected, nearby) functions are similar. At large t, entire libraries blur into single points.
The beauty is that **different values of t reveal different levels of the hierarchy**. Small t resolves compilation units. Medium t resolves libraries. Large t resolves the first-party/library split. You get the entire hierarchical segmentation from a single mathematical object just by tuning a parameter.
You could visualize this as a dendrogram: at what "diffusion time" do two functions merge into the same cluster? Functions in the same CU merge almost immediately. Functions in the same library merge at medium time. Everything merges eventually. The dendrogram *is* the hierarchical classification you're after.
## One Final Practical Insight: Robustness to Bad Disassembly
All of these techniques assume the call graph is accurate, but in real binaries — especially malware — disassembly is imperfect. Indirect calls get missed. Tail calls look like jumps rather than calls. Obfuscated code deliberately breaks analysis.
This is where the **spatial signal acts as a safety net**. Even when the graph is incomplete or noisy, the address layout is always there and always reliable. If your spectral clustering says two functions belong in different groups but they're 20 bytes apart with no other functions between them, the spatial evidence should override. The address axis is your highest-confidence signal; the graph is the informative but noisy supplement. Any algorithm you build should be designed to degrade gracefully — if the graph were *completely* absent, you should still be able to do a passable job purely on address-range heuristics and data xref locality. The graph makes it better, but the 1D structure is the backbone.
---
> here are some further considerations. how should we integrate them into our list of strategies?
>
> 1. Features that seem underexploited.
>
> - 23a. Shared constant pools and tables.
> - 23b. Exception/unwind metadata neighborhoods.
> - 23c. Import-family fingerprints.
> - 23d. Repeated alignment and gap patterns between code blocks.
> - 23e. Data-section locality matching code-section locality.
> - 23f. Ego-network shape: not just who a function calls, but what its 1-hop and 2-hop neighborhood looks like.
> - 23g. Call-direction asymmetry across a candidate boundary.
>
> 1. A specific novel tactic: score gaps, not just nodes.
>
> - 24a. For each gap between adjacent functions in address order, compute “boundary likelihood.”
> - 24b. Features could include:
> - sudden change in seed evidence
> - reduced internal xrefs across the gap
> - padding/alignment pattern
> - change in referenced data neighborhood
> - change in import/API usage style
> - change in graph embedding neighborhood
> - 24c. Then solve for a sparse set of boundaries.
> - 24d. This is more direct than asking “what label is each function?”
>
> 1. Another promising tactic: recover “data islands” first.
>
> - 25a. Cluster strings/tables/rodata/data by address locality and xref neighborhoods.
> - 25b. Label those islands with higher confidence than code.
> - 25c. Then attach functions to the islands they reference.
> - 25d. Libraries often expose themselves through data long before their code is obvious.
>
> 1. A third tactic: active refinement near high-entropy frontiers.
>
> - 26a. Most of the binary may be easy.
> - 26b. The cost is concentrated near a few uncertain boundaries.
> - 26c. So only there should MAPA use expensive methods:
> - deeper semantic analysis
> - more complex embeddings
> - decompiler-derived features
> - analyst interaction
> - 26d. This is much better than spending equal effort on every function.
These four ideas aren't just additions to the list — they restructure how I think the whole pipeline should work. Let me explain why.
## The Paradigm Shift: Score Boundaries, Not Functions
Item 24 is the single most important idea in this entire conversation so far, and I think it should become the **primary framing** of the problem rather than a tactic alongside others.
Every algorithm we've discussed — label propagation, spectral clustering, min-cut, PPR, HMMs — has been asking "what label does each function get?" But that's the indirect question. The direct question is "where are the seams?" And those seams live *between* functions, in the gaps. This reframing matters practically because the number of gaps (N-1 for N functions) is fixed, the features per gap are rich and computable, and the output you actually want — a segmentation — is literally a sparse subset of those gaps marked as boundaries.
This turns the whole problem into **supervised or semi-supervised boundary classification on a 1D sequence of gaps**, which is vastly simpler and more tractable than graph community detection. You compute a feature vector for each gap, you have some gaps where you know a boundary exists (between a FLIRT-matched library function and an adjacent non-library function, for instance), and you train a classifier. Or you don't even need training — you define a scoring function and then solve for the *sparsest* set of boundaries that explains your evidence, which is essentially LASSO or L1-penalized segmentation.
The gap features listed in 24b are excellent, and I'd add a few more that fall out of the earlier discussion: the diffusion distance between the functions flanking the gap, the min-cut capacity if you sever the graph at that point, and the eigenvector discontinuity at that gap in the Fiedler vector. Every algorithm we previously discussed becomes a **feature generator for the gap scorer** rather than a standalone solution. That's the architectural insight — the boundary scorer is the spine, and everything else feeds into it.
## Data Islands as the Foundation Layer
Item 25 is the second most important idea, and it reorders the pipeline in a way I didn't previously consider. The argument is subtle and correct: data is *more reliably clusterable than code* for several reasons. Strings are human-readable and often uniquely attributable to a library. Data items don't have the ambiguity of indirect calls. The data section layout reflects compilation unit structure just as faithfully as the code section, but with less noise. And critically, data doesn't get obfuscated nearly as often as code does in malware — the strings still need to be there at runtime.
So the pipeline should be: cluster the data section first into "islands" using address locality and cross-reference structure. Label those islands using string content, constant patterns, and any other attributable features. Then project those labels back onto functions via data references. A function that references three strings in the "zlib island" is almost certainly zlib code, even if you've never seen its bytes before and FLIRT doesn't match it.
This is powerful because it provides anchors that are **denser and more confident** than what you get from FLIRT or code-side analysis alone. It also means the data-island clustering becomes the *first* stage of the pipeline, before any code-side graph analysis. You start from data, anchor the code, then refine.
## The Underexploited Features Suggest a Richer Feature Space for Gaps
Items 23a through 23g aren't just a wishlist — they fill specific blind spots in the gap scoring framework:
**23a (shared constant pools/tables)** and **23e (data-section locality)** feed directly into the data-island strategy. But they also provide a gap feature: if the functions on either side of a gap reference data from the *same* data island, the gap is unlikely to be a boundary. If they reference data from *different* islands, it probably is.
**23b (exception/unwind metadata)** is a particularly clever signal I hadn't considered. Exception handling tables are generated per-compilation-unit by most compilers, and the unwinder metadata groups functions together in a way that directly reflects CU boundaries. This is almost ground truth when it's present — you should parse `.pdata`/`.xdata` (on Windows) or `.eh_frame` (on Linux) and treat co-membership in the same unwind group as a hard constraint. Functions sharing an unwind info record are in the same CU, period.
**23c (import-family fingerprints)** is interesting because it captures *style*. First-party code that does file I/O will call a different set of APIs than zlib's compression internals. You're not just looking at individual API calls — you're looking at the *distribution* of API families a function touches. Two functions that both use memory allocation + string manipulation + crypto APIs probably belong to the same logical unit (maybe an OpenSSL subsystem), while a function that only touches file I/O is probably a different module. This becomes a gap feature: does the API usage style change across this gap?
**23d (alignment and padding patterns)** is a beautiful low-level signal. Compilers emit characteristic alignment between functions (16-byte alignment, `int3` padding on MSVC, `nop` sleds on GCC). But *linkers* insert different padding between compilation units — often larger gaps or different padding bytes. So the gap's physical characteristics (how many bytes of padding, what the padding looks like) are direct evidence of whether it's an intra-CU gap or an inter-CU gap. This is a nearly free feature that requires no graph analysis at all.
**23f (ego-network shape)** is where things get really interesting algorithmically. Two functions might both have degree 5 in the call graph, but one's ego network is a tree (it calls 5 leaf functions that call nothing else — probably a dispatcher or API facade) while the other's ego network is a dense clique (its 5 callees all call each other — probably deep library internals). The *shape* of the local neighborhood, not just its size, is a fingerprint of where you are in the program's structure. You could compute small graph motif counts or local clustering coefficients and use those as features.
**23g (call-direction asymmetry)** completes the picture for boundary detection specifically. At a real API boundary, you expect calls going *into* the library (from the user-code side) but not calls going *out* of the library back into user code (libraries don't call their callers). So at a candidate boundary gap, you count directed edges crossing it in each direction. A strong asymmetry — many calls going deeper, few or none coming back — is a strong boundary signal. A gap where calls flow freely in both directions is probably internal to a single module.
## Active Refinement as the Final Stage, Not a Separate Strategy
Item 26 should be understood as the **execution strategy** for the whole pipeline, not a separate algorithm. The core insight is that uncertainty is spatially concentrated. After running the cheap, fast stages (data island clustering, gap feature computation, label propagation from anchors), most of the binary is probably classified with high confidence. What remains is a handful of ambiguous frontier zones — maybe 5-10% of the address space where two competing labels are close in score.
This is where you focus your expensive techniques: run the decompiler only on the ambiguous functions. Compute embeddings only for the frontier. Query the analyst only about the uncertain gaps. And do it in the "bisect" order — query the gap in the middle of the largest remaining uncertain region first, because its answer splits the ambiguity in half.
This means the pipeline architecture should explicitly support a **confidence map** over the address space. Every stage produces not just labels but uncertainty estimates. The uncertainty map guides where to spend further effort. And there's a natural stopping criterion: stop when the maximum remaining uncertainty drops below a threshold, or when the analyst says "good enough."
## The Revised Pipeline
Integrating everything, here's how I'd now structure it:
**Stage 0 — Feature extraction.** Parse the binary. Extract functions, call graph, data xrefs, strings, exception metadata, alignment patterns, import calls. Cheap and mechanical.
**Stage 1 — Data island clustering.** Cluster the data section by address locality and xref structure. Label islands using string content and constant pool attribution. This gives you the densest, most confident set of anchors.
**Stage 2 — Anchor projection.** Project data-island labels onto functions via xrefs. Combine with FLIRT matches and known entry points (`main`, etc.). Every function now has either a confident label, a tentative label, or no label.
**Stage 3 — Gap scoring.** For every inter-function gap, compute the full feature vector: padding pattern, data-island change, API style change, call-graph edge density across the gap, call-direction asymmetry, ego-network similarity of flanking functions, exception metadata boundary, eigenvector discontinuity. Score each gap's boundary likelihood.
**Stage 4 — Segmentation.** Solve for the sparse set of boundaries that best explains the gap scores and the anchor constraints. This could be L1-penalized changepoint detection, or an HMM where the gap scores inform transition probabilities.
**Stage 5 — Label assignment.** Within each segment, assign a library/CU label using the anchors, propagated labels, and graph community structure. This is where label propagation or spectral methods work *within* the already-established boundaries rather than trying to find them.
**Stage 6 — Active refinement.** Identify the highest-uncertainty boundaries. Apply expensive techniques — decompilation, semantic analysis, analyst queries — specifically at those points. Update the segmentation. Repeat until confidence is sufficient.
The key design principle is that boundary detection (stages 3-4) and label assignment (stage 5) are **decoupled**. You find where the seams are first, then figure out what's on each side. And the most reliable signals (data, metadata, physical layout) come first, with the noisier signals (graph topology, semantic analysis) used for refinement rather than discovery.

View File

@@ -0,0 +1,83 @@
# mapa plan: html call-graph neighborhood view
This plan explores a larger-neighborhood view for the HTML map. Today the page shows only direct relationships. A tag marks functions that reference strings with that tag, and a string marks functions that reference that exact string. The proposed experiment is to push that signal backward through the call graph so the page can show the code region around those direct hits.
The real question is whether the larger neighborhood stays localized enough to be useful. If a `#zlib` string grows into one compact region when we include one, two, or three caller hops, that supports the idea that tagged strings can anchor a broader library region. If the overlay quickly spreads into generic wrappers and dispatchers, the direct-hit view is probably carrying most of the useful signal already.
## Current behavior
The collector already has most of the data needed for this experiment. `mapa/collector.py` resolves thunk targets, builds `resolved_callers` and `resolved_callees`, and stores those relationships on each `MapaFunction` as `callers` and `calls`. The report model therefore already contains a usable reverse call graph.
The HTML renderer does not use that graph today. `mapa/html_renderer.py` emits only the function tooltip text, a tag-to-function index, and string rows with their direct function indices. The browser script then applies three binary states: tag border on, string fill on, or dim. There is no traversal, no score, and no way to distinguish direct evidence from nearby code.
One detail matters for later weighting. `_collect_tag_entries()` collapses each tag to a set of functions. That is enough for the current binary view, but it drops multiplicity. If one function references several `#zlib` strings, the current page still treats that as one direct hit. That simplification is acceptable for a first neighborhood experiment, but it becomes relevant if we later want repeated string evidence to count more strongly.
## Proposed model
The first experiment should use caller-depth propagation only. For a selected tag or string, define the directly matching functions as depth 0. Then walk backward through callers. Depth 1 is the callers of direct-match functions, depth 2 is the callers of depth-1 functions, and so on. A bounded depth of 0 to 3, or at most 4, is enough for the experiment.
The overlay should weaken with distance. Direct matches stay strongest. Indirect matches fade with depth. A simple additive model is enough. Each direct-match function contributes a seed weight of 1.0. A caller at depth `d` receives `seed_weight * decay(d)`. If several matching paths reach the same function, add the contributions together. That gives the effect we want. A function that sits above several tagged descendants should stand out more than a function that reaches only one.
Two decay families are worth trying. Harmonic decay follows the original intuition of 1.0, 1/2, 1/3, 1/4. Geometric decay uses 1.0, 1/2, 1/4, 1/8. Harmonic keeps more mass at larger depths. Geometric is more conservative and is less likely to smear across the whole map through generic caller chains. The page should probably expose both and default to geometric.
For the first pass, the seed should be per direct-match function rather than per direct-match string. That keeps the experiment focused on graph expansion instead of mixing graph expansion with direct evidence density. If the idea looks useful, then a second pass can try per-string seeds or rarity-weighted seeds.
## What the page should show
The current page already has a useful split between tag selection and string selection. Tag selection uses the border and string selection uses the fill. That can stay. The propagation engine can compute two separate score arrays, one for the border channel and one for the fill channel. If both a tag and a string are active, both overlays remain visible.
The main challenge is interpretation. A single cumulative heat map can look compelling while hiding the reason a function is lit up. The user should be able to tell the difference between a direct match, a one-hop caller, and a function that receives several weak contributions through different paths. For that reason, the page should support both cumulative and exact-hop views. In cumulative mode, the page shows the total score up to the chosen depth. In exact-hop mode, the page shows only depth 0, or only depth 1, or only depth 2. If there is room, small multiples would be even better. Four aligned copies of the same function grid for depths 0 through 3 would make locality much easier to judge.
Direct matches should remain visually distinct from indirect ones. A direct seed should not disappear into the same faint channel as a weak propagated score. Tooltips should also report the exact score for the active overlay. Otherwise the view will be hard to debug.
## Recommended first implementation
The narrow version of this feature can stay almost entirely inside `mapa/html_renderer.py`. The collector already records the caller relationships needed for bounded backward traversal.
The renderer should emit one more structure in its inline JSON: `callersByIndex`, a reverse adjacency list keyed by function index. It can keep the existing direct seed data for tags and strings. The browser script can then compute scores on demand for the active selection. The data volume is small enough that this should stay cheap on typical samples.
The computation can stay simple. Initialize the direct-match functions as the depth-0 frontier. For each depth from 1 to `maxDepth`, push the previous frontier to its callers, multiply the step by the chosen decay factor, and add the contributions into a total score array. This counts bounded walks rather than strictly simple paths. With shallow depths, that is a reasonable approximation.
The first UI pass only needs a few controls: maximum caller depth, decay family, and view mode. The existing function grid and string list can remain in place.
A practical first cut would keep these defaults:
- caller-only propagation
- default depth of 2
- geometric decay
- additive scoring
- cumulative view, with exact-hop available as a switch
## Rendering considerations
The current 10-pixel square is large enough for a binary on/off signal, but it may be too small for subtle border intensity changes. Border alpha alone may be hard to read. A better approach is to keep direct tag hits as a solid border and render propagated tag scores with either a stronger border color ramp or a small glow. For strings, direct matches can keep the current solid fill while propagated scores use a fill alpha ramp. The exact styling can stay simple, but direct and indirect states should be separable at a glance.
The page should also keep dimming non-matches when any overlay is active. Otherwise weak propagated scores will be visually lost in large samples.
## Risks
The main risk is graph bleed. Generic wrappers, initialization code, dispatchers, shared error handlers, and utility helpers often sit one or two caller hops above many unrelated regions. Those functions can make the map look more connected than the underlying library region really is.
Cycles are another risk. Recursive or mutually recursive groups can accumulate score in ways that are mathematically consistent under a bounded-walk model but visually misleading. Strongly connected component collapsing is a possible future refinement, but it should not be part of the first pass.
These risks argue for conservative defaults. Caller-only propagation is easier to reason about than a bidirectional neighborhood. Depth should stay shallow. Geometric decay is a safer default than harmonic. Exact-hop inspection should be available so the user can see whether the first one or two shells are still localized.
## Variations worth testing
If the first pass looks promising, there are several obvious follow-ons.
One variation is seed definition. Compare per-function seeds, per-string seeds, and rarity-weighted seeds. The last option is appealing because `StringTagMatch` already preserves `global_count`, and rare strings are usually more diagnostic than common ones.
Another variation is degree normalization. Raw additive scoring favors functions that sit above many matching descendants. That is partly what we want, but it also rewards generic coordinator functions. A normalized variant could divide contributions by a degree term and ask a different question: how concentrated is the evidence in this function's neighborhood.
A thresholded view is also worth trying. Instead of showing a continuous score ramp, let the user set a minimum score and mark only functions above that threshold. That could make contiguous regions easier to spot.
A comparison mode would be useful as well. Showing direct-only and propagated views side by side would make it easy to see whether the larger neighborhood adds a coherent region or just noise.
## How to evaluate the idea
The function grid is address-ordered, so this experiment is really about locality in address space. The first thing to look for is whether shallow propagation expands a direct-hit cluster into a still-coherent region. Good test cases are binaries where we already expect a compact static-library region, such as zlib, OpenSSL, or sqlite3.
It would also help to add a few quantitative summaries. The page could report the smallest contiguous address span containing most of the score mass, how many separate spans remain above a threshold, and how those numbers change as depth increases. That would make the result less subjective.
The concept is worth implementing as an experiment. The hard data is already present in the report model, and a conservative first pass can stay mostly inside the HTML renderer. If shallow caller propagation still yields compact regions for known libraries, then richer weighting models are worth exploring. If it smears immediately, that is still a useful result and tells us that the direct-hit view is already close to the limit of the available signal.

View File

@@ -0,0 +1,502 @@
# mapa IDA/IDALib port plan
Goal: preserve the current `scripts/mapa.py` report while replacing the Lancelot/BinExport2 backend with IDALib. Use the `ida-domain` API for normal analysis where it cleanly maps to the needed data. Reuse the existing capa IDA backend as the semantic reference for imports, thunks, string resolution, function naming, and database bootstrap.
This revision adds concrete guidance from capa's existing IDA backend.
## Sources used
Primary sources:
- `scripts/mapa.py`
- https://github.com/HexRaysSA/ida-claude-plugins/blob/main/plugins/ida-plugin-development/skills/ida-domain-api/SKILL.md
- `capa/features/extractors/ida/idalib.py`
- `capa/features/extractors/ida/extractor.py`
- `capa/features/extractors/ida/helpers.py`
- `capa/features/extractors/ida/file.py`
- `capa/features/extractors/ida/function.py`
- `capa/features/extractors/ida/insn.py`
- `capa/features/extractors/ida/basicblock.py`
- `capa/ida/helpers.py`
- `capa/loader.py`
- `tests/fixtures.py`
- `tests/test_idalib_features.py`
- `capa/features/common.py`
- `idals.py` from https://github.com/williballenthin/idawilli/tree/master/idals
Domain API references:
- Overview: https://ida-domain.docs.hex-rays.com/llms.txt
- Getting started: https://ida-domain.docs.hex-rays.com/getting_started/index.md
- Examples: https://ida-domain.docs.hex-rays.com/examples/index.md
- Database: https://ida-domain.docs.hex-rays.com/ref/database/index.md
- Functions: https://ida-domain.docs.hex-rays.com/ref/functions/index.md
- Flowchart: https://ida-domain.docs.hex-rays.com/ref/flowchart/index.md
- Instructions: https://ida-domain.docs.hex-rays.com/ref/instructions/index.md
- Xrefs: https://ida-domain.docs.hex-rays.com/ref/xrefs/index.md
- Strings: https://ida-domain.docs.hex-rays.com/ref/strings/index.md
- Segments: https://ida-domain.docs.hex-rays.com/ref/segments/index.md
- Names: https://ida-domain.docs.hex-rays.com/ref/names/index.md
- Entries: https://ida-domain.docs.hex-rays.com/ref/entries/index.md
## Key correction after reviewing capa
The current `capa/features/extractors/ida/` backend is IDALib-capable, but it is not written against `ida-domain` today. It uses the classic IDA Python surface: `idapro`, `idaapi`, `idautils`, `idc`, `ida_bytes`, `ida_funcs`, `ida_segment`, and related modules.
That means the correct migration strategy is not "invent a fresh IDA collector from scratch". The correct strategy is:
- use capa's existing IDA backend as the behavioral spec and a source of proven heuristics
- implement the new collector against `ida-domain` wherever the needed API exists cleanly
- treat the existing legacy helpers as reference material, not as the default implementation path
- only introduce lower-level `ida_*` calls if the implementer can point to a concrete `ida-domain` gap and document it
This is especially important for:
- IDALib database bootstrap
- import and extern enumeration
- thunk-chain resolution
- string/data-reference chasing
- alternative function names from comments
- known IDA version caveats
## Current mapa output that must remain stable
The current script renders these sections, in this order:
- `meta`
- `modules`
- `sections`
- `libraries`
- `functions`
Accepted intentional change for the port: remove `modules` entirely.
Inside `functions`, it currently:
- iterates functions in address order
- prints `thunk ...` for thunk functions
- prints `function ...` for normal functions
- inserts source-file separators when the primary Assemblage source path changes, ignoring gaps with missing Assemblage data
- annotates functions with Assemblage source name and file when available
- forwards callers through thunk targets so callers of a thunk appear on the real target
- prints `B/E/I` as basic blocks / CFG edges / instructions plus total instruction bytes
- prints capa rule names attached to the function
- prints `calls:` for internal non-library callees
- prints `api:` for import/external/library callees
- prints `string:` for referenced strings
That output contract should stay stable unless a deliberate change is accepted and documented.
## What capa already gives you
There are three reusable assets.
The first is a proven `ida-domain` database-resolution and session-opening path in `../idawilli/idals/idals.py`. That code already does the part mapa needs most: accept either a raw sample or an existing `.i64` / `.idb`, hash raw inputs, cache analyzed databases by SHA-256, and guard concurrent access.
The second is capa's proven IDALib bootstrap path:
- `capa/features/extractors/ida/idalib.py`
- `capa/loader.py`
- `tests/fixtures.py`
The third is capa's proven semantic definition of the data mapa cares about:
- function enumeration: `capa/features/extractors/ida/extractor.py`
- segments, imports, externs: `capa/features/extractors/ida/file.py`, `helpers.py`
- callers and function names: `capa/features/extractors/ida/function.py`
- API calls, call targets, strings, mnemonics, offsets: `capa/features/extractors/ida/insn.py`
- CFG/basic blocks: `capa/features/extractors/ida/basicblock.py`, `helpers.py`
- hashes, architecture, imagebase, file-type helpers: `capa/ida/helpers.py`
The practical split is simple. Use `idals.py` as the model for database resolution, caching, and guarded open/close. Use capa's IDA backend as the model for analysis semantics and parity behavior. Use `ida-domain` as the primary query surface inside the collector. Do not depend on Lancelot anywhere in the new implementation.
## Important behavioral facts from capa's backend
1. IDALib bootstrap in capa is not a bare `Database.open(...)` call.
- It uses `capa.features.extractors.ida.idalib.has_idalib()` and `load_idalib()`.
- It then calls `idapro.open_database(..., run_auto_analysis=True, args="-Olumina:host=0.0.0.0 -Osecondary_lumina:host=0.0.0.0 -R")`.
- It disables console chatter with `idapro.enable_console_messages(False)`.
- It waits for analysis completion with `ida_auto.auto_wait()`.
2. Capa explicitly disables Lumina during IDALib analysis.
- Reason documented in `capa/loader.py`: Lumina can inject bad names or overwrite debug-info names.
- mapa should do the same unless there is a deliberate decision to trust Lumina.
3. Capa requests resource loading with `-R`.
- This matters for some file-scope extraction.
- `tests/test_idalib_features.py` notes that IDA 9.0 had resource-loading limitations under IDALib.
4. The existing `IdaFeatureExtractor.get_functions()` is not a direct drop-in for mapa.
- It calls `helpers.get_functions(skip_thunks=True, skip_libs=True)`.
- mapa must render thunk functions, so mapa needs its own full function inventory.
5. Capa already encodes the thunk semantics mapa needs.
- `THUNK_CHAIN_DEPTH_DELTA` is defined in `capa/features/common.py` as `5`.
- `capa/features/extractors/ida/insn.py:check_for_api_call()` follows code refs, then data refs, through thunk chains to resolve imports/externs.
- `capa/features/extractors/binexport2/__init__.py:BinExport2Analysis._compute_thunks()` shows the intended "single-target thunk chain" rule: only resolve through chains with exactly one callee per thunk hop.
6. Capa already encodes mapa-relevant string semantics.
- `helpers.find_data_reference_from_insn(insn, max_depth=10)` follows single data-reference chains.
- `helpers.find_string_at(ea)` looks for C strings and works around an IDA Unicode-decoding quirk.
- `insn.extract_insn_string_features()` and `extract_insn_bytes_features()` use that behavior.
7. Capa already has the import and extern logic mapa needs.
- `helpers.get_file_imports()` enumerates import modules and normalizes names.
- `helpers.get_file_externs()` enumerates functions from `SEG_XTRN` segments.
- `file.extract_file_import_names()` shows how capa treats name-vs-ordinal imports.
8. Capa already has alternative-name logic.
- `helpers.get_function_alternative_names()` parses comments that look like `Alternative name is 'foo'`.
- `function.extract_function_alternative_names()` exposes them as `FunctionName` features.
9. Capa already has the CFG behavior mapa should match.
- `helpers.get_function_blocks()` uses `idaapi.FlowChart(f, flags=(idaapi.FC_PREDS | idaapi.FC_NOEXT))`.
- The `NOEXT` part matters: it avoids useless external blocks contaminating B/E/I counts.
10. The test suite documents real version caveats.
- IDA 9.0 and 9.1 had some ELF symbol issues.
- IDA 9.0 under IDALib had resource-loading limitations.
- mapa validation should account for those when comparing outputs.
## Database resolution and caching pattern to copy from idals
`../idawilli/idals/idals.py` is the best starting point for the "raw file or existing database" problem. It already solves the user-visible behavior mapa needs.
Its pattern is:
- if the input suffix is `.i64` or `.idb`, use that database directly
- otherwise compute hashes for the raw file with `compute_file_hashes()` and use the SHA-256 as the cache key
- store the generated database in a common cache directory, currently `~/.cache/hex-rays/idals/<sha256>.i64`
- serialize access with `database_access_guard()`
- detect an already-open or unpacked database by watching for the companion `.nam` file
- use an advisory `flock` on `<db>.lock` to avoid concurrent writers
- after acquiring the lock, re-check `.nam` to close the TOCTOU hole
- on a cache miss, analyze the raw sample with `Database.open(..., IdaCommandOptions(auto_analysis=True, new_database=True, output_database=..., load_resources=True), save_on_close=True)`
- after the cached database exists, open it read-only with `open_database_session(..., auto_analysis=False)` and `save_on_close=False`
mapa should adopt that pattern with only minor changes:
- use the same SHA-256-keyed cache strategy
- keep the same locking protocol
- put the cache in a mapa-specific directory, or intentionally share the idals directory if reuse is desired
- expose the cache location as a small helper or constant so it can be documented and tested
- reuse the computed SHA-256 for the `meta` section instead of hashing the sample twice
There is one deliberate integration check to make here. `idals.py` uses `ida-domain`'s `Database.open(...)`, while capa's bootstrap path uses `idapro.open_database(...)` and disables Lumina explicitly. For mapa, prefer the `idals.py` open-and-cache pattern because it already handles the database lifecycle correctly. Then verify whether the `ida-domain` open path offers an equivalent way to suppress Lumina. If it does, use it. If it does not, decide whether that matters for mapa output or whether database creation should fall back to capa's `idapro.open_database(...)` path while cached-session opens keep the `idals.py` pattern.
## Recommended architecture
Do not port `scripts/mapa.py` by replacing each Lancelot query inline. Split it into four layers:
- CLI and argument parsing
- IDA bootstrap and environment setup
- report collection
- rendering
Use backend-neutral dataclasses for the report model:
- `MapaReport`
- `MapaMeta`
- `MapaSection`
- `MapaLibrary`
- `MapaFunction`
- `MapaCall`
- `MapaString`
- `AssemblageRecord`
The collector should have one primary data-access layer: `ida-domain` for functions, flowcharts, instructions, strings, names, segments, xrefs, and database lifecycle. Existing capa helpers remain useful as semantic references and regression oracles.
## Best practical strategy
The implementation target is an IDALib-only collector with `ida-domain` as the primary API surface.
Concretely:
- use `ida-domain` for function inventory, instruction iteration, CFG stats, name lookup, segment listing, xref walking, and cached database open/create
- use the existing capa IDA code to understand the intended semantics for imports, externs, thunk resolution, data-reference chasing, and alternative names
- if the implementer discovers a real `ida-domain` gap, document the gap explicitly before introducing lower-level `ida_*` calls
That gives the next implementer a clear target: no Lancelot, no default hybrid backend, and no legacy helper dependency unless a concrete gap forces it.
## Concrete mapping from mapa fields to capa/backend logic
| mapa field/behavior | First source to consult | Recommended implementation |
|---|---|---|
| IDALib discovery | `capa/features/extractors/ida/idalib.py` | Reuse `has_idalib()` / `load_idalib()` logic if mapa needs to bootstrap `idapro` availability itself. |
| resolve/open DB | `../idawilli/idals/idals.py` | Use `resolve_database()` and `open_database_session()` as the primary pattern. |
| cache key and cache DB path | `../idawilli/idals/idals.py` | Hash raw inputs once and key cached databases by SHA-256. |
| Lumina suppression policy | `capa/loader.py`, `tests/fixtures.py` | Carry forward capa's disable-Lumina behavior if the chosen open path supports it. |
| sample hashes | `../idawilli/idals/idals.py`, `capa/ida/helpers.py`, `extractor.py` | Reuse the SHA-256 computed for cache lookup; prefer IDA-provided hashes when opening an existing database. |
| image base | `capa/ida/helpers.py` | Prefer IDA imagebase helper; use Domain API only if it exposes the same value clearly. |
| sections | `helpers.get_segments()`, `file.extract_file_section_names()` | Use `db.segments`; match capa's header-segment filtering rules if needed. |
| import modules/functions | `helpers.get_file_imports()` | Implement with `ida-domain` if the needed import data is exposed cleanly; otherwise use this helper as the semantic reference for normalization. |
| externs | `helpers.get_file_externs()` | Match this behavior with `ida-domain` if possible; if not, document the missing API and then fall back deliberately. |
| function inventory | `extractor.py`, `helpers.get_functions()` | Do not use extractor's default function list because it skips thunks/libs. Build a mapa-specific inventory with `ida-domain`. |
| callers | `function.extract_function_calls_to()` | Reproduce the same behavior with domain xrefs and compare against this helper during validation. |
| call targets | `insn.extract_function_calls_from()` | Reproduce the same behavior with domain xrefs and compare against this helper during validation. |
| API calls | `insn.extract_insn_api_features()` | Match the import/extern/thunk resolution semantics exposed by this function. |
| string refs | `helpers.find_data_reference_from_insn()`, `find_string_at()` | Match the same single-ref-chain behavior and max depth `10`. |
| function names | `function.extract_function_name()`, alternative-name helpers | Use normal name, demangled name, alternative names, and render Assemblage annotations separately without renaming the IDA function. |
| B/E/I stats | `helpers.get_function_blocks()` | Match `PREDS | NOEXT` semantics; use domain flowchart if possible. |
| function ordering | current `scripts/mapa.py` | Keep address order for deltas and rendering stability. |
## Step-by-step implementation plan
### 1. Freeze the current mapa output
Before editing code, save golden outputs from the current `scripts/mapa.py` for:
- a sample with normal internal calls and imports
- a sample with thunk-heavy call patterns
- a sample with capa and Assemblage overlays
These are the parity targets.
### 2. Add `resolve_database()` and `open_database_session()` helpers
Base these directly on `../idawilli/idals/idals.py`.
`resolve_database()` should:
- accept either a raw sample or an existing `.i64` / `.idb`
- return existing databases unchanged
- hash raw inputs once and use SHA-256 as the cache key
- place cached databases under the XDG cache root in `mandiant/mapa/`, i.e. `$XDG_CACHE_HOME/mandiant/mapa/` when set, else `~/.cache/mandiant/mapa/`
- guard cache creation with the same `.nam` + `flock` protocol from `database_access_guard()`
- analyze cache misses with `Database.open(..., IdaCommandOptions(auto_analysis=True, new_database=True, output_database=..., load_resources=True), save_on_close=True)`
- keep cache creation transparent in normal mode and only log cache details in verbose/debug mode
`open_database_session()` should:
- use the same guard before opening the database
- open cached or user-supplied databases with `new_database=False`
- default to `save_on_close=False`
- optionally run `ida_auto.auto_wait()` when `auto_analysis=True`
This should become mapa's primary database lifecycle.
Then add one capa-derived check on top: if the chosen open path can suppress Lumina, do so. If the `ida-domain` path cannot, verify whether that difference affects naming enough to justify a fallback to capa's `idapro.open_database(...)` path during cache creation.
### 3. Introduce a backend-neutral report model
Before touching the collector logic, split `scripts/mapa.py` into:
- CLI
- collector
- renderer
- input-overlay parsing for capa JSON and Assemblage CSV
Keep the renderer stable. The collector should return value objects only.
### 4. Build a mapa-specific function inventory
Do not use `IdaFeatureExtractor.get_functions()` as-is, because it skips thunks and library functions.
Instead:
- enumerate all functions in address order with `ida-domain` if possible
- keep flags for `is_thunk`, `is_library`, and `is_external`
- retain enough metadata to render thunks, skip imports from the function list, and compute deltas
For parity, compare your inventory against:
- `helpers.get_functions(skip_thunks=False, skip_libs=False)`
- IDA function flags such as `FUNC_THUNK` and `FUNC_LIB`
### 5. Recreate import and extern logic using capa's semantics
For the `libraries` section and for `api:` classification, start from the behavior encoded in:
- `helpers.get_file_imports()`
- `helpers.get_file_externs()`
That behavior already handles:
- PE imports with `__imp_` prefixes
- ELF imports with `@@version` suffixes
- ordinal imports
- extern functions in `SEG_XTRN`
The implementation target remains `ida-domain`. The next implementer should reproduce this behavior there if the API surface is available. If a real gap appears, document the gap before introducing any fallback.
### 6. Implement thunk resolution with capa's exact semantics
Build one cached helper, for example `resolve_thunk_target(ea)`, and use it everywhere.
Behavior should match capa's existing semantics:
- maximum thunk-chain depth: `THUNK_CHAIN_DEPTH_DELTA == 5`
- follow code refs first, then data refs if needed
- only resolve through single-target chains
- stop on cycles, zero-target, or multi-target cases
- allow the final resolved target to be an import or extern
Use two existing code paths as references:
- `capa/features/extractors/ida/insn.py:check_for_api_call()`
- `capa/features/extractors/binexport2/__init__.py:BinExport2Analysis._compute_thunks()`
This helper must drive:
- caller forwarding
- `calls:` lines
- `api:` lines
- capa match attachment when a match lands in a thunk
### 7. Use capa features as references, not as the collector
Do not build mapa by instantiating `IdaFeatureExtractor()` and aggregating capa features into the final report. That would create a hidden second backend and blur the migration target.
Instead, query IDA directly through `ida-domain` and use the capa feature-extraction code as a reference when the intended semantics are unclear. The implementer should compare specific results against:
- `Characteristic("calls to")`
- `Characteristic("calls from")`
- `API`
- `String`
- `FunctionName`
- `Mnemonic`
This keeps the delivered collector IDALib-only while still giving the implementer a precise oracle for parity checks.
### 8. Recreate callers and callees
Use a precomputed normalized call graph. Do not compute callers ad hoc during rendering.
For each non-import function:
- walk its instructions
- identify call or jump-to-import patterns using the same logic as `extract_insn_api_features()`
- resolve thunk chains
- classify the resolved target as internal or API/import/extern
- record caller and callee relationships on resolved targets
For parity, verify against these capa semantics:
- function callers: `function.extract_function_calls_to()`
- outgoing calls: `insn.extract_function_calls_from()`
- API calls: `insn.extract_insn_api_features()`
Important detail: the existing helper treats both `call` and `jmp` as API-bearing instructions in some thunk/import cases. Do not assume `call` only.
### 9. Recreate B/E/I with capa's CFG semantics
For each rendered function:
- basic blocks: count basic blocks using the equivalent of `helpers.get_function_blocks()`
- edges: sum successors across those blocks
- instructions: count instructions across those blocks
- bytes: sum instruction sizes
The important parity rule is the CFG construction mode:
- match `idaapi.FlowChart(f, flags=(idaapi.FC_PREDS | idaapi.FC_NOEXT))`
If the Domain API flowchart differs, use it only if it can match the no-external-block behavior. Otherwise use a tiny legacy helper for block enumeration and keep everything else in the Domain API.
### 10. Recreate string extraction with capa's data-ref chasing
Do not just test `db.strings.get_at(xref.to_ea)` and stop. That will miss the semantics capa already uses.
Start from capa's behavior:
- follow a single data-reference chain from the instruction, up to depth `10`
- if the final target is a string, emit it
- otherwise it may be bytes, not a string
For mapa specifically:
- only render strings, not raw bytes
- deduplicate by rendered string value, matching the current script
- trim trailing whitespace the same way the current script does
Reference implementation:
- `helpers.find_data_reference_from_insn()`
- `helpers.find_string_at()`
- `insn.extract_insn_string_features()`
### 11. Reuse capa's name and alternative-name semantics
For the function display name, use this order:
- demangled name
- IDA function name
- alternative names from comments if they help and the main name is poor
- final fallback such as `sub_{ea:x}`
Render Assemblage source name and source file as annotations beneath the function header. Do not mutate the database just to apply Assemblage data.
Reference points:
- `function.extract_function_name()`
- `helpers.get_function_alternative_names()`
### 12. Reattach capa matches by containing function
Keep the current capa JSON input format, but simplify the mapping logic.
Recommended algorithm:
- parse the capa JSON as today
- for each absolute match address, ask IDA for the containing function
- if that function is a thunk, resolve it through the thunk resolver
- attach the rule name to the resolved function start EA
- warn when no containing function exists
This is simpler than the current BinExport-specific mapping and aligns better with IDA's data model.
### 13. Rebuild top-level sections using capa-backed semantics
For `meta`:
- sample name: input path or IDA metadata
- hashes: prefer IDA-provided hash helpers in `capa/ida/helpers.py`
- architecture: reuse the logic in `capa/features/extractors/ida/global_.py`
- timestamp: define explicitly, because BinExport's old field is gone
For `sections`:
- use `ida-domain` segments if possible
- match capa's `skip_header_segments` behavior if needed
For `libraries`:
- use `helpers.get_file_imports()` and group/display import modules accordingly
For `modules`:
- remove the section entirely as an intentional interface change
- document the removal in the spec so future ports do not try to reintroduce BinExport-specific `module` semantics accidentally
### 14. Add tests using capa's existing IDALib pattern
Pure tests should cover:
- Assemblage parsing and RVA-to-VA mapping
- thunk-chain resolution
- import/extern normalization
- string de-duplication and trimming
- final rendering from a prebuilt `MapaReport`
Integration tests should reuse the same lifecycle mapa will use in production:
- resolve the input to an existing or cached database
- open it through the guarded session helper
- collect the mapa report
- compare key functions and sections against golden outputs
Use `tests/test_idalib_features.py` as the reference for version-specific skips and expectations, and use `../idawilli/idals/idals.py` as the reference for database resolution and guarded open/close behavior.
### 15. Validate parity and document deliberate differences
Compare the new output against the frozen Lancelot output on the supplied samples.
Verify specifically:
- function ordering
- thunk rendering
- thunk-forwarded callers
- internal vs API call classification
- libraries/imports section contents
- string extraction
- B/E/I counts
- Assemblage annotations and source-file separators
- capa attachment
Document every known delta. The likely ones are:
- function discovery differences between IDA and Lancelot
- the intentional removal of the `modules` section
- symbol differences across IDA versions, especially ELF on older 9.x
- resource-dependent differences on older IDALib versions
## Minimal implementation checklist
A good order of work is:
1. freeze current mapa outputs
2. add backend-neutral report dataclasses
3. add `resolve_database()` and `open_database_session()` helpers modeled on `idals.py`
4. implement the XDG cache path and quiet-by-default cache creation behavior
5. build a full mapa function inventory that includes thunks
6. port sections and metadata
7. implement import/extern classification to match capa semantics
8. implement the thunk resolver using capa's existing semantics
9. build normalized caller/callee/API indexes
10. port B/E/I using `PREDS | NOEXT`-equivalent CFG traversal
11. port string extraction using capa's data-ref-chain semantics
12. port Assemblage overlay handling
13. port capa JSON address-to-function attachment
14. remove the `modules` section and document the interface change
15. compare outputs against golden references
16. document any proven `ida-domain` gaps and any intentional differences in spec/design during implementation
## Resolved decisions for the implementation handoff
Record these in `spec.md` or `design.md` during implementation so the behavior stays stable.
- accepted inputs: raw binary and existing IDA databases
- cached databases live under the XDG cache root in `mandiant/mapa/`
- mapa may create and persist cached IDA databases automatically
- cache creation stays quiet in normal mode and only surfaces in verbose/debug logging
- Lumina stays disabled for now
- `meta.ts` becomes `datetime.now()`
- remove the `modules` section from the report
- the implementation target is IDALib only and all Lancelot dependencies should be removed
- assume `ida-domain` is sufficient unless the implementer can demonstrate a specific missing API; any lower-level fallback must be justified and documented

View File

@@ -0,0 +1,169 @@
# mapa plan: vendor Quantum Strand string tags
This plan describes how to extend mapa so every rendered `string:` line can carry right-aligned tags from Quantum Strand's string databases. The implementation target is broader than the earlier draft. It should include the full set of useful database-backed tags now: open-source library tags, CRT tags, expert tags, winapi tags, global-prevalence tags, and junk-code tags. The feature is still strictly limited to database matches. It must not import Quantum Strand or FLOSS as a runtime library, and it must not pull in Quantum Strand's whole-file layout analysis, structure tagging, file offsets, encoding columns, or any other non-database context.
The implementor should work only in `mapa/`, `tests/`, and packaging/docs as needed. Nothing belongs under `capa/`. The sibling checkout at `../quantumstrand/` is only a research source and a place to copy vendored resources from once.
## What mapa should do when this lands
When mapa emits a string referenced by a function, the left side should stay in the current mapa style and the right side should show database-derived tags. The renderer should use Rich width calculations so the tag column stays visible and the string text is clipped first if necessary. The output should continue to be function-centric and concise.
A representative result looks like this:
```text
string: "invalid distance code" #zlib
string: "GetProcAddress" #winapi
string: "!This program cannot be run in DOS mode." #common
string: "CurrencyDispenser1" #capa
string: "_initterm" #winapi #code-junk
```
The model should retain richer match metadata than the text renderer shows. The renderer only needs visible tags. The report model should still remember which database family matched and any associated metadata that might matter later.
## Non-goals
This feature is not a Quantum Strand port. Do not bring over its file-layout tree, structure labels like `import table`, section box rendering, code-vs-data analysis, duplicate-string tagging, relocation tagging, xor-decoding tags, or hidden-string filtering. Those features solve a different problem. mapa already knows which function references a string and only needs database-backed context for that string.
mapa should not start suppressing strings based on database matches. Even if an upstream expert rule has `action: hide`, mapa should still render the string. Store the action in metadata if it exists, but do not use it to drop rows.
## Upstream resources to copy
All of the following resources should be vendored into mapa under a new package such as `mapa/string_tags/data/`.
Library and CRT databases:
```bash
mkdir -p mapa/string_tags/data/oss mapa/string_tags/data/crt
cp ../quantumstrand/floss/qs/db/data/oss/*.jsonl.gz mapa/string_tags/data/oss/
cp ../quantumstrand/floss/qs/db/data/crt/msvc_v143.jsonl.gz mapa/string_tags/data/crt/
```
This copies these library databases:
`brotli.jsonl.gz`, `bzip2.jsonl.gz`, `cryptopp.jsonl.gz`, `curl.jsonl.gz`, `detours.jsonl.gz`, `jemalloc.jsonl.gz`, `jsoncpp.jsonl.gz`, `kcp.jsonl.gz`, `liblzma.jsonl.gz`, `libpcap.jsonl.gz`, `libsodium.jsonl.gz`, `mbedtls.jsonl.gz`, `openssl.jsonl.gz`, `sqlite3.jsonl.gz`, `tomcrypt.jsonl.gz`, `wolfssl.jsonl.gz`, `zlib.jsonl.gz`, plus the CRT database `msvc_v143.jsonl.gz`.
Expert, winapi, prevalence, and junk-code databases:
```bash
mkdir -p mapa/string_tags/data/expert mapa/string_tags/data/winapi mapa/string_tags/data/gp
cp ../quantumstrand/floss/qs/db/data/expert/capa.jsonl mapa/string_tags/data/expert/
cp ../quantumstrand/floss/qs/db/data/winapi/apis.txt.gz mapa/string_tags/data/winapi/
cp ../quantumstrand/floss/qs/db/data/winapi/dlls.txt.gz mapa/string_tags/data/winapi/
cp ../quantumstrand/floss/qs/db/data/gp/gp.jsonl.gz mapa/string_tags/data/gp/
cp ../quantumstrand/floss/qs/db/data/gp/cwindb-native.jsonl.gz mapa/string_tags/data/gp/
cp ../quantumstrand/floss/qs/db/data/gp/cwindb-dotnet.jsonl.gz mapa/string_tags/data/gp/
cp ../quantumstrand/floss/qs/db/data/gp/junk-code.jsonl.gz mapa/string_tags/data/gp/
cp ../quantumstrand/floss/qs/db/data/gp/xaa-hashes.bin mapa/string_tags/data/gp/
cp ../quantumstrand/floss/qs/db/data/gp/yaa-hashes.bin mapa/string_tags/data/gp/
```
The implementor should also create `mapa/string_tags/SOURCES.md` and record the upstream repo path, upstream commit, copied files, and any code copied or rewritten from upstream. The research for this plan used upstream commit `73eb1541e896c065fc694ba7b01067f56871631b`.
## Upstream code to read before implementing
The useful Quantum Strand code is small. Before writing anything, read `../quantumstrand/floss/qs/db/oss.py`, `expert.py`, `gp.py`, `winapi.py`, `../quantumstrand/floss/qs/main.py`, and the tests `../quantumstrand/tests/test_oss_db.py`, `test_winapi_db.py`, `test_gp_db.py`, `test_qs.py`, and `test_qs_pma0101.py`.
The only part of `floss/qs/main.py` that should influence mapa design is the small tagging and Rich rendering logic. Leave the rest of that file behind.
## Behavior that must be preserved from Quantum Strand
Quantum Strand's database lookups are simple and should be preserved exactly.
The OSS and CRT databases are gzip-compressed JSONL files. Each line contains one `OpenSourceString` record with fields such as `string`, `library_name`, `library_version`, `file_path`, `function_name`, and `line_number`. Lookup is exact by `string`. A match emits tag `#<library_name>`. The CRT file uses `library_name: "msvc"`, so it emits `#msvc`.
The expert database file is plain `capa.jsonl`, not gzip-compressed despite what the readme says. Each record is an `ExpertRule` with `type`, `value`, `tag`, `action`, and descriptive metadata. Matching behavior follows `floss/qs/db/expert.py`: exact string match for `type == "string"`, substring search for `type == "substring"`, and `re.compile(rule.value).search(...)` for `type == "regex"`. A match emits `rule.tag`, which in the current vendored file is typically `#capa`.
The winapi database is two gzip-compressed text files. `dlls.txt.gz` is loaded into a lowercase set and matched against `string.lower()`. `apis.txt.gz` is loaded into a case-sensitive set and matched against the string verbatim. A match from either source emits `#winapi`.
The global-prevalence JSONL databases are `gp.jsonl.gz`, `cwindb-native.jsonl.gz`, and `cwindb-dotnet.jsonl.gz`. Quantum Strand loads them as `StringGlobalPrevalenceDatabase` and does exact string lookup. Any hit in any of those databases emits `#common`.
The junk-code JSONL database is `junk-code.jsonl.gz`. It has the same file format as the prevalence JSONL databases, but Quantum Strand treats it separately. Any hit emits `#code-junk`.
The hash databases are `xaa-hashes.bin` and `yaa-hashes.bin`. Each file is a flat sequence of 8-byte truncated MD5 digests. Quantum Strand computes `md5(string.encode("utf-8")).digest()[:8]` and checks membership in the set. A hit emits `#common`.
These match rules are the core of the feature. They are much more important than matching Quantum Strand's internal class names.
## Recommended mapa package layout
Add a dedicated package under `mapa/`. A good layout is `mapa/string_tags/__init__.py`, `model.py`, `loaders.py`, `tagger.py`, a `data/` subtree, and `SOURCES.md`.
Do not copy upstream modules verbatim unless necessary. A mapa-local rewrite is cleaner because the code is short and mapa needs a narrower API than Quantum Strand.
`model.py` should define two small dataclasses. `StringTagMatch` should capture one concrete match with fields like `tag`, `source_family`, `source_name`, `kind`, and optional metadata such as `library_name`, `library_version`, `file_path`, `function_name`, `line_number`, `note`, `description`, `action`, `global_count`, `encoding`, and `location`. `StringTagResult` should hold the final sorted tag tuple plus the tuple of `StringTagMatch` entries.
`loaders.py` should own the file-format readers. It should use `gzip`, `hashlib`, `msgspec`, and `importlib.resources`. There is no reason to invent a new parser. This project already depends on `msgspec`, which is also what Quantum Strand uses for the JSONL formats.
`tagger.py` should own the process-wide cached tagger. A simple shape is `load_default_tagger()` plus an object with `tag_string(raw: str) -> StringTagResult`. The tagger should lazily load and cache the vendored databases once per process.
## Report-model changes
`mapa/model.py` should be extended so a rendered mapa string can carry tags and match metadata. The minimal change is to add `tags` and `tag_matches` to `MapaString`. The existing `value` field should remain the display string. If the implementor wants to preserve the exact raw string too, add a `raw_value` field. That is worthwhile because mapa currently trims trailing whitespace before storing the string, and exact-match databases should run against the untrimmed value.
The most important collector rule is this: match against the raw extracted string first, derive the display string second, and deduplicate on the display string only after the database matches have been computed. If two raw strings collapse to the same display value after `rstrip()`, their tags and metadata should be merged onto the single rendered `MapaString` entry.
## Collector guidance
The collector should keep its existing string-discovery behavior. This plan does not ask the implementor to revisit how mapa follows data references or how it discovers a string in IDA. Once `collect_report()` recovers a raw string, the new tagging pipeline begins.
A good implementation sequence inside `mapa/collector.py` is: recover `raw_value`, call the vendored tagger on `raw_value`, compute `display_value = raw_value.rstrip()`, skip empty display values, and then either create or update the `MapaString` entry for that display value. The update path should union tag names and append only unique `StringTagMatch` values. The final `MapaString.tags` should be sorted for stable rendering and stable tests.
This is the one place where the current mapa behavior is most likely to cause silent misses. If the implementor tags only the trimmed string, exact-match results from Quantum Strand can be lost.
## Tag aggregation rules
The model should preserve all concrete matches, even when multiple databases emit the same visible tag. This matters most for `#common`, because a string may hit several prevalence databases and one or both hash databases. The visible tag list should deduplicate tag names, but the metadata should preserve every source that contributed.
The tagger should produce tags in a deterministic order. A simple stable order is alphabetical order on the tag name after aggregation. The metadata order should also be deterministic, for example by `(tag, source_family, source_name, library_name, note, value)`.
## Rendering guidance
Replace the current plain markup string for `string:` rows with a dedicated Rich `Text` builder. The implementor should read `render_string()` and related helpers in `../quantumstrand/floss/qs/main.py` and copy only the layout idea. The left side is the existing `string: "..."` text. The right side is the space-joined visible tag list. Width should come from Rich's own measurement.
A helper such as `Renderer.render_string_line(value: str, tags: Sequence[str]) -> Text` is sufficient. It should use `self.console.size.width - (self.indent * 2)` as the available width for the line content, build a `Text` object for the left side and another for the right side, reserve at least one separating space, and then align or truncate the left side so the right side stays visible. If the terminal is too narrow for that layout, fall back to a single-column form that still shows the tags.
mapa should adopt one Quantum Strand display rule because it reduces noise without hiding information: when a string has `#common` plus one or more more-specific tags, omit `#common` from the visible tag column but keep it in `tag_matches` and `MapaString.tags`. That is a rendering choice only. The underlying data should stay intact.
No string row should be hidden by tag policy. `#common` and `#code-junk` may be styled in a muted color. `#capa` may be highlighted. `#winapi` and library tags can use the default string-tag style unless the implementor finds a better minimal palette. The important behavior is visibility and stable alignment. Decorative styling is secondary.
## Recommended visible-tag policy
The rendered tag column should follow these rules.
Show all tags except `#common` when a more-specific tag is also present. Keep `#common` visible only if it is the only tag. Show `#code-junk` even when other tags are present because it communicates a different kind of context than `#common`. Show `#winapi`, `#capa`, and library tags directly. Do not invent mapa-specific aliases or rename the upstream tags.
This yields readable outputs such as `#winapi #code-junk`, `#capa`, `#zlib`, or `#common`. It avoids noisy combinations like `#common #winapi` on every common API name.
## Packaging guidance
If mapa needs to work from an installed package, `pyproject.toml` will need changes because it currently only packages `capa*`. The implementor should include `mapa*` packages and package data under `mapa/string_tags/data/`. The loader should use `importlib.resources.files()` so it works both from a source checkout and an installed wheel.
Even if packaging is deferred, the code should still use `importlib.resources` because it centralizes the resource lookup and avoids hard-coded repository-relative paths.
## Implementation steps for the handoff
The implementor should start by copying the resources, writing `mapa/string_tags/SOURCES.md`, and adding pure loader tests before touching mapa's collector or renderer. Then they should implement the small loader layer for the five upstream database families: OSS/CRT, expert, winapi, prevalence JSONL, and prevalence hash files. After that they should implement the aggregated tagger and add pure tagger tests using known literals from the vendored datasets.
Once the tagger is stable, they should extend `MapaString`, thread tagging through `mapa/collector.py`, and finally switch `mapa/renderer.py` to the Rich `Text`-based string-row helper. Only after all of that is working should they update packaging and installed-resource handling, because those changes are easier to verify when the core behavior already exists.
During implementation they should update `doc/plans/spec.md` and `doc/plans/design.md` to record the final user-visible behavior and the final module layout. The spec should say that `string:` rows may carry right-aligned database tags and should document the visible-tag policy. The design doc should say where the vendored databases live, how the loader is structured, and how the collector merges raw-string matches into deduplicated display strings.
## Concrete test plan
Most tests should avoid IDA. Start with pure loader and tagger tests. Known-good assertions from the upstream data include `"invalid distance code" -> #zlib`, `"IsolationAware function called after IsolationAwareCleanup" -> #msvc`, `"CurrencyDispenser1" -> #capa`, `"kernel32.dll" -> #winapi`, `"CreateFileA" -> #winapi`, and `"!This program cannot be run in DOS mode." -> #common`. `"_initterm"` is a useful mixed case because Quantum Strand's own tests show it as both `#winapi` and `#code-junk`.
Cover the expert database's three rule types: exact, substring, and regex. Add a hash-database test that emits `#common` even when the string is absent from the JSONL prevalence files. Add another case where several databases contribute the same visible tag and the metadata still records every contributing match.
Add renderer tests using a fixed-width Rich console. One test should show that an untagged string row still matches the old mapa format. Another should show that a tagged row keeps the tag column at the right edge. A narrow-width test should show that the string side is clipped first. Another should check that `#common` disappears from the visible tag list when a more-specific tag exists while remaining present in the underlying model.
Finally, add report tests in `tests/test_mapa.py` that build a small `MapaReport` directly. At least one string should carry a library tag, at least one should carry `#common`, and at least one should carry a multi-tag combination like `#winapi #code-junk`. None of these tests should require IDA.
## Performance and memory notes
Vendoring every requested database is still practical, but loading them all eagerly may have a noticeable startup cost. The tagger should therefore be cached process-wide and built lazily. Hash files should be read once into memory as sets of 8-byte digests. The string databases should be decoded once into in-memory maps. This is a good place to keep the code simple first and optimize only if startup becomes a measured problem.
The current compressed data footprint is modest for OSS, CRT, expert, and winapi. The prevalence family is the largest part of the set, especially the hash files. That is another reason to centralize loading and avoid repeated per-function or per-string initialization.
## Notes the implementor should not miss
`floss/qs/db/data/expert/readme.md` says the expert database is gzip-compressed, but the shipped file is plain `capa.jsonl`. Follow the code and the actual file on disk. `floss/qs/db/oss.py` includes the CRT file in `DEFAULT_PATHS`, so treat `#msvc` as part of the library tagging feature. Quantum Strand's `remove_false_positive_lib_strings()` should not be copied because its five-hit threshold is tuned for whole-file triage and fits mapa's per-function presentation poorly.
The main risk in this work is not the file formats. It is silent semantic drift during integration. The implementor should preserve Quantum Strand's exact query rules, tag against the raw string before trimming, keep all concrete matches in metadata, and only simplify at the renderer boundary.

File diff suppressed because one or more lines are too long

92
doc/plans/spec.md Normal file
View File

@@ -0,0 +1,92 @@
# mapa specification
mapa renders either a structured text report or a self-contained HTML map of a binary's function map: metadata, sections, import libraries, and a per-function breakdown of callers, callees, API calls, strings, CFG stats, capa rule matches, and optional Assemblage ground truth.
## Invocation
```
python scripts/mapa.py <input_file> [--capa <capa.json>] [--assemblage <functions.csv>] [--output {text,html-map}] [--open] [--verbose] [--quiet]
```
`input_file` accepts raw binaries (PE, ELF), existing IDA databases (`.i64`, `.idb`), or any file IDA can analyze. For raw files, mapa automatically creates and caches an analyzed IDA database under the XDG cache root (`$XDG_CACHE_HOME/mandiant/mapa/` or `~/.cache/mandiant/mapa/`) keyed by the file's SHA-256 hash.
## Backend
IDALib only. All analysis uses `ida-domain` as the primary query API. The Lancelot/BinExport2 backend has been removed.
## Output modes
`text` is the default. It renders the existing structured terminal report to stdout.
`html-map` renders a single standalone HTML document to stdout. The page inlines all HTML, CSS, JavaScript, and data. It has a compact metadata summary, a tag control strip, and neighborhood controls at the top, then a split view below. The left pane contains the function grid and the right pane contains the program-string list.
The two panes scroll independently. A draggable vertical divider lets the user resize the panes horizontally. Function squares stay in function-address order and still use the naive left-to-right wrapping layout, but they now wrap within the current width of the left pane rather than the full page width.
Function squares are fixed small blocks laid out left-to-right and wrapped responsively within the left pane. Hovering a function square previews that function's neighborhood. Clicking a function square locks or unlocks that function as the active seed. Hovering a tag previews the neighborhood seeded by every function that directly references a string with that tag. Clicking a tag locks or unlocks that tag seed set. Hovering a string row previews the neighborhood seeded by every function that directly references that string. Clicking a string row locks or unlocks that string seed set. Only one active seed source is used at a time. Any locked function, tag, or string overrides hover previews from the other sources until it is cleared.
The neighborhood controls choose graph direction and traversal depth. Direction can be `callers`, `callees`, or `both`, with `both` as the default. Depth can be 1 through 4, with 3 as the default. For the active seed set, mapa runs a bounded breadth-first search from each seed over the selected direction mode, uses shortest distance from that seed, applies geometric decay `0.5 ** distance`, and adds the per-seed contributions together. Direct seed functions get an outline. Reached functions get a fill intensity scaled within the current view. Functions outside the active neighborhood dim.
The tag strip is sorted by descending distinct-function count, then tag name, and each control shows that count. The page also shows a small legend for heat, seed outline, and dim states. The string list shows each string's virtual address explicitly, preserves duplicate display values at different addresses, and shows visible tags right-aligned in each row. Function hover shows a tooltip containing the same single-function mapa summary content as text mode, plus the current heat, seed state, and shortest distance when a neighborhood is active. Top-level tag controls use only string tags. Capa rule names are not included there.
`--open` is only valid with `--output html-map`. In that mode, mapa writes the HTML to a temporary `.html` file, opens the user's local web browser on the corresponding `file://` URL, and does not write the HTML document to stdout.
The visible-tag policy is the same in both modes: hide `#common` when a more-specific tag is present, but keep it visible when it is the only tag.
## Report sections
The text report renders these sections in order:
1. meta — file name, SHA-256, architecture, timestamp
2. sections — memory segments with address, permissions (rwx), and size
3. libraries — import modules
4. functions — per-function detail in address order
### Functions section
Each function renders as either `thunk <name> @ <address>` or `function <name> @ <address>` followed by:
- source-file separator — a horizontal rule inserted before a function when its primary Assemblage source path differs from the last seen non-empty source path
- `assemblage name:` — source function name from Assemblage, when available
- `assemblage file:` — source file path from Assemblage, when available
- `xref:` — callers with direction arrow and function-order delta
- `B/E/I:` — basic blocks / CFG edges / instructions (total bytes)
- `capa:` — matched capa rule names
- `calls:` — internal non-library callees with direction and delta
- `api:` — import/external/library callees
- `string:` — referenced strings (deduplicated, whitespace-trimmed), with optional right-aligned database tags
Thunk functions show only the header plus any Assemblage lines.
### Assemblage overlay
When `--assemblage` is provided, mapa reads a CSV file and requires these columns: `hash`, `name`, `start`, `end`, and `source_file`.
Assemblage matching works like this:
- mapa resolves the sample SHA-256 from the input file or the opened IDA database.
- mapa keeps only CSV rows whose `hash` matches that SHA-256, case-insensitively.
- mapa treats `start` and `end` as RVAs and adds the IDA database base address to map them to function VAs.
- mapa does not rename functions, callers, or callees from Assemblage data. The displayed function header stays IDA-derived.
- mapa strips the trailing provenance suffix from `source_file` before rendering, for example `C:\src\foo.c (MD5: ...)` renders as `C:\src\foo.c`.
- Exact duplicate CSV rows are deduplicated. If multiple distinct Assemblage rows map to the same function address, mapa renders all of them in CSV order.
- For source-file separators, mapa uses the first Assemblage record's normalized `source_file` path as the function's primary source path.
- Missing Assemblage data does not start or end a source-file run. It does not trigger a separator and does not reset the last seen non-empty source path.
- When a later function has a different primary source path from the last seen non-empty source path, mapa inserts a separator immediately before that function.
## Deliberate interface changes from the Lancelot/BinExport2 version
- The `modules` section has been removed. BinExport2's module concept has no IDA equivalent.
## Decisions
- 2026-03-16: Lumina disabled during database creation via `IdaCommandOptions(plugin_options="lumina:host=0.0.0.0 -Osecondary_lumina:host=0.0.0.0")`, matching capa's `loader.py`. The `plugin_options` field maps to IDA's `-O` switch; embedding `-O` in the value for the second option works because `build_args()` concatenates it verbatim. Resource loading enabled via `load_resources=True` (maps to `-R`).
- 2026-03-16: Cache directory is `$XDG_CACHE_HOME/mandiant/mapa/` (or `~/.cache/mandiant/mapa/`). Separate from idals cache.
- 2026-03-16: `meta.ts` is `datetime.now(UTC).isoformat()` — no longer sourced from BinExport2.
- 2026-03-16: Thunk chain depth limit is 5 (matches capa's `THUNK_CHAIN_DEPTH_DELTA`).
- 2026-03-16: CFG stats use `FlowChartFlags.NOEXT | FlowChartFlags.PREDS` to match capa's block enumeration semantics.
- 2026-03-16: String extraction follows single data-reference chains up to depth 10, matching capa's `find_data_reference_from_insn`.
- 2026-03-16: String rows may carry right-aligned database tags derived from vendored Quantum Strand string databases. Tags include `#<library>` (e.g. `#zlib`, `#openssl`), `#msvc`, `#capa`, `#winapi`, `#common`, and `#code-junk`. Visible tag policy: `#common` is hidden when a more-specific tag is present; `#code-junk` is always shown. Tags are matched against the raw (untrimmed) string value. The underlying model preserves all match metadata even when the renderer suppresses a visible tag.
- 2026-03-16: Assemblage input is a CSV keyed by sample SHA-256. mapa matches rows by `hash`, converts `start`/`end` RVAs to VAs using the database base address, annotates functions with `assemblage name:` and `assemblage file:` lines, and does not override IDA-derived function names.
- 2026-03-16: `--output html-map` uses only string tags in the top control strip, sorts them by descending distinct-function count then name, shows those counts in the controls, and lists program strings by string VA with explicit addresses.
- 2026-03-16: `--output html-map` uses a split view with independently scrolling function and string panes, a draggable vertical divider, right-aligned visible tags in each string row, function/tag/string seed selection, caller/callee/both direction controls, depth controls from 1 to 4, and additive geometric-decay neighborhood heat based on bounded shortest-distance traversal.
- 2026-03-16: `--open` is valid only with `--output html-map`. It writes the HTML report to a temporary `.html` file and opens the local browser on that file instead of writing the HTML to stdout.

View File

@@ -7,6 +7,7 @@
- [ ] Review changes
- capa https://github.com/mandiant/capa/compare/\<last-release\>...master
- capa-rules https://github.com/mandiant/capa-rules/compare/\<last-release>\...master
- [ ] Run `$ bump-my-version bump {patch/minor/major} [--allow-dirty]` to update [capa/version.py](https://github.com/mandiant/capa/blob/master/capa/version.py) and other version files
- [ ] Update [CHANGELOG.md](https://github.com/mandiant/capa/blob/master/CHANGELOG.md)
- Do not forget to add a nice introduction thanking contributors
- Remember that we need a major release if we introduce breaking changes
@@ -36,7 +37,6 @@
- [capa <release>...master](https://github.com/mandiant/capa/compare/<release>...master)
- [capa-rules <release>...master](https://github.com/mandiant/capa-rules/compare/<release>...master)
```
- [ ] Update [capa/version.py](https://github.com/mandiant/capa/blob/master/capa/version.py)
- [ ] Create a PR with the updated [CHANGELOG.md](https://github.com/mandiant/capa/blob/master/CHANGELOG.md) and [capa/version.py](https://github.com/mandiant/capa/blob/master/capa/version.py). Copy this checklist in the PR description.
- [ ] Update the [homepage](https://github.com/mandiant/capa/blob/master/web/public/index.html) (i.e. What's New section)
- [ ] After PR review, merge the PR and [create the release in GH](https://github.com/mandiant/capa/releases/new) using text from the [CHANGELOG.md](https://github.com/mandiant/capa/blob/master/CHANGELOG.md).

View File

@@ -2,6 +2,21 @@
See `capa -h` for all supported arguments and usage examples.
## Ways to consume capa output
| Method | Output / interface | Typical use |
|--------|--------------------|-------------|
| **CLI** | Text (default, `-v`, `-vv`), JSON (`-j`), or other formats | Scripting, CI, one-off analysis |
| [**IDA Pro**](https://github.com/mandiant/capa/tree/master/capa/ida/plugin) | capa Explorer plugin inside IDA | Interactive analysis with jump-to-address |
| [**Ghidra**](https://github.com/mandiant/capa/tree/master/capa/ghidra/plugin) | capa Explorer plugin inside Ghidra | Interactive analysis with Ghidra integration |
| [**Binary Ninja**](https://github.com/mandiant/capa/tree/master/capa/features/extractors/binja) | capa run using Binary Ninja as the analysis backend | Interactive analysis with Binary Ninja integration |
| [**Dynamic (Sandbox)**](https://www.mandiant.com/resources/blog/dynamic-capa-executable-behavior-cape-sandbox) | capa run on dynamic sandbox report (CAPE, VMRay, etc.) | Dynamic analysis of sandbox output |
| [**Web (capa Explorer)**](https://mandiant.github.io/capa/explorer/) | Web UI (upload JSON or load from URL) | Sharing results, viewing from VirusTotal or similar |
## Default vs verbose output
By default, capa shows only *top-level* rule matches: capabilities that are not already implied by another displayed rule. For example, if a rule "persist via Run registry key" matches and it *contains* a match for "set registry value", the default output lists only "persist via Run registry key". This keeps the default output short while still reflecting all detected capabilities at the top level. Use **`-v`** to see all rule matches, including nested ones. Use **`-vv`** for an even more detailed view that shows how each rule matched.
## tips and tricks
### only run selected rules
@@ -11,7 +26,7 @@ For example, `capa -t william.ballenthin@mandiant.com` runs rules that reference
### only analyze selected functions
Use the `--restrict-to-functions` option to extract capabilities from only a selected set of functions. This is useful for analyzing
large functions and figuring out their capabilities and their address of occurance; for example: PEB access, RC4 encryption, etc.
large functions and figuring out their capabilities and their address of occurrence; for example: PEB access, RC4 encryption, etc.
To use this, you can copy the virtual addresses from your favorite disassembler and pass them to capa as follows:
`capa sample.exe --restrict-to-functions 0x4019C0,0x401CD0`. If you add the `-v` option then capa will extract the interesting parts of a function for you.

0
mapa/__init__.py Normal file
View File

5
mapa/__main__.py Normal file
View File

@@ -0,0 +1,5 @@
import sys
from mapa.cli import main
sys.exit(main())

57
mapa/assemblage.py Normal file
View File

@@ -0,0 +1,57 @@
from __future__ import annotations
import csv
import logging
from collections import defaultdict
from pathlib import Path
from mapa.model import AssemblageRecord
logger = logging.getLogger(__name__)
REQUIRED_COLUMNS = frozenset({"hash", "name", "start", "end", "source_file"})
def validate_assemblage_columns(fieldnames: list[str] | None) -> None:
columns = set(fieldnames or [])
missing = sorted(REQUIRED_COLUMNS - columns)
if missing:
raise ValueError(
f"assemblage CSV is missing required columns: {', '.join(missing)}"
)
def load_assemblage_records(
assemblage_path: Path,
sample_sha256: str,
base_address: int,
) -> dict[int, list[AssemblageRecord]]:
if not sample_sha256:
raise ValueError("sample sha256 is required to load assemblage data")
normalized_sha256 = sample_sha256.lower()
records_by_address: defaultdict[int, list[AssemblageRecord]] = defaultdict(list)
seen_by_address: defaultdict[int, set[AssemblageRecord]] = defaultdict(set)
with assemblage_path.open("rt", encoding="utf-8", newline="") as handle:
reader = csv.DictReader(handle)
validate_assemblage_columns(reader.fieldnames)
for row in reader:
row_hash = (row.get("hash") or "").strip().lower()
if row_hash != normalized_sha256:
continue
record = AssemblageRecord.from_csv_row(row, base_address=base_address)
seen = seen_by_address[record.address]
if record in seen:
continue
seen.add(record)
records_by_address[record.address].append(record)
logger.debug(
"loaded %d assemblage records for %s from %s",
sum(len(records) for records in records_by_address.values()),
normalized_sha256,
assemblage_path,
)
return dict(records_by_address)

235
mapa/cli.py Normal file
View File

@@ -0,0 +1,235 @@
from __future__ import annotations
import argparse
import json
import logging
import sys
import time
import webbrowser
from collections import defaultdict
from pathlib import Path
from tempfile import NamedTemporaryFile
from typing import Callable
from rich.console import Console
from rich.logging import RichHandler
from rich.theme import Theme
from mapa.assemblage import load_assemblage_records
from mapa.model import AssemblageRecord
logger = logging.getLogger("mapa")
def build_parser() -> argparse.ArgumentParser:
parser = argparse.ArgumentParser(description="mapa: binary function map")
parser.add_argument(
"input_file", type=Path, help="path to input file (binary, .i64, or .idb)"
)
parser.add_argument("--capa", type=Path, help="path to capa JSON results file")
parser.add_argument("--assemblage", type=Path, help="path to Assemblage CSV file")
parser.add_argument(
"--output",
choices=("text", "html-map"),
default="text",
help="output format",
)
parser.add_argument(
"--open",
action="store_true",
help="when used with --output html-map, write to a temp file and open it in a browser",
)
parser.add_argument("--verbose", action="store_true", help="enable verbose logging")
parser.add_argument(
"--quiet", action="store_true", help="disable all output but errors"
)
return parser
def validate_output_options(output: str, open_report: bool) -> None:
if open_report and output != "html-map":
raise ValueError("--open requires --output html-map")
def write_temp_html_report(content: str, directory: Path | None = None) -> Path:
temp_dir = None if directory is None else str(directory)
with NamedTemporaryFile(
mode="w",
encoding="utf-8",
suffix=".html",
prefix="mapa-",
delete=False,
dir=temp_dir,
) as handle:
handle.write(content)
return Path(handle.name)
def open_html_report(
content: str,
opener: Callable[[str], bool] | None = None,
directory: Path | None = None,
) -> Path:
report_path = write_temp_html_report(content, directory=directory)
browser_opener = webbrowser.open if opener is None else opener
opened = browser_opener(report_path.as_uri())
if not opened:
raise RuntimeError(f"failed to open browser for {report_path}")
return report_path
def _load_capa_matches(
capa_path: Path,
thunk_targets: dict[int, int],
) -> dict[int, set[str]]:
"""Load capa JSON and map matches to function addresses."""
doc = json.loads(capa_path.read_text())
functions_by_basic_block: dict[int, int] = {}
for function in doc["meta"]["analysis"]["layout"]["functions"]:
for basic_block in function["matched_basic_blocks"]:
functions_by_basic_block[basic_block["address"]["value"]] = function[
"address"
]["value"]
matches_by_address: defaultdict[int, set[str]] = defaultdict(set)
for rule_name, results in doc["rules"].items():
for location, _ in results["matches"]:
if location["type"] != "absolute":
continue
matches_by_address[location["value"]].add(rule_name)
matches_by_function: defaultdict[int, set[str]] = defaultdict(set)
for address, matches in matches_by_address.items():
func_addr = functions_by_basic_block.get(address, address)
if func_addr in thunk_targets:
logger.debug(
"forwarding capa matches from thunk 0x%x to 0x%x",
func_addr,
thunk_targets[func_addr],
)
func_addr = thunk_targets[func_addr]
matches_by_function[func_addr].update(matches)
for match in matches:
logger.info("capa: 0x%x: %s", func_addr, match)
return dict(matches_by_function)
def main(argv: list[str] | None = None) -> int:
if argv is None:
argv = sys.argv[1:]
parser = build_parser()
args = parser.parse_args(args=argv)
try:
validate_output_options(args.output, args.open)
except ValueError as error:
parser.error(str(error))
stderr_console = Console(stderr=True)
logging.basicConfig(
level=logging.DEBUG
if args.verbose
else (logging.ERROR if args.quiet else logging.INFO),
format="%(message)s",
handlers=[
RichHandler(
console=stderr_console, show_path=False, rich_tracebacks=args.verbose
)
],
)
from mapa.collector import collect_report
from mapa.html_renderer import render_html_map
from mapa.ida_db import open_database_session, resolve_database
from mapa.renderer import render_report
t0 = time.time()
db_path, md5, sha256 = resolve_database(args.input_file)
logger.debug("perf: resolve_database: %0.2fs", time.time() - t0)
theme = Theme(
{
"decoration": "grey54",
"title": "yellow",
"key": "blue",
"value": "blue",
"default": "blue",
},
inherit=False,
)
console = Console(theme=theme, markup=False, emoji=False)
t0 = time.time()
with open_database_session(db_path) as db:
logger.debug("perf: open_database: %0.2fs", time.time() - t0)
base_address = db.base_address or 0
effective_sha256 = sha256 or db.sha256 or ""
assemblage_records_by_address: dict[int, list[AssemblageRecord]] = {}
if args.assemblage:
assemblage_records_by_address = load_assemblage_records(
args.assemblage,
sample_sha256=effective_sha256,
base_address=base_address,
)
matches_by_function: dict[int, set[str]] = {}
if args.capa:
from ida_domain.functions import FunctionFlags
from mapa.collector import (
_build_extern_index,
_build_import_index,
_resolve_thunk_target,
)
import_index = _build_import_index(db)
extern_addrs = _build_extern_index(db)
thunk_targets: dict[int, int] = {}
for func in db.functions:
flags = db.functions.get_flags(func)
if flags and FunctionFlags.THUNK in flags:
target = _resolve_thunk_target(
db, int(func.start_ea), import_index, extern_addrs
)
if target is not None:
thunk_targets[int(func.start_ea)] = target
matches_by_function = _load_capa_matches(
args.capa,
thunk_targets,
)
t0 = time.time()
report = collect_report(
db,
md5=md5,
sha256=effective_sha256,
matches_by_function=matches_by_function,
assemblage_records_by_address=assemblage_records_by_address,
)
logger.debug("perf: collect_report: %0.2fs", time.time() - t0)
t0 = time.time()
if args.output == "html-map":
html = render_html_map(report)
if args.open:
report_path = open_html_report(html)
logger.info("opened html map: %s", report_path)
else:
sys.stdout.write(html)
else:
render_report(report, console)
logger.debug("perf: render_report: %0.2fs", time.time() - t0)
return 0
if __name__ == "__main__":
sys.exit(main())

492
mapa/collector.py Normal file
View File

@@ -0,0 +1,492 @@
from __future__ import annotations
import logging
from datetime import datetime, timezone
from ida_domain.database import Database
from ida_domain.flowchart import FlowChartFlags
from ida_domain.functions import FunctionFlags
from mapa.model import (
AssemblageRecord,
MapaCall,
MapaCaller,
MapaFunction,
MapaLibrary,
MapaMeta,
MapaProgramString,
MapaReport,
MapaSection,
MapaString,
)
from mapa.strings import (
MAX_STRING_READ,
extract_ascii_from_buf,
extract_utf16le_from_buf,
)
from mapa.string_tags.tagger import StringTagger, load_default_tagger
logger = logging.getLogger(__name__)
THUNK_CHAIN_DEPTH_DELTA = 5
def _get_permissions_string(perm: int) -> str:
return (
("r" if perm & 4 else "-")
+ ("w" if perm & 2 else "-")
+ ("x" if perm & 1 else "-")
)
def _collect_meta(db: Database, md5: str, sha256: str) -> MapaMeta:
name = db.path or ""
if not md5 and db.md5:
md5 = db.md5
if not sha256 and db.sha256:
sha256 = db.sha256
arch = db.architecture or ""
base_address = db.base_address or 0
timestamp = datetime.now(timezone.utc).isoformat()
return MapaMeta(
name=name,
sha256=sha256,
md5=md5,
arch=arch,
timestamp=timestamp,
base_address=base_address,
)
def _collect_sections(db: Database) -> list[MapaSection]:
sections: list[MapaSection] = []
for seg in db.segments.get_all():
sections.append(
MapaSection(
address=int(seg.start_ea),
size=int(seg.end_ea) - int(seg.start_ea),
perms=_get_permissions_string(int(seg.perm)),
name=db.segments.get_name(seg) or "",
)
)
return sections
def _normalize_module_name(name: str) -> str:
"""Normalize an import module name to include extension.
IDA strips .dll from PE import module names (e.g. 'KERNEL32' instead of
'KERNEL32.dll'). Add it back when the name has no extension.
"""
if "." not in name:
return f"{name}.dll".lower()
return name.lower()
def _collect_libraries(db: Database) -> list[MapaLibrary]:
libraries: list[MapaLibrary] = []
for module in db.imports.get_all_modules():
libraries.append(MapaLibrary(name=_normalize_module_name(module.name)))
return libraries
def _build_import_index(db: Database) -> dict[int, tuple[str, str]]:
"""Build address -> (module, function_name) for all imports."""
imports: dict[int, tuple[str, str]] = {}
for imp in db.imports.get_all_imports():
name = imp.name or f"ord{imp.ordinal}"
imports[int(imp.address)] = (_normalize_module_name(imp.module_name), name)
return imports
def _build_extern_index(db: Database) -> set[int]:
"""Collect addresses in XTRN segments."""
externs: set[int] = set()
for seg in db.segments.get_all():
seg_class = db.segments.get_class(seg)
if seg_class and seg_class.upper() == "XTRN":
for func in db.functions.get_between(int(seg.start_ea), int(seg.end_ea)):
externs.add(int(func.start_ea))
return externs
def _resolve_thunk_target(
db: Database,
ea: int,
import_index: dict[int, tuple[str, str]],
extern_addrs: set[int],
) -> int | None:
"""Follow thunk chains up to THUNK_CHAIN_DEPTH_DELTA hops.
Returns the final resolved address, or None if resolution fails.
"""
current = ea
for _ in range(THUNK_CHAIN_DEPTH_DELTA):
code_refs = list(db.xrefs.code_refs_from_ea(current, flow=False))
if len(code_refs) == 1:
target = int(code_refs[0])
if target in import_index or target in extern_addrs:
return target
target_func = db.functions.get_at(target)
if target_func:
flags = db.functions.get_flags(target_func)
if flags and FunctionFlags.THUNK in flags:
current = target
continue
return target
data_refs = list(db.xrefs.data_refs_from_ea(current))
if len(data_refs) == 1:
target = int(data_refs[0])
if target in import_index or target in extern_addrs:
return target
target_func = db.functions.get_at(target)
if target_func:
flags = db.functions.get_flags(target_func)
if flags and FunctionFlags.THUNK in flags:
current = target
continue
return target
break
return None
def _find_string_at(db: Database, ea: int) -> str | None:
"""Read bytes at the given address and check for ASCII or UTF-16 LE string."""
try:
buf = db.bytes.get_bytes_at(ea, MAX_STRING_READ)
except Exception:
return None
if not buf:
return None
result = extract_ascii_from_buf(buf)
if result is not None:
return result
return extract_utf16le_from_buf(buf)
def _find_data_reference_string(
db: Database, insn_ea: int, max_depth: int = 10
) -> tuple[int, str] | None:
"""Follow single data-reference chains from an instruction to find a string."""
current = insn_ea
for _ in range(max_depth):
try:
data_refs = list(db.xrefs.data_refs_from_ea(current))
except Exception:
break
if len(data_refs) != 1:
break
target = int(data_refs[0])
if not db.is_valid_ea(target):
break
result = _find_string_at(db, target)
if result is not None:
return target, result
current = target
return None
def _merge_string_metadata(
tags: tuple[str, ...],
tag_matches: tuple,
new_tags: tuple[str, ...],
new_tag_matches: tuple,
) -> tuple[tuple[str, ...], tuple]:
merged_tags = tuple(sorted(set(tags) | set(new_tags)))
seen_match_keys = {match.sort_key for match in tag_matches}
unique_new = tuple(
match for match in new_tag_matches if match.sort_key not in seen_match_keys
)
return merged_tags, tag_matches + unique_new
def collect_report(
db: Database,
md5: str = "",
sha256: str = "",
matches_by_function: dict[int, set[str]] | None = None,
assemblage_records_by_address: dict[int, list[AssemblageRecord]] | None = None,
tagger: StringTagger | None = None,
) -> MapaReport:
"""Collect a complete mapa report from an open IDA database."""
if matches_by_function is None:
matches_by_function = {}
if assemblage_records_by_address is None:
assemblage_records_by_address = {}
if tagger is None:
tagger = load_default_tagger()
meta = _collect_meta(db, md5, sha256)
sections = _collect_sections(db)
libraries = _collect_libraries(db)
import_index = _build_import_index(db)
extern_addrs = _build_extern_index(db)
all_functions: list[tuple[int, object, bool, bool]] = []
for func in db.functions:
ea = int(func.start_ea)
flags = db.functions.get_flags(func)
is_thunk = flags is not None and FunctionFlags.THUNK in flags
is_lib = flags is not None and FunctionFlags.LIB in flags
all_functions.append((ea, func, is_thunk, is_lib))
all_functions.sort(key=lambda x: x[0])
func_address_to_order: dict[int, int] = {}
for i, (ea, _, _, _) in enumerate(all_functions):
func_address_to_order[ea] = i
thunk_targets: dict[int, int] = {}
for ea, func, is_thunk, _ in all_functions:
if is_thunk:
target = _resolve_thunk_target(db, ea, import_index, extern_addrs)
if target is not None:
thunk_targets[ea] = target
resolved_callers: dict[int, set[int]] = {}
resolved_callees: dict[int, list[tuple[int, bool]]] = {}
for ea, func, is_thunk, is_lib in all_functions:
if is_thunk or ea in import_index or ea in extern_addrs:
continue
fc = db.functions.get_flowchart(
func, flags=FlowChartFlags.NOEXT | FlowChartFlags.PREDS
)
if fc is None:
continue
seen_callees: set[int] = set()
callees: list[tuple[int, bool]] = []
for block in fc:
insns = block.get_instructions()
if insns is None:
continue
for insn in insns:
if not db.instructions.is_call_instruction(insn):
# also check for jumps to imports (thunk pattern)
mnem = db.instructions.get_mnemonic(insn)
if mnem and mnem.lower().startswith("jmp"):
call_targets = list(
db.xrefs.code_refs_from_ea(int(insn.ea), flow=False)
)
else:
continue
else:
call_targets = list(db.xrefs.calls_from_ea(int(insn.ea)))
if not call_targets:
call_targets = list(
db.xrefs.code_refs_from_ea(int(insn.ea), flow=False)
)
for target_ea in call_targets:
target_ea = int(target_ea)
resolved_target = target_ea
if target_ea in thunk_targets:
resolved_target = thunk_targets[target_ea]
if resolved_target in seen_callees:
continue
seen_callees.add(resolved_target)
is_api = (
resolved_target in import_index
or resolved_target in extern_addrs
)
callees.append((resolved_target, is_api))
if resolved_target not in resolved_callers:
resolved_callers[resolved_target] = set()
resolved_callers[resolved_target].add(ea)
resolved_callees[ea] = callees
mapa_functions: list[MapaFunction] = []
program_strings_by_address: dict[int, MapaProgramString] = {}
for ea, func, is_thunk, is_lib in all_functions:
if ea in import_index or ea in extern_addrs:
continue
name = db.functions.get_name(func) or f"sub_{ea:x}"
order = func_address_to_order[ea]
mf = MapaFunction(
address=ea,
name=name,
is_thunk=is_thunk,
is_library=is_lib,
assemblage_records=list(assemblage_records_by_address.get(ea, [])),
)
if is_thunk:
mapa_functions.append(mf)
continue
fc = db.functions.get_flowchart(
func, flags=FlowChartFlags.NOEXT | FlowChartFlags.PREDS
)
if fc is not None:
num_blocks = 0
num_edges = 0
num_insns = 0
total_bytes = 0
for block in fc:
num_blocks += 1
num_edges += block.count_successors()
insns = block.get_instructions()
if insns is None:
continue
for insn in insns:
num_insns += 1
insn_size = db.heads.size(int(insn.ea))
total_bytes += insn_size
mf.num_basic_blocks = num_blocks
mf.num_edges = num_edges
mf.num_instructions = num_insns
mf.total_instruction_bytes = total_bytes
for caller_ea in sorted(resolved_callers.get(ea, set())):
if caller_ea not in func_address_to_order:
continue
caller_order = func_address_to_order[caller_ea]
delta = caller_order - order
direction = "" if delta < 0 else ""
caller_func = db.functions.get_at(caller_ea)
caller_name = (
db.functions.get_name(caller_func)
if caller_func
else f"sub_{caller_ea:x}"
)
mf.callers.append(
MapaCaller(
name=caller_name or f"sub_{caller_ea:x}",
address=caller_ea,
delta=delta,
direction=direction,
)
)
for target_ea, is_api in resolved_callees.get(ea, []):
if is_api:
if target_ea in import_index:
module_name, func_name = import_index[target_ea]
api_name = f"{module_name}!{func_name}"
else:
target_func = db.functions.get_at(target_ea)
api_name = (
db.functions.get_name(target_func)
if target_func
else f"sub_{target_ea:x}"
)
api_name = api_name or f"sub_{target_ea:x}"
mf.apis.append(
MapaCall(
name=api_name,
address=target_ea,
is_api=True,
)
)
else:
if target_ea not in func_address_to_order:
continue
target_order = func_address_to_order[target_ea]
delta = target_order - order
direction = "" if delta < 0 else ""
target_func = db.functions.get_at(target_ea)
target_name = (
db.functions.get_name(target_func)
if target_func
else f"sub_{target_ea:x}"
)
mf.calls.append(
MapaCall(
name=target_name or f"sub_{target_ea:x}",
address=target_ea,
is_api=False,
delta=delta,
direction=direction,
)
)
if fc is not None:
seen_strings: dict[str, MapaString] = {}
fc2 = db.functions.get_flowchart(
func, flags=FlowChartFlags.NOEXT | FlowChartFlags.PREDS
)
if fc2 is not None:
for block in fc2:
insns = block.get_instructions()
if insns is None:
continue
for insn in insns:
string_result = _find_data_reference_string(db, int(insn.ea))
if string_result is None:
continue
string_ea, raw = string_result
tag_result = tagger.tag_string(raw)
display = raw.rstrip()
if not display:
continue
if display in seen_strings:
existing = seen_strings[display]
existing.tags, existing.tag_matches = _merge_string_metadata(
existing.tags,
existing.tag_matches,
tag_result.tags,
tag_result.matches,
)
existing.address = min(existing.address, string_ea)
else:
ms = MapaString(
value=display,
address=string_ea,
tags=tuple(sorted(set(tag_result.tags))),
tag_matches=tag_result.matches,
)
seen_strings[display] = ms
mf.strings.append(ms)
if string_ea in program_strings_by_address:
existing_program_string = program_strings_by_address[string_ea]
existing_program_string.tags, existing_program_string.tag_matches = _merge_string_metadata(
existing_program_string.tags,
existing_program_string.tag_matches,
tag_result.tags,
tag_result.matches,
)
existing_program_string.function_addresses = tuple(
sorted(
set(existing_program_string.function_addresses)
| {ea}
)
)
else:
program_strings_by_address[string_ea] = MapaProgramString(
value=display,
address=string_ea,
tags=tuple(sorted(set(tag_result.tags))),
tag_matches=tag_result.matches,
function_addresses=(ea,),
)
mf.capa_matches = sorted(matches_by_function.get(ea, set()))
mapa_functions.append(mf)
return MapaReport(
meta=meta,
sections=sections,
libraries=libraries,
functions=mapa_functions,
program_strings=sorted(
program_strings_by_address.values(),
key=lambda string: string.address,
),
)

553
mapa/html_renderer.py Normal file
View File

@@ -0,0 +1,553 @@
from __future__ import annotations
import json
from html import escape
from mapa.model import MapaProgramString, MapaReport
from mapa.renderer import _visible_tags, render_function_summary_text
def _to_json(value: object) -> str:
return json.dumps(value, separators=(",", ":")).replace("</", "<\\/")
def _collect_tag_entries(report: MapaReport) -> list[tuple[str, list[int]]]:
function_index_by_address = {
function.address: index for index, function in enumerate(report.functions)
}
tag_to_functions: dict[str, set[int]] = {}
for program_string in report.program_strings:
visible_tags = _visible_tags(program_string.tags)
if not visible_tags:
continue
function_indices = {
function_index_by_address[address]
for address in program_string.function_addresses
if address in function_index_by_address
}
if not function_indices:
continue
for tag in visible_tags:
tag_to_functions.setdefault(tag, set()).update(function_indices)
return sorted(
(
(tag, sorted(function_indices))
for tag, function_indices in tag_to_functions.items()
),
key=lambda item: (-len(item[1]), item[0]),
)
def _collect_function_graph(
report: MapaReport, function_index_by_address: dict[int, int]
) -> tuple[list[list[int]], list[list[int]]]:
callers_by_index: list[list[int]] = []
callees_by_index: list[list[int]] = []
for function in report.functions:
caller_indices = sorted(
{
function_index_by_address[caller.address]
for caller in function.callers
if caller.address in function_index_by_address
}
)
callee_indices = sorted(
{
function_index_by_address[call.address]
for call in function.calls
if call.address in function_index_by_address
}
)
callers_by_index.append(caller_indices)
callees_by_index.append(callee_indices)
return callers_by_index, callees_by_index
def _render_string_row(program_string: MapaProgramString, index: int) -> str:
visible_tags = _visible_tags(program_string.tags)
tag_text = " ".join(visible_tags)
tag_span = ""
if tag_text:
tag_span = f'<span class="string-tags">{escape(tag_text)}</span>'
return (
f'<button type="button" class="string-row" data-string-index="{index}" '
f'data-string-address="{escape(hex(program_string.address), quote=True)}" '
f'data-string-value="{escape(program_string.value, quote=True)}" '
f'data-string-tags="{escape(tag_text, quote=True)}">'
f'<span class="string-address">{escape(hex(program_string.address))}</span>'
f'<span class="string-value">{escape(program_string.value)}</span>'
f"{tag_span}"
"</button>"
)
def render_html_map(report: MapaReport) -> str:
tag_entries = _collect_tag_entries(report)
function_index_by_address = {
function.address: index for index, function in enumerate(report.functions)
}
callers_by_index, callees_by_index = _collect_function_graph(
report, function_index_by_address
)
program_strings = sorted(report.program_strings, key=lambda string: string.address)
data = {
"functions": [
{
"name": function.name,
"address": hex(function.address),
"label": f"{function.name} @ {hex(function.address)}",
"summary": render_function_summary_text(function),
}
for function in report.functions
],
"callersByIndex": callers_by_index,
"calleesByIndex": callees_by_index,
"tags": {tag: function_indices for tag, function_indices in tag_entries},
"strings": [
{
"address": hex(program_string.address),
"value": program_string.value,
"functionIndices": [
function_index_by_address[address]
for address in program_string.function_addresses
if address in function_index_by_address
],
}
for program_string in program_strings
],
}
style = """
*,*::before,*::after{box-sizing:border-box;margin:0;padding:0}
html,body{height:100%}
:root{--bg:#fff;--fg:#111;--muted:#666;--line:#cfcfcf;--fill:#d9d9d9;--heat:#2563eb;--seed:#f59e0b;--square:10px}
body{height:100vh;overflow:hidden;background:var(--bg);color:var(--fg);font:13px/1.4 ui-monospace,SFMono-Regular,Menlo,Consolas,monospace;padding:16px}
body.is-resizing{cursor:col-resize}
body.is-resizing *{user-select:none}
main{height:100%;min-height:0;display:flex;flex-direction:column;gap:16px}
section{display:flex;flex-direction:column;gap:8px}
.meta,.legend,.pane-header{color:var(--muted)}
.controls{display:flex;flex-wrap:wrap;gap:6px}
.control-row{display:flex;flex-wrap:wrap;align-items:center;gap:8px}
.control-label{color:var(--muted)}
.control{border:1px solid var(--line);background:transparent;color:inherit;padding:2px 6px;font:inherit;cursor:pointer}
.control.is-active{border-color:var(--heat);color:var(--heat)}
.control-count{color:var(--muted)}
.split-view{flex:1;min-height:0;display:flex;align-items:stretch}
.pane{min-height:0;display:flex;flex-direction:column;overflow:auto;border:1px solid var(--line);background:var(--bg)}
.pane-functions{flex:0 0 50%;min-width:18rem}
.pane-strings{flex:1 1 auto;min-width:18rem}
.pane-header{position:sticky;top:0;z-index:1;background:var(--bg);border-bottom:1px solid var(--line);padding:8px 10px}
.pane-body{display:flex;flex-direction:column;gap:8px;padding:8px 10px}
.splitter{position:relative;flex:0 0 12px;cursor:col-resize;touch-action:none}
.splitter::before{content:'';position:absolute;top:0;bottom:0;left:50%;width:1px;background:var(--line);transform:translateX(-50%)}
.splitter::after{content:'';position:absolute;top:50%;left:50%;width:3px;height:40px;border-left:1px solid var(--line);border-right:1px solid var(--line);transform:translate(-50%,-50%)}
.function-grid{display:flex;flex-wrap:wrap;gap:1px;align-content:flex-start}
.function-box{position:relative;z-index:0;width:var(--square);height:var(--square);border:1px solid var(--line);background:var(--fill);overflow:visible}
.function-box::before{content:'';position:absolute;inset:0;background:var(--heat);opacity:var(--heat-opacity,0);pointer-events:none}
.function-box::after{content:'';position:absolute;inset:-2px;border:1px solid var(--seed);opacity:0;pointer-events:none}
.function-box.is-seed::after{opacity:1}
.function-box.is-dim{opacity:.28}
.string-list{display:flex;flex-direction:column;gap:2px}
.string-row{display:flex;align-items:flex-start;gap:8px;width:100%;border:1px solid transparent;background:transparent;color:inherit;padding:3px 4px;font:inherit;text-align:left;cursor:pointer}
.string-row:hover,.string-row.is-active{border-color:var(--line)}
.string-address{color:var(--muted);white-space:nowrap;flex:0 0 auto}
.string-value{min-width:0;flex:1 1 auto;white-space:pre-wrap;word-break:break-word}
.string-tags{margin-left:auto;flex:0 0 auto;padding-left:8px;color:var(--muted);white-space:nowrap}
.tooltip{position:fixed;z-index:10;display:none;width:min(42rem,calc(100vw - 24px));max-height:calc(100vh - 24px);overflow:auto;border:1px solid var(--line);background:#fff;padding:8px;pointer-events:none;white-space:pre-wrap;box-shadow:0 2px 8px rgba(0,0,0,.08)}
.tooltip.is-visible{display:block}
h1,h2{font-size:inherit}
""".strip()
script = """
const data=JSON.parse(document.getElementById('mapa-data').textContent);
const splitView=document.getElementById('split-view');
const functionsPane=document.getElementById('functions-pane');
const splitter=document.getElementById('splitter');
const functionBoxes=[...document.querySelectorAll('.function-box')];
const tagControls=[...document.querySelectorAll('.tag-control')];
const stringRows=[...document.querySelectorAll('.string-row')];
const directionControls=[...document.querySelectorAll('.direction-control')];
const depthControls=[...document.querySelectorAll('.depth-control')];
const neighborhoodStatus=document.getElementById('neighborhood-status');
const tooltip=document.getElementById('tooltip');
const bothByIndex=data.callersByIndex.map((callers,index)=>[...new Set([...callers,...data.calleesByIndex[index]])]);
const functionCount=data.functions.length;
let hoveredFunction=null;
let lockedFunction=null;
let hoveredTag=null;
let lockedTag=null;
let hoveredString=null;
let lockedString=null;
let directionMode='both';
let maxDepth=3;
let activePointerId=null;
let tooltipFunctionIndex=null;
let currentNeighborhood=null;
const getDecayScore=(distance)=>0.5**distance;
const hasLockedSeed=()=>lockedFunction!==null||lockedTag!==null||lockedString!==null;
const getUniqueSeedIndices=(seedIndices)=>[...new Set(seedIndices)];
const clearHoveredSeeds=()=>{
hoveredFunction=null;
hoveredTag=null;
hoveredString=null;
};
const clearLockedSeeds=()=>{
lockedFunction=null;
lockedTag=null;
lockedString=null;
};
const buildSeedSource=(kind,key,label,seedIndices)=>{
const uniqueSeedIndices=getUniqueSeedIndices(seedIndices);
if(uniqueSeedIndices.length===0){return null;}
return {kind,key,label,seedIndices:uniqueSeedIndices};
};
const getFunctionSeedSource=(index)=>buildSeedSource('function',String(index),data.functions[index].label,[index]);
const getTagSeedSource=(tag)=>buildSeedSource('tag',tag,tag,data.tags[tag]||[]);
const getStringSeedSource=(index)=>{
const stringData=data.strings[index];
return buildSeedSource('string',String(index),`${stringData.value} @ ${stringData.address}`,stringData.functionIndices);
};
const getLockedSeedSource=()=>{
if(lockedFunction!==null){return getFunctionSeedSource(lockedFunction);}
if(lockedTag!==null){return getTagSeedSource(lockedTag);}
if(lockedString!==null){return getStringSeedSource(lockedString);}
return null;
};
const getHoveredSeedSource=()=>{
if(hoveredFunction!==null){return getFunctionSeedSource(hoveredFunction);}
if(hoveredTag!==null){return getTagSeedSource(hoveredTag);}
if(hoveredString!==null){return getStringSeedSource(hoveredString);}
return null;
};
const getActiveSeedSource=()=>getLockedSeedSource()??getHoveredSeedSource();
const getAdjacency=()=>{
if(directionMode==='callers'){return data.callersByIndex;}
if(directionMode==='callees'){return data.calleesByIndex;}
return bothByIndex;
};
const computeNeighborhoodState=(seedIndices)=>{
const adjacency=getAdjacency();
const uniqueSeedIndices=getUniqueSeedIndices(seedIndices);
const scores=new Float32Array(functionCount);
const bestDistances=new Array(functionCount).fill(null);
for(const seedIndex of uniqueSeedIndices){
const distances=new Array(functionCount).fill(-1);
const queue=[seedIndex];
distances[seedIndex]=0;
for(let queueIndex=0;queueIndex<queue.length;queueIndex++){
const functionIndex=queue[queueIndex];
const distance=distances[functionIndex];
if(distance===maxDepth){continue;}
for(const neighborIndex of adjacency[functionIndex]){
if(distances[neighborIndex]!==-1){continue;}
distances[neighborIndex]=distance+1;
queue.push(neighborIndex);
}
}
distances.forEach((distance,functionIndex)=>{
if(distance===-1){return;}
scores[functionIndex]+=getDecayScore(distance);
const bestDistance=bestDistances[functionIndex];
if(bestDistance===null||distance<bestDistance){bestDistances[functionIndex]=distance;}
});
}
let maxScore=0;
scores.forEach((score)=>{
if(score>maxScore){maxScore=score;}
});
return {
scores:Array.from(scores),
bestDistances,
seedIndices:uniqueSeedIndices,
seedSet:new Set(uniqueSeedIndices),
maxScore,
};
};
const renderTooltipText=(index)=>{
const summary=data.functions[index].summary;
if(currentNeighborhood===null){return summary;}
const lines=[`heat: ${currentNeighborhood.scores[index].toFixed(2)}`,`seed: ${currentNeighborhood.seedSet.has(index)?'yes':'no'}`];
const distance=currentNeighborhood.bestDistances[index];
if(distance!==null){lines.push(`distance: ${distance}`);}
return `${lines.join('\\n')}\\n\\n${summary}`;
};
const updateTooltip=()=>{
if(tooltipFunctionIndex===null){return;}
tooltip.textContent=renderTooltipText(tooltipFunctionIndex);
};
const updateStatus=()=>{
const activeSeedSource=getActiveSeedSource();
if(activeSeedSource===null){
neighborhoodStatus.textContent='hover or click a function, tag, or string';
return;
}
const seedCount=currentNeighborhood===null?0:currentNeighborhood.seedIndices.length;
neighborhoodStatus.textContent=`${activeSeedSource.kind} ${activeSeedSource.label} · direction ${directionMode} · depth ${maxDepth} · ${seedCount} seed${seedCount===1?'':'s'}`;
};
const updateView=()=>{
const activeSeedSource=getActiveSeedSource();
currentNeighborhood=activeSeedSource===null?null:computeNeighborhoodState(activeSeedSource.seedIndices);
const hasActive=currentNeighborhood!==null;
functionBoxes.forEach((box,index)=>{
const score=currentNeighborhood===null?0:currentNeighborhood.scores[index];
const heatOpacity=currentNeighborhood===null||currentNeighborhood.maxScore===0?0:score/currentNeighborhood.maxScore;
box.style.setProperty('--heat-opacity',heatOpacity.toFixed(3));
box.classList.toggle('is-seed',currentNeighborhood!==null&&currentNeighborhood.seedSet.has(index));
box.classList.toggle('is-dim',hasActive&&score===0);
});
tagControls.forEach((control)=>{
control.classList.toggle('is-active',activeSeedSource!==null&&activeSeedSource.kind==='tag'&&control.dataset.tag===activeSeedSource.key);
});
stringRows.forEach((row)=>{
row.classList.toggle('is-active',activeSeedSource!==null&&activeSeedSource.kind==='string'&&row.dataset.stringIndex===activeSeedSource.key);
});
directionControls.forEach((control)=>{
control.classList.toggle('is-active',control.dataset.direction===directionMode);
});
depthControls.forEach((control)=>{
control.classList.toggle('is-active',Number(control.dataset.depth)===maxDepth);
});
updateStatus();
updateTooltip();
};
const placeTooltip=(event)=>{
const offset=12;
let left=event.clientX+offset;
let top=event.clientY+offset;
const rect=tooltip.getBoundingClientRect();
if(left+rect.width>window.innerWidth-8){left=Math.max(8,window.innerWidth-rect.width-8);}
if(top+rect.height>window.innerHeight-8){top=Math.max(8,window.innerHeight-rect.height-8);}
tooltip.style.left=`${left}px`;
tooltip.style.top=`${top}px`;
};
const getPaneMinWidth=()=>parseFloat(getComputedStyle(document.documentElement).fontSize)*18;
const resizePanes=(clientX)=>{
const rect=splitView.getBoundingClientRect();
const splitterWidth=splitter.getBoundingClientRect().width;
const paneMinWidth=getPaneMinWidth();
const minLeft=rect.left+paneMinWidth;
const maxLeft=rect.right-paneMinWidth-splitterWidth;
if(maxLeft<=minLeft){functionsPane.style.flexBasis='50%';return;}
const clampedLeft=Math.min(maxLeft,Math.max(minLeft,clientX));
functionsPane.style.flexBasis=`${clampedLeft-rect.left}px`;
};
const clampPaneSize=()=>{
const basis=parseFloat(functionsPane.style.flexBasis);
if(Number.isFinite(basis)){resizePanes(splitView.getBoundingClientRect().left+basis);}
};
const stopResizing=(event)=>{
if(activePointerId===null||event.pointerId!==activePointerId){return;}
if(splitter.hasPointerCapture(event.pointerId)){splitter.releasePointerCapture(event.pointerId);}
activePointerId=null;
document.body.classList.remove('is-resizing');
};
functionBoxes.forEach((box,index)=>{
box.addEventListener('mouseenter',(event)=>{
tooltipFunctionIndex=index;
tooltip.classList.add('is-visible');
if(!hasLockedSeed()){hoveredFunction=index;}
updateView();
placeTooltip(event);
});
box.addEventListener('mousemove',placeTooltip);
box.addEventListener('mouseleave',()=>{
tooltip.classList.remove('is-visible');
tooltipFunctionIndex=null;
if(!hasLockedSeed()&&hoveredFunction===index){
hoveredFunction=null;
updateView();
}
});
box.addEventListener('click',()=>{
const functionIndex=Number(box.dataset.functionIndex);
if(lockedFunction===functionIndex){
lockedFunction=null;
}else{
clearLockedSeeds();
lockedFunction=functionIndex;
}
clearHoveredSeeds();
updateView();
});
});
tagControls.forEach((control)=>{
control.addEventListener('mouseenter',()=>{
if(hasLockedSeed()){return;}
hoveredTag=control.dataset.tag;
updateView();
});
control.addEventListener('mouseleave',()=>{
if(hasLockedSeed()||hoveredTag!==control.dataset.tag){return;}
hoveredTag=null;
updateView();
});
control.addEventListener('click',()=>{
const tag=control.dataset.tag;
if(lockedTag===tag){
lockedTag=null;
}else{
clearLockedSeeds();
lockedTag=tag;
}
clearHoveredSeeds();
updateView();
});
});
stringRows.forEach((row)=>{
row.addEventListener('mouseenter',()=>{
if(hasLockedSeed()){return;}
hoveredString=Number(row.dataset.stringIndex);
updateView();
});
row.addEventListener('mouseleave',()=>{
if(hasLockedSeed()||hoveredString!==Number(row.dataset.stringIndex)){return;}
hoveredString=null;
updateView();
});
row.addEventListener('click',()=>{
const stringIndex=Number(row.dataset.stringIndex);
if(lockedString===stringIndex){
lockedString=null;
}else{
clearLockedSeeds();
lockedString=stringIndex;
}
clearHoveredSeeds();
updateView();
});
});
directionControls.forEach((control)=>{
control.addEventListener('click',()=>{
directionMode=control.dataset.direction;
updateView();
});
});
depthControls.forEach((control)=>{
control.addEventListener('click',()=>{
maxDepth=Number(control.dataset.depth);
updateView();
});
});
splitter.addEventListener('pointerdown',(event)=>{
activePointerId=event.pointerId;
splitter.setPointerCapture(event.pointerId);
document.body.classList.add('is-resizing');
resizePanes(event.clientX);
event.preventDefault();
});
splitter.addEventListener('pointermove',(event)=>{if(activePointerId===event.pointerId){resizePanes(event.clientX);}});
splitter.addEventListener('pointerup',stopResizing);
splitter.addEventListener('pointercancel',stopResizing);
window.addEventListener('resize',clampPaneSize);
updateView();
""".strip()
parts: list[str] = [
"<!doctype html>",
'<html lang="en">',
"<head>",
'<meta charset="utf-8">',
f"<title>{escape(report.meta.name)} - mapa html map</title>",
"<style>",
style,
"</style>",
"</head>",
"<body>",
"<main>",
"<section>",
f"<div>{escape(report.meta.name)}</div>",
(
f'<div class="meta">sha256 {escape(report.meta.sha256)} · arch {escape(report.meta.arch)}'
f" · functions {len(report.functions)} · strings {len(program_strings)} · tags {len(tag_entries)}</div>"
),
"<h1>tags</h1>",
'<div class="controls" id="tag-controls">',
]
for tag, function_indices in tag_entries:
parts.append(
(
f'<button type="button" class="control tag-control" data-tag="{escape(tag, quote=True)}" '
f'data-count="{len(function_indices)}">{escape(tag)} '
f'<span class="control-count">({len(function_indices)})</span></button>'
)
)
parts.extend(
[
"</div>",
'<div class="control-row">',
'<span class="control-label">direction</span>',
'<div class="controls" id="direction-controls">',
'<button type="button" class="control direction-control" data-direction="callers">callers</button>',
'<button type="button" class="control direction-control" data-direction="callees">callees</button>',
'<button type="button" class="control direction-control is-active" data-direction="both">both</button>',
"</div>",
'<span class="control-label">depth</span>',
'<div class="controls" id="depth-controls">',
'<button type="button" class="control depth-control" data-depth="1">1</button>',
'<button type="button" class="control depth-control" data-depth="2">2</button>',
'<button type="button" class="control depth-control is-active" data-depth="3">3</button>',
'<button type="button" class="control depth-control" data-depth="4">4</button>',
"</div>",
"</div>",
'<div class="meta" id="neighborhood-status"></div>',
'<div class="legend">fill = neighborhood heat · outline = seed · dim = outside neighborhood</div>',
"</section>",
'<div class="split-view" id="split-view">',
'<section class="pane pane-functions" id="functions-pane">',
f'<div class="pane-header">functions ({len(report.functions)})</div>',
'<div class="pane-body">',
'<div class="function-grid" id="function-grid">',
]
)
for index, function in enumerate(report.functions):
parts.append(
(
f'<div class="function-box" data-function-index="{index}" '
f'data-function-address="{escape(hex(function.address), quote=True)}" '
f'aria-label="{escape(function.name, quote=True)}"></div>'
)
)
parts.extend(
[
"</div>",
"</div>",
"</section>",
'<div class="splitter" id="splitter" role="separator" aria-orientation="vertical" aria-label="resize panes"></div>',
'<section class="pane pane-strings" id="strings-pane">',
f'<div class="pane-header">strings ({len(program_strings)})</div>',
'<div class="pane-body">',
'<div class="string-list" id="string-list">',
]
)
for index, program_string in enumerate(program_strings):
parts.append(_render_string_row(program_string, index))
parts.extend(
[
"</div>",
"</div>",
"</section>",
"</div>",
"</main>",
'<div class="tooltip" id="tooltip"></div>',
f'<script type="application/json" id="mapa-data">{_to_json(data)}</script>',
"<script>",
script,
"</script>",
"</body>",
"</html>",
]
)
return "\n".join(parts)

165
mapa/ida_db.py Normal file
View File

@@ -0,0 +1,165 @@
from __future__ import annotations
import contextlib
import fcntl
import hashlib
import logging
import os
import time
from pathlib import Path
from typing import Iterator
import idapro # must be first: mutates sys.path so ida_auto and ida_domain are importable
import ida_auto
from ida_domain.database import Database, IdaCommandOptions
logger = logging.getLogger(__name__)
DATABASE_ACCESS_TIMEOUT = 5.0
DATABASE_ANALYSIS_TIMEOUT = 120.0
DATABASE_POLL_INTERVAL = 0.25
def get_cache_dir() -> Path:
xdg = os.environ.get("XDG_CACHE_HOME")
if xdg:
base = Path(xdg)
else:
base = Path.home() / ".cache"
return base / "mandiant" / "mapa"
def compute_file_hashes(file_path: Path) -> tuple[str, str]:
"""Compute (md5, sha256) for a file.
Raises:
OSError: If the file cannot be read.
"""
md5_digest = hashlib.md5()
sha256_digest = hashlib.sha256()
with file_path.open("rb") as fh:
for chunk in iter(lambda: fh.read(65536), b""):
md5_digest.update(chunk)
sha256_digest.update(chunk)
return md5_digest.hexdigest(), sha256_digest.hexdigest()
def _wait_for_repack(db_path: Path, timeout: float) -> None:
nam_path = db_path.with_suffix(".nam")
deadline = time.monotonic() + timeout
while nam_path.exists():
if time.monotonic() >= deadline:
raise RuntimeError(
f"Database {db_path} appears to be open in another program "
f"({nam_path} still exists after {timeout:.0f}s)."
)
time.sleep(DATABASE_POLL_INTERVAL)
@contextlib.contextmanager
def database_access_guard(db_path: Path, timeout: float) -> Iterator[None]:
"""Advisory guard that serialises access to an IDA database.
Uses .nam polling + flock on <db>.lock with TOCTOU re-check.
Raises:
RuntimeError: On timeout waiting for the database.
"""
_wait_for_repack(db_path, timeout)
lock_path = Path(str(db_path) + ".lock")
lock_fd = lock_path.open("w")
deadline = time.monotonic() + timeout
try:
while True:
try:
fcntl.flock(lock_fd, fcntl.LOCK_EX | fcntl.LOCK_NB)
break
except OSError:
if time.monotonic() >= deadline:
raise RuntimeError(
f"Timed out waiting for lock on {db_path} after {timeout:.0f}s."
)
time.sleep(DATABASE_POLL_INTERVAL)
_wait_for_repack(db_path, max(0, deadline - time.monotonic()))
yield
finally:
fcntl.flock(lock_fd, fcntl.LOCK_UN)
lock_fd.close()
def resolve_database(file_path: Path) -> tuple[Path, str, str]:
"""Resolve an input path to an .i64/.idb database path.
Returns (db_path, md5, sha256). For existing databases, hashes are empty
strings (they'll be read from IDA metadata instead).
Raises:
RuntimeError: If analysis or caching fails.
"""
suffix = file_path.suffix.lower()
if suffix in {".i64", ".idb"}:
logger.debug("Using existing database: %s", file_path)
return file_path, "", ""
cache_dir = get_cache_dir()
cache_dir.mkdir(parents=True, exist_ok=True)
md5, sha256 = compute_file_hashes(file_path)
cache_path = cache_dir / f"{sha256}.i64"
if cache_path.exists():
logger.debug("Cache hit for %s -> %s", file_path, cache_path)
return cache_path, md5, sha256
logger.debug("Cache miss for %s; analyzing to %s", file_path, cache_path)
with database_access_guard(cache_path, timeout=DATABASE_ANALYSIS_TIMEOUT):
if cache_path.exists():
logger.debug("Cache populated while waiting for lock: %s", cache_path)
return cache_path, md5, sha256
logger.info("Analyzing %s (this may take a moment)...", file_path.name)
idapro.enable_console_messages(False)
ida_options = IdaCommandOptions(
auto_analysis=True,
new_database=True,
output_database=str(cache_path),
load_resources=True,
plugin_options="lumina:host=0.0.0.0 -Osecondary_lumina:host=0.0.0.0",
)
try:
with Database.open(str(file_path), ida_options, save_on_close=True):
ida_auto.auto_wait()
except Exception as exc:
raise RuntimeError(f"Analysis failed for {file_path}: {exc}") from exc
if not cache_path.exists():
raise RuntimeError(f"Analysis produced no database for {file_path}")
logger.debug("Analysis completed: %s", cache_path)
return cache_path, md5, sha256
@contextlib.contextmanager
def open_database_session(db_path: Path, auto_analysis: bool = False) -> Iterator[Database]:
"""Open a database session with advisory locking.
Raises:
RuntimeError: If opening fails or the database is locked.
"""
with database_access_guard(db_path, timeout=DATABASE_ACCESS_TIMEOUT):
ida_options = IdaCommandOptions(auto_analysis=auto_analysis, new_database=False)
logger.debug("Opening database session: %s (auto_analysis=%s)", db_path, auto_analysis)
idapro.enable_console_messages(False)
try:
database = Database.open(str(db_path), ida_options, save_on_close=False)
except Exception as exc:
raise RuntimeError(f"Failed to open {db_path}: {exc}") from exc
with database:
if auto_analysis:
ida_auto.auto_wait()
yield database
logger.debug("Closed database session: %s", db_path)

126
mapa/model.py Normal file
View File

@@ -0,0 +1,126 @@
from __future__ import annotations
from collections.abc import Mapping
from dataclasses import dataclass, field
@dataclass(frozen=True)
class AssemblageRecord:
sha256: str
name: str
start_rva: int
end_rva: int
address: int
end_address: int
source_file: str
@property
def source_path(self) -> str:
if not self.source_file.endswith(")"):
return self.source_file
head, separator, _ = self.source_file.rpartition(" (")
if separator:
return head
return self.source_file
@classmethod
def from_csv_row(
cls, row: Mapping[str, str], base_address: int
) -> AssemblageRecord:
start_rva = int(row["start"], 0)
end_rva = int(row["end"], 0)
return cls(
sha256=row["hash"].strip().lower(),
name=row["name"].strip(),
start_rva=start_rva,
end_rva=end_rva,
address=base_address + start_rva,
end_address=base_address + end_rva,
source_file=row["source_file"].strip(),
)
@dataclass
class MapaString:
value: str
address: int
tags: tuple[str, ...] = ()
tag_matches: tuple = ()
@dataclass
class MapaProgramString:
value: str
address: int
tags: tuple[str, ...] = ()
tag_matches: tuple = ()
function_addresses: tuple[int, ...] = ()
@dataclass
class MapaCall:
name: str
address: int
is_api: bool
delta: int = 0
direction: str = ""
@dataclass
class MapaCaller:
name: str
address: int
delta: int = 0
direction: str = ""
@dataclass
class MapaFunction:
address: int
name: str
is_thunk: bool = False
is_library: bool = False
num_basic_blocks: int = 0
num_edges: int = 0
num_instructions: int = 0
total_instruction_bytes: int = 0
callers: list[MapaCaller] = field(default_factory=list)
calls: list[MapaCall] = field(default_factory=list)
apis: list[MapaCall] = field(default_factory=list)
strings: list[MapaString] = field(default_factory=list)
capa_matches: list[str] = field(default_factory=list)
assemblage_records: list[AssemblageRecord] = field(default_factory=list)
@dataclass
class MapaSection:
address: int
size: int
perms: str
name: str = ""
@dataclass
class MapaLibrary:
name: str
is_static: bool = False
load_address: int | None = None
@dataclass
class MapaMeta:
name: str
sha256: str
md5: str = ""
arch: str = ""
timestamp: str = ""
base_address: int = 0
@dataclass
class MapaReport:
meta: MapaMeta
sections: list[MapaSection] = field(default_factory=list)
libraries: list[MapaLibrary] = field(default_factory=list)
functions: list[MapaFunction] = field(default_factory=list)
program_strings: list[MapaProgramString] = field(default_factory=list)

273
mapa/renderer.py Normal file
View File

@@ -0,0 +1,273 @@
from __future__ import annotations
import contextlib
import rich.padding
from rich.console import Console
from rich.markup import escape
from rich.text import Text
from mapa.model import MapaFunction, MapaReport, MapaString
class Renderer:
def __init__(self, console: Console):
self.console: Console = console
self.indent: int = 0
@contextlib.contextmanager
def indenting(self):
self.indent += 1
try:
yield
finally:
self.indent -= 1
@staticmethod
def markup(s: str, **kwargs) -> Text:
escaped_args = {
k: (escape(v) if isinstance(v, str) else v) for k, v in kwargs.items()
}
return Text.from_markup(s.format(**escaped_args))
def print(self, renderable, **kwargs):
if not kwargs:
return self.console.print(
rich.padding.Padding(renderable, (0, 0, 0, self.indent * 2))
)
assert isinstance(renderable, str)
return self.print(self.markup(renderable, **kwargs))
def writeln(self, s: str):
self.print(s)
@contextlib.contextmanager
def section(self, name):
if isinstance(name, str):
self.print("[title]{name}", name=name)
elif isinstance(name, Text):
name = name.copy()
name.stylize_before(self.console.get_style("title"))
self.print(name)
else:
raise ValueError("unexpected section name")
with self.indenting():
yield
def _visible_tags(tags: tuple[str, ...]) -> list[str]:
tag_set = set(tags)
has_specific = any(t != "#common" for t in tag_set)
result = []
for t in tags:
if t == "#common" and has_specific:
continue
result.append(t)
return result
def _render_string_line(o: Renderer, value: str, tags: list[str]) -> Text:
left = Text.from_markup(
'string: [decoration]"[/]{string}[decoration]"[/]'.format(
string=escape(value)
)
)
right = Text(" ".join(tags), style="dim")
available = o.console.size.width - (o.indent * 2)
min_gap = 1
right_len = right.cell_len
max_left = available - right_len - min_gap
if max_left < 12:
combined = left.copy()
combined.append(" ")
combined.append(right)
return combined
if left.cell_len > max_left:
left.truncate(max_left - 1, overflow="ellipsis")
padding = available - left.cell_len - right_len
combined = left.copy()
combined.append(" " * padding)
combined.append(right)
return combined
def _get_primary_source_path(func) -> str | None:
if not func.assemblage_records:
return None
source_path = func.assemblage_records[0].source_path
if not source_path:
return None
return source_path
def _render_source_path_separator(o: Renderer, source_path: str) -> Text:
label = f"[ {source_path} ]"
available = max(0, o.console.size.width - (o.indent * 2))
if available <= len(label) + 2:
return Text(label, style="decoration")
rule_len = available - len(label) - 2
left_len = rule_len // 2
right_len = rule_len - left_len
rendered = Text("-" * left_len, style="decoration")
rendered.append(" ")
rendered.append(label, style="decoration")
rendered.append(" ")
rendered.append("-" * right_len, style="decoration")
return rendered
def _format_function_heading(func: MapaFunction) -> str:
kind = "thunk" if func.is_thunk else "function"
return f"{kind} {func.name} @ {hex(func.address)}"
def _iter_function_rows(func: MapaFunction):
for record in func.assemblage_records:
yield "assemblage", record
if func.is_thunk:
return
for caller in func.callers:
yield "caller", caller
yield "metrics", None
for match in func.capa_matches:
yield "capa", match
for call in func.calls:
yield "call", call
for api in func.apis:
yield "api", api
for string in func.strings:
yield "string", string
def _render_plain_string_line(string: MapaString) -> str:
visible_tags = _visible_tags(string.tags)
line = f'string: "{string.value}"'
if visible_tags:
line += f" {' '.join(visible_tags)}"
return line
def render_function_summary_text(func: MapaFunction) -> str:
lines = [_format_function_heading(func)]
for kind, value in _iter_function_rows(func):
if kind == "assemblage":
lines.append(f"assemblage name: {value.name}")
lines.append(f"assemblage file: {value.source_path}")
elif kind == "caller":
lines.append(
f"xref: {value.direction} {value.name} ({value.delta:+})"
)
elif kind == "metrics":
lines.append(
f"B/E/I: {func.num_basic_blocks} / {func.num_edges} / {func.num_instructions} ({func.total_instruction_bytes} bytes)"
)
elif kind == "capa":
lines.append(f"capa: {value}")
elif kind == "call":
lines.append(
f"calls: {value.direction} {value.name} ({value.delta:+})"
)
elif kind == "api":
lines.append(f"api: {value.name}")
elif kind == "string":
lines.append(_render_plain_string_line(value))
else:
raise ValueError(f"unexpected function row kind: {kind}")
return "\n".join(lines)
def render_report(report: MapaReport, console: Console) -> None:
o = Renderer(console)
with o.section("meta"):
o.writeln(f"name: {report.meta.name}")
o.writeln(f"sha256: {report.meta.sha256}")
o.writeln(f"arch: {report.meta.arch}")
o.writeln(f"ts: {report.meta.timestamp}")
with o.section("sections"):
for section in report.sections:
o.writeln(f"- {hex(section.address)} {section.perms} {hex(section.size)}")
with o.section("libraries"):
for lib in report.libraries:
static = " (static)" if lib.is_static else ""
addr = (
f" at {hex(lib.load_address)}" if lib.load_address is not None else ""
)
o.writeln(f"- {lib.name:<12s}{static}{addr}")
if not report.libraries:
o.writeln("(none)")
with o.section("functions"):
last_source_path: str | None = None
for func in report.functions:
source_path = _get_primary_source_path(func)
if source_path is not None:
if last_source_path is not None and source_path != last_source_path:
o.print(_render_source_path_separator(o, source_path))
last_source_path = source_path
with o.section(
o.markup(
"{function_kind} [default]{function_name}[/] [decoration]@ {function_address}[/]",
function_kind="thunk" if func.is_thunk else "function",
function_name=func.name,
function_address=hex(func.address),
)
):
for kind, value in _iter_function_rows(func):
if kind == "assemblage":
o.writeln(f"assemblage name: {value.name}")
o.writeln(f"assemblage file: {value.source_path}")
elif kind == "caller":
o.print(
"xref: [decoration]{direction}[/] {name} [decoration]({delta:+})[/]",
direction=value.direction,
name=value.name,
delta=value.delta,
)
elif kind == "metrics":
o.writeln(
f"B/E/I: {func.num_basic_blocks} / {func.num_edges} / {func.num_instructions} ({func.total_instruction_bytes} bytes)"
)
elif kind == "capa":
o.writeln(f"capa: {value}")
elif kind == "call":
o.print(
"calls: [decoration]{direction}[/] {name} [decoration]({delta:+})[/]",
direction=value.direction,
name=value.name,
delta=value.delta,
)
elif kind == "api":
o.print(
"api: {name}",
name=value.name,
)
elif kind == "string":
visible_tags = _visible_tags(value.tags)
if visible_tags:
o.print(_render_string_line(o, value.value, visible_tags))
else:
o.print(
'string: [decoration]"[/]{string}[decoration]"[/]',
string=value.value,
)
else:
raise ValueError(f"unexpected function row kind: {kind}")
if not func.is_thunk:
o.print("")

View File

@@ -0,0 +1,25 @@
# Vendored Quantum Strand string databases
Upstream: `../quantumstrand/` (https://github.com/mandiant/flare-floss, Quantum Strand branch)
Upstream commit: `73eb1541e896c065fc694ba7b01067f56871631b`
## Copied data files
- `data/oss/*.jsonl.gz` — open-source library string databases
- `data/crt/msvc_v143.jsonl.gz` — MSVC CRT string database
- `data/expert/capa.jsonl` — expert tagging rules (plain JSONL, not gzipped)
- `data/winapi/apis.txt.gz` — Windows API function names
- `data/winapi/dlls.txt.gz` — Windows DLL names
- `data/gp/gp.jsonl.gz` — global prevalence strings
- `data/gp/cwindb-native.jsonl.gz` — CWinDB native prevalence strings
- `data/gp/cwindb-dotnet.jsonl.gz` — CWinDB .NET prevalence strings
- `data/gp/junk-code.jsonl.gz` — junk/compiler-generated code strings
- `data/gp/xaa-hashes.bin` — truncated MD5 hash set (8 bytes per entry)
- `data/gp/yaa-hashes.bin` — truncated MD5 hash set (8 bytes per entry)
## Code
The loader, tagger, and model code in this package are mapa-local rewrites
inspired by upstream modules `floss/qs/db/oss.py`, `expert.py`, `gp.py`,
`winapi.py`, and the tagging logic in `floss/qs/main.py`. No upstream code
was copied verbatim.

View File

@@ -0,0 +1 @@
from __future__ import annotations

Binary file not shown.

File diff suppressed because it is too large Load Diff

Some files were not shown because too many files have changed in this diff Show More