mapa: add QS string tags
Vendor QS string databases and tag each string row with right-aligned
database-derived tags (#zlib, #winapi, #capa, #common, #code-junk,
etc.).
Tags are matched against raw strings before display trimming. The
visible
tag policy suppresses #common when a more-specific tag is present.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace the Lancelot/BinExport2 backend with an IDALib-only implementation
using ida-domain as the primary query surface.
New mapa/ package with four layers:
- model.py: backend-neutral dataclasses (MapaReport, MapaFunction, etc.)
- ida_db.py: database lifecycle with SHA-256 caching and flock guards
- collector.py: populates MapaReport from an open ida_domain.Database
- renderer.py: Rich-based text output from MapaReport
- cli.py: argument parsing, capa/assemblage loading, orchestration
Key behaviors preserved from the original:
- Report sections: meta, sections, libraries, functions (modules removed)
- Thunk chain resolution (depth 5, matching capa's THUNK_CHAIN_DEPTH_DELTA)
- Caller forwarding through thunks
- CFG stats with NOEXT|PREDS flags
- String extraction via data-reference chains (depth 10)
- Assemblage overlay and capa match attachment
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
mapa: suppress Lumina via IdaCommandOptions.plugin_options
Match capa's loader.py behavior: disable primary and secondary Lumina
servers by passing plugin_options through IdaCommandOptions, which maps
to IDA's -O switch. load_resources=True already provides -R.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
mapa: add __main__.py for python -m mapa invocation
scripts/mapa.py shadows the mapa package when run directly because
Python adds scripts/ to sys.path. The canonical invocation is now:
python -m mapa <input_file> [options]
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
mapa: import idapro before ida_auto
idapro must be imported first because it mutates sys.path to make
ida_auto and other IDA modules available.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
mapa: guard against InvalidEAError in string/xref lookups
ida-domain raises InvalidEAError for unmapped addresses instead of
returning None. Guard data_refs_from_ea and strings.get_at calls
so the collector handles broken reference chains gracefully.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
mapa: change default/key theme color from black to blue
Black text is invisible on dark terminals. Use blue for function names,
keys, and values.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
mapa: use module.dll!func format for APIs and libraries
IDA strips .dll from PE import module names. Add it back so libraries
render as 'KERNEL32.dll' and API entries as 'KERNEL32.dll!CreateFileW'.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
mapa: lowercase module names in libraries and API entries
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
mapa: use FLOSS/capa regex-based string extraction instead of IDA string list
IDA's built-in string list has a minimum length threshold (~5 chars)
that silently drops short strings like "exec". Replace db.strings and
ida_bytes.get_strlit_contents with regex-based extraction from FLOSS/capa
that scans raw segment bytes for ASCII and UTF-16 LE strings (min 4 chars).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
mapa: simplify string extraction to on-demand via get_cstring_at
Replace upfront segment-scanning index with on-demand reads using
db.bytes.get_cstring_at, validated against FLOSS/capa printable ASCII
charset. The index approach missed mid-string references and did
unnecessary work scanning entire segments.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
mapa: add UTF-16 LE wide string extraction
Read raw bytes at data reference targets and check for both ASCII and
UTF-16 LE strings using FLOSS/capa printability heuristics. Neither
ida_domain's get_cstring_at nor get_string_at handle wide strings, so
we parse the byte patterns directly.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* webui: include feature type in global search (match, regex, etc.)
Searching for "match" or "regex" in the capa Explorer web UI produced
no results because PrimeVue's globalFilterFields only included the
name field, while the feature kind (e.g. "match", "regex", "api") is
stored in the separate typeValue field.
Add 'typeValue' to globalFilterFields so that the global search box
matches nodes by both their value (name) and their kind (typeValue).
No change to rendering or data structure; only the set of fields
consulted during filtering is widened.
Fixes#2349.
* changelog: add entry for #2349 webui global search fix
* rules: handle empty or invalid YAML documents in Rule.from_yaml
Empty or whitespace-only .yml files caused a cryptic TypeError in
Rule.from_dict (NoneType not subscriptable) when yaml.load returned None.
This made lint.py abort with a stack trace instead of a clear message.
Add an early guard in Rule.from_yaml that raises InvalidRule with a
descriptive message when the parsed document is None or structurally
invalid. get_rules() now logs a warning and skips such files so that
scripts/lint.py completes cleanly even when placeholder .yml files
exist in the rules/ or rules/nursery/ directories.
Fixes#2900.
* changelog: add entry for #2900 empty YAML handling
* rules: fix exception check and add get_rules skip test
- Use e.args[0] instead of str(e) to check the error message.
InvalidRule.__str__ prepends "invalid rule: " so str(e) never
matched the bare message, causing every InvalidRule to be re-raised.
- Add test_get_rules_skips_empty_yaml to cover the get_rules skip path,
confirming that an empty file is warned-and-skipped while a valid
sibling rule is still loaded.
* fix: correct isort import ordering in tests/test_rules.py
Move capa.engine import before capa.rules.cache to satisfy
isort --length-sort ordering.
* loader: skip PE files with unrealistically large section virtual sizes
Some malformed PE samples declare section virtual sizes orders of
magnitude larger than the file itself (e.g. a ~400 KB file with a
900 MB section). vivisect attempts to map these regions, causing
unbounded CPU and memory consumption (see #1989).
Add _is_probably_corrupt_pe() which uses pefile (fast_load=True) to
check whether any section's Misc_VirtualSize exceeds
max(file_size * 128, 512 MB). If the check fires, get_workspace()
raises CorruptFile before vivisect is invoked, keeping the existing
exception handling path consistent.
Thresholds are intentionally conservative to avoid false positives on
large but legitimate binaries. When pefile is unavailable the helper
returns False and behaviour is unchanged.
Fixes#1989.
* changelog: add entry for #1989 corrupt PE large sections
* loader: apply Gemini review improvements
- Extend corrupt-PE check to FORMAT_AUTO so malformed PE files
cannot bypass the guard when format is auto-detected (the helper
returns False for non-PE files so there is no false-positive risk).
- Replace magic literals 128 and 512*1024*1024 with named constants
_VSIZE_FILE_RATIO and _MAX_REASONABLE_VSIZE for clarity.
- Remove redundant int() cast around getattr(Misc_VirtualSize); keep
the `or 0` guard for corrupt files where pefile may return None.
- Extend test to cover FORMAT_AUTO path alongside FORMAT_PE.
* tests: remove mock-only corrupt PE test per maintainer request
williballenthin noted the test doesn't add real value since it only
exercises the mock, not the actual heuristic. Removing it per feedback.
* fix: resolve flake8 NIC002 implicit string concat and add missing test
Fix the implicit string concatenation across multiple lines that caused
code_style CI to fail. Also add the test_corrupt_pe_with_unrealistic_section_size_short_circuits
test that was described in the PR body but not committed.
* perf: eliminate O(n²) tuple growth and reduce per-match overhead
Four data-driven performance improvements identified by profiling
the hot paths in capa's rule-matching and capability-finding pipeline:
1. find_static_capabilities / find_dynamic_capabilities (O(n²) → O(n))
Tuple concatenation with `t += (item,)` copies the entire tuple on
every iteration. For a binary with N functions this allocates O(N²)
total objects. Replace with list accumulation and a single
`tuple(list)` conversion at the end.
2. RuleSet._match: pre-compute rule_index_by_rule_name (O(n) → O(1))
`_match` is called once per instruction / basic-block / function scope
(potentially millions of times). Previously it rebuilt the name→index
dict on every call. The dict is now computed once in `__init__` and
stored as `_rule_index_by_scope`, reducing each call to a dict lookup.
3. RuleSet._match: candidate_rules.pop(0) → deque.popleft() (O(n) → O(1))
`list.pop(0)` is O(n) because it shifts every remaining element.
Switch to `collections.deque` for O(1) left-side consumption.
4. RuleSet._extract_subscope_rules: list.pop(0) → deque.popleft() (O(n²) → O(n))
Same issue: BFS over rules used list.pop(0), making the whole loop
quadratic. Changed to a deque queue for linear-time processing.
Fixes#2880
* perf: use sorted merge instead of full re-sort for new rule candidates
When a rule matches and introduces new dependent candidates into
_match's work queue, the previous approach converted the deque to a
list, extended it with the new items, and re-sorted the whole
collection — O((k+m) log(k+m)).
Because the existing deque is already topologically sorted, we only
need to sort the new additions — O(m log m) — and then merge the two
sorted sequences in O(k+m) using heapq.merge.
Also adds a CHANGELOG entry for the performance improvements in #2890.
* perf: simplify candidate_rules to LIFO list, revert heapq.merge
Address reviewer feedback:
- Replace deque+popleft with list+pop (LIFO stack) in _extract_subscope_rules;
processing order doesn't affect correctness, and list.pop() is O(1).
- Replace deque+popleft with list+pop (LIFO stack) in _match; sort candidate
rules descending so pop() from the end yields the topologically-first rule.
- Revert heapq.merge back to the simpler extend+re-sort pattern; the added
complexity wasn't justified given the typically small candidate set.
- Remove now-unused `import heapq`.
* loader: handle struct.error from dnfile and show clear CorruptFile message
When .NET metadata is truncated or invalid, dnfile can raise struct.error.
Catch it in DnfileFeatureExtractor and DotnetFileFeatureExtractor and
re-raise as CorruptFile with a user-friendly message.
Fixes#2442
* dnfile: centralize dnPE() loading in load_dotnet_image helper
Add load_dotnet_image() to dnfile/helpers.py that calls dnfile.dnPE()
and catches struct.error, raising CorruptFile with the original error
message included (f"Invalid or truncated .NET metadata: {e}").
Both DnfileFeatureExtractor and DotnetFileFeatureExtractor now call the
helper instead of duplicating the try/except block, and their direct
import of struct is removed.
Addresses review feedback on #2872.
* style: reformat dnfile files with black --line-length=120
Fixes CI code_style failure by applying the project's configured
line length (120) instead of black's default (88).
---------
Co-authored-by: Moritz <mr-tz@users.noreply.github.com>
Co-authored-by: Mike Hunhoff <mike.hunhoff@gmail.com>
* ci: add black auto-format workflow (#2827)
Signed-off-by: priyank <priyank8445@gmail.com>
* ci: use pre-commit to run black and isort (#2827)
* ci: fix install dependencies to include dev extras
---------
Signed-off-by: priyank <priyank8445@gmail.com>
* doc: add table comparing ways to consume capa output
Add a short table to usage.md for CLI, IDA, Ghidra, CAPE, and web.
Fixes#2273
* doc: add links to each option in the ways-to-consume table
Addresses reviewer feedback to provide a link to learn more for each
consumption method (IDA Pro, Ghidra, CAPE, Web/capa Explorer).
Refs #2273
* doc: add Binary Ninja to ways-to-consume table
Fixes#2273
* webui: show error when JSON does not follow expected schema
Validate result document has required fields (meta, meta.version,
meta.analysis, meta.analysis.layout, rules) after parse. Show
user-friendly error; for URL loads suggest reanalyzing (e.g. VT).
Fixes#2363
* webui: fix array validation bug and deduplicate VT suggestion string
- introduce isInvalidObject() helper (checks !v || typeof !== "object" || Array.isArray)
so that arrays are correctly rejected in schema validation
- extract VT_REANALYZE_SUGGESTION constant to eliminate the duplicated string
in loadRdoc()
Addresses review feedback on #2871
* webui: address review - validate feature_counts, hoist VT_REANALYZE_SUGGESTION
- Add validation for meta.analysis.feature_counts in validateRdocSchema()
so parseFunctionCapabilities and other consumers do not hit missing/invalid
feature_counts at runtime.
- Require feature_counts to have either 'functions' or 'processes' array
(static vs dynamic result documents).
- Move VT_REANALYZE_SUGGESTION to module top level to avoid redefining
on every loadRdoc call.
* webui: allow file-scoped-only result documents in schema validation
- Validation: allow feature_counts without functions/processes arrays; if
present they must be arrays.
- rdocParser: default feature_counts.functions to [] when missing so
file-scoped-only docs do not throw.
* webui: remove leading space from VT_REANALYZE_SUGGESTION constant
Per review feedback: the concatenation at call sites handles spacing,
so the constant should not carry a leading space.
* ida-explorer: fix TypeError when sorting mixed address types
When a feature has multiple locations and those locations contain a mix
of integer-based addresses (e.g. AbsoluteVirtualAddress) and non-integer
addresses (e.g. _NoAddress), calling sorted() raises a TypeError because
Python falls back to the reflected comparison (__gt__) which is not
defined on _NoAddress.
Add a sort key to sorted() that places integer-based addresses first
(sorted by value) and non-integer addresses last, avoiding the
cross-type comparison.
Fixes#2195
* ida-explorer: fix comparison at source so sorted(locations) works everywhere
Implement the gt solution per review: fix comparison for all addresses
so we can use sorted(locations) / sorted(addrs) consistently without
per-call-site sort keys.
- Add _NoAddress.__gt__ so mixed-type comparison works: (real_address <
NO_ADDRESS) invokes it and NoAddress sorts last. Avoids TypeError
when sorting AbsoluteVirtualAddress with _NoAddress.
- In ida/plugin/model.py, use sorted(locations) instead of a custom
key. view.py (lines 1054, 1077) already use sorted(); they now work
with mixed address types without change.
Fixes#2195
* changelog: move address sort fix to Bug Fixes section
Per maintainer feedback: fix applies beyond ida-explorer.
* webui: fix 404 for \"View rule in capa-rules\" links
The createCapaRulesUrl function was constructing URLs by lowercasing
the rule name and replacing spaces with hyphens, which produced URLs
like /rules/packaged-as-single-file-.net-application/ (404).
The capa-rules website uses the original rule name with URL encoding
(e.g. /rules/packaged%20as%20single-file%20.NET%20application/).
Use encodeURIComponent() on the rule name to produce correct URLs.
Fixes#2482
* refactor: extract baseUrl constant in createCapaRulesUrl per code review
* main: suggest --os flag when OS detection fails for ELF files
When capa cannot detect the target OS of an ELF file, it exits with an
error. Some ELF files lack the standard metadata capa uses for OS
detection (GNU ABI tag, OSABI field, library dependencies, etc.) even
though they do target a valid OS (e.g. a stripped Linux binary using
only raw syscalls).
Add a hint to the unsupported-OS error message telling users they can
specify the OS explicitly with the --os flag, matching the workaround
recommended in the issue.
Fixes#2577
Strings extracted from analyzed samples may contain bracket characters
that Rich interprets as markup (e.g. [/tag]). When these are embedded
directly in markup templates like f"[dim]{s}", Rich raises a
MarkupError if the brackets form an invalid tag.
Use rich.markup.escape() to sanitize all user-controlled strings before
embedding them in Rich markup templates in bold(), bold2(), mute(), and
warn().
Fixes#2699