* webui: include feature type in global search (match, regex, etc.)
Searching for "match" or "regex" in the capa Explorer web UI produced
no results because PrimeVue's globalFilterFields only included the
name field, while the feature kind (e.g. "match", "regex", "api") is
stored in the separate typeValue field.
Add 'typeValue' to globalFilterFields so that the global search box
matches nodes by both their value (name) and their kind (typeValue).
No change to rendering or data structure; only the set of fields
consulted during filtering is widened.
Fixes#2349.
* changelog: add entry for #2349 webui global search fix
* rules: handle empty or invalid YAML documents in Rule.from_yaml
Empty or whitespace-only .yml files caused a cryptic TypeError in
Rule.from_dict (NoneType not subscriptable) when yaml.load returned None.
This made lint.py abort with a stack trace instead of a clear message.
Add an early guard in Rule.from_yaml that raises InvalidRule with a
descriptive message when the parsed document is None or structurally
invalid. get_rules() now logs a warning and skips such files so that
scripts/lint.py completes cleanly even when placeholder .yml files
exist in the rules/ or rules/nursery/ directories.
Fixes#2900.
* changelog: add entry for #2900 empty YAML handling
* rules: fix exception check and add get_rules skip test
- Use e.args[0] instead of str(e) to check the error message.
InvalidRule.__str__ prepends "invalid rule: " so str(e) never
matched the bare message, causing every InvalidRule to be re-raised.
- Add test_get_rules_skips_empty_yaml to cover the get_rules skip path,
confirming that an empty file is warned-and-skipped while a valid
sibling rule is still loaded.
* fix: correct isort import ordering in tests/test_rules.py
Move capa.engine import before capa.rules.cache to satisfy
isort --length-sort ordering.
* loader: skip PE files with unrealistically large section virtual sizes
Some malformed PE samples declare section virtual sizes orders of
magnitude larger than the file itself (e.g. a ~400 KB file with a
900 MB section). vivisect attempts to map these regions, causing
unbounded CPU and memory consumption (see #1989).
Add _is_probably_corrupt_pe() which uses pefile (fast_load=True) to
check whether any section's Misc_VirtualSize exceeds
max(file_size * 128, 512 MB). If the check fires, get_workspace()
raises CorruptFile before vivisect is invoked, keeping the existing
exception handling path consistent.
Thresholds are intentionally conservative to avoid false positives on
large but legitimate binaries. When pefile is unavailable the helper
returns False and behaviour is unchanged.
Fixes#1989.
* changelog: add entry for #1989 corrupt PE large sections
* loader: apply Gemini review improvements
- Extend corrupt-PE check to FORMAT_AUTO so malformed PE files
cannot bypass the guard when format is auto-detected (the helper
returns False for non-PE files so there is no false-positive risk).
- Replace magic literals 128 and 512*1024*1024 with named constants
_VSIZE_FILE_RATIO and _MAX_REASONABLE_VSIZE for clarity.
- Remove redundant int() cast around getattr(Misc_VirtualSize); keep
the `or 0` guard for corrupt files where pefile may return None.
- Extend test to cover FORMAT_AUTO path alongside FORMAT_PE.
* tests: remove mock-only corrupt PE test per maintainer request
williballenthin noted the test doesn't add real value since it only
exercises the mock, not the actual heuristic. Removing it per feedback.
* fix: resolve flake8 NIC002 implicit string concat and add missing test
Fix the implicit string concatenation across multiple lines that caused
code_style CI to fail. Also add the test_corrupt_pe_with_unrealistic_section_size_short_circuits
test that was described in the PR body but not committed.
* perf: eliminate O(n²) tuple growth and reduce per-match overhead
Four data-driven performance improvements identified by profiling
the hot paths in capa's rule-matching and capability-finding pipeline:
1. find_static_capabilities / find_dynamic_capabilities (O(n²) → O(n))
Tuple concatenation with `t += (item,)` copies the entire tuple on
every iteration. For a binary with N functions this allocates O(N²)
total objects. Replace with list accumulation and a single
`tuple(list)` conversion at the end.
2. RuleSet._match: pre-compute rule_index_by_rule_name (O(n) → O(1))
`_match` is called once per instruction / basic-block / function scope
(potentially millions of times). Previously it rebuilt the name→index
dict on every call. The dict is now computed once in `__init__` and
stored as `_rule_index_by_scope`, reducing each call to a dict lookup.
3. RuleSet._match: candidate_rules.pop(0) → deque.popleft() (O(n) → O(1))
`list.pop(0)` is O(n) because it shifts every remaining element.
Switch to `collections.deque` for O(1) left-side consumption.
4. RuleSet._extract_subscope_rules: list.pop(0) → deque.popleft() (O(n²) → O(n))
Same issue: BFS over rules used list.pop(0), making the whole loop
quadratic. Changed to a deque queue for linear-time processing.
Fixes#2880
* perf: use sorted merge instead of full re-sort for new rule candidates
When a rule matches and introduces new dependent candidates into
_match's work queue, the previous approach converted the deque to a
list, extended it with the new items, and re-sorted the whole
collection — O((k+m) log(k+m)).
Because the existing deque is already topologically sorted, we only
need to sort the new additions — O(m log m) — and then merge the two
sorted sequences in O(k+m) using heapq.merge.
Also adds a CHANGELOG entry for the performance improvements in #2890.
* perf: simplify candidate_rules to LIFO list, revert heapq.merge
Address reviewer feedback:
- Replace deque+popleft with list+pop (LIFO stack) in _extract_subscope_rules;
processing order doesn't affect correctness, and list.pop() is O(1).
- Replace deque+popleft with list+pop (LIFO stack) in _match; sort candidate
rules descending so pop() from the end yields the topologically-first rule.
- Revert heapq.merge back to the simpler extend+re-sort pattern; the added
complexity wasn't justified given the typically small candidate set.
- Remove now-unused `import heapq`.
* loader: handle struct.error from dnfile and show clear CorruptFile message
When .NET metadata is truncated or invalid, dnfile can raise struct.error.
Catch it in DnfileFeatureExtractor and DotnetFileFeatureExtractor and
re-raise as CorruptFile with a user-friendly message.
Fixes#2442
* dnfile: centralize dnPE() loading in load_dotnet_image helper
Add load_dotnet_image() to dnfile/helpers.py that calls dnfile.dnPE()
and catches struct.error, raising CorruptFile with the original error
message included (f"Invalid or truncated .NET metadata: {e}").
Both DnfileFeatureExtractor and DotnetFileFeatureExtractor now call the
helper instead of duplicating the try/except block, and their direct
import of struct is removed.
Addresses review feedback on #2872.
* style: reformat dnfile files with black --line-length=120
Fixes CI code_style failure by applying the project's configured
line length (120) instead of black's default (88).
---------
Co-authored-by: Moritz <mr-tz@users.noreply.github.com>
Co-authored-by: Mike Hunhoff <mike.hunhoff@gmail.com>
* ci: add black auto-format workflow (#2827)
Signed-off-by: priyank <priyank8445@gmail.com>
* ci: use pre-commit to run black and isort (#2827)
* ci: fix install dependencies to include dev extras
---------
Signed-off-by: priyank <priyank8445@gmail.com>
* doc: add table comparing ways to consume capa output
Add a short table to usage.md for CLI, IDA, Ghidra, CAPE, and web.
Fixes#2273
* doc: add links to each option in the ways-to-consume table
Addresses reviewer feedback to provide a link to learn more for each
consumption method (IDA Pro, Ghidra, CAPE, Web/capa Explorer).
Refs #2273
* doc: add Binary Ninja to ways-to-consume table
Fixes#2273
* webui: show error when JSON does not follow expected schema
Validate result document has required fields (meta, meta.version,
meta.analysis, meta.analysis.layout, rules) after parse. Show
user-friendly error; for URL loads suggest reanalyzing (e.g. VT).
Fixes#2363
* webui: fix array validation bug and deduplicate VT suggestion string
- introduce isInvalidObject() helper (checks !v || typeof !== "object" || Array.isArray)
so that arrays are correctly rejected in schema validation
- extract VT_REANALYZE_SUGGESTION constant to eliminate the duplicated string
in loadRdoc()
Addresses review feedback on #2871
* webui: address review - validate feature_counts, hoist VT_REANALYZE_SUGGESTION
- Add validation for meta.analysis.feature_counts in validateRdocSchema()
so parseFunctionCapabilities and other consumers do not hit missing/invalid
feature_counts at runtime.
- Require feature_counts to have either 'functions' or 'processes' array
(static vs dynamic result documents).
- Move VT_REANALYZE_SUGGESTION to module top level to avoid redefining
on every loadRdoc call.
* webui: allow file-scoped-only result documents in schema validation
- Validation: allow feature_counts without functions/processes arrays; if
present they must be arrays.
- rdocParser: default feature_counts.functions to [] when missing so
file-scoped-only docs do not throw.
* webui: remove leading space from VT_REANALYZE_SUGGESTION constant
Per review feedback: the concatenation at call sites handles spacing,
so the constant should not carry a leading space.
* ida-explorer: fix TypeError when sorting mixed address types
When a feature has multiple locations and those locations contain a mix
of integer-based addresses (e.g. AbsoluteVirtualAddress) and non-integer
addresses (e.g. _NoAddress), calling sorted() raises a TypeError because
Python falls back to the reflected comparison (__gt__) which is not
defined on _NoAddress.
Add a sort key to sorted() that places integer-based addresses first
(sorted by value) and non-integer addresses last, avoiding the
cross-type comparison.
Fixes#2195
* ida-explorer: fix comparison at source so sorted(locations) works everywhere
Implement the gt solution per review: fix comparison for all addresses
so we can use sorted(locations) / sorted(addrs) consistently without
per-call-site sort keys.
- Add _NoAddress.__gt__ so mixed-type comparison works: (real_address <
NO_ADDRESS) invokes it and NoAddress sorts last. Avoids TypeError
when sorting AbsoluteVirtualAddress with _NoAddress.
- In ida/plugin/model.py, use sorted(locations) instead of a custom
key. view.py (lines 1054, 1077) already use sorted(); they now work
with mixed address types without change.
Fixes#2195
* changelog: move address sort fix to Bug Fixes section
Per maintainer feedback: fix applies beyond ida-explorer.
* webui: fix 404 for \"View rule in capa-rules\" links
The createCapaRulesUrl function was constructing URLs by lowercasing
the rule name and replacing spaces with hyphens, which produced URLs
like /rules/packaged-as-single-file-.net-application/ (404).
The capa-rules website uses the original rule name with URL encoding
(e.g. /rules/packaged%20as%20single-file%20.NET%20application/).
Use encodeURIComponent() on the rule name to produce correct URLs.
Fixes#2482
* refactor: extract baseUrl constant in createCapaRulesUrl per code review
* main: suggest --os flag when OS detection fails for ELF files
When capa cannot detect the target OS of an ELF file, it exits with an
error. Some ELF files lack the standard metadata capa uses for OS
detection (GNU ABI tag, OSABI field, library dependencies, etc.) even
though they do target a valid OS (e.g. a stripped Linux binary using
only raw syscalls).
Add a hint to the unsupported-OS error message telling users they can
specify the OS explicitly with the --os flag, matching the workaround
recommended in the issue.
Fixes#2577
Strings extracted from analyzed samples may contain bracket characters
that Rich interprets as markup (e.g. [/tag]). When these are embedded
directly in markup templates like f"[dim]{s}", Rich raises a
MarkupError if the brackets form an invalid tag.
Use rich.markup.escape() to sanitize all user-controlled strings before
embedding them in Rich markup templates in bold(), bold2(), mute(), and
warn().
Fixes#2699
Catch envi.exc.SegmentationViolation raised by vivisect when processing
malformed ELF files with invalid relocations and convert it to a
CorruptFile exception with a descriptive message.
Closes#2794
Co-authored-by: Mike Hunhoff <mike.hunhoff@gmail.com>