* Change capa-rules version in installation guide
Updated the installation instructions to reflect the newest version of capa-rules.
* add md files from /doc to bumpversion.toml
* adjust rule installation command
* bump to 9.4.0
- remove bytes_rules from _RuleFeatureIndex; bytes_prefix_index is the
only structure needed for candidate selection
- build bytes_prefix_index directly in _index_rules_by_feature() instead
of building bytes_rules then converting, removing one full pass
- add if -1 in bytes_prefix_index guard to avoid temporary object
creation for the short-pattern fallback (almost never taken)
- remove assert isinstance(feature.value, bytes) checks in _match();
add Bytes.value: bytes class-level annotation so mypy narrows the
type without the runtime check
- remove cache structure compatibility block from cache.py per reviewer
request to handle in a separate PR
- update test assertions from bytes_rules to bytes_prefix_index
- Change _match() guard from bytes_rules to bytes_prefix_index
so the guard references the field actually used for candidate selection.
- Update stale comment to describe the prefix-bucket strategy.
- Clarify bytes_rules dataclass comment (retained for logging only).
- Add test_bytes_prefix_index_mixed_short_and_long_patterns covering
rules with both short (<4B) and long (>=4B) patterns exercised together.
When _RuleFeatureIndex gains a new field, pickle.loads() on an older
cached ruleset succeeds but the resulting objects silently lack the new
field — causing an AttributeError deep in _match() at runtime.
Extend load_cached_ruleset() to walk every _RuleFeatureIndex in the
loaded ruleset and verify each dataclass field is present on the
instance. On mismatch, delete the stale cache and return None so the
caller rebuilds from scratch. Production users are unaffected (the
version hash in the cache key already invalidates caches across
releases); this guard covers developer switching between branches.
The previous implementation rebuilt a `defaultdict` mapping byte prefixes
to extracted feature values inside `_match()`, which is called per
function/basic-block/instruction. Moving the rule-side index build to
`_index_rules_by_feature()` (called once at RuleSet construction) eliminates
this per-call allocation and O(R) rule iteration from the hot path.
`_match()` now looks up candidate rules via the pre-built `bytes_prefix_index`
stored in `_RuleFeatureIndex`, iterating only extracted byte features to
compute their prefixes.
Instead of iterating all extracted Bytes features for every bytes-based rule,
build a prefix index keyed by fixed bucket sizes (4, 8, 16, 32, 64, 128, 256)
once per scope evaluation. Each bytes pattern is looked up in the largest
bucket that fits its length, then only candidates sharing that prefix are
compared, replacing the previous O(n) linear scan with an O(1) hash lookup.
Patterns shorter than the minimum bucket still fall back to the full scan.
Adds a test to verify correctness for exact match, startswith match, mismatch,
and short-bytes cases.
Closes: https://github.com/mandiant/capa/issues/2128
* webui: include feature type in global search (match, regex, etc.)
Searching for "match" or "regex" in the capa Explorer web UI produced
no results because PrimeVue's globalFilterFields only included the
name field, while the feature kind (e.g. "match", "regex", "api") is
stored in the separate typeValue field.
Add 'typeValue' to globalFilterFields so that the global search box
matches nodes by both their value (name) and their kind (typeValue).
No change to rendering or data structure; only the set of fields
consulted during filtering is widened.
Fixes#2349.
* changelog: add entry for #2349 webui global search fix
* rules: handle empty or invalid YAML documents in Rule.from_yaml
Empty or whitespace-only .yml files caused a cryptic TypeError in
Rule.from_dict (NoneType not subscriptable) when yaml.load returned None.
This made lint.py abort with a stack trace instead of a clear message.
Add an early guard in Rule.from_yaml that raises InvalidRule with a
descriptive message when the parsed document is None or structurally
invalid. get_rules() now logs a warning and skips such files so that
scripts/lint.py completes cleanly even when placeholder .yml files
exist in the rules/ or rules/nursery/ directories.
Fixes#2900.
* changelog: add entry for #2900 empty YAML handling
* rules: fix exception check and add get_rules skip test
- Use e.args[0] instead of str(e) to check the error message.
InvalidRule.__str__ prepends "invalid rule: " so str(e) never
matched the bare message, causing every InvalidRule to be re-raised.
- Add test_get_rules_skips_empty_yaml to cover the get_rules skip path,
confirming that an empty file is warned-and-skipped while a valid
sibling rule is still loaded.
* fix: correct isort import ordering in tests/test_rules.py
Move capa.engine import before capa.rules.cache to satisfy
isort --length-sort ordering.
* loader: skip PE files with unrealistically large section virtual sizes
Some malformed PE samples declare section virtual sizes orders of
magnitude larger than the file itself (e.g. a ~400 KB file with a
900 MB section). vivisect attempts to map these regions, causing
unbounded CPU and memory consumption (see #1989).
Add _is_probably_corrupt_pe() which uses pefile (fast_load=True) to
check whether any section's Misc_VirtualSize exceeds
max(file_size * 128, 512 MB). If the check fires, get_workspace()
raises CorruptFile before vivisect is invoked, keeping the existing
exception handling path consistent.
Thresholds are intentionally conservative to avoid false positives on
large but legitimate binaries. When pefile is unavailable the helper
returns False and behaviour is unchanged.
Fixes#1989.
* changelog: add entry for #1989 corrupt PE large sections
* loader: apply Gemini review improvements
- Extend corrupt-PE check to FORMAT_AUTO so malformed PE files
cannot bypass the guard when format is auto-detected (the helper
returns False for non-PE files so there is no false-positive risk).
- Replace magic literals 128 and 512*1024*1024 with named constants
_VSIZE_FILE_RATIO and _MAX_REASONABLE_VSIZE for clarity.
- Remove redundant int() cast around getattr(Misc_VirtualSize); keep
the `or 0` guard for corrupt files where pefile may return None.
- Extend test to cover FORMAT_AUTO path alongside FORMAT_PE.
* tests: remove mock-only corrupt PE test per maintainer request
williballenthin noted the test doesn't add real value since it only
exercises the mock, not the actual heuristic. Removing it per feedback.
* fix: resolve flake8 NIC002 implicit string concat and add missing test
Fix the implicit string concatenation across multiple lines that caused
code_style CI to fail. Also add the test_corrupt_pe_with_unrealistic_section_size_short_circuits
test that was described in the PR body but not committed.
* perf: eliminate O(n²) tuple growth and reduce per-match overhead
Four data-driven performance improvements identified by profiling
the hot paths in capa's rule-matching and capability-finding pipeline:
1. find_static_capabilities / find_dynamic_capabilities (O(n²) → O(n))
Tuple concatenation with `t += (item,)` copies the entire tuple on
every iteration. For a binary with N functions this allocates O(N²)
total objects. Replace with list accumulation and a single
`tuple(list)` conversion at the end.
2. RuleSet._match: pre-compute rule_index_by_rule_name (O(n) → O(1))
`_match` is called once per instruction / basic-block / function scope
(potentially millions of times). Previously it rebuilt the name→index
dict on every call. The dict is now computed once in `__init__` and
stored as `_rule_index_by_scope`, reducing each call to a dict lookup.
3. RuleSet._match: candidate_rules.pop(0) → deque.popleft() (O(n) → O(1))
`list.pop(0)` is O(n) because it shifts every remaining element.
Switch to `collections.deque` for O(1) left-side consumption.
4. RuleSet._extract_subscope_rules: list.pop(0) → deque.popleft() (O(n²) → O(n))
Same issue: BFS over rules used list.pop(0), making the whole loop
quadratic. Changed to a deque queue for linear-time processing.
Fixes#2880
* perf: use sorted merge instead of full re-sort for new rule candidates
When a rule matches and introduces new dependent candidates into
_match's work queue, the previous approach converted the deque to a
list, extended it with the new items, and re-sorted the whole
collection — O((k+m) log(k+m)).
Because the existing deque is already topologically sorted, we only
need to sort the new additions — O(m log m) — and then merge the two
sorted sequences in O(k+m) using heapq.merge.
Also adds a CHANGELOG entry for the performance improvements in #2890.
* perf: simplify candidate_rules to LIFO list, revert heapq.merge
Address reviewer feedback:
- Replace deque+popleft with list+pop (LIFO stack) in _extract_subscope_rules;
processing order doesn't affect correctness, and list.pop() is O(1).
- Replace deque+popleft with list+pop (LIFO stack) in _match; sort candidate
rules descending so pop() from the end yields the topologically-first rule.
- Revert heapq.merge back to the simpler extend+re-sort pattern; the added
complexity wasn't justified given the typically small candidate set.
- Remove now-unused `import heapq`.