- remove bytes_rules from _RuleFeatureIndex; bytes_prefix_index is the
only structure needed for candidate selection
- build bytes_prefix_index directly in _index_rules_by_feature() instead
of building bytes_rules then converting, removing one full pass
- add if -1 in bytes_prefix_index guard to avoid temporary object
creation for the short-pattern fallback (almost never taken)
- remove assert isinstance(feature.value, bytes) checks in _match();
add Bytes.value: bytes class-level annotation so mypy narrows the
type without the runtime check
- remove cache structure compatibility block from cache.py per reviewer
request to handle in a separate PR
- update test assertions from bytes_rules to bytes_prefix_index
- Change _match() guard from bytes_rules to bytes_prefix_index
so the guard references the field actually used for candidate selection.
- Update stale comment to describe the prefix-bucket strategy.
- Clarify bytes_rules dataclass comment (retained for logging only).
- Add test_bytes_prefix_index_mixed_short_and_long_patterns covering
rules with both short (<4B) and long (>=4B) patterns exercised together.
Instead of iterating all extracted Bytes features for every bytes-based rule,
build a prefix index keyed by fixed bucket sizes (4, 8, 16, 32, 64, 128, 256)
once per scope evaluation. Each bytes pattern is looked up in the largest
bucket that fits its length, then only candidates sharing that prefix are
compared, replacing the previous O(n) linear scan with an O(1) hash lookup.
Patterns shorter than the minimum bucket still fall back to the full scan.
Adds a test to verify correctness for exact match, startswith match, mismatch,
and short-bytes cases.
Closes: https://github.com/mandiant/capa/issues/2128
Replace the header from source code files using the following script:
```Python
for dir_path, dir_names, file_names in os.walk("capa"):
for file_name in file_names:
# header are only in `.py` and `.toml` files
if file_name[-3:] not in (".py", "oml"):
continue
file_path = f"{dir_path}/{file_name}"
f = open(file_path, "rb+")
content = f.read()
m = re.search(OLD_HEADER, content)
if not m:
continue
print(f"{file_path}: {m.group('year')}")
content = content.replace(m.group(0), NEW_HEADER % m.group("year"))
f.seek(0)
f.write(content)
```
Some files had the copyright headers inside a `"""` comment and needed
manual changes before applying the script. `hook-vivisect.py` and
`pyinstaller.spec` didn't include the license in the header and also
needed manual changes.
The old header had the confusing sentence `All rights reserved`, which
does not make sense for an open source license. Replace the header by
the default Google header that corrects this issue and keep capa
consistent with other Google projects.
Adapt the linter to work with the new header.
Replace also the copyright text in the `web/public/index.html` file for
consistency.
Implement the "tighten rule pre-selection" algorithm described here:
https://github.com/mandiant/capa/issues/2063#issuecomment-2100498720
In summary:
> Rather than indexing all features from all rules,
> we should pick and index the minimal set (ideally, one) of
> features from each rule that must be present for the rule to match.
> When we have multiple candidates, pick the feature that is
> probably most uncommon and therefore "selective".
This seems to work pretty well. Total evaluations when running against
mimikatz drop from 19M to 1.1M (wow!) and capa seems to match around
3x more functions per second (wow wow).
When doing large scale runs, capa is about 25% faster when using the
vivisect backend (analysis heavy) or 3x faster when using the
upcoming BinExport2 backend (minimal analysis).