Implement the "tighten rule pre-selection" algorithm described here:
https://github.com/mandiant/capa/issues/2063#issuecomment-2100498720
In summary:
> Rather than indexing all features from all rules,
> we should pick and index the minimal set (ideally, one) of
> features from each rule that must be present for the rule to match.
> When we have multiple candidates, pick the feature that is
> probably most uncommon and therefore "selective".
This seems to work pretty well. Total evaluations when running against
mimikatz drop from 19M to 1.1M (wow!) and capa seems to match around
3x more functions per second (wow wow).
When doing large scale runs, capa is about 25% faster when using the
vivisect backend (analysis heavy) or 3x faster when using the
upcoming BinExport2 backend (minimal analysis).
* include rule caching in PyInstaller build process
The following commit introduces a new function that caches the capa
rule set, so that users don't have to manually run ./scripts/cache-
ruleset.py, before running pyinstaller.
* ci: omit Cache rule set step from build.yml workflow
* refactor: move cache generation to cache.py
* mkdir cache directory when it does not exist
---------
Co-authored-by: Soufiane Fariss <soufiane.fariss@um5s.net.ma>
Co-authored-by: Moritz <mr-tz@users.noreply.github.com>
* main: split main into a bunch of "main routines"
[wip] since there are a few references to BinExport2
that are in progress elsewhre. Next commit will remove them.
* main: remove references to wip BinExport2 code
* changelog
* main: rename first position argument "input_file"
closes#1946
* main: linters
* main: move rule-related routines to capa.rules
ref #1821
* main: extract routines to capa.loader module
closes#1821
* add loader module
* loader: learn to load freeze format
* freeze: use new cli arg handling
* Update capa/loader.py
Co-authored-by: Moritz <mr-tz@users.noreply.github.com>
* main: remove duplicate documentation
* main: add doc about where some functions live
* scripts: migrate to new main wrapper helper functions
* scripts: port to main routines
* main: better handle auto-detection of backend
* scripts: migrate bulk-process to main wrappers
* scripts: migrate scripts to main wrappers
* main: rename *_from_args to *_from_cli
* changelog
* cache-ruleset: remove duplication
* main: fix tag handling
* cache-ruleset: fix cli args
* cache-ruleset: fix special rule cli handling
* scripts: fix type bytes
* main: remove old TODO message
* loader: fix references to binja extractor
---------
Co-authored-by: Moritz <mr-tz@users.noreply.github.com>
im worried this will interact poorly with our rule cache,
unless we add more handling there, which needs more testing.
so, since the filtering likely has only a small impact on performance,
revert the rule filtering changes for simplicity.