mirror of
https://github.com/mandiant/capa.git
synced 2025-12-07 13:20:36 -08:00
* elf: os: detect Android via clang compiler .ident note * elf: os: detect Android via dependency on liblog.so * main: split main into a bunch of "main routines" [wip] since there are a few references to BinExport2 that are in progress elsewhre. Next commit will remove them. * features: add BinExport2 declarations * BinExport2: initial skeleton of feature extraction * main: remove references to wip BinExport2 code * changelog * main: rename first position argument "input_file" closes #1946 * main: linters * main: move rule-related routines to capa.rules ref #1821 * main: extract routines to capa.loader module closes #1821 * add loader module * loader: learn to load freeze format * freeze: use new cli arg handling * Update capa/loader.py Co-authored-by: Moritz <mr-tz@users.noreply.github.com> * main: remove duplicate documentation * main: add doc about where some functions live * scripts: migrate to new main wrapper helper functions * scripts: port to main routines * main: better handle auto-detection of backend * scripts: migrate bulk-process to main wrappers * scripts: migrate scripts to main wrappers * main: rename *_from_args to *_from_cli * changelog * cache-ruleset: remove duplication * main: fix tag handling * cache-ruleset: fix cli args * cache-ruleset: fix special rule cli handling * scripts: fix type bytes * main: nicely format debug messages * helpers: ensure log messages aren't very long * flake8 config * binexport2: formatting * loader: learn to load BinExport2 files * main: debug log the format and backend * elf: add more arch constants * binexport: parse global features * binexport: extract file features * binexport2: begin to enumerate function/bb/insns * binexport: pass context to function/bb/insn extractors * binexport: linters * binexport: linters * scripts: add script to inspect binexport2 file * inspect-binexport: fix xref symbols * inspect-binexport: factor out the index building * binexport: move index to binexport extractor module * binexport: implement ELF/aarch64 GOT/thunk analyzer * binexport: implement API features * binexport: record the full vertex for a thunk * binexport: learn to extract numbers * binexport: number: skipped mapped numbers * binexport: fix basic block address indexing * binexport: rename function * binexport: extract operand numbers * binexport: learn to extract calls from characteristics * binexport: learn to extract mnemonics * pre-commit: skip protobuf file * binexport: better search for sample file * loader: add file extractors for BinExport2 * binexport: remove extra parameter * new black config * binexport: index string xrefs * binexport: learn to extract bytes and strings * binexport: cache parsed PE/ELF * binexport: handle Ghidra SYMBOL numbers * binexport2: handle binexport#78 (Ghidra only uses SYMBOL expresssions) * main: write error output to stderr, not stdout * scripts: add example detect-binexport2-capabilities.py * detect-binexport2-capabilities: more documentation/examples * elffile: recognize more architectures * binexport: handle read_memory errors * binexport: index flow graphs by address * binexport: cleanup logging * binexport: learn to extract function names * binexport: learn to extract all function features * binexport: learn to extract bb tight loops * elf: don't require vivisect just for type annotations * main: remove unused imports * rules: don't eagerly import ruamel until needed * loader: avoid eager imports of some backend-related code * changelog * fmt * binexport: better render optional fields * fix merge conflicts * fix formatting * remove Ghidra data reference madness * handle PermissionError when searching sample file for BinExport2 file * handle PermissionError when searching sample file for BinExport2 file * add Android as valid OS * inspect-binexport: strip strings * inspect-binexport: render operands * fix lints * ruff: update config layout * inspect-binexport: better align comments/xrefs * use explicit search paths to get sample for BinExport file * add initial BinExport tests * add/update BinExport tests and minor fixes * inspect-binexport: add perf tracking * inspect-binexport: cache rendered operands * lints * do not extract number features for ret instructions * Fix BinExport's "tight loop" feature extraction. `idx.target_edges_by_basic_block_index[basic_block_index]` is of type `List[Edges]`. The index `basic_block_index` was definitely not an element. * inspect-binexport: better render data section * linters * main: accept --format=binexport2 * binexport: insn: add support for parsing bare immediate int operands * binexport2: bb: fix tight loop detection ref #2050 * binexport: api: generate variations of Win32 APIs * lints * binexport: index: don't assume instruction index is 1:1 with address * be2: index instruction addresses * be2: temp remove bytes feature processing * binexport: read memory from an address space extracted from PE/ELF closes #2061 * be2: resolve thunks to imported functions * be2: check for be2 string reference before bytes/string extraction overhead * be2: remove unneeded check * be2: do not process thunks * be2: insn: polish thunk handling a bit * be2: pre-compute thunk targets * parse negative numbers * update tests to use Ghidra-generated BinExport file * remove unused import * black reformat * run tests always (for now) * binexport: tests: fix test case * binexport: extractor: fix insn lint * binexport: addressspace: use base address recovered from binexport file * Add nzxor charecteristic in BinExport extractor. by referencing vivisect implementation. * add tests, fix stack cookie detection * test BinExport feature PRs * reformat and fix * complete TODO descriptions * wip tests * binexport: add typing where applicable (#2106) * binexport2: revert import names from BinExport2 proto binexport2_pb.BinExport2 isnt a package so we can't import it like: from ...binexport2_pb.BinExport2 import CallGraph * fix stack offset numbers and disable offset tests * xfail OperandOffset * generate symbol variants * wip: read negative numbers * update tight loop tests * binexport: fix function loop feature detection * binexport: update binexport function loop tests * binexport: fix lints and imports * binexport: add back assert statement to thunk calculation * binexport: update tests to use Ghidra binexport file * binexport: add additional debug info to thunk calculation assert * binexport: update unit tests to focus on Ghidra * binexport: fix lints * binexport: remove Ghidra symbol madness and fix x86/amd64 stack offset number tests * binexport: use masking for Number features * binexport: ignore call/jmp immediates for intel architecture * binexport: check if immediate is a mapped address * binexport: emit offset features for immediates likely structure offsets * binexport: add twos complement wrapper insn.py * binexport: add support for x86 offset features * binexport: code refactor * binexport: init refactor for multi-arch instruction feature parsing * binexport: intel: emit indirect call characteristic * binexport: use helper method for instruction mnemonic * binexport: arm: emit offset features from stp instruction * binexport: arm: emit indirect call characteristic * binexport: arm: improve offset feature extraction * binexport: add workaroud for Ghidra bug that results in empty operands (no expressions) * binexport: skip x86 stack string tests * binexport: update mimikatz.exe_ feature count tests for Ghidra * core: loader: update binja import * core: loader: update binja imports * binexport: arm: ignore number features for add instruction manipulating stack * binexport: update unit tests * binexport: arm: ignore number features for sub instruction manipulating stack * binexport: arm: emit offset features for add instructions * binexport: remove TODO from tests workflow * binexport: update CHANGELOG * binexport: remove outdated TODOs * binexport: re-enable support for data references in inspect-binexport2.py * binexport: skip data references to code * binexport: remove outdated TODOs * Update scripts/inspect-binexport2.py * Update CHANGELOG.md * Update capa/helpers.py * Update capa/features/extractors/common.py * Update capa/features/extractors/binexport2/extractor.py * Update capa/features/extractors/binexport2/arch/arm/insn.py Co-authored-by: Moritz <mr-tz@users.noreply.github.com> * initial add * test binexport scripts * add tests using small ARM ELF * add method to get instruction by address * index instructions by address * adjust and extend tests * handle operator with no children bug * binexport: use instruction address index ref: https://github.com/mandiant/capa/pull/1950/files#r1728570811 * inspect binexport: handle lsl with no children add pruning phase to expression tree building to remove known-bad branches. This might address some of the data we're seeing due to: https://github.com/NationalSecurityAgency/ghidra/issues/6821 Also introduces a --instruction optional argument to dump the details of a specific instruction. * binexport: consolidate expression tree logic into helpers * binexport: index instruction indices by address * binexport: introduce instruction pattern matching Introduce intruction pattern matching to declaratively describe the instructions and operands that we want to extract. While there's a bit more code, its much more thoroughly tested, and is less brittle than the prior if/else/if/else/if/else implementation. * binexport: helpers: fix missing comment words * binexport: update tests to reflect updated test files * remove testing of feature branch --------- Co-authored-by: Moritz <mr-tz@users.noreply.github.com> Co-authored-by: Mike Hunhoff <mike.hunhoff@gmail.com> Co-authored-by: mr-tz <moritz.raabe@mandiant.com> Co-authored-by: Lin Chen <larch.lin.chen@gmail.com>
244 lines
7.9 KiB
Python
244 lines
7.9 KiB
Python
# Copyright (C) 2021 Mandiant, Inc. All Rights Reserved.
|
|
# Licensed under the Apache License, Version 2.0 (the "License");
|
|
# you may not use this file except in compliance with the License.
|
|
# You may obtain a copy of the License at: [package root]/LICENSE.txt
|
|
# Unless required by applicable law or agreed to in writing, software distributed under the License
|
|
# is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
|
# See the License for the specific language governing permissions and limitations under the License.
|
|
|
|
import os
|
|
import sys
|
|
import logging
|
|
import textwrap
|
|
import subprocess
|
|
from pathlib import Path
|
|
|
|
import pytest
|
|
|
|
logger = logging.getLogger(__name__)
|
|
|
|
CD = Path(__file__).resolve().parent
|
|
|
|
|
|
def get_script_path(s: str):
|
|
return str(CD / ".." / "scripts" / s)
|
|
|
|
|
|
def get_binary_file_path():
|
|
return str(CD / "data" / "9324d1a8ae37a36ae560c37448c9705a.exe_")
|
|
|
|
|
|
def get_cape_report_file_path():
|
|
return str(
|
|
CD
|
|
/ "data"
|
|
/ "dynamic"
|
|
/ "cape"
|
|
/ "v2.4"
|
|
/ "fb7ade52dc5a1d6128b9c217114a46d0089147610f99f5122face29e429a1e74.json.gz"
|
|
)
|
|
|
|
|
|
def get_binexport2_file_path():
|
|
return str(CD / "data" / "binexport2" / "mimikatz.exe_.ghidra.BinExport")
|
|
|
|
|
|
def get_rules_path():
|
|
return str(CD / ".." / "rules")
|
|
|
|
|
|
def get_rule_path():
|
|
return str(Path(get_rules_path()) / "lib" / "allocate-memory.yml")
|
|
|
|
|
|
@pytest.mark.parametrize(
|
|
"script,args",
|
|
[
|
|
pytest.param("capa2yara.py", [get_rules_path()]),
|
|
pytest.param("capafmt.py", [get_rule_path()]),
|
|
pytest.param(
|
|
"capa2sarif.py",
|
|
[Path(__file__).resolve().parent / "data" / "rd" / "Practical Malware Analysis Lab 01-01.dll_.json"],
|
|
),
|
|
# testing some variations of linter script
|
|
pytest.param("lint.py", ["-t", "create directory", get_rules_path()]),
|
|
# `create directory` rule has native and .NET example PEs
|
|
pytest.param("lint.py", ["--thorough", "-t", "create directory", get_rules_path()]),
|
|
pytest.param("match-function-id.py", [get_binary_file_path()]),
|
|
pytest.param("show-capabilities-by-function.py", [get_binary_file_path()]),
|
|
pytest.param("show-features.py", [get_binary_file_path()]),
|
|
pytest.param("show-features.py", ["-F", "0x407970", get_binary_file_path()]),
|
|
pytest.param("show-features.py", ["-P", "MicrosoftEdgeUpdate.exe", get_cape_report_file_path()]),
|
|
pytest.param("show-unused-features.py", [get_binary_file_path()]),
|
|
pytest.param("capa-as-library.py", [get_binary_file_path()]),
|
|
# not testing "minimize-vmray-results.py" as we don't currently upload full VMRay analysis archives
|
|
],
|
|
)
|
|
def test_scripts(script, args):
|
|
script_path = get_script_path(script)
|
|
p = run_program(script_path, args)
|
|
assert p.returncode == 0
|
|
|
|
|
|
@pytest.mark.parametrize(
|
|
"script,args",
|
|
[
|
|
pytest.param("inspect-binexport2.py", [get_binexport2_file_path()]),
|
|
pytest.param("detect-binexport2-capabilities.py", [get_binexport2_file_path()]),
|
|
],
|
|
)
|
|
def test_binexport_scripts(script, args):
|
|
# define sample bytes location
|
|
os.environ["CAPA_SAMPLES_DIR"] = str(Path(CD / "data"))
|
|
|
|
script_path = get_script_path(script)
|
|
p = run_program(script_path, args)
|
|
assert p.returncode == 0
|
|
|
|
|
|
def test_bulk_process(tmp_path):
|
|
# create test directory to recursively analyze
|
|
t = tmp_path / "test"
|
|
t.mkdir()
|
|
|
|
source_file = Path(__file__).resolve().parent / "data" / "ping_täst.exe_"
|
|
dest_file = t / "test.exe_"
|
|
|
|
dest_file.write_bytes(source_file.read_bytes())
|
|
|
|
p = run_program(get_script_path("bulk-process.py"), [str(t.parent)])
|
|
assert p.returncode == 0
|
|
|
|
|
|
def run_program(script_path, args):
|
|
args = [sys.executable] + [script_path] + args
|
|
logger.debug("running: %r", args)
|
|
return subprocess.run(args, stdout=subprocess.PIPE)
|
|
|
|
|
|
@pytest.mark.xfail(reason="result document test files haven't been updated yet")
|
|
def test_proto_conversion(tmp_path):
|
|
t = tmp_path / "proto-test"
|
|
t.mkdir()
|
|
json_file = Path(__file__).resolve().parent / "data" / "rd" / "Practical Malware Analysis Lab 01-01.dll_.json"
|
|
|
|
p = run_program(get_script_path("proto-from-results.py"), [json_file])
|
|
assert p.returncode == 0
|
|
|
|
pb_file = t / "pma.pb"
|
|
pb_file.write_bytes(p.stdout)
|
|
|
|
p = run_program(get_script_path("proto-to-results.py"), [pb_file])
|
|
assert p.returncode == 0
|
|
|
|
assert p.stdout.startswith(b'{\n "meta": ') or p.stdout.startswith(b'{\r\n "meta": ')
|
|
|
|
|
|
def test_detect_duplicate_features(tmpdir):
|
|
TEST_RULE_0 = textwrap.dedent(
|
|
"""
|
|
rule:
|
|
meta:
|
|
name: Test Rule 0
|
|
scopes:
|
|
static: function
|
|
dynamic: process
|
|
features:
|
|
- and:
|
|
- number: 1
|
|
- not:
|
|
- string: process
|
|
"""
|
|
)
|
|
|
|
TEST_RULESET = {
|
|
"rule_1": textwrap.dedent(
|
|
"""
|
|
rule:
|
|
meta:
|
|
name: Test Rule 1
|
|
scopes:
|
|
static: function
|
|
dynamic: process
|
|
features:
|
|
- or:
|
|
- string: unique
|
|
- number: 2
|
|
- and:
|
|
- or:
|
|
- arch: i386
|
|
- number: 4
|
|
- not:
|
|
- count(mnemonic(xor)): 5
|
|
- not:
|
|
- os: linux
|
|
"""
|
|
),
|
|
"rule_2": textwrap.dedent(
|
|
"""
|
|
rule:
|
|
meta:
|
|
name: Test Rule 2
|
|
scopes:
|
|
static: function
|
|
dynamic: process
|
|
features:
|
|
- and:
|
|
- string: "sites.ini"
|
|
- basic block:
|
|
- and:
|
|
- api: CreateFile
|
|
- mnemonic: xor
|
|
"""
|
|
),
|
|
"rule_3": textwrap.dedent(
|
|
"""
|
|
rule:
|
|
meta:
|
|
name: Test Rule 3
|
|
scopes:
|
|
static: function
|
|
dynamic: process
|
|
features:
|
|
- and:
|
|
- not:
|
|
- number: 4
|
|
- basic block:
|
|
- and:
|
|
- api: bind
|
|
- number: 2
|
|
"""
|
|
),
|
|
}
|
|
|
|
"""
|
|
The rule_overlaps list represents the number of overlaps between each rule in the RULESET.
|
|
An overlap includes a rule overlap with itself.
|
|
The scripts
|
|
The overlaps are like:
|
|
- Rule 0 has zero overlaps in RULESET
|
|
- Rule 1 overlaps with 3 other rules in RULESET
|
|
These overlap values indicate the number of rules with which
|
|
each rule in RULESET has overlapping features.
|
|
"""
|
|
rule_overlaps = [0, 4, 3, 3]
|
|
|
|
rule_dir = tmpdir.mkdir("capa_rule_overlap_test")
|
|
rule_paths = []
|
|
|
|
rule_file = tmpdir.join("rule_0.yml")
|
|
rule_file.write(TEST_RULE_0)
|
|
rule_paths.append(rule_file.strpath)
|
|
|
|
for rule_name, RULE_CONTENT in TEST_RULESET.items():
|
|
rule_file = rule_dir.join("%s.yml" % rule_name)
|
|
rule_file.write(RULE_CONTENT)
|
|
rule_paths.append(rule_file.strpath)
|
|
|
|
# tests if number of overlaps for rules in RULESET found are correct.
|
|
script_path = get_script_path("detect_duplicate_features.py")
|
|
for expected_overlaps, rule_path in zip(rule_overlaps, rule_paths):
|
|
args = [rule_dir.strpath, rule_path]
|
|
overlaps_found = run_program(script_path, args)
|
|
assert overlaps_found.returncode == expected_overlaps
|