Files
capa/capa/features/common.py
Willi Ballenthin ee17d75be9 implement BinExport2 backend (#1950)
* elf: os: detect Android via clang compiler .ident note

* elf: os: detect Android via dependency on liblog.so

* main: split main into a bunch of "main routines"

[wip] since there are a few references to BinExport2
that are in progress elsewhre. Next commit will remove them.

* features: add BinExport2 declarations

* BinExport2: initial skeleton of feature extraction

* main: remove references to wip BinExport2 code

* changelog

* main: rename first position argument "input_file"

closes #1946

* main: linters

* main: move rule-related routines to capa.rules

ref #1821

* main: extract routines to capa.loader module

closes #1821

* add loader module

* loader: learn to load freeze format

* freeze: use new cli arg handling

* Update capa/loader.py

Co-authored-by: Moritz <mr-tz@users.noreply.github.com>

* main: remove duplicate documentation

* main: add doc about where some functions live

* scripts: migrate to new main wrapper helper functions

* scripts: port to main routines

* main: better handle auto-detection of backend

* scripts: migrate bulk-process to main wrappers

* scripts: migrate scripts to main wrappers

* main: rename *_from_args to *_from_cli

* changelog

* cache-ruleset: remove duplication

* main: fix tag handling

* cache-ruleset: fix cli args

* cache-ruleset: fix special rule cli handling

* scripts: fix type bytes

* main: nicely format debug messages

* helpers: ensure log messages aren't very long

* flake8 config

* binexport2: formatting

* loader: learn to load BinExport2 files

* main: debug log the format and backend

* elf: add more arch constants

* binexport: parse global features

* binexport: extract file features

* binexport2: begin to enumerate function/bb/insns

* binexport: pass context to function/bb/insn extractors

* binexport: linters

* binexport: linters

* scripts: add script to inspect binexport2 file

* inspect-binexport: fix xref symbols

* inspect-binexport: factor out the index building

* binexport: move index to binexport extractor module

* binexport: implement ELF/aarch64 GOT/thunk analyzer

* binexport: implement API features

* binexport: record the full vertex for a thunk

* binexport: learn to extract numbers

* binexport: number: skipped mapped numbers

* binexport: fix basic block address indexing

* binexport: rename function

* binexport: extract operand numbers

* binexport: learn to extract calls from characteristics

* binexport: learn to extract mnemonics

* pre-commit: skip protobuf file

* binexport: better search for sample file

* loader: add file extractors for BinExport2

* binexport: remove extra parameter

* new black config

* binexport: index string xrefs

* binexport: learn to extract bytes and strings

* binexport: cache parsed PE/ELF

* binexport: handle Ghidra SYMBOL numbers

* binexport2: handle binexport#78 (Ghidra only uses SYMBOL expresssions)

* main: write error output to stderr, not stdout

* scripts: add example detect-binexport2-capabilities.py

* detect-binexport2-capabilities: more documentation/examples

* elffile: recognize more architectures

* binexport: handle read_memory errors

* binexport: index flow graphs by address

* binexport: cleanup logging

* binexport: learn to extract function names

* binexport: learn to extract all function features

* binexport: learn to extract bb tight loops

* elf: don't require vivisect just for type annotations

* main: remove unused imports

* rules: don't eagerly import ruamel until needed

* loader: avoid eager imports of some backend-related code

* changelog

* fmt

* binexport: better render optional fields

* fix merge conflicts

* fix formatting

* remove Ghidra data reference madness

* handle PermissionError when searching sample file for BinExport2 file

* handle PermissionError when searching sample file for BinExport2 file

* add Android as valid OS

* inspect-binexport: strip strings

* inspect-binexport: render operands

* fix lints

* ruff: update config layout

* inspect-binexport: better align comments/xrefs

* use explicit search paths to get sample for BinExport file

* add initial BinExport tests

* add/update BinExport tests and minor fixes

* inspect-binexport: add perf tracking

* inspect-binexport: cache rendered operands

* lints

* do not extract number features for ret instructions

* Fix BinExport's "tight loop" feature extraction.

`idx.target_edges_by_basic_block_index[basic_block_index]` is of type
`List[Edges]`. The index `basic_block_index` was definitely not an
element.

* inspect-binexport: better render data section

* linters

* main: accept --format=binexport2

* binexport: insn: add support for parsing bare immediate int operands

* binexport2: bb: fix tight loop detection

ref #2050

* binexport: api: generate variations of Win32 APIs

* lints

* binexport: index: don't assume instruction index is 1:1 with address

* be2: index instruction addresses

* be2: temp remove bytes feature processing

* binexport: read memory from an address space extracted from PE/ELF

closes #2061

* be2: resolve thunks to imported functions

* be2: check for be2 string reference before bytes/string extraction overhead

* be2: remove unneeded check

* be2: do not process thunks

* be2: insn: polish thunk handling a bit

* be2: pre-compute thunk targets

* parse negative numbers

* update tests to use Ghidra-generated BinExport file

* remove unused import

* black reformat

* run tests always (for now)

* binexport: tests: fix test case

* binexport: extractor: fix insn lint

* binexport: addressspace: use base address recovered from binexport file

* Add nzxor charecteristic in BinExport extractor.

by referencing vivisect implementation.

* add tests, fix stack cookie detection

* test BinExport feature PRs

* reformat and fix

* complete TODO descriptions

* wip tests

* binexport: add typing where applicable (#2106)

* binexport2: revert import names from BinExport2 proto

binexport2_pb.BinExport2 isnt a package so we can't import it like:

    from ...binexport2_pb.BinExport2 import CallGraph

* fix stack offset numbers and disable offset tests

* xfail OperandOffset

* generate symbol variants

* wip: read negative numbers

* update tight loop tests

* binexport: fix function loop feature detection

* binexport: update binexport function loop tests

* binexport: fix lints and imports

* binexport: add back assert statement to thunk calculation

* binexport: update tests to use Ghidra binexport file

* binexport: add additional debug info to thunk calculation assert

* binexport: update unit tests to focus on Ghidra

* binexport: fix lints

* binexport: remove Ghidra symbol madness and fix x86/amd64 stack offset number tests

* binexport: use masking for Number features

* binexport: ignore call/jmp immediates for intel architecture

* binexport: check if immediate is a mapped address

* binexport: emit offset features for immediates likely structure offsets

* binexport: add twos complement wrapper insn.py

* binexport: add support for x86 offset features

* binexport: code refactor

* binexport: init refactor for multi-arch instruction feature parsing

* binexport: intel: emit indirect call characteristic

* binexport: use helper method for instruction mnemonic

* binexport: arm: emit offset features from stp instruction

* binexport: arm: emit indirect call characteristic

* binexport: arm: improve offset feature extraction

* binexport: add workaroud for Ghidra bug that results in empty operands (no expressions)

* binexport: skip x86 stack string tests

* binexport: update mimikatz.exe_ feature count tests for Ghidra

* core: loader: update binja import

* core: loader: update binja imports

* binexport: arm: ignore number features for add instruction manipulating stack

* binexport: update unit tests

* binexport: arm: ignore number features for sub instruction manipulating stack

* binexport: arm: emit offset features for add instructions

* binexport: remove TODO from tests workflow

* binexport: update CHANGELOG

* binexport: remove outdated TODOs

* binexport: re-enable support for data references in inspect-binexport2.py

* binexport: skip data references to code

* binexport: remove outdated TODOs

* Update scripts/inspect-binexport2.py

* Update CHANGELOG.md

* Update capa/helpers.py

* Update capa/features/extractors/common.py

* Update capa/features/extractors/binexport2/extractor.py

* Update capa/features/extractors/binexport2/arch/arm/insn.py

Co-authored-by: Moritz <mr-tz@users.noreply.github.com>

* initial add

* test binexport scripts

* add tests using small ARM ELF

* add method to get instruction by address

* index instructions by address

* adjust and extend tests

* handle operator with no children bug

* binexport: use instruction address index

ref: https://github.com/mandiant/capa/pull/1950/files#r1728570811

* inspect binexport: handle lsl with no children

add pruning phase to expression tree building
to remove known-bad branches. This might address
some of the data we're seeing due to:
https://github.com/NationalSecurityAgency/ghidra/issues/6821

Also introduces a --instruction optional argument
to dump the details of a specific instruction.

* binexport: consolidate expression tree logic into helpers

* binexport: index instruction indices by address

* binexport: introduce instruction pattern matching

Introduce intruction pattern matching to declaratively
describe the instructions and operands that we want to
extract. While there's a bit more code, its much more
thoroughly tested, and is less brittle than the prior
if/else/if/else/if/else implementation.

* binexport: helpers: fix missing comment words

* binexport: update tests to reflect updated test files

* remove testing of feature branch

---------

Co-authored-by: Moritz <mr-tz@users.noreply.github.com>
Co-authored-by: Mike Hunhoff <mike.hunhoff@gmail.com>
Co-authored-by: mr-tz <moritz.raabe@mandiant.com>
Co-authored-by: Lin Chen <larch.lin.chen@gmail.com>
2024-09-12 10:09:05 -06:00

502 lines
17 KiB
Python

# Copyright (C) 2021 Mandiant, Inc. All Rights Reserved.
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at: [package root]/LICENSE.txt
# Unless required by applicable law or agreed to in writing, software distributed under the License
# is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and limitations under the License.
import re
import abc
import codecs
import typing
import logging
import collections
from typing import TYPE_CHECKING, Set, Dict, List, Union, Optional
if TYPE_CHECKING:
# circular import, otherwise
import capa.engine
import capa.perf
import capa.features
import capa.features.extractors.elf
from capa.features.address import Address
logger = logging.getLogger(__name__)
MAX_BYTES_FEATURE_SIZE = 0x100
# thunks may be chained so we specify a delta to control the depth to which these chains are explored
THUNK_CHAIN_DEPTH_DELTA = 5
class FeatureAccess:
READ = "read"
WRITE = "write"
VALID_FEATURE_ACCESS = (FeatureAccess.READ, FeatureAccess.WRITE)
def bytes_to_str(b: bytes) -> str:
return str(codecs.encode(b, "hex").decode("utf-8"))
def hex_string(h: str) -> str:
"""render hex string e.g. "0a40b1" as "0A 40 B1" """
return " ".join(h[i : i + 2] for i in range(0, len(h), 2)).upper()
def escape_string(s: str) -> str:
"""escape special characters"""
s = repr(s)
if not s.startswith(('"', "'")):
# u'hello\r\nworld' -> hello\\r\\nworld
s = s[2:-1]
else:
# 'hello\r\nworld' -> hello\\r\\nworld
s = s[1:-1]
s = s.replace("\\'", "'") # repr() may escape "'" in some edge cases, remove
s = s.replace('"', '\\"') # repr() does not escape '"', add
return s
class Result:
"""
represents the results of an evaluation of statements against features.
instances of this class should behave like a bool,
e.g. `assert Result(True, ...) == True`
instances track additional metadata about evaluation results.
they contain references to the statement node (e.g. an And statement),
as well as the children Result instances.
we need this so that we can render the tree of expressions and their results.
"""
def __init__(
self,
success: bool,
statement: Union["capa.engine.Statement", "Feature"],
children: List["Result"],
locations: Optional[Set[Address]] = None,
):
super().__init__()
self.success = success
self.statement = statement
self.children = children
self.locations = locations if locations is not None else set()
def __eq__(self, other):
if isinstance(other, bool):
return self.success == other
return False
def __bool__(self):
return self.success
def __nonzero__(self):
return self.success
class Feature(abc.ABC): # noqa: B024
# this is an abstract class, since we don't want anyone to instantiate it directly,
# but it doesn't have any abstract methods.
def __init__(
self,
value: Union[str, int, float, bytes],
description: Optional[str] = None,
):
"""
Args:
value (any): the value of the feature, such as the number or string.
description (str): a human-readable description that explains the feature value.
"""
super().__init__()
self.name = self.__class__.__name__.lower()
self.value = value
self.description = description
def __hash__(self):
return hash((self.name, self.value))
def __eq__(self, other):
return self.name == other.name and self.value == other.value
def __lt__(self, other):
# implementing sorting by serializing to JSON is a huge hack.
# it's slow, inelegant, and probably doesn't work intuitively;
# however, we only use it for deterministic output, so it's good enough for now.
# circular import
# we should fix if this wasn't already a huge hack.
import capa.features.freeze.features
return (
capa.features.freeze.features.feature_from_capa(self).model_dump_json()
< capa.features.freeze.features.feature_from_capa(other).model_dump_json()
)
def get_name_str(self) -> str:
"""
render the name of this feature, for use by `__str__` and friends.
subclasses should override to customize the rendering.
"""
return self.name
def get_value_str(self) -> str:
"""
render the value of this feature, for use by `__str__` and friends.
subclasses should override to customize the rendering.
"""
return str(self.value)
def __str__(self):
if self.value is not None:
if self.description:
return f"{self.get_name_str()}({self.get_value_str()} = {self.description})"
else:
return f"{self.get_name_str()}({self.get_value_str()})"
else:
return f"{self.get_name_str()}"
def __repr__(self):
return str(self)
def evaluate(self, features: "capa.engine.FeatureSet", short_circuit=True) -> Result:
capa.perf.counters["evaluate.feature"] += 1
capa.perf.counters["evaluate.feature." + self.name] += 1
return Result(self in features, self, [], locations=features.get(self, set()))
class MatchedRule(Feature):
def __init__(self, value: str, description=None):
super().__init__(value, description=description)
self.name = "match"
class Characteristic(Feature):
def __init__(self, value: str, description=None):
super().__init__(value, description=description)
class String(Feature):
def __init__(self, value: str, description=None):
super().__init__(value, description=description)
def get_value_str(self) -> str:
assert isinstance(self.value, str)
return escape_string(self.value)
class Class(Feature):
def __init__(self, value: str, description=None):
super().__init__(value, description=description)
class Namespace(Feature):
def __init__(self, value: str, description=None):
super().__init__(value, description=description)
class Substring(String):
def __init__(self, value: str, description=None):
super().__init__(value, description=description)
self.value = value
def evaluate(self, features: "capa.engine.FeatureSet", short_circuit=True):
capa.perf.counters["evaluate.feature"] += 1
capa.perf.counters["evaluate.feature.substring"] += 1
# mapping from string value to list of locations.
# will unique the locations later on.
matches: typing.DefaultDict[str, Set[Address]] = collections.defaultdict(set)
assert isinstance(self.value, str)
for feature, locations in features.items():
if not isinstance(feature, (String,)):
continue
if not isinstance(feature.value, str):
# this is a programming error: String should only contain str
raise ValueError("unexpected feature value type")
if self.value in feature.value:
matches[feature.value].update(locations)
if short_circuit:
# we found one matching string, that's sufficient to match.
# don't collect other matching strings in this mode.
break
if matches:
# collect all locations
locations = set()
for locs in matches.values():
locations.update(locs)
# unlike other features, we cannot return put a reference to `self` directly in a `Result`.
# this is because `self` may match on many strings, so we can't stuff the matched value into it.
# instead, return a new instance that has a reference to both the substring and the matched values.
return Result(True, _MatchedSubstring(self, dict(matches)), [], locations=locations)
else:
return Result(False, _MatchedSubstring(self, {}), [])
def get_value_str(self) -> str:
assert isinstance(self.value, str)
return escape_string(self.value)
def __str__(self):
assert isinstance(self.value, str)
return f"substring({escape_string(self.value)})"
class _MatchedSubstring(Substring):
"""
this represents specific match instances of a substring feature.
treat it the same as a `Substring` except it has the `matches` field that contains the complete strings that matched.
note: this type should only ever be constructed by `Substring.evaluate()`. it is not part of the public API.
"""
def __init__(self, substring: Substring, matches: Dict[str, Set[Address]]):
"""
args:
substring: the substring feature that matches.
match: mapping from matching string to its locations.
"""
super().__init__(str(substring.value), description=substring.description)
# we want this to collide with the name of `Substring` above,
# so that it works nicely with the renderers.
self.name = "substring"
# this may be None if the substring doesn't match
self.matches = matches
def __str__(self):
matches = ", ".join(f'"{s}"' for s in (self.matches or {}).keys())
assert isinstance(self.value, str)
return f'substring("{self.value}", matches = {matches})'
class Regex(String):
def __init__(self, value: str, description=None):
super().__init__(value, description=description)
self.value = value
pat = self.value[len("/") : -len("/")]
flags = re.DOTALL
if value.endswith("/i"):
pat = self.value[len("/") : -len("/i")]
flags |= re.IGNORECASE
try:
self.re = re.compile(pat, flags)
except re.error as exc:
if value.endswith("/i"):
value = value[: -len("i")]
raise ValueError(
f"invalid regular expression: {value} it should use Python syntax, try it at https://pythex.org"
) from exc
def evaluate(self, features: "capa.engine.FeatureSet", short_circuit=True):
capa.perf.counters["evaluate.feature"] += 1
capa.perf.counters["evaluate.feature.regex"] += 1
# mapping from string value to list of locations.
# will unique the locations later on.
matches: typing.DefaultDict[str, Set[Address]] = collections.defaultdict(set)
for feature, locations in features.items():
if not isinstance(feature, (String,)):
continue
if not isinstance(feature.value, str):
# this is a programming error: String should only contain str
raise ValueError("unexpected feature value type")
# `re.search` finds a match anywhere in the given string
# which implies leading and/or trailing whitespace.
# using this mode cleans is more convenient for rule authors,
# so that they don't have to prefix/suffix their terms like: /.*foo.*/.
if self.re.search(feature.value):
matches[feature.value].update(locations)
if short_circuit:
# we found one matching string, that's sufficient to match.
# don't collect other matching strings in this mode.
break
if matches:
# collect all locations
locations = set()
for locs in matches.values():
locations.update(locs)
# unlike other features, we cannot return put a reference to `self` directly in a `Result`.
# this is because `self` may match on many strings, so we can't stuff the matched value into it.
# instead, return a new instance that has a reference to both the regex and the matched values.
# see #262.
return Result(True, _MatchedRegex(self, dict(matches)), [], locations=locations)
else:
return Result(False, _MatchedRegex(self, {}), [])
def __str__(self):
assert isinstance(self.value, str)
return f"regex(string =~ {self.value})"
class _MatchedRegex(Regex):
"""
this represents specific match instances of a regular expression feature.
treat it the same as a `Regex` except it has the `matches` field that contains the complete strings that matched.
note: this type should only ever be constructed by `Regex.evaluate()`. it is not part of the public API.
"""
def __init__(self, regex: Regex, matches: Dict[str, Set[Address]]):
"""
args:
regex: the regex feature that matches.
matches: mapping from matching string to its locations.
"""
super().__init__(str(regex.value), description=regex.description)
# we want this to collide with the name of `Regex` above,
# so that it works nicely with the renderers.
self.name = "regex"
# this may be None if the regex doesn't match
self.matches = matches
def __str__(self):
matches = ", ".join(f'"{s}"' for s in (self.matches or {}).keys())
assert isinstance(self.value, str)
return f"regex(string =~ {self.value}, matches = {matches})"
class StringFactory:
def __new__(cls, value: str, description=None):
if value.startswith("/") and (value.endswith("/") or value.endswith("/i")):
return Regex(value, description=description)
return String(value, description=description)
class Bytes(Feature):
def __init__(self, value: bytes, description=None):
super().__init__(value, description=description)
self.value = value
def evaluate(self, features: "capa.engine.FeatureSet", short_circuit=True):
assert isinstance(self.value, bytes)
capa.perf.counters["evaluate.feature"] += 1
capa.perf.counters["evaluate.feature.bytes"] += 1
capa.perf.counters["evaluate.feature.bytes." + str(len(self.value))] += 1
for feature, locations in features.items():
if not isinstance(feature, (Bytes,)):
continue
assert isinstance(feature.value, bytes)
if feature.value.startswith(self.value):
return Result(True, self, [], locations=locations)
return Result(False, self, [])
def get_value_str(self):
assert isinstance(self.value, bytes)
return hex_string(bytes_to_str(self.value))
# other candidates here: https://docs.microsoft.com/en-us/windows/win32/debug/pe-format#machine-types
ARCH_I386 = "i386"
ARCH_AMD64 = "amd64"
ARCH_AARCH64 = "aarch64"
# dotnet
ARCH_ANY = "any"
VALID_ARCH = (ARCH_I386, ARCH_AMD64, ARCH_AARCH64, ARCH_ANY)
class Arch(Feature):
def __init__(self, value: str, description=None):
super().__init__(value, description=description)
self.name = "arch"
OS_WINDOWS = "windows"
OS_LINUX = "linux"
OS_MACOS = "macos"
OS_ANDROID = "android"
# dotnet
OS_ANY = "any"
VALID_OS = {os.value for os in capa.features.extractors.elf.OS}
VALID_OS.update({OS_WINDOWS, OS_LINUX, OS_MACOS, OS_ANY, OS_ANDROID})
# internal only, not to be used in rules
OS_AUTO = "auto"
class OS(Feature):
def __init__(self, value: str, description=None):
super().__init__(value, description=description)
self.name = "os"
def evaluate(self, features: "capa.engine.FeatureSet", short_circuit=True):
capa.perf.counters["evaluate.feature"] += 1
capa.perf.counters["evaluate.feature." + self.name] += 1
for feature, locations in features.items():
if not isinstance(feature, (OS,)):
continue
assert isinstance(feature.value, str)
if OS_ANY in (self.value, feature.value) or self.value == feature.value:
return Result(True, self, [], locations=locations)
return Result(False, self, [])
FORMAT_PE = "pe"
FORMAT_ELF = "elf"
FORMAT_DOTNET = "dotnet"
VALID_FORMAT = (FORMAT_PE, FORMAT_ELF, FORMAT_DOTNET)
# internal only, not to be used in rules
FORMAT_AUTO = "auto"
FORMAT_SC32 = "sc32"
FORMAT_SC64 = "sc64"
FORMAT_CAPE = "cape"
FORMAT_DRAKVUF = "drakvuf"
FORMAT_VMRAY = "vmray"
FORMAT_BINEXPORT2 = "binexport2"
FORMAT_FREEZE = "freeze"
FORMAT_RESULT = "result"
STATIC_FORMATS = {
FORMAT_SC32,
FORMAT_SC64,
FORMAT_PE,
FORMAT_ELF,
FORMAT_DOTNET,
FORMAT_FREEZE,
FORMAT_RESULT,
FORMAT_BINEXPORT2,
}
DYNAMIC_FORMATS = {
FORMAT_CAPE,
FORMAT_DRAKVUF,
FORMAT_VMRAY,
FORMAT_FREEZE,
FORMAT_RESULT,
}
FORMAT_UNKNOWN = "unknown"
class Format(Feature):
def __init__(self, value: str, description=None):
super().__init__(value, description=description)
self.name = "format"
def is_global_feature(feature):
"""
is this a feature that is extracted at every scope?
today, these are OS, arch, and format features.
"""
return isinstance(feature, (OS, Arch, Format))