Files
capa/scripts/inspect-binexport2.py
Willi Ballenthin ee17d75be9 implement BinExport2 backend (#1950)
* elf: os: detect Android via clang compiler .ident note

* elf: os: detect Android via dependency on liblog.so

* main: split main into a bunch of "main routines"

[wip] since there are a few references to BinExport2
that are in progress elsewhre. Next commit will remove them.

* features: add BinExport2 declarations

* BinExport2: initial skeleton of feature extraction

* main: remove references to wip BinExport2 code

* changelog

* main: rename first position argument "input_file"

closes #1946

* main: linters

* main: move rule-related routines to capa.rules

ref #1821

* main: extract routines to capa.loader module

closes #1821

* add loader module

* loader: learn to load freeze format

* freeze: use new cli arg handling

* Update capa/loader.py

Co-authored-by: Moritz <mr-tz@users.noreply.github.com>

* main: remove duplicate documentation

* main: add doc about where some functions live

* scripts: migrate to new main wrapper helper functions

* scripts: port to main routines

* main: better handle auto-detection of backend

* scripts: migrate bulk-process to main wrappers

* scripts: migrate scripts to main wrappers

* main: rename *_from_args to *_from_cli

* changelog

* cache-ruleset: remove duplication

* main: fix tag handling

* cache-ruleset: fix cli args

* cache-ruleset: fix special rule cli handling

* scripts: fix type bytes

* main: nicely format debug messages

* helpers: ensure log messages aren't very long

* flake8 config

* binexport2: formatting

* loader: learn to load BinExport2 files

* main: debug log the format and backend

* elf: add more arch constants

* binexport: parse global features

* binexport: extract file features

* binexport2: begin to enumerate function/bb/insns

* binexport: pass context to function/bb/insn extractors

* binexport: linters

* binexport: linters

* scripts: add script to inspect binexport2 file

* inspect-binexport: fix xref symbols

* inspect-binexport: factor out the index building

* binexport: move index to binexport extractor module

* binexport: implement ELF/aarch64 GOT/thunk analyzer

* binexport: implement API features

* binexport: record the full vertex for a thunk

* binexport: learn to extract numbers

* binexport: number: skipped mapped numbers

* binexport: fix basic block address indexing

* binexport: rename function

* binexport: extract operand numbers

* binexport: learn to extract calls from characteristics

* binexport: learn to extract mnemonics

* pre-commit: skip protobuf file

* binexport: better search for sample file

* loader: add file extractors for BinExport2

* binexport: remove extra parameter

* new black config

* binexport: index string xrefs

* binexport: learn to extract bytes and strings

* binexport: cache parsed PE/ELF

* binexport: handle Ghidra SYMBOL numbers

* binexport2: handle binexport#78 (Ghidra only uses SYMBOL expresssions)

* main: write error output to stderr, not stdout

* scripts: add example detect-binexport2-capabilities.py

* detect-binexport2-capabilities: more documentation/examples

* elffile: recognize more architectures

* binexport: handle read_memory errors

* binexport: index flow graphs by address

* binexport: cleanup logging

* binexport: learn to extract function names

* binexport: learn to extract all function features

* binexport: learn to extract bb tight loops

* elf: don't require vivisect just for type annotations

* main: remove unused imports

* rules: don't eagerly import ruamel until needed

* loader: avoid eager imports of some backend-related code

* changelog

* fmt

* binexport: better render optional fields

* fix merge conflicts

* fix formatting

* remove Ghidra data reference madness

* handle PermissionError when searching sample file for BinExport2 file

* handle PermissionError when searching sample file for BinExport2 file

* add Android as valid OS

* inspect-binexport: strip strings

* inspect-binexport: render operands

* fix lints

* ruff: update config layout

* inspect-binexport: better align comments/xrefs

* use explicit search paths to get sample for BinExport file

* add initial BinExport tests

* add/update BinExport tests and minor fixes

* inspect-binexport: add perf tracking

* inspect-binexport: cache rendered operands

* lints

* do not extract number features for ret instructions

* Fix BinExport's "tight loop" feature extraction.

`idx.target_edges_by_basic_block_index[basic_block_index]` is of type
`List[Edges]`. The index `basic_block_index` was definitely not an
element.

* inspect-binexport: better render data section

* linters

* main: accept --format=binexport2

* binexport: insn: add support for parsing bare immediate int operands

* binexport2: bb: fix tight loop detection

ref #2050

* binexport: api: generate variations of Win32 APIs

* lints

* binexport: index: don't assume instruction index is 1:1 with address

* be2: index instruction addresses

* be2: temp remove bytes feature processing

* binexport: read memory from an address space extracted from PE/ELF

closes #2061

* be2: resolve thunks to imported functions

* be2: check for be2 string reference before bytes/string extraction overhead

* be2: remove unneeded check

* be2: do not process thunks

* be2: insn: polish thunk handling a bit

* be2: pre-compute thunk targets

* parse negative numbers

* update tests to use Ghidra-generated BinExport file

* remove unused import

* black reformat

* run tests always (for now)

* binexport: tests: fix test case

* binexport: extractor: fix insn lint

* binexport: addressspace: use base address recovered from binexport file

* Add nzxor charecteristic in BinExport extractor.

by referencing vivisect implementation.

* add tests, fix stack cookie detection

* test BinExport feature PRs

* reformat and fix

* complete TODO descriptions

* wip tests

* binexport: add typing where applicable (#2106)

* binexport2: revert import names from BinExport2 proto

binexport2_pb.BinExport2 isnt a package so we can't import it like:

    from ...binexport2_pb.BinExport2 import CallGraph

* fix stack offset numbers and disable offset tests

* xfail OperandOffset

* generate symbol variants

* wip: read negative numbers

* update tight loop tests

* binexport: fix function loop feature detection

* binexport: update binexport function loop tests

* binexport: fix lints and imports

* binexport: add back assert statement to thunk calculation

* binexport: update tests to use Ghidra binexport file

* binexport: add additional debug info to thunk calculation assert

* binexport: update unit tests to focus on Ghidra

* binexport: fix lints

* binexport: remove Ghidra symbol madness and fix x86/amd64 stack offset number tests

* binexport: use masking for Number features

* binexport: ignore call/jmp immediates for intel architecture

* binexport: check if immediate is a mapped address

* binexport: emit offset features for immediates likely structure offsets

* binexport: add twos complement wrapper insn.py

* binexport: add support for x86 offset features

* binexport: code refactor

* binexport: init refactor for multi-arch instruction feature parsing

* binexport: intel: emit indirect call characteristic

* binexport: use helper method for instruction mnemonic

* binexport: arm: emit offset features from stp instruction

* binexport: arm: emit indirect call characteristic

* binexport: arm: improve offset feature extraction

* binexport: add workaroud for Ghidra bug that results in empty operands (no expressions)

* binexport: skip x86 stack string tests

* binexport: update mimikatz.exe_ feature count tests for Ghidra

* core: loader: update binja import

* core: loader: update binja imports

* binexport: arm: ignore number features for add instruction manipulating stack

* binexport: update unit tests

* binexport: arm: ignore number features for sub instruction manipulating stack

* binexport: arm: emit offset features for add instructions

* binexport: remove TODO from tests workflow

* binexport: update CHANGELOG

* binexport: remove outdated TODOs

* binexport: re-enable support for data references in inspect-binexport2.py

* binexport: skip data references to code

* binexport: remove outdated TODOs

* Update scripts/inspect-binexport2.py

* Update CHANGELOG.md

* Update capa/helpers.py

* Update capa/features/extractors/common.py

* Update capa/features/extractors/binexport2/extractor.py

* Update capa/features/extractors/binexport2/arch/arm/insn.py

Co-authored-by: Moritz <mr-tz@users.noreply.github.com>

* initial add

* test binexport scripts

* add tests using small ARM ELF

* add method to get instruction by address

* index instructions by address

* adjust and extend tests

* handle operator with no children bug

* binexport: use instruction address index

ref: https://github.com/mandiant/capa/pull/1950/files#r1728570811

* inspect binexport: handle lsl with no children

add pruning phase to expression tree building
to remove known-bad branches. This might address
some of the data we're seeing due to:
https://github.com/NationalSecurityAgency/ghidra/issues/6821

Also introduces a --instruction optional argument
to dump the details of a specific instruction.

* binexport: consolidate expression tree logic into helpers

* binexport: index instruction indices by address

* binexport: introduce instruction pattern matching

Introduce intruction pattern matching to declaratively
describe the instructions and operands that we want to
extract. While there's a bit more code, its much more
thoroughly tested, and is less brittle than the prior
if/else/if/else/if/else implementation.

* binexport: helpers: fix missing comment words

* binexport: update tests to reflect updated test files

* remove testing of feature branch

---------

Co-authored-by: Moritz <mr-tz@users.noreply.github.com>
Co-authored-by: Mike Hunhoff <mike.hunhoff@gmail.com>
Co-authored-by: mr-tz <moritz.raabe@mandiant.com>
Co-authored-by: Lin Chen <larch.lin.chen@gmail.com>
2024-09-12 10:09:05 -06:00

464 lines
19 KiB
Python

#!/usr/bin/env python
"""
Copyright (C) 2023 Mandiant, Inc. All Rights Reserved.
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at: [package root]/LICENSE.txt
Unless required by applicable law or agreed to in writing, software distributed under the License
is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and limitations under the License.
"""
import io
import sys
import time
import logging
import argparse
import contextlib
from typing import Dict, List, Optional
import capa.main
import capa.features.extractors.binexport2
from capa.features.extractors.binexport2.binexport2_pb2 import BinExport2
logger = logging.getLogger("inspect-binexport2")
@contextlib.contextmanager
def timing(msg: str):
t0 = time.time()
yield
t1 = time.time()
logger.debug("perf: %s: %0.2fs", msg, t1 - t0)
class Renderer:
def __init__(self, o: io.StringIO):
self.o = o
self.indent = 0
@contextlib.contextmanager
def indenting(self):
self.indent += 1
try:
yield
finally:
self.indent -= 1
def write(self, s):
self.o.write(s)
def writeln(self, s):
self.o.write(" " * self.indent)
self.o.write(s)
self.o.write("\n")
@contextlib.contextmanager
def section(self, name):
self.writeln(name)
with self.indenting():
try:
yield
finally:
pass
self.writeln("/" + name)
self.writeln("")
def getvalue(self):
return self.o.getvalue()
# internal to `render_operand`
def _render_expression_tree(
be2: BinExport2,
operand: BinExport2.Operand,
expression_tree: List[List[int]],
tree_index: int,
o: io.StringIO,
):
expression_index = operand.expression_index[tree_index]
expression = be2.expression[expression_index]
children_tree_indexes: List[int] = expression_tree[tree_index]
if expression.type == BinExport2.Expression.REGISTER:
o.write(expression.symbol)
assert len(children_tree_indexes) == 0
return
elif expression.type == BinExport2.Expression.SYMBOL:
o.write(expression.symbol)
assert len(children_tree_indexes) <= 1
if len(children_tree_indexes) == 0:
return
elif len(children_tree_indexes) == 1:
# like: v
# from: mov v0.D[0x1], x9
# |
# 0
# .
# |
# D
child_index = children_tree_indexes[0]
_render_expression_tree(be2, operand, expression_tree, child_index, o)
return
else:
raise NotImplementedError(len(children_tree_indexes))
elif expression.type == BinExport2.Expression.IMMEDIATE_INT:
o.write(f"0x{expression.immediate:X}")
assert len(children_tree_indexes) == 0
return
elif expression.type == BinExport2.Expression.SIZE_PREFIX:
# like: b4
#
# We might want to use this occasionally, such as to disambiguate the
# size of MOVs into/out of memory. But I'm not sure when/where we need that yet.
#
# IDA spams this size prefix hint *everywhere*, so we can't rely on the exporter
# to provide it only when necessary.
assert len(children_tree_indexes) == 1
child_index = children_tree_indexes[0]
_render_expression_tree(be2, operand, expression_tree, child_index, o)
return
elif expression.type == BinExport2.Expression.OPERATOR:
if len(children_tree_indexes) == 1:
# prefix operator, like "ds:"
if expression.symbol != "!":
o.write(expression.symbol)
child_index = children_tree_indexes[0]
_render_expression_tree(be2, operand, expression_tree, child_index, o)
# postfix operator, like "!" in aarch operand "[x1, 8]!"
if expression.symbol == "!":
o.write(expression.symbol)
return
elif len(children_tree_indexes) == 2:
# infix operator: like "+" in "ebp+10"
child_a = children_tree_indexes[0]
child_b = children_tree_indexes[1]
_render_expression_tree(be2, operand, expression_tree, child_a, o)
o.write(expression.symbol)
_render_expression_tree(be2, operand, expression_tree, child_b, o)
return
elif len(children_tree_indexes) == 3:
# infix operator: like "+" in "ebp+ecx+10"
child_a = children_tree_indexes[0]
child_b = children_tree_indexes[1]
child_c = children_tree_indexes[2]
_render_expression_tree(be2, operand, expression_tree, child_a, o)
o.write(expression.symbol)
_render_expression_tree(be2, operand, expression_tree, child_b, o)
o.write(expression.symbol)
_render_expression_tree(be2, operand, expression_tree, child_c, o)
return
else:
raise NotImplementedError(len(children_tree_indexes))
elif expression.type == BinExport2.Expression.DEREFERENCE:
o.write("[")
assert len(children_tree_indexes) == 1
child_index = children_tree_indexes[0]
_render_expression_tree(be2, operand, expression_tree, child_index, o)
o.write("]")
return
elif expression.type == BinExport2.Expression.IMMEDIATE_FLOAT:
raise NotImplementedError(expression.type)
else:
raise NotImplementedError(expression.type)
_OPERAND_CACHE: Dict[int, str] = {}
def render_operand(be2: BinExport2, operand: BinExport2.Operand, index: Optional[int] = None) -> str:
# For the mimikatz example file, there are 138k distinct operands.
# Of those, only 11k are unique, which is less than 10% of the total.
# The most common operands are seen 37k, 24k, 17k, 15k, 11k, ... times.
# In other words, the most common five operands account for 100k instances,
# which is around 75% of operand instances.
# Therefore, we expect caching to be fruitful, trading memory for CPU time.
#
# No caching: 6.045 s ± 0.164 s [User: 5.916 s, System: 0.129 s]
# With caching: 4.259 s ± 0.161 s [User: 4.141 s, System: 0.117 s]
#
# So we can save 30% of CPU time by caching operand rendering.
#
# Other measurements:
#
# perf: loading BinExport2: 0.06s
# perf: indexing BinExport2: 0.34s
# perf: rendering BinExport2: 1.96s
# perf: writing BinExport2: 1.13s
# ________________________________________________________
# Executed in 4.40 secs fish external
# usr time 4.22 secs 0.00 micros 4.22 secs
# sys time 0.18 secs 842.00 micros 0.18 secs
if index and index in _OPERAND_CACHE:
return _OPERAND_CACHE[index]
o = io.StringIO()
tree = capa.features.extractors.binexport2.helpers._build_expression_tree(be2, operand)
_render_expression_tree(be2, operand, tree, 0, o)
s = o.getvalue()
if index:
_OPERAND_CACHE[index] = s
return s
def inspect_operand(be2: BinExport2, operand: BinExport2.Operand):
expression_tree = capa.features.extractors.binexport2.helpers._build_expression_tree(be2, operand)
def rec(tree_index, indent=0):
expression_index = operand.expression_index[tree_index]
expression = be2.expression[expression_index]
children_tree_indexes: List[int] = expression_tree[tree_index]
NEWLINE = "\n"
print(f" {' ' * indent}expression: {str(expression).replace(NEWLINE, ', ')}")
for child_index in children_tree_indexes:
rec(child_index, indent + 1)
rec(0)
def inspect_instruction(be2: BinExport2, instruction: BinExport2.Instruction, address: int):
mnemonic = be2.mnemonic[instruction.mnemonic_index]
print("instruction:")
print(f" address: {hex(address)}")
print(f" mnemonic: {mnemonic.name}")
print(" operands:")
for i, operand_index in enumerate(instruction.operand_index):
print(f" - operand {i}: [{operand_index}]")
operand = be2.operand[operand_index]
# Ghidra bug where empty operands (no expressions) may
# exist so we skip those for now (see https://github.com/NationalSecurityAgency/ghidra/issues/6817)
if len(operand.expression_index) > 0:
inspect_operand(be2, operand)
def main(argv=None):
if argv is None:
argv = sys.argv[1:]
parser = argparse.ArgumentParser(description="Inspect BinExport2 files")
capa.main.install_common_args(parser, wanted={"input_file"})
parser.add_argument("--instruction", type=lambda v: int(v, 0))
args = parser.parse_args(args=argv)
try:
capa.main.handle_common_args(args)
except capa.main.ShouldExitError as e:
return e.status_code
o = Renderer(io.StringIO())
with timing("loading BinExport2"):
be2: BinExport2 = capa.features.extractors.binexport2.get_binexport2(args.input_file)
with timing("indexing BinExport2"):
idx = capa.features.extractors.binexport2.BinExport2Index(be2)
t0 = time.time()
with o.section("meta"):
o.writeln(f"name: {be2.meta_information.executable_name}")
o.writeln(f"sha256: {be2.meta_information.executable_id}")
o.writeln(f"arch: {be2.meta_information.architecture_name}")
o.writeln(f"ts: {be2.meta_information.timestamp}")
with o.section("modules"):
for module in be2.module:
o.writeln(f"- {module.name}")
if not be2.module:
o.writeln("(none)")
with o.section("sections"):
for section in be2.section:
perms = ""
perms += "r" if section.flag_r else "-"
perms += "w" if section.flag_w else "-"
perms += "x" if section.flag_x else "-"
o.writeln(f"- {hex(section.address)} {perms} {hex(section.size)}")
with o.section("libraries"):
for library in be2.library:
o.writeln(
f"- {library.name:<12s} {'(static)' if library.is_static else ''}{(' at ' + hex(library.load_address)) if library.HasField('load_address') else ''}"
)
if not be2.library:
o.writeln("(none)")
with o.section("functions"):
for vertex_index, vertex in enumerate(be2.call_graph.vertex):
if not vertex.HasField("address"):
continue
with o.section(f"function {idx.get_function_name_by_vertex(vertex_index)} @ {hex(vertex.address)}"):
o.writeln(f"type: {vertex.Type.Name(vertex.type)}")
if vertex.HasField("mangled_name"):
o.writeln(f"name: {vertex.mangled_name}")
if vertex.HasField("demangled_name"):
o.writeln(f"demangled: {vertex.demangled_name}")
if vertex.HasField("library_index"):
# TODO(williballenthin): this seems to be incorrect for Ghidra exporter
# https://github.com/mandiant/capa/issues/1755
library = be2.library[vertex.library_index]
o.writeln(f"library: [{vertex.library_index}] {library.name}")
if vertex.HasField("module_index"):
module = be2.module[vertex.module_index]
o.writeln(f"module: [{vertex.module_index}] {module.name}")
if idx.callees_by_vertex_index[vertex_index] or idx.callers_by_vertex_index[vertex_index]:
o.writeln("xrefs:")
for caller_index in idx.callers_by_vertex_index[vertex_index]:
o.writeln(f"{idx.get_function_name_by_vertex(caller_index)}")
for callee_index in idx.callees_by_vertex_index[vertex_index]:
o.writeln(f"{idx.get_function_name_by_vertex(callee_index)}")
if vertex.address not in idx.flow_graph_index_by_address:
o.writeln("(no flow graph)")
else:
flow_graph_index = idx.flow_graph_index_by_address[vertex.address]
flow_graph = be2.flow_graph[flow_graph_index]
o.writeln("")
for basic_block_index in flow_graph.basic_block_index:
basic_block = be2.basic_block[basic_block_index]
basic_block_address = idx.get_basic_block_address(basic_block_index)
with o.section(f"basic block {hex(basic_block_address)}"):
for edge in idx.target_edges_by_basic_block_index[basic_block_index]:
if edge.type == BinExport2.FlowGraph.Edge.Type.CONDITION_FALSE:
continue
source_basic_block_index = edge.source_basic_block_index
source_basic_block_address = idx.get_basic_block_address(source_basic_block_index)
o.writeln(
f"{BinExport2.FlowGraph.Edge.Type.Name(edge.type)} basic block {hex(source_basic_block_address)}"
)
for instruction_index, instruction, instruction_address in idx.basic_block_instructions(
basic_block
):
mnemonic = be2.mnemonic[instruction.mnemonic_index]
operands = []
for operand_index in instruction.operand_index:
operand = be2.operand[operand_index]
# Ghidra bug where empty operands (no expressions) may
# exist so we skip those for now (see https://github.com/NationalSecurityAgency/ghidra/issues/6817)
if len(operand.expression_index) > 0:
operands.append(render_operand(be2, operand, index=operand_index))
call_targets = ""
if instruction.call_target:
call_targets = " "
for call_target_address in instruction.call_target:
call_target_name = idx.get_function_name_by_address(call_target_address)
call_targets += f"→ function {call_target_name} @ {hex(call_target_address)} "
data_references = ""
if instruction_index in idx.data_reference_index_by_source_instruction_index:
data_references = " "
for data_reference_index in idx.data_reference_index_by_source_instruction_index[
instruction_index
]:
data_reference = be2.data_reference[data_reference_index]
data_reference_address = data_reference.address
data_references += f"⇥ data {hex(data_reference_address)} "
string_references = ""
if instruction_index in idx.string_reference_index_by_source_instruction_index:
string_references = " "
for (
string_reference_index
) in idx.string_reference_index_by_source_instruction_index[instruction_index]:
string_reference = be2.string_reference[string_reference_index]
string_index = string_reference.string_table_index
string = be2.string_table[string_index]
string_references += f'⇥ string "{string.rstrip()}" '
comments = ""
if instruction.comment_index:
comments = " "
for comment_index in instruction.comment_index:
comment = be2.comment[comment_index]
comment_string = be2.string_table[comment.string_table_index]
comments += f"; {BinExport2.Comment.Type.Name(comment.type)} {comment_string} "
o.writeln(
f"{hex(instruction_address)} {mnemonic.name:<12s}{', '.join(operands):<14s}{call_targets}{data_references}{string_references}{comments}"
)
does_fallthrough = False
for edge in idx.source_edges_by_basic_block_index[basic_block_index]:
if edge.type == BinExport2.FlowGraph.Edge.Type.CONDITION_FALSE:
does_fallthrough = True
continue
back_edge = ""
if edge.HasField("is_back_edge") and edge.is_back_edge:
back_edge = ""
target_basic_block_index = edge.target_basic_block_index
target_basic_block_address = idx.get_basic_block_address(target_basic_block_index)
o.writeln(
f"{BinExport2.FlowGraph.Edge.Type.Name(edge.type)} basic block {hex(target_basic_block_address)} {back_edge}"
)
if does_fallthrough:
o.writeln("↓ CONDITION_FALSE")
with o.section("data"):
for data_address in sorted(idx.data_reference_index_by_target_address.keys()):
if data_address in idx.insn_address_by_index:
# appears to be code
continue
data_xrefs: List[int] = []
for data_reference_index in idx.data_reference_index_by_target_address[data_address]:
data_reference = be2.data_reference[data_reference_index]
instruction_address = idx.get_insn_address(data_reference.instruction_index)
data_xrefs.append(instruction_address)
if not data_xrefs:
continue
o.writeln(f"{hex(data_address)}{hex(data_xrefs[0])}")
for data_xref in data_xrefs[1:]:
o.writeln(f"{' ' * len(hex(data_address))}{hex(data_xref)}")
t1 = time.time()
logger.debug("perf: rendering BinExport2: %0.2fs", t1 - t0)
with timing("writing to STDOUT"):
print(o.getvalue())
if args.instruction:
insn = idx.insn_by_address[args.instruction]
inspect_instruction(be2, insn, args.instruction)
if __name__ == "__main__":
sys.exit(main())