mirror of
https://github.com/mandiant/capa.git
synced 2025-12-06 21:00:57 -08:00
* elf: os: detect Android via clang compiler .ident note * elf: os: detect Android via dependency on liblog.so * main: split main into a bunch of "main routines" [wip] since there are a few references to BinExport2 that are in progress elsewhre. Next commit will remove them. * features: add BinExport2 declarations * BinExport2: initial skeleton of feature extraction * main: remove references to wip BinExport2 code * changelog * main: rename first position argument "input_file" closes #1946 * main: linters * main: move rule-related routines to capa.rules ref #1821 * main: extract routines to capa.loader module closes #1821 * add loader module * loader: learn to load freeze format * freeze: use new cli arg handling * Update capa/loader.py Co-authored-by: Moritz <mr-tz@users.noreply.github.com> * main: remove duplicate documentation * main: add doc about where some functions live * scripts: migrate to new main wrapper helper functions * scripts: port to main routines * main: better handle auto-detection of backend * scripts: migrate bulk-process to main wrappers * scripts: migrate scripts to main wrappers * main: rename *_from_args to *_from_cli * changelog * cache-ruleset: remove duplication * main: fix tag handling * cache-ruleset: fix cli args * cache-ruleset: fix special rule cli handling * scripts: fix type bytes * main: nicely format debug messages * helpers: ensure log messages aren't very long * flake8 config * binexport2: formatting * loader: learn to load BinExport2 files * main: debug log the format and backend * elf: add more arch constants * binexport: parse global features * binexport: extract file features * binexport2: begin to enumerate function/bb/insns * binexport: pass context to function/bb/insn extractors * binexport: linters * binexport: linters * scripts: add script to inspect binexport2 file * inspect-binexport: fix xref symbols * inspect-binexport: factor out the index building * binexport: move index to binexport extractor module * binexport: implement ELF/aarch64 GOT/thunk analyzer * binexport: implement API features * binexport: record the full vertex for a thunk * binexport: learn to extract numbers * binexport: number: skipped mapped numbers * binexport: fix basic block address indexing * binexport: rename function * binexport: extract operand numbers * binexport: learn to extract calls from characteristics * binexport: learn to extract mnemonics * pre-commit: skip protobuf file * binexport: better search for sample file * loader: add file extractors for BinExport2 * binexport: remove extra parameter * new black config * binexport: index string xrefs * binexport: learn to extract bytes and strings * binexport: cache parsed PE/ELF * binexport: handle Ghidra SYMBOL numbers * binexport2: handle binexport#78 (Ghidra only uses SYMBOL expresssions) * main: write error output to stderr, not stdout * scripts: add example detect-binexport2-capabilities.py * detect-binexport2-capabilities: more documentation/examples * elffile: recognize more architectures * binexport: handle read_memory errors * binexport: index flow graphs by address * binexport: cleanup logging * binexport: learn to extract function names * binexport: learn to extract all function features * binexport: learn to extract bb tight loops * elf: don't require vivisect just for type annotations * main: remove unused imports * rules: don't eagerly import ruamel until needed * loader: avoid eager imports of some backend-related code * changelog * fmt * binexport: better render optional fields * fix merge conflicts * fix formatting * remove Ghidra data reference madness * handle PermissionError when searching sample file for BinExport2 file * handle PermissionError when searching sample file for BinExport2 file * add Android as valid OS * inspect-binexport: strip strings * inspect-binexport: render operands * fix lints * ruff: update config layout * inspect-binexport: better align comments/xrefs * use explicit search paths to get sample for BinExport file * add initial BinExport tests * add/update BinExport tests and minor fixes * inspect-binexport: add perf tracking * inspect-binexport: cache rendered operands * lints * do not extract number features for ret instructions * Fix BinExport's "tight loop" feature extraction. `idx.target_edges_by_basic_block_index[basic_block_index]` is of type `List[Edges]`. The index `basic_block_index` was definitely not an element. * inspect-binexport: better render data section * linters * main: accept --format=binexport2 * binexport: insn: add support for parsing bare immediate int operands * binexport2: bb: fix tight loop detection ref #2050 * binexport: api: generate variations of Win32 APIs * lints * binexport: index: don't assume instruction index is 1:1 with address * be2: index instruction addresses * be2: temp remove bytes feature processing * binexport: read memory from an address space extracted from PE/ELF closes #2061 * be2: resolve thunks to imported functions * be2: check for be2 string reference before bytes/string extraction overhead * be2: remove unneeded check * be2: do not process thunks * be2: insn: polish thunk handling a bit * be2: pre-compute thunk targets * parse negative numbers * update tests to use Ghidra-generated BinExport file * remove unused import * black reformat * run tests always (for now) * binexport: tests: fix test case * binexport: extractor: fix insn lint * binexport: addressspace: use base address recovered from binexport file * Add nzxor charecteristic in BinExport extractor. by referencing vivisect implementation. * add tests, fix stack cookie detection * test BinExport feature PRs * reformat and fix * complete TODO descriptions * wip tests * binexport: add typing where applicable (#2106) * binexport2: revert import names from BinExport2 proto binexport2_pb.BinExport2 isnt a package so we can't import it like: from ...binexport2_pb.BinExport2 import CallGraph * fix stack offset numbers and disable offset tests * xfail OperandOffset * generate symbol variants * wip: read negative numbers * update tight loop tests * binexport: fix function loop feature detection * binexport: update binexport function loop tests * binexport: fix lints and imports * binexport: add back assert statement to thunk calculation * binexport: update tests to use Ghidra binexport file * binexport: add additional debug info to thunk calculation assert * binexport: update unit tests to focus on Ghidra * binexport: fix lints * binexport: remove Ghidra symbol madness and fix x86/amd64 stack offset number tests * binexport: use masking for Number features * binexport: ignore call/jmp immediates for intel architecture * binexport: check if immediate is a mapped address * binexport: emit offset features for immediates likely structure offsets * binexport: add twos complement wrapper insn.py * binexport: add support for x86 offset features * binexport: code refactor * binexport: init refactor for multi-arch instruction feature parsing * binexport: intel: emit indirect call characteristic * binexport: use helper method for instruction mnemonic * binexport: arm: emit offset features from stp instruction * binexport: arm: emit indirect call characteristic * binexport: arm: improve offset feature extraction * binexport: add workaroud for Ghidra bug that results in empty operands (no expressions) * binexport: skip x86 stack string tests * binexport: update mimikatz.exe_ feature count tests for Ghidra * core: loader: update binja import * core: loader: update binja imports * binexport: arm: ignore number features for add instruction manipulating stack * binexport: update unit tests * binexport: arm: ignore number features for sub instruction manipulating stack * binexport: arm: emit offset features for add instructions * binexport: remove TODO from tests workflow * binexport: update CHANGELOG * binexport: remove outdated TODOs * binexport: re-enable support for data references in inspect-binexport2.py * binexport: skip data references to code * binexport: remove outdated TODOs * Update scripts/inspect-binexport2.py * Update CHANGELOG.md * Update capa/helpers.py * Update capa/features/extractors/common.py * Update capa/features/extractors/binexport2/extractor.py * Update capa/features/extractors/binexport2/arch/arm/insn.py Co-authored-by: Moritz <mr-tz@users.noreply.github.com> * initial add * test binexport scripts * add tests using small ARM ELF * add method to get instruction by address * index instructions by address * adjust and extend tests * handle operator with no children bug * binexport: use instruction address index ref: https://github.com/mandiant/capa/pull/1950/files#r1728570811 * inspect binexport: handle lsl with no children add pruning phase to expression tree building to remove known-bad branches. This might address some of the data we're seeing due to: https://github.com/NationalSecurityAgency/ghidra/issues/6821 Also introduces a --instruction optional argument to dump the details of a specific instruction. * binexport: consolidate expression tree logic into helpers * binexport: index instruction indices by address * binexport: introduce instruction pattern matching Introduce intruction pattern matching to declaratively describe the instructions and operands that we want to extract. While there's a bit more code, its much more thoroughly tested, and is less brittle than the prior if/else/if/else/if/else implementation. * binexport: helpers: fix missing comment words * binexport: update tests to reflect updated test files * remove testing of feature branch --------- Co-authored-by: Moritz <mr-tz@users.noreply.github.com> Co-authored-by: Mike Hunhoff <mike.hunhoff@gmail.com> Co-authored-by: mr-tz <moritz.raabe@mandiant.com> Co-authored-by: Lin Chen <larch.lin.chen@gmail.com>
143 lines
4.9 KiB
Python
143 lines
4.9 KiB
Python
# Copyright (C) 2021 Mandiant, Inc. All Rights Reserved.
|
|
# Licensed under the Apache License, Version 2.0 (the "License");
|
|
# you may not use this file except in compliance with the License.
|
|
# You may obtain a copy of the License at: [package root]/LICENSE.txt
|
|
# Unless required by applicable law or agreed to in writing, software distributed under the License
|
|
# is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
|
# See the License for the specific language governing permissions and limitations under the License.
|
|
import io
|
|
import re
|
|
import logging
|
|
import binascii
|
|
import contextlib
|
|
from typing import Tuple, Iterator
|
|
|
|
import pefile
|
|
|
|
import capa.features
|
|
import capa.features.extractors.elf
|
|
import capa.features.extractors.pefile
|
|
import capa.features.extractors.strings
|
|
from capa.features.common import (
|
|
OS,
|
|
OS_ANY,
|
|
OS_AUTO,
|
|
ARCH_ANY,
|
|
FORMAT_PE,
|
|
FORMAT_ELF,
|
|
OS_WINDOWS,
|
|
FORMAT_FREEZE,
|
|
FORMAT_RESULT,
|
|
Arch,
|
|
Format,
|
|
String,
|
|
Feature,
|
|
)
|
|
from capa.features.freeze import is_freeze
|
|
from capa.features.address import NO_ADDRESS, Address, FileOffsetAddress
|
|
|
|
logger = logging.getLogger(__name__)
|
|
|
|
# match strings for formats
|
|
MATCH_PE = b"MZ"
|
|
MATCH_ELF = b"\x7fELF"
|
|
MATCH_RESULT = b'{"meta":'
|
|
MATCH_JSON_OBJECT = b'{"'
|
|
|
|
|
|
def extract_file_strings(buf: bytes, **kwargs) -> Iterator[Tuple[String, Address]]:
|
|
"""
|
|
extract ASCII and UTF-16 LE strings from file
|
|
"""
|
|
for s in capa.features.extractors.strings.extract_ascii_strings(buf):
|
|
yield String(s.s), FileOffsetAddress(s.offset)
|
|
|
|
for s in capa.features.extractors.strings.extract_unicode_strings(buf):
|
|
yield String(s.s), FileOffsetAddress(s.offset)
|
|
|
|
|
|
def extract_format(buf: bytes) -> Iterator[Tuple[Feature, Address]]:
|
|
if buf.startswith(MATCH_PE):
|
|
yield Format(FORMAT_PE), NO_ADDRESS
|
|
elif buf.startswith(MATCH_ELF):
|
|
yield Format(FORMAT_ELF), NO_ADDRESS
|
|
elif is_freeze(buf):
|
|
yield Format(FORMAT_FREEZE), NO_ADDRESS
|
|
elif buf.startswith(MATCH_RESULT):
|
|
yield Format(FORMAT_RESULT), NO_ADDRESS
|
|
elif re.sub(rb"\s", b"", buf[:20]).startswith(MATCH_JSON_OBJECT):
|
|
# potential start of JSON object data without whitespace
|
|
# we don't know what it is exactly, but may support it (e.g. a dynamic CAPE sandbox report)
|
|
# skip verdict here and let subsequent code analyze this further
|
|
return
|
|
else:
|
|
# we likely end up here:
|
|
# 1. handling a file format (e.g. macho)
|
|
#
|
|
# for (1), this logic will need to be updated as the format is implemented.
|
|
logger.debug("unknown file format: %s", buf[:4].hex())
|
|
return
|
|
|
|
|
|
def extract_arch(buf) -> Iterator[Tuple[Feature, Address]]:
|
|
if buf.startswith(MATCH_PE):
|
|
yield from capa.features.extractors.pefile.extract_file_arch(pe=pefile.PE(data=buf))
|
|
|
|
elif buf.startswith(MATCH_RESULT):
|
|
yield Arch(ARCH_ANY), NO_ADDRESS
|
|
|
|
elif buf.startswith(MATCH_ELF):
|
|
with contextlib.closing(io.BytesIO(buf)) as f:
|
|
arch = capa.features.extractors.elf.detect_elf_arch(f)
|
|
|
|
if arch not in capa.features.common.VALID_ARCH:
|
|
logger.debug("unsupported arch: %s", arch)
|
|
return
|
|
|
|
yield Arch(arch), NO_ADDRESS
|
|
|
|
else:
|
|
# we likely end up here:
|
|
# 1. handling shellcode, or
|
|
# 2. handling a new file format (e.g. macho)
|
|
#
|
|
# for (1) we can't do much - its shellcode and all bets are off.
|
|
# we could maybe accept a further CLI argument to specify the arch,
|
|
# but i think this would be rarely used.
|
|
# rules that rely on arch conditions will fail to match on shellcode.
|
|
#
|
|
# for (2), this logic will need to be updated as the format is implemented.
|
|
logger.debug("unsupported file format: %s, will not guess Arch", binascii.hexlify(buf[:4]).decode("ascii"))
|
|
return
|
|
|
|
|
|
def extract_os(buf, os=OS_AUTO) -> Iterator[Tuple[Feature, Address]]:
|
|
if os != OS_AUTO:
|
|
yield OS(os), NO_ADDRESS
|
|
|
|
if buf.startswith(MATCH_PE):
|
|
yield OS(OS_WINDOWS), NO_ADDRESS
|
|
elif buf.startswith(MATCH_RESULT):
|
|
yield OS(OS_ANY), NO_ADDRESS
|
|
elif buf.startswith(MATCH_ELF):
|
|
with contextlib.closing(io.BytesIO(buf)) as f:
|
|
os = capa.features.extractors.elf.detect_elf_os(f)
|
|
|
|
if os not in capa.features.common.VALID_OS:
|
|
logger.debug("unsupported os: %s", os)
|
|
return
|
|
|
|
yield OS(os), NO_ADDRESS
|
|
|
|
else:
|
|
# we likely end up here:
|
|
# 1. handling shellcode, or
|
|
# 2. handling a new file format (e.g. macho)
|
|
#
|
|
# for (1) we can't do much - its shellcode and all bets are off.
|
|
# rules that rely on OS conditions will fail to match on shellcode.
|
|
#
|
|
# for (2), this logic will need to be updated as the format is implemented.
|
|
logger.debug("unsupported file format: %s, will not guess OS", binascii.hexlify(buf[:4]).decode("ascii"))
|
|
return
|