import source files, forgetting about 938 prior commits

This commit is contained in:
William Ballenthin
2020-06-18 09:13:01 -06:00
parent f2d795090c
commit add3537447
65 changed files with 10322 additions and 0 deletions

111
.gitignore vendored Normal file
View File

@@ -0,0 +1,111 @@
# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class
# C extensions
*.so
# Distribution / packaging
.Python
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
*.egg-info/
.installed.cfg
*.egg
MANIFEST
# PyInstaller
# Usually these files are written by a python script from a template
# before PyInstaller builds the exe, so as to inject date/other infos into it.
*.manifest
*.spec
# Installer logs
pip-log.txt
pip-delete-this-directory.txt
# Unit test / coverage reports
htmlcov/
.tox/
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
*.cover
.hypothesis/
.pytest_cache/
# Translations
*.mo
*.pot
# Django stuff:
*.log
local_settings.py
db.sqlite3
# Flask stuff:
instance/
.webassets-cache
# Scrapy stuff:
.scrapy
# Sphinx documentation
docs/_build/
# PyBuilder
target/
# Jupyter Notebook
.ipynb_checkpoints
# pyenv
.python-version
# celery beat schedule file
celerybeat-schedule
# SageMath parsed files
*.sage.py
# Environments
.env
.venv
env/
venv/
ENV/
env.bak/
venv.bak/
# Spyder project settings
.spyderproject
.spyproject
# Rope project settings
.ropeproject
# mkdocs documentation
/site
# mypy
.mypy_cache/
.idea/*
*.prof
*.viv
*.idb
*.i64
!rules/lib

456
README.md Normal file
View File

@@ -0,0 +1,456 @@
# capa
[![Build Status](https://drone.oneteamed.net/api/badges/FLARE/capa/status.svg)](https://drone.oneteamed.net/FLARE/capa)
capa detects capabilities in executable files.
You run it against a .exe or .dll and it tells you what it thinks the program can do.
For example, it might suggest that the file is a backdoor, is capable of installing services, or relies on HTTP to communicate.
```
λ capa.exe suspicious.exe -q
objectives:
communication
data manipulation
machine access control
behaviors:
communication-via-http
encrypt data
load code functionality
techniques:
send-http-request
encrypt data using rc4
load pe
```
# download
Download capa from the [Releases](/releases) page or get the nightly builds here:
- Windows 64bit: TODO
- Windows 32bit: TODO
- Linux: TODO
- OSX: TODO
# contents
- [installation](#installation)
- [example](#example)
- [rule format](#rule-format)
- [meta block](#meta-block)
- [features block](#features-block)
- [extracted features](#extracted-features)
- [function features](#function-features)
- [api](#api)
- [number](#number)
- [string](#string)
- [bytes](#bytes)
- [offset](#offset)
- [mnemonic](#mnemonic)
- [characteristics](#characteristics)
- [file features](#file-features)
- [string](#file-string)
- [export](#export)
- [import](#import)
- [section](#section)
- [counting](#counting)
- [matching prior rule matches](#matching-prior-rule-matches)
- [limitations](#Limitations)
# installation
See [doc/installation.md](doc/installation.md) for information on how to setup the project, including how to use it as a Python library.
For more information about how to use capa, including running it as an IDA script/plugin see [doc/usage.md](doc/usage.md).
# example
Here we run capa against an unknown binary (`level32.exe`),
and the tool reports that the program can decode data via XOR,
references data in its resource section, writes to a file, and spawns a new process.
Taken together, this makes us think that `level32.exe` could be a dropper.
Therefore, our next analysis step might be to run `level32.exe` in a sandbox and try to recover the payload.
```
λ capa.exe level32.exe -q
disposition: malicious
category: dropper
objectives:
data manipulation
machine access control
behaviors:
encrypt data
load code functionality
techniques:
encrypt data using rc4
load pe
anomalies:
embedded PE file
```
By passing the `-vv` flag (for Very Verbose), capa reports exactly where it found evidence of these capabilities.
This is useful for at least two reasons:
- it helps explain why we should trust the results, and enables us to verify the conclusions
- it shows where within the binary an experienced analyst might study with IDA Pro
```
λ capa.exe level32.exe -q -vv
rule load PE file:
- function 0x401c58:
or:
and:
mnemonic(cmp):
- virtual address: 0x401c58
- virtual address: 0x401c68
- virtual address: 0x401c74
- virtual address: 0x401c7f
- virtual address: 0x401c8a
or:
number(0x4550):
- virtual address: 0x401c68
or:
number(0x5a4d):
- virtual address: 0x401c58
...
```
# rule format
capa uses a collection of rules to identify capabilities within a program.
These rules are easy to write, even for those new to reverse engineering.
By authoring rules, you can extend the capabilities that capa recognizes.
In some regards, capa rules are a mixture of the OpenIOC, Yara, and YAML formats.
Here's an example rule used by capa:
```
───────┬────────────────────────────────────────────────────────
│ File: rules/calculate-crc32.yml
───────┼────────────────────────────────────────────────────────
1 │ rule:
2 │ meta:
3 │ name: calculate CRC32
4 | rule-category: data-manipulation/hash-data/hash-data-using-crc32
5 │ author: moritz.raabe@fireeye.com
6 │ scope: function
7 │ examples:
8 │ - 2D3EDC218A90F03089CC01715A9F047F:0x403CBD
9 │ features:
10 │ - and:
11 │ - mnemonic: shr
12 │ - number: 0xEDB88320
13 │ - number: 8
14 │ - characteristic(nzxor): True
───────┴────────────────────────────────────────────────────────
```
Rules are yaml files that follow a certain schema.
The top level element is a dictionary named `rule` with two required children dictionaries:
`meta` and `features`.
## meta block
The meta block contains metadata that identifies the rule, categorizes into behaviors,
and provides references to additional documentation.
Here are the common fields:
- `name` is required. This string should uniquely identify the rule.
- `rule-category` is required when a rule describes a behavior (as opposed to matching a role or disposition).
The rule category specifies an objective, behavior, and technique matched by this rule,
using a format like `$objective/$behavior/$technique`.
An objective is a high level goal of a program, such as "communication".
A behavior is something that a program may do, such as "communication via socket".
A technique is a way of implementing some behavior, such as "send-data".
- `maec/malware-category` is required when the rule describes a role, such as `dropper` or `backdoor`.
- `maec/analysis-conclusion` is required when the rule describes a disposition, such as `benign` or `malicious`.
- `scope` indicates to which feature set this rule applies.
It can takes the following values:
- **`basic block`:** limits matches to a basic block.
It is used to achieve locality in rules (for example for parameters of a function).
- **`function`:** identify functions.
It doesn't support child functions (see [doc/limitations.md](doc/limitations.md#wrapper-functions-and-matches-in-child-functions)).
It is the default.
- **`file`:** matches file format aspects.
- **`program`:** *matches the matches* of `function` and `file` scopes.
Not yet implemented.
- `author` specifies the name or handle of the rule author.
- `examples` is a list of references to samples that should match the capability.
When the rule scope is `function`, then the reference should be `<sample hash>:<function va>`.
- `reference` lists related information in a book, article, blog post, etc.
Other fields are allowed but not defined in this specification. `description` is probably a good one.
## features block
This section declares logical statements about the features that must exist for the rule to match.
There are five structural expressions that may be nested:
- `and` - all of the children expressions must match
- `or` - match at least one of the children
- `not` - match when the child expression does not
- `N or more` - match at least `N` or more of the children
- `optional` is an alias for `0 or more`, which is useful for documenting related features. See [write-file.yml](/rules/machine-access-control/file-manipulation/write-file.yml) for an example.
For example, consider the following rule:
```
9 │ - and:
10 │ - mnemonic: shr
11 │ - number: 0xEDB88320
12 │ - number: 8
13 │ - characteristic(nzxor): True
```
For this to match, the function must:
- contain an `shr` instruction, and
- reference the immediate constant `0xEDB88320`, which some may recognize as related to the CRC32 checksum, and
- reference the number `8`, and
- have an unusual feature, in this case, contain a non-zeroing XOR instruction
If only one of these features is found in a function, the rule will not match.
# extracted features
## function features
capa extracts features from the disassembly of a function, such as which API functions are called.
The tool also reasons about the code structure to guess at function-level constructs.
These are the features supported at the function-scope:
- [api](#api)
- [number](#number)
- [string](#string)
- [bytes](#bytes)
- [offset](#offset)
- [mnemonic](#mnemonic)
- [characteristics](#characteristics)
### api
A call to a named function, probably an import,
though possibly a local function (like `malloc`) extracted via FLIRT.
The parameter is a string describing the function name, specified like `module.functionname` or `functionname`.
Example:
api: kernel32.CreateFileA
api: CreateFileA
### number
A number used by the logic of the program.
This should not be a stack or structure offset.
For example, a crypto constant.
The parameter is a number; if prefixed with `0x` then in hex format, otherwise, decimal format.
To associate context with a number, e.g. for constant definitions, append an equal sign and the respective name to
the number definition. This helps with documenting rules and provides context in capa's output.
Examples:
number: 16
number: 0x10
number: 0x40 = PAGE_EXECUTE_READWRITE
TODO: signed vs unsigned.
### string
A string referenced by the logic of the program.
This is probably a pointer to an ASCII or Unicode string.
This could also be an obfuscated string, for example a stack string.
The parameter is a string describing the string.
This can be the verbatim value, or a regex matching the string.
Regexes should be surrounded with `/` characters.
By default, capa uses case-sensitive matching and assumes leading and trailing wildcards.
To perform case-insensitive matching append an `i`. To anchor the regex at the start or end of a string, use `^` and/or `$`.
Examples:
string: This program cannot be run in DOS mode.
string: Firefox 64.0
string: /SELECT.*FROM.*WHERE/
string: /Hardware\\Description\\System\\CentralProcessor/i
Note that regex matching is expensive (`O(features)` rather than `O(1)`) so they should be used sparingly.
### bytes
A sequence of bytes referenced by the logic of the program.
The provided sequence must match from the beginning of the referenced bytes and be no more than `0x100` bytes.
The parameter is a sequence of hexadecimal bytes followed by an optional description.
The example below illustrates byte matching given a COM CLSID pushed onto the stack prior to `CoCreateInstance`.
Disassembly:
push offset iid_004118d4_IShellLinkA ; riid
push 1 ; dwClsContext
push 0 ; pUnkOuter
push offset clsid_004118c4_ShellLink ; rclsid
call ds:CoCreateInstance
Example rule elements:
bytes: 01 14 02 00 00 00 00 00 C0 00 00 00 00 00 00 46 = CLSID_ShellLink
bytes: EE 14 02 00 00 00 00 00 C0 00 00 00 00 00 00 46 = IID_IShellLink
### offset
A structure offset referenced by the logic of the program.
This should not be a stack offset.
The parameter is a number; if prefixed with `0x` then in hex format, otherwise, decimal format.
Examples:
offset: 0xC
offset: 0x14
### mnemonic
An instruction mnemonic found in the given function.
The parameter is a string containing the mnemonic.
Examples:
mnemonic: xor
mnemonic: shl
### characteristics
Characteristics are features that are extracted by the analysis engine.
They are one-off features that seem interesting to the authors.
For example, the `characteristic(nzxor)` feature describes non-zeroing XOR instructions.
captdet does not support instruction pattern matching,
so a select set of interesting instructions are pulled out as characteristics.
| characteristic | scope | description |
|--------------------------------------------|-----------------------|-------------|
| `characteristic(embedded pe): true` | file | (XOR encoded) embedded PE files. |
| `characteristic(switch): true` | function | Function contains a switch or jump table. |
| `characteristic(loop): true` | function | Function contains a loop. |
| `characteristic(recursive call): true` | function | Function is recursive. |
| `characteristic(calls from): true` | function | There are unique calls from this function. Best used like: `count(characteristic(calls from)): 3 or more` |
| `characteristic(calls to): true` | function | There are unique calls to this function. Best used like: `count(characteristic(calls to)): 3 or more` |
| `characteristic(nzxor): true` | basic block, function | Non-zeroing XOR instruction |
| `characteristic(peb access): true` | basic block, function | Access to the process environment block (PEB), e.g. via fs:[30h], gs:[60h], or `NtCurrentPeb` |
| `characteristic(fs access): true` | basic block, function | Access to memory via the `fs` segment. |
| `characteristic(gs access): true` | basic block, function | Access to memory via the `gs` segment. |
| `characteristic(cross section flow): true` | basic block, function | Function contains a call/jump to a different section. This is commonly seen in unpacking stubs. |
| `characteristic(tight loop): true` | basic block | A tight loop where a basic block branches to itself. |
| `characteristic(indirect call): true` | basic block, function | Indirect call instruction; for example, `call edx` or `call qword ptr [rsp+78h]`. |
## file features
capa extracts features from the file data.
File features stem from the file structure, i.e. PE structure or the raw file data.
These are the features supported at the file-scope:
- [string](#file-string)
- [export](#export)
- [import](#import)
- [section](#section)
### file string
An ASCII or UTF-16 LE string present in the file.
The parameter is a string describing the string.
This can be the verbatim value, or a regex matching the string.
Regexes should be surrounded with `/` characters. By default, capa uses case-sensitive matching.
To perform case-insensitive matching append an `i`.
Examples:
string: Z:\Dev\dropper\dropper.pdb
string: [ENTER]
string: /.*VBox.*/
string: /.*Software\Microsoft\Windows\CurrentVersion\Run.*/i
Note that regex matching is expensive (`O(features)` rather than `O(1)`) so they should be used sparingly.
### export
The name of a routine exported from a shared library.
Examples:
export: InstallA
### import
The name of a routine imported from a shared library.
Examples:
import: kernel32.WinExec
import: WinExec # wildcard module name
import: kernel32.#22 # by ordinal
### section
The name of a section in a structured file.
Examples:
section: .rsrc
## counting
Many rules will inspect the feature set for a select combination of features;
however, some rules may consider the number of times a feature was seen in a feature set.
These rules can be expressed like:
count(characteristic(nzxor)): 2 # exactly match count==2
count(characteristic(nzxor)): 2 or more # at least two matches
count(characteristic(nzxor)): 2 or fewer # at most two matches
count(characteristic(nzxor)): (2, 10) # match any value in the range 2<=count<=10
count(mnemonic(mov)): 3
count(basic block): 4
## matching prior rule matches
capa rules can specify logic for matching on other rule matches.
This allows a rule author to refactor common capability patterns into their own reusable components.
You can specify a rule match expression like so:
- and:
- match: file creation
- match: process creation
Rules are uniquely identified by their `rule.meta.name` property;
this is the value that should appear on the right hand side of the `match` expression.
capa will refuse to run if a rule dependency is not present during matching.
Common rule patterns, such as the various ways to implement "writes to a file", can be refactored into "library rules".
These are rules with `rule.meta.lib: True`.
By default, library rules will not be output to the user as a rule match,
but can be matched by other rules.
When no active rules depend on a library rule, these the library rules will not be evaluated - maintaining performance.
# limitations
To learn more about capa's current limitations see [here](doc/limitations.md).

0
capa/__init__.py Normal file
View File

286
capa/engine.py Normal file
View File

@@ -0,0 +1,286 @@
import re
import sys
import copy
import collections
import capa.features
class Statement(object):
'''
superclass for structural nodes, such as and/or/not.
this exists to provide a default impl for `__str__` and `__repr__`,
and to declare the interface method `evaluate`
'''
def __init__(self):
super(Statement, self).__init__()
self.name = self.__class__.__name__
def __str__(self):
return '%s(%s)' % (self.name.lower(), ','.join(map(str, self.get_children())))
def __repr__(self):
return str(self)
def evaluate(self, ctx):
'''
classes that inherit `Statement` must implement `evaluate`
args:
ctx (defaultdict[Feature, set[VA]])
returns:
Result
'''
raise NotImplementedError()
def get_children(self):
if hasattr(self, 'child'):
yield self.child
if hasattr(self, 'children'):
for child in self.children:
yield child
def replace_child(self, existing, new):
if hasattr(self, 'child'):
if self.child is existing:
self.child = new
if hasattr(self, 'children'):
for i, child in enumerate(self.children):
if child is existing:
self.children[i] = new
class Result(object):
'''
represents the results of an evaluation of statements against features.
instances of this class should behave like a bool,
e.g. `assert Result(True, ...) == True`
instances track additional metadata about evaluation results.
they contain references to the statement node (e.g. an And statement),
as well as the children Result instances.
we need this so that we can render the tree of expressions and their results.
'''
def __init__(self, success, statement, children, locations=None):
'''
args:
success (bool)
statement (capa.engine.Statement or capa.features.Feature)
children (list[Result])
locations (iterable[VA])
'''
super(Result, self).__init__()
self.success = success
self.statement = statement
self.children = children
self.locations = locations if locations is not None else ()
def __eq__(self, other):
if isinstance(other, bool):
return self.success == other
return False
def __bool__(self):
return self.success
def __nonzero__(self):
return self.success
class And(Statement):
'''match if all of the children evaluate to True.'''
def __init__(self, *children):
super(And, self).__init__()
self.children = list(children)
def evaluate(self, ctx):
results = [child.evaluate(ctx) for child in self.children]
success = all(results)
return Result(success, self, results)
class Or(Statement):
'''match if any of the children evaluate to True.'''
def __init__(self, *children):
super(Or, self).__init__()
self.children = list(children)
def evaluate(self, ctx):
results = [child.evaluate(ctx) for child in self.children]
success = any(results)
return Result(success, self, results)
class Not(Statement):
'''match only if the child evaluates to False.'''
def __init__(self, child):
super(Not, self).__init__()
self.child = child
def evaluate(self, ctx):
results = [self.child.evaluate(ctx)]
success = not results[0]
return Result(success, self, results)
class Some(Statement):
'''match if at least N of the children evaluate to True.'''
def __init__(self, count, *children):
super(Some, self).__init__()
self.count = count
self.children = list(children)
def evaluate(self, ctx):
results = [child.evaluate(ctx) for child in self.children]
# note that here we cast the child result as a bool
# because we've overridden `__bool__` above.
#
# we can't use `if child is True` because the instance is not True.
success = sum([1 for child in results if bool(child) is True]) >= self.count
return Result(success, self, results)
class Element(Statement):
'''match if the child is contained in the ctx set.'''
def __init__(self, child):
super(Element, self).__init__()
self.child = child
def __hash__(self):
return hash((self.name, self.child))
def __eq__(self, other):
return self.name == other.name and self.child == other.child
def evaluate(self, ctx):
return Result(self.child in ctx, self, [])
class Range(Statement):
'''match if the child is contained in the ctx set with a count in the given range.'''
def __init__(self, child, min=None, max=None):
super(Range, self).__init__()
self.child = child
self.min = min if min is not None else 0
self.max = max if max is not None else (1 << 64 - 1)
def evaluate(self, ctx):
if self.child not in ctx:
return Result(False, self, [self.child])
count = len(ctx[self.child])
return Result(self.min <= count <= self.max, self, [], locations=ctx[self.child])
def __str__(self):
if self.max == (1 << 64 - 1):
return 'range(%s, min=%d, max=infinity)' % (str(self.child), self.min)
else:
return 'range(%s, min=%d, max=%d)' % (str(self.child), self.min, self.max)
class Regex(Statement):
'''match if the given pattern matches a String feature.'''
def __init__(self, pattern):
super(Regex, self).__init__()
self.pattern = pattern
pat = self.pattern[len('/'):-len('/')]
flags = re.DOTALL
if pattern.endswith('/i'):
pat = self.pattern[len('/'):-len('/i')]
flags |= re.IGNORECASE
self.re = re.compile(pat, flags)
self.match = ''
def evaluate(self, ctx):
for feature, locations in ctx.items():
if not isinstance(feature, (capa.features.String, )):
continue
# `re.search` finds a match anywhere in the given string
# which implies leading and/or trailing whitespace.
# using this mode cleans is more convenient for rule authors,
# so that they don't have to prefix/suffix their terms like: /.*foo.*/.
if self.re.search(feature.value):
self.match = feature.value
return Result(True, self, [], locations=locations)
return Result(False, self, [])
def __str__(self):
return 'regex(string =~ %s, matched = "%s")' % (self.pattern, self.match)
class Subscope(Statement):
'''
a subscope element is a placeholder in a rule - it should not be evaluated directly.
the engine should preprocess rules to extract subscope statements into their own rules.
'''
def __init__(self, scope, child):
super(Subscope, self).__init__()
self.scope = scope
self.child = child
def evaluate(self, ctx):
raise ValueError('cannot evaluate a subscope directly!')
def topologically_order_rules(rules):
'''
order the given rules such that dependencies show up before dependents.
this means that as we match rules, we can add features, and these
will be matched by subsequent rules if they follow this order.
assumes that the rule dependency graph is a DAG.
'''
rules = {rule.name: rule for rule in rules}
seen = set([])
ret = []
def rec(rule):
if rule.name in seen:
return
for dep in rule.get_dependencies():
rec(rules[dep])
ret.append(rule)
seen.add(rule.name)
for rule in rules.values():
rec(rule)
return ret
def match(rules, features, va):
'''
Args:
rules (List[capa.rules.Rule]): these must already be ordered topologically by dependency.
features (Mapping[capa.features.Feature, int]):
va (int): location of the features
Returns:
Tuple[List[capa.features.Feature], Dict[str, Tuple[int, capa.engine.Result]]]: two-tuple with entries:
- list of features used for matching (which may be greater than argument, due to rule match features), and
- mapping from rule name to (location of match, result object)
'''
results = collections.defaultdict(list)
# copy features so that we can modify it
# without affecting the caller (keep this function pure)
#
# note: copy doesn't notice this is a defaultdict, so we'll recreate that manually.
features = collections.defaultdict(set, copy.copy(features))
for rule in rules:
res = rule.evaluate(features)
if res:
results[rule.name].append((va, res))
features[capa.features.MatchedRule(rule.name)].add(va)
return (features, results)

113
capa/features/__init__.py Normal file
View File

@@ -0,0 +1,113 @@
import codecs
import logging
import capa.engine
logger = logging.getLogger(__name__)
MAX_BYTES_FEATURE_SIZE = 0x100
class Feature(object):
def __init__(self, args):
super(Feature, self).__init__()
self.name = self.__class__.__name__
self.args = args
def __hash__(self):
return hash((self.name, tuple(self.args)))
def __eq__(self, other):
return self.name == other.name and self.args == other.args
def __str__(self):
return '%s(%s)' % (self.name.lower(), ','.join(self.args))
def __repr__(self):
return str(self)
def evaluate(self, ctx):
return capa.engine.Result(self in ctx, self, [], locations=ctx.get(self, []))
def serialize(self):
return self.__dict__
def freeze_serialize(self):
return (self.__class__.__name__,
self.args)
@classmethod
def freeze_deserialize(cls, args):
return cls(*args)
class MatchedRule(Feature):
def __init__(self, rule_name):
super(MatchedRule, self).__init__([rule_name])
self.rule_name = rule_name
def __str__(self):
return 'match(%s)' % (self.rule_name)
class Characteristic(Feature):
def __init__(self, name, value=None):
'''
when `value` is not provided, this serves as descriptor for a class of characteristics.
this is only used internally, such as in `rules.py` when checking if a statement is
supported by a given scope.
'''
super(Characteristic, self).__init__([name, value])
self.name = name
self.value = value
def evaluate(self, ctx):
if self.value is None:
raise ValueError('cannot evaluate characteristc %s with empty value' % (str(self)))
return super(Characteristic, self).evaluate(ctx)
def __str__(self):
if self.value is None:
return 'characteristic(%s)' % (self.name)
else:
return 'characteristic(%s(%s))' % (self.name, self.value)
class String(Feature):
def __init__(self, value):
super(String, self).__init__([value])
self.value = value
def __str__(self):
return 'string("%s")' % (self.value)
class Bytes(Feature):
def __init__(self, value, symbol=None):
super(Bytes, self).__init__([value])
self.value = value
self.symbol = symbol
def evaluate(self, ctx):
for feature, locations in ctx.items():
if not isinstance(feature, (capa.features.Bytes, )):
continue
if feature.value.startswith(self.value):
return capa.engine.Result(True, self, [], locations=locations)
return capa.engine.Result(False, self, [])
def __str__(self):
if self.symbol:
return 'bytes(0x%s = %s)' % (codecs.encode(self.value, 'hex').upper(), self.symbol)
else:
return 'bytes(0x%s)' % (codecs.encode(self.value, 'hex').upper())
def freeze_serialize(self):
return (self.__class__.__name__,
map(lambda x: codecs.encode(x, 'hex'), self.args))
@classmethod
def freeze_deserialize(cls, args):
return cls(*map(lambda x: codecs.decode(x, 'hex'), args))

View File

@@ -0,0 +1,9 @@
from capa.features import Feature
class BasicBlock(Feature):
def __init__(self):
super(BasicBlock, self).__init__([])
def __str__(self):
return 'basic block'

View File

@@ -0,0 +1,274 @@
import abc
try:
import ida
except (ImportError, SyntaxError):
pass
try:
import viv
except (ImportError, SyntaxError):
pass
__all__ = ["ida", "viv"]
class FeatureExtractor(object):
'''
FeatureExtractor defines the interface for fetching features from a sample.
There may be multiple backends that support fetching features for capa.
For example, we use vivisect by default, but also want to support saving
and restoring features from a JSON file.
When we restore the features, we'd like to use exactly the same matching logic
to find matching rules.
Therefore, we can define a FeatureExtractor that provides features from the
serialized JSON file and do matching without a binary analysis pass.
Also, this provides a way to hook in an IDA backend.
This class is not instantiated directly; it is the base class for other implementations.
'''
__metaclass__ = abc.ABCMeta
def __init__(self):
#
# note: a subclass should define ctor parameters for its own use.
# for example, the Vivisect feature extract might require the vw and/or path.
# this base class doesn't know what to do with that info, though.
#
super(FeatureExtractor, self).__init__()
@abc.abstractmethod
def extract_file_features(self):
'''
extract file-scope features.
example::
extractor = VivisectFeatureExtractor(vw, path)
for feature, va in extractor.get_file_features():
print('0x%x: %s', va, feature)
yields:
Tuple[capa.features.Feature, int]: feature and its location
'''
raise NotImplemented
@abc.abstractmethod
def get_functions(self):
'''
enumerate the functions and provide opaque values that will
subsequently be provided to `.extract_function_features()`, etc.
by "opaque value", we mean that this can be any object, as long as it
provides enough context to `.extract_function_features()`.
the opaque value should support casting to int (`__int__`) for the function start address.
yields:
any: the opaque function value.
'''
raise NotImplemented
@abc.abstractmethod
def extract_function_features(self, f):
'''
extract function-scope features.
the arguments are opaque values previously provided by `.get_functions()`, etc.
example::
extractor = VivisectFeatureExtractor(vw, path)
for function in extractor.get_functions():
for feature, va in extractor.extract_function_features(function):
print('0x%x: %s', va, feature)
args:
f [any]: an opaque value previously fetched from `.get_functions()`.
yields:
Tuple[capa.features.Feature, int]: feature and its location
'''
raise NotImplemented
@abc.abstractmethod
def get_basic_blocks(self, f):
'''
enumerate the basic blocks in the given function and provide opaque values that will
subsequently be provided to `.extract_basic_block_features()`, etc.
by "opaque value", we mean that this can be any object, as long as it
provides enough context to `.extract_basic_block_features()`.
the opaque value should support casting to int (`__int__`) for the basic block start address.
yields:
any: the opaque basic block value.
'''
raise NotImplemented
@abc.abstractmethod
def extract_basic_block_features(self, f, bb):
'''
extract basic block-scope features.
the arguments are opaque values previously provided by `.get_functions()`, etc.
example::
extractor = VivisectFeatureExtractor(vw, path)
for function in extractor.get_functions():
for bb in extractor.get_basic_blocks(function):
for feature, va in extractor.extract_basic_block_features(function, bb):
print('0x%x: %s', va, feature)
args:
f [any]: an opaque value previously fetched from `.get_functions()`.
bb [any]: an opaque value previously fetched from `.get_basic_blocks()`.
yields:
Tuple[capa.features.Feature, int]: feature and its location
'''
raise NotImplemented
@abc.abstractmethod
def get_instructions(self, f, bb):
'''
enumerate the instructions in the given basic block and provide opaque values that will
subsequently be provided to `.extract_insn_features()`, etc.
by "opaque value", we mean that this can be any object, as long as it
provides enough context to `.extract_insn_features()`.
the opaque value should support casting to int (`__int__`) for the instruction address.
yields:
any: the opaque function value.
'''
raise NotImplemented
@abc.abstractmethod
def extract_insn_features(self, f, bb, insn):
'''
extract instruction-scope features.
the arguments are opaque values previously provided by `.get_functions()`, etc.
example::
extractor = VivisectFeatureExtractor(vw, path)
for function in extractor.get_functions():
for bb in extractor.get_basic_blocks(function):
for insn in extractor.get_instructions(function, bb):
for feature, va in extractor.extract_insn_features(function, bb, insn):
print('0x%x: %s', va, feature)
args:
f [any]: an opaque value previously fetched from `.get_functions()`.
bb [any]: an opaque value previously fetched from `.get_basic_blocks()`.
insn [any]: an opaque value previously fetched from `.get_instructions()`.
yields:
Tuple[capa.features.Feature, int]: feature and its location
'''
raise NotImplemented
class NullFeatureExtractor(FeatureExtractor):
'''
An extractor that extracts some user-provided features.
The structure of the single parameter is demonstrated in the example below.
This is useful for testing, as we can provide expected values and see if matching works.
Also, this is how we represent features deserialized from a freeze file.
example::
extractor = NullFeatureExtractor({
'file features': [
(0x402345, capa.features.Characteristic('embedded pe', True)),
],
'functions': {
0x401000: {
'features': [
(0x401000, capa.features.Characteristic('switch', True)),
],
'basic blocks': {
0x401000: {
'features': [
(0x401000, capa.features.Characteristic('tight-loop', True)),
],
'instructions': {
0x401000: {
'features': [
(0x401000, capa.features.Characteristic('nzxor', True)),
],
},
0x401002: ...
}
},
0x401005: ...
}
},
0x40200: ...
}
)
'''
def __init__(self, features):
super(NullFeatureExtractor, self).__init__()
self.features = features
def extract_file_features(self):
for p in self.features.get('file features', []):
va, feature = p
yield feature, va
def get_functions(self):
for va in sorted(self.features['functions'].keys()):
yield va
def extract_function_features(self, f):
for p in (self.features # noqa: E127 line over-indented
.get('functions', {})
.get(f, {})
.get('features', [])):
va, feature = p
yield feature, va
def get_basic_blocks(self, f):
for va in sorted(self.features # noqa: E127 line over-indented
.get('functions', {})
.get(f, {})
.get('basic blocks', {})
.keys()):
yield va
def extract_basic_block_features(self, f, bb):
for p in (self.features # noqa: E127 line over-indented
.get('functions', {})
.get(f, {})
.get('basic blocks', {})
.get(bb, {})
.get('features', [])):
va, feature = p
yield feature, va
def get_instructions(self, f, bb):
for va in sorted(self.features # noqa: E127 line over-indented
.get('functions', {})
.get(f, {})
.get('basic blocks', {})
.get(bb, {})
.get('instructions', {})
.keys()):
yield va
def extract_insn_features(self, f, bb, insn):
for p in (self.features # noqa: E127 line over-indented
.get('functions', {})
.get(f, {})
.get('basic blocks', {})
.get(bb, {})
.get('instructions', {})
.get(insn, {})
.get('features', [])):
va, feature = p
yield feature, va

View File

@@ -0,0 +1,61 @@
import sys
import builtins
from capa.features.insn import API
MIN_STACKSTRING_LEN = 8
def xor_static(data, i):
if sys.version_info >= (3, 0):
return bytes(c ^ i for c in data)
else:
return ''.join(chr(ord(c) ^ i) for c in data)
def is_aw_function(function_name):
'''
is the given function name an A/W function?
these are variants of functions that, on Windows, accept either a narrow or wide string.
'''
if len(function_name) < 2:
return False
# last character should be 'A' or 'W'
if function_name[-1] not in ('A', 'W'):
return False
# second to last character should be lowercase letter
return 'a' <= function_name[-2] <= 'z' or '0' <= function_name[-2] <= '9'
def generate_api_features(apiname, va):
'''
for a given function name and address, generate API names.
we over-generate features to make matching easier.
these include:
- kernel32.CreateFileA
- kernel32.CreateFile
- CreateFileA
- CreateFile
'''
# (kernel32.CreateFileA, 0x401000)
yield API(apiname), va
if is_aw_function(apiname):
# (kernel32.CreateFile, 0x401000)
yield API(apiname[:-1]), va
if '.' in apiname:
modname, impname = apiname.split('.')
# strip modname to support importname-only matching
# (CreateFileA, 0x401000)
yield API(impname), va
if is_aw_function(impname):
# (CreateFile, 0x401000)
yield API(impname[:-1]), va
def all_zeros(bytez):
return all(b == 0 for b in builtins.bytes(bytez))

View File

@@ -0,0 +1,73 @@
import sys
import types
import idaapi
from capa.features.extractors import FeatureExtractor
import capa.features.extractors.ida.file
import capa.features.extractors.ida.insn
import capa.features.extractors.ida.helpers
import capa.features.extractors.ida.function
import capa.features.extractors.ida.basicblock
def get_va(self):
if isinstance(self, idaapi.BasicBlock):
return self.start_ea
if isinstance(self, idaapi.func_t):
return self.start_ea
if isinstance(self, idaapi.insn_t):
return self.ea
raise TypeError
def add_va_int_cast(o):
'''
dynamically add a cast-to-int (`__int__`) method to the given object
that returns the value of the `.va` property.
this bit of skullduggery lets use cast viv-utils objects as ints.
the correct way of doing this is to update viv-utils (or subclass the objects here).
'''
if sys.version_info >= (3, 0):
setattr(o, '__int__', types.MethodType(get_va, o))
else:
setattr(o, '__int__', types.MethodType(get_va, o, type(o)))
return o
class IdaFeatureExtractor(FeatureExtractor):
def __init__(self):
super(IdaFeatureExtractor, self).__init__()
def extract_file_features(self):
for feature, va in capa.features.extractors.ida.file.extract_features():
yield feature, va
def get_functions(self):
for f in capa.features.extractors.ida.helpers.get_functions(ignore_thunks=True, ignore_libs=True):
yield add_va_int_cast(f)
def extract_function_features(self, f):
for feature, va in capa.features.extractors.ida.function.extract_features(f):
yield feature, va
def get_basic_blocks(self, f):
for bb in idaapi.FlowChart(f, flags=idaapi.FC_PREDS):
yield add_va_int_cast(bb)
def extract_basic_block_features(self, f, bb):
for feature, va in capa.features.extractors.ida.basicblock.extract_features(f, bb):
yield feature, va
def get_instructions(self, f, bb):
for insn in capa.features.extractors.ida.helpers.get_instructions_in_range(bb.start_ea, bb.end_ea):
yield add_va_int_cast(insn)
def extract_insn_features(self, f, bb, insn):
for feature, va in capa.features.extractors.ida.insn.extract_features(f, bb, insn):
yield feature, va

View File

@@ -0,0 +1,170 @@
import sys
import struct
import string
import pprint
import idautils
import idaapi
import idc
from capa.features.extractors.ida import helpers
from capa.features import Characteristic
from capa.features.basicblock import BasicBlock
from capa.features.extractors.helpers import MIN_STACKSTRING_LEN
def _ida_get_printable_len(op):
''' Return string length if all operand bytes are ascii or utf16-le printable
args:
op (IDA op_t)
'''
op_val = helpers.mask_op_val(op)
if op.dtype == idaapi.dt_byte:
chars = struct.pack('<B', op_val)
elif op.dtype == idaapi.dt_word:
chars = struct.pack('<H', op_val)
elif op.dtype == idaapi.dt_dword:
chars = struct.pack('<I', op_val)
elif op.dtype == idaapi.dt_qword:
chars = struct.pack('<Q', op_val)
else:
raise ValueError('Unhandled operand data type 0x%x.' % op.dtype)
def _is_printable_ascii(chars):
if sys.version_info >= (3, 0):
return all(c < 127 and chr(c) in string.printable for c in chars)
else:
return all(ord(c) < 127 and c in string.printable for c in chars)
def _is_printable_utf16le(chars):
if sys.version_info >= (3, 0):
if all(c == 0x00 for c in chars[1::2]):
return _is_printable_ascii(chars[::2])
else:
if all(c == '\x00' for c in chars[1::2]):
return _is_printable_ascii(chars[::2])
if _is_printable_ascii(chars):
return idaapi.get_dtype_size(op.dtype)
if _is_printable_utf16le(chars):
return idaapi.get_dtype_size(op.dtype) / 2
return 0
def _is_mov_imm_to_stack(insn):
''' verify instruction moves immediate onto stack
args:
insn (IDA insn_t)
'''
if insn.Op2.type != idaapi.o_imm:
return False
if not helpers.is_op_stack_var(insn.ea, 0):
return False
if not insn.get_canon_mnem().startswith('mov'):
return False
return True
def _ida_bb_contains_stackstring(f, bb):
''' check basic block for stackstring indicators
true if basic block contains enough moves of constant bytes to the stack
args:
f (IDA func_t)
bb (IDA BasicBlock)
'''
count = 0
for insn in helpers.get_instructions_in_range(bb.start_ea, bb.end_ea):
if _is_mov_imm_to_stack(insn):
count += _ida_get_printable_len(insn.Op2)
if count > MIN_STACKSTRING_LEN:
return True
return False
def extract_bb_stackstring(f, bb):
''' extract stackstring indicators from basic block
args:
f (IDA func_t)
bb (IDA BasicBlock)
'''
if _ida_bb_contains_stackstring(f, bb):
yield Characteristic('stack string', True), bb.start_ea
def _ida_bb_contains_tight_loop(f, bb):
''' check basic block for stackstring indicators
true if last instruction in basic block branches to basic block start
args:
f (IDA func_t)
bb (IDA BasicBlock)
'''
bb_end = idc.prev_head(bb.end_ea)
if bb.start_ea < bb_end:
for ref in idautils.CodeRefsFrom(bb_end, True):
if ref == bb.start_ea:
return True
return False
def extract_bb_tight_loop(f, bb):
''' extract tight loop indicators from a basic block
args:
f (IDA func_t)
bb (IDA BasicBlock)
'''
if _ida_bb_contains_tight_loop(f, bb):
yield Characteristic('tight loop', True), bb.start_ea
def extract_features(f, bb):
''' extract basic block features
args:
f (IDA func_t)
bb (IDA BasicBlock)
'''
yield BasicBlock(), bb.start_ea
for bb_handler in BASIC_BLOCK_HANDLERS:
for feature, va in bb_handler(f, bb):
yield feature, va
BASIC_BLOCK_HANDLERS = (
extract_bb_tight_loop,
extract_bb_stackstring,
)
def main():
features = []
for f in helpers.get_functions(ignore_thunks=True, ignore_libs=True):
for bb in idaapi.FlowChart(f, flags=idaapi.FC_PREDS):
features.extend(list(extract_features(f, bb)))
pprint.pprint(features)
if __name__ == '__main__':
main()

View File

@@ -0,0 +1,155 @@
import struct
import pprint
import idautils
import idaapi
import idc
from capa.features import String
from capa.features import Characteristic
from capa.features.file import Section
from capa.features.file import Export
from capa.features.file import Import
import capa.features.extractors.strings
import capa.features.extractors.helpers
import capa.features.extractors.ida.helpers
def _ida_check_segment_for_pe(seg):
''' check segment for embedded PE
adapted for IDA from:
https://github.com/vivisect/vivisect/blob/7be4037b1cecc4551b397f840405a1fc606f9b53/PE/carve.py#L19
args:
seg (IDA segment_t)
'''
seg_max = seg.end_ea
mz_xor = [(capa.features.extractors.helpers.xor_static(b'MZ', i),
capa.features.extractors.helpers.xor_static(b'PE', i),
i)
for i in range(256)]
todo = [(capa.features.extractors.ida.helpers.find_byte_sequence(seg.start_ea, seg.end_ea, mzx), mzx, pex, i) for mzx, pex, i in mz_xor]
todo = [(off, mzx, pex, i) for (off, mzx, pex, i) in todo if off != idaapi.BADADDR]
while len(todo):
off, mzx, pex, i = todo.pop()
# The MZ header has one field we will check e_lfanew is at 0x3c
e_lfanew = off + 0x3c
if seg_max < (e_lfanew + 4):
continue
newoff = struct.unpack('<I', capa.features.extractors.helpers.xor_static(idc.get_bytes(e_lfanew, 4), i))[0]
peoff = off + newoff
if seg_max < (peoff + 2):
continue
if idc.get_bytes(peoff, 2) == pex:
yield (off, i)
nextres = capa.features.extractors.ida.helpers.find_byte_sequence(off + 1, seg.end_ea, mzx)
if nextres != -1:
todo.append((nextres, mzx, pex, i))
def extract_file_embedded_pe():
''' extract embedded PE features
IDA must load resource sections for this to be complete
- '-R' from console
- Check 'Load resource sections' when opening binary in IDA manually
'''
for seg in capa.features.extractors.ida.helpers.get_segments():
if seg.is_header_segm():
# IDA may load header segments, skip if present
continue
for ea, _ in _ida_check_segment_for_pe(seg):
yield Characteristic('embedded pe', True), ea
def extract_file_export_names():
''' extract function exports '''
for _, _, ea, name in idautils.Entries():
yield Export(name), ea
def extract_file_import_names():
''' extract function imports
1. imports by ordinal:
- modulename.#ordinal
2. imports by name, results in two features to support importname-only
matching:
- modulename.importname
- importname
'''
for ea, imp_info in capa.features.extractors.ida.helpers.get_file_imports().items():
dllname, name, ordi = imp_info
if name:
yield Import('%s.%s' % (dllname, name)), ea
yield Import(name), ea
if ordi:
yield Import('%s.#%s' % (dllname, str(ordi))), ea
def extract_file_section_names():
''' extract section names
IDA must load resource sections for this to be complete
- '-R' from console
- Check 'Load resource sections' when opening binary in IDA manually
'''
for seg in capa.features.extractors.ida.helpers.get_segments():
if seg.is_header_segm():
# IDA may load header segments, skip if present
continue
yield Section(idaapi.get_segm_name(seg)), seg.start_ea
def extract_file_strings():
''' extract ASCII and UTF-16 LE strings
IDA must load resource sections for this to be complete
- '-R' from console
- Check 'Load resource sections' when opening binary in IDA manually
'''
for seg in capa.features.extractors.ida.helpers.get_segments():
seg_buff = capa.features.extractors.ida.helpers.get_segment_buffer(seg)
for s in capa.features.extractors.strings.extract_ascii_strings(seg_buff):
yield String(s.s), (seg.start_ea + s.offset)
for s in capa.features.extractors.strings.extract_unicode_strings(seg_buff):
yield String(s.s), (seg.start_ea + s.offset)
def extract_features():
''' extract file features '''
for file_handler in FILE_HANDLERS:
for feature, va in file_handler():
yield feature, va
FILE_HANDLERS = (
extract_file_export_names,
extract_file_import_names,
extract_file_strings,
extract_file_section_names,
extract_file_embedded_pe,
)
def main():
pprint.pprint(list(extract_features()))
if __name__ == '__main__':
main()

View File

@@ -0,0 +1,100 @@
import idautils
import idaapi
from capa.features import Characteristic
from capa.features.extractors import loops
def _ida_function_contains_switch(f):
''' check a function for switch statement indicators
adapted from:
https://reverseengineering.stackexchange.com/questions/17548/calc-switch-cases-in-idapython-cant-iterate-over-results?rq=1
arg:
f (IDA func_t)
'''
for start, end in idautils.Chunks(f.start_ea):
for head in idautils.Heads(start, end):
if idaapi.get_switch_info(head):
return True
return False
def extract_function_switch(f):
''' extract switch indicators from a function
arg:
f (IDA func_t)
'''
if _ida_function_contains_switch(f):
yield Characteristic('switch', True), f.start_ea
def extract_function_calls_to(f):
''' extract callers to a function
args:
f (IDA func_t)
'''
for ea in idautils.CodeRefsTo(f.start_ea, True):
yield Characteristic('calls to', True), ea
def extract_function_loop(f):
''' extract loop indicators from a function
args:
f (IDA func_t)
'''
edges = []
for bb in idaapi.FlowChart(f):
map(lambda s: edges.append((bb.start_ea, s.start_ea)), bb.succs())
if edges and loops.has_loop(edges):
yield Characteristic('loop', True), f.start_ea
def extract_recursive_call(f):
''' extract recursive function call
args:
f (IDA func_t)
'''
for ref in idautils.CodeRefsTo(f.start_ea, True):
if f.contains(ref):
yield Characteristic('recursive call', True), f.start_ea
break
def extract_features(f):
''' extract function features
arg:
f (IDA func_t)
'''
for func_handler in FUNCTION_HANDLERS:
for feature, va in func_handler(f):
yield feature, va
FUNCTION_HANDLERS = (
extract_function_calls_to,
extract_function_switch,
extract_function_loop,
extract_recursive_call
)
def main():
features = []
for f in helpers.get_functions(ignore_thunks=True, ignore_libs=True):
features.extend(list(extract_features(f)))
pprint.pprint(features)
if __name__ == '__main__':
main()

View File

@@ -0,0 +1,298 @@
import sys
import string
import idautils
import idaapi
import idc
def find_byte_sequence(start, end, seq):
''' find byte sequence
args:
start: min virtual address
end: max virtual address
seq: bytes to search e.g. b'\x01\x03'
'''
if sys.version_info >= (3, 0):
return idaapi.find_binary(start, end, ' '.join(['%02x' % b for b in seq]), 0, idaapi.SEARCH_DOWN)
else:
return idaapi.find_binary(start, end, ' '.join(['%02x' % ord(b) for b in seq]), 0, idaapi.SEARCH_DOWN)
def get_functions(start=None, end=None, ignore_thunks=False, ignore_libs=False):
''' get functions, range optional
args:
start: min virtual address
end: max virtual address
ret:
yield func_t*
'''
for ea in idautils.Functions(start=start, end=end):
f = idaapi.get_func(ea)
if ignore_thunks and f.flags & idaapi.FUNC_THUNK:
continue
if ignore_libs and f.flags & idaapi.FUNC_LIB:
continue
yield f
def get_segments():
''' Get list of segments (sections) in the binary image '''
for n in range(idaapi.get_segm_qty()):
seg = idaapi.getnseg(n)
if seg:
yield seg
def get_segment_buffer(seg):
''' return bytes stored in a given segment
decrease buffer size until IDA is able to read bytes from the segment
'''
buff = b''
sz = seg.end_ea - seg.start_ea
while sz > 0:
buff = idaapi.get_bytes(seg.start_ea, sz)
if buff:
break
sz -= 0x1000
# IDA returns None if get_bytes fails, so convert for consistent return type
return buff if buff else b''
def get_file_imports():
''' get file imports '''
_imports = {}
for idx in range(idaapi.get_import_module_qty()):
dllname = idaapi.get_import_module_name(idx)
if not dllname:
continue
def _inspect_import(ea, name, ordi):
if name and name.startswith('__imp_'):
# handle mangled names starting
name = name[len('__imp_'):]
_imports[ea] = (dllname.lower(), name, ordi)
return True
idaapi.enum_import_names(idx, _inspect_import)
return _imports
def get_instructions_in_range(start, end):
''' yield instructions in range
args:
start: virtual address (inclusive)
end: virtual address (exclusive)
yield:
(insn_t*)
'''
for head in idautils.Heads(start, end):
inst = idautils.DecodeInstruction(head)
if inst:
yield inst
def is_operand_equal(op1, op2):
''' compare two IDA op_t '''
if op1.flags != op2.flags:
return False
if op1.dtype != op2.dtype:
return False
if op1.type != op2.type:
return False
if op1.reg != op2.reg:
return False
if op1.phrase != op2.phrase:
return False
if op1.value != op2.value:
return False
if op1.addr != op2.addr:
return False
return True
def is_basic_block_equal(bb1, bb2):
''' compare two IDA BasicBlock '''
return bb1.start_ea == bb2.start_ea \
and bb1.end_ea == bb2.end_ea \
and bb1.type == bb2.type
def basic_block_size(bb):
''' calculate size of basic block '''
return bb.end_ea - bb.start_ea
def read_bytes_at(ea, count):
segm_end = idc.get_segm_end(ea)
if ea + count > segm_end:
return idc.get_bytes(ea, segm_end - ea)
else:
return idc.get_bytes(ea, count)
def find_string_at(ea, min=4):
''' check if ASCII string exists at a given virtual address '''
found = idaapi.get_strlit_contents(ea, -1, idaapi.STRTYPE_C)
if found and len(found) > min:
try:
found = found.decode('ascii')
# hacky check for IDA bug; get_strlit_contents also reads Unicode as
# myy__uunniiccoodde when searching in ASCII mode so we check for that here
# and return the fixed up value
if len(found) >= 3 and found[1::2] == found[2::2]:
found = found[0] + found[1::2]
return found
except UnicodeDecodeError:
pass
return None
def get_op_phrase_info(op):
''' parse phrase features from operand
Pretty much dup of sark's implementation:
https://github.com/tmr232/Sark/blob/master/sark/code/instruction.py#L28-L73
'''
if op.type not in (idaapi.o_phrase, idaapi.o_displ):
return
scale = 1 << ((op.specflag2 & 0xC0) >> 6)
offset = op.addr
if op.specflag1 == 0:
index = None
base = op.reg
elif op.specflag1 == 1:
index = (op.specflag2 & 0x38) >> 3
base = (op.specflag2 & 0x07) >> 0
if op.reg == 0xC:
if base & 4:
base += 8
if index & 4:
index += 8
else:
return
if (index == base == idautils.procregs.sp.reg) and (scale == 1):
# HACK: This is a really ugly hack. For some reason, phrases of the form `[esp + ...]` (`sp`, `rsp` as well)
# set both the `index` and the `base` to `esp`. This is not significant, as `esp` cannot be used as an
# index, but it does cause issues with the parsing.
# This is only relevant to Intel architectures.
index = None
return {'base': base, 'index': index, 'scale': scale, 'offset': offset}
def is_op_write(insn, op):
''' Check if an operand is written to (destination operand) '''
return idaapi.has_cf_chg(insn.get_canon_feature(), op.n)
def is_op_read(insn, op):
''' Check if an operand is read from (source operand) '''
return idaapi.has_cf_use(insn.get_canon_feature(), op.n)
def is_sp_modified(insn):
''' determine if instruction modifies SP, ESP, RSP '''
for op in get_insn_ops(insn, op_type=(idaapi.o_reg,)):
if op.reg != idautils.procregs.sp.reg:
continue
if is_op_write(insn, op):
return True
return False
def is_bp_modified(insn):
''' check if instruction modifies BP, EBP, RBP '''
for op in get_insn_ops(insn, op_type=(idaapi.o_reg,)):
if op.reg != idautils.procregs.bp.reg:
continue
if is_op_write(insn, op):
return True
return False
def is_frame_register(reg):
''' check if register is sp or bp '''
return reg in (idautils.procregs.sp.reg, idautils.procregs.bp.reg)
def get_insn_ops(insn, op_type=None):
''' yield op_t for instruction, filter on type if specified '''
for op in insn.ops:
if op.type == idaapi.o_void:
# avoid looping all 6 ops if only subset exists
break
if op_type and op.type not in op_type:
continue
yield op
def ea_flags(ea):
''' retrieve processor flags for a given address '''
return idaapi.get_flags(ea)
def is_op_stack_var(ea, n):
''' check if operand is a stack variable '''
return idaapi.is_stkvar(ea_flags(ea), n)
def mask_op_val(op):
''' mask off a value based on data type
necesssary due to a bug in 64-bit
Example:
.rsrc:0054C12C mov [ebp+var_4], 0FFFFFFFFh
insn.Op2.dtype == idaapi.dt_dword
insn.Op2.value == 0xffffffffffffffff
'''
masks = {
idaapi.dt_byte: 0xFF,
idaapi.dt_word: 0xFFFF,
idaapi.dt_dword: 0xFFFFFFFF,
idaapi.dt_qword: 0xFFFFFFFFFFFFFFFF
}
mask = masks.get(op.dtype, None)
if not mask:
raise ValueError('No support for operand data type 0x%x' % op.dtype)
return mask & op.value
def ea_to_offset(ea):
''' convert virtual address to file offset '''
return idaapi.get_fileregion_offset(ea)

View File

@@ -0,0 +1,420 @@
import pprint
import idautils
import idaapi
import idc
from capa.features import String
from capa.features import Bytes
from capa.features import Characteristic
from capa.features import MAX_BYTES_FEATURE_SIZE
from capa.features.insn import Number
from capa.features.insn import Offset
from capa.features.insn import Mnemonic
import capa.features.extractors.helpers
import capa.features.extractors.ida.helpers
_file_imports_cache = None
def get_imports():
global _file_imports_cache
if _file_imports_cache is None:
_file_imports_cache = capa.features.extractors.ida.helpers.get_file_imports()
return _file_imports_cache
def _check_for_api_call(insn):
''' check instruction for API call '''
if not idaapi.is_call_insn(insn):
return
for call_ref in idautils.CodeRefsFrom(insn.ea, False):
imp = get_imports().get(call_ref, None)
if imp:
yield '%s.%s' % (imp[0], imp[1])
else:
f = idaapi.get_func(call_ref)
if f and f.flags & idaapi.FUNC_THUNK:
# check if call to thunk
# TODO: first instruction might not always be the thunk
for thunk_ref in idautils.DataRefsFrom(call_ref):
# TODO: always data ref for thunk??
imp = get_imports().get(thunk_ref, None)
if imp:
yield '%s.%s' % (imp[0], imp[1])
def extract_insn_api_features(f, bb, insn):
''' parse instruction API features
args:
f (IDA func_t)
bb (IDA BasicBlock)
insn (IDA insn_t)
example:
call dword [0x00473038]
'''
for api_name in _check_for_api_call(insn):
for feature, va in capa.features.extractors.helpers.generate_api_features(api_name, insn.ea):
yield feature, va
def extract_insn_number_features(f, bb, insn):
''' parse instruction number features
args:
f (IDA func_t)
bb (IDA BasicBlock)
insn (IDA insn_t)
example:
push 3136B0h ; dwControlCode
'''
if idaapi.is_ret_insn(insn):
# skip things like:
# .text:0042250E retn 8
return
if capa.features.extractors.ida.helpers.is_sp_modified(insn):
# skip things like:
# .text:00401145 add esp, 0Ch
return
for op in capa.features.extractors.ida.helpers.get_insn_ops(insn, op_type=(idaapi.o_imm,)):
op_val = capa.features.extractors.ida.helpers.mask_op_val(op)
if idaapi.is_mapped(op_val):
# assume valid address is not a constant
continue
yield Number(op_val), insn.ea
def extract_insn_bytes_features(f, bb, insn):
''' parse referenced byte sequences
args:
f (IDA func_t)
bb (IDA BasicBlock)
insn (IDA insn_t)
example:
push offset iid_004118d4_IShellLinkA ; riid
'''
if idaapi.is_call_insn(insn):
# ignore call instructions
return
for ref in idautils.DataRefsFrom(insn.ea):
extracted_bytes = capa.features.extractors.ida.helpers.read_bytes_at(ref, MAX_BYTES_FEATURE_SIZE)
if extracted_bytes:
if not capa.features.extractors.helpers.all_zeros(extracted_bytes):
yield Bytes(extracted_bytes), insn.ea
def extract_insn_string_features(f, bb, insn):
''' parse instruction string features
args:
f (IDA func_t)
bb (IDA BasicBlock)
insn (IDA insn_t)
example:
push offset aAcr ; "ACR > "
'''
for ref in idautils.DataRefsFrom(insn.ea):
found = capa.features.extractors.ida.helpers.find_string_at(ref)
if found:
yield String(found), insn.ea
def extract_insn_offset_features(f, bb, insn):
''' parse instruction structure offset features
args:
f (IDA func_t)
bb (IDA BasicBlock)
insn (IDA insn_t)
example:
.text:0040112F cmp [esi+4], ebx
'''
for op in capa.features.extractors.ida.helpers.get_insn_ops(insn, op_type=(idaapi.o_phrase, idaapi.o_displ)):
if capa.features.extractors.ida.helpers.is_op_stack_var(insn.ea, op.n):
# skip stack offsets
continue
p_info = capa.features.extractors.ida.helpers.get_op_phrase_info(op)
if not p_info:
continue
op_off = p_info['offset']
if 0 == op_off:
# TODO: Do we want to record offset of zero?
continue
if idaapi.is_mapped(op_off):
# Ignore:
# mov esi, dword_1005B148[esi]
continue
# TODO: Do we handle two's complement?
yield Offset(op_off), insn.ea
def _contains_stack_cookie_keywords(s):
''' check if string contains stack cookie keywords
Examples:
xor ecx, ebp ; StackCookie
mov eax, ___security_cookie
'''
if not s:
return False
s = s.strip().lower()
if 'cookie' not in s:
return False
return any(keyword in s for keyword in ('stack', 'security'))
def _bb_stack_cookie_registers(bb):
''' scan basic block for stack cookie operations
yield registers ids that may have been used for stack cookie operations
assume instruction that sets stack cookie and nzxor exist in same block
and stack cookie register is not modified prior to nzxor
Example:
.text:004062DA mov eax, ___security_cookie <-- stack cookie
.text:004062DF mov ecx, eax
.text:004062E1 mov ebx, [esi]
.text:004062E3 and ecx, 1Fh
.text:004062E6 mov edi, [esi+4]
.text:004062E9 xor ebx, eax
.text:004062EB mov esi, [esi+8]
.text:004062EE xor edi, eax <-- ignore
.text:004062F0 xor esi, eax <-- ignore
.text:004062F2 ror edi, cl
.text:004062F4 ror esi, cl
.text:004062F6 ror ebx, cl
.text:004062F8 cmp edi, esi
.text:004062FA jnz loc_40639D
TODO: this is expensive, but necessary?...
'''
for insn in capa.features.extractors.ida.helpers.get_instructions_in_range(bb.start_ea, bb.end_ea):
if _contains_stack_cookie_keywords(idc.GetDisasm(insn.ea)):
for op in capa.features.extractors.ida.helpers.get_insn_ops(insn, op_type=(idaapi.o_reg,)):
if capa.features.extractors.ida.helpers.is_op_write(insn, op):
# only include modified registers
yield op.reg
def _is_nzxor_stack_cookie(f, bb, insn):
''' check if nzxor is related to stack cookie '''
if _contains_stack_cookie_keywords(idaapi.get_cmt(insn.ea, False)):
# Example:
# xor ecx, ebp ; StackCookie
return True
if any(op_reg in _bb_stack_cookie_registers(bb) for op_reg in (insn.Op1.reg, insn.Op2.reg)):
# Example:
# mov eax, ___security_cookie
# xor eax, ebp
return True
return False
def extract_insn_nzxor_characteristic_features(f, bb, insn):
''' parse instruction non-zeroing XOR instruction
ignore expected non-zeroing XORs, e.g. security cookies
args:
f (IDA func_t)
bb (IDA BasicBlock)
insn (IDA insn_t)
'''
if insn.itype != idaapi.NN_xor:
return
if capa.features.extractors.ida.helpers.is_operand_equal(insn.Op1, insn.Op2):
return
if _is_nzxor_stack_cookie(f, bb, insn):
return
yield Characteristic('nzxor', True), insn.ea
def extract_insn_mnemonic_features(f, bb, insn):
''' parse instruction mnemonic features
args:
f (IDA func_t)
bb (IDA BasicBlock)
insn (IDA insn_t)
'''
yield Mnemonic(insn.get_canon_mnem()), insn.ea
def extract_insn_peb_access_characteristic_features(f, bb, insn):
''' parse instruction peb access
fs:[0x30] on x86, gs:[0x60] on x64
TODO:
IDA should be able to do this..
'''
if insn.itype not in (idaapi.NN_push, idaapi.NN_mov):
return
if all(map(lambda op: op.type != idaapi.o_mem, insn.ops)):
# try to optimize for only memory referencese
return
disasm = idc.GetDisasm(insn.ea)
if ' fs:30h' in disasm or ' gs:60h' in disasm:
# TODO: replace above with proper IDA
yield Characteristic('peb access', True), insn.ea
def extract_insn_segment_access_features(f, bb, insn):
''' parse instruction fs or gs access
TODO:
IDA should be able to do this...
'''
if all(map(lambda op: op.type != idaapi.o_mem, insn.ops)):
# try to optimize for only memory referencese
return
disasm = idc.GetDisasm(insn.ea)
if ' fs:' in disasm:
# TODO: replace above with proper IDA
yield Characteristic('fs access', True), insn.ea
if ' gs:' in disasm:
# TODO: replace above with proper IDA
yield Characteristic('gs access', True), insn.ea
def extract_insn_cross_section_cflow(f, bb, insn):
''' inspect the instruction for a CALL or JMP that crosses section boundaries
args:
f (IDA func_t)
bb (IDA BasicBlock)
insn (IDA insn_t)
'''
for ref in idautils.CodeRefsFrom(insn.ea, False):
if ref in get_imports().keys():
# ignore API calls
continue
if not idaapi.getseg(ref):
# handle IDA API bug
continue
if idaapi.getseg(ref) == idaapi.getseg(insn.ea):
continue
yield Characteristic('cross section flow', True), insn.ea
def extract_function_calls_from(f, bb, insn):
''' extract functions calls from features
most relevant at the function scope, however, its most efficient to extract at the instruction scope
args:
f (IDA func_t)
bb (IDA BasicBlock)
insn (IDA insn_t)
'''
if not idaapi.is_call_insn(insn):
# ignore jmp, etc.
return
for ref in idautils.CodeRefsFrom(insn.ea, False):
yield Characteristic('calls from', True), ref
def extract_function_indirect_call_characteristic_features(f, bb, insn):
''' extract indirect function calls (e.g., call eax or call dword ptr [edx+4])
does not include calls like => call ds:dword_ABD4974
most relevant at the function or basic block scope;
however, its most efficient to extract at the instruction scope
args:
f (IDA func_t)
bb (IDA BasicBlock)
insn (IDA insn_t)
'''
if not idaapi.is_call_insn(insn):
return
if idc.get_operand_type(insn.ea, 0) in (idc.o_reg, idc.o_phrase, idc.o_displ):
yield Characteristic('indirect call', True), insn.ea
def extract_features(f, bb, insn):
''' extract instruction features
args:
f (IDA func_t)
bb (IDA BasicBlock)
insn (IDA insn_t)
'''
for inst_handler in INSTRUCTION_HANDLERS:
for feature, va in inst_handler(f, bb, insn):
yield feature, va
INSTRUCTION_HANDLERS = (
extract_insn_api_features,
extract_insn_number_features,
extract_insn_bytes_features,
extract_insn_string_features,
extract_insn_offset_features,
extract_insn_nzxor_characteristic_features,
extract_insn_mnemonic_features,
extract_insn_peb_access_characteristic_features,
extract_insn_cross_section_cflow,
extract_insn_segment_access_features,
extract_function_calls_from,
extract_function_indirect_call_characteristic_features
)
def main():
features = []
for f in capa.features.extractors.ida.helpers.get_functions(ignore_thunks=True, ignore_libs=True):
for bb in idaapi.FlowChart(f, flags=idaapi.FC_PREDS):
for insn in capa.features.extractors.ida.helpers.get_instructions_in_range(bb.start_ea, bb.end_ea):
features.extend(list(extract_features(f, bb, insn)))
pprint.pprint(features)
if __name__ == '__main__':
main()

View File

@@ -0,0 +1,17 @@
from networkx.algorithms.components import strongly_connected_components
from networkx import nx
def has_loop(edges, threshold=2):
''' check if a list of edges representing a directed graph contains a loop
args:
edges: list of edge sets representing a directed graph i.e. [(1, 2), (2, 1)]
threshold: min number of nodes contained in loop
returns:
bool
'''
g = nx.DiGraph()
g.add_edges_from(edges)
return any(len(comp) >= threshold for comp in strongly_connected_components(g))

View File

@@ -0,0 +1,98 @@
# Copyright (C) 2017 FireEye, Inc. All Rights Reserved.
#
# strings code from FLOSS, https://github.com/fireeye/flare-floss
#
import re
from collections import namedtuple
ASCII_BYTE = r" !\"#\$%&\'\(\)\*\+,-\./0123456789:;<=>\?@ABCDEFGHIJKLMNOPQRSTUVWXYZ\[\]\^_`abcdefghijklmnopqrstuvwxyz\{\|\}\\\~\t".encode('ascii')
ASCII_RE_4 = re.compile(b"([%s]{%d,})" % (ASCII_BYTE, 4))
UNICODE_RE_4 = re.compile(b"((?:[%s]\x00){%d,})" % (ASCII_BYTE, 4))
REPEATS = [b"A", b"\x00", b"\xfe", b"\xff"]
SLICE_SIZE = 4096
String = namedtuple("String", ["s", "offset"])
def buf_filled_with(buf, character):
dupe_chunk = character * SLICE_SIZE
for offset in range(0, len(buf), SLICE_SIZE):
new_chunk = buf[offset: offset + SLICE_SIZE]
if dupe_chunk[:len(new_chunk)] != new_chunk:
return False
return True
def extract_ascii_strings(buf, n=4):
'''
Extract ASCII strings from the given binary data.
:param buf: A bytestring.
:type buf: str
:param n: The minimum length of strings to extract.
:type n: int
:rtype: Sequence[String]
'''
if not buf:
return
if (buf[0] in REPEATS) and buf_filled_with(buf, buf[0]):
return
r = None
if n == 4:
r = ASCII_RE_4
else:
reg = b"([%s]{%d,})" % (ASCII_BYTE, n)
r = re.compile(reg)
for match in r.finditer(buf):
yield String(match.group().decode("ascii"), match.start())
def extract_unicode_strings(buf, n=4):
'''
Extract naive UTF-16 strings from the given binary data.
:param buf: A bytestring.
:type buf: str
:param n: The minimum length of strings to extract.
:type n: int
:rtype: Sequence[String]
'''
if not buf:
return
if (buf[0] in REPEATS) and buf_filled_with(buf, buf[0]):
return
if n == 4:
r = UNICODE_RE_4
else:
reg = b"((?:[%s]\x00){%d,})" % (ASCII_BYTE, n)
r = re.compile(reg)
for match in r.finditer(buf):
try:
yield String(match.group().decode("utf-16"), match.start())
except UnicodeDecodeError:
pass
def main():
import sys
with open(sys.argv[1], 'rb') as f:
b = f.read()
for s in extract_ascii_strings(b):
print('0x{:x}: {:s}'.format(s.offset, s.s))
for s in extract_unicode_strings(b):
print('0x{:x}: {:s}'.format(s.offset, s.s))
if __name__ == '__main__':
main()

View File

@@ -0,0 +1,73 @@
import types
import viv_utils
import capa.features.extractors
import capa.features.extractors.viv.file
import capa.features.extractors.viv.function
import capa.features.extractors.viv.basicblock
import capa.features.extractors.viv.insn
from capa.features.extractors import FeatureExtractor
import file
import function
import basicblock
import insn
__all__ = ["file", "function", "basicblock", "insn"]
def get_va(self):
try:
# vivisect type
return self.va
except AttributeError:
pass
raise TypeError()
def add_va_int_cast(o):
'''
dynamically add a cast-to-int (`__int__`) method to the given object
that returns the value of the `.va` property.
this bit of skullduggery lets use cast viv-utils objects as ints.
the correct way of doing this is to update viv-utils (or subclass the objects here).
'''
setattr(o, '__int__', types.MethodType(get_va, o, type(o)))
return o
class VivisectFeatureExtractor(FeatureExtractor):
def __init__(self, vw, path):
super(VivisectFeatureExtractor, self).__init__()
self.vw = vw
self.path = path
def extract_file_features(self):
for feature, va in capa.features.extractors.viv.file.extract_features(self.vw, self.path):
yield feature, va
def get_functions(self):
for va in sorted(self.vw.getFunctions()):
yield add_va_int_cast(viv_utils.Function(self.vw, va))
def extract_function_features(self, f):
for feature, va in capa.features.extractors.viv.function.extract_features(f):
yield feature, va
def get_basic_blocks(self, f):
for bb in f.basic_blocks:
yield add_va_int_cast(bb)
def extract_basic_block_features(self, f, bb):
for feature, va in capa.features.extractors.viv.basicblock.extract_features(f, bb):
yield feature, va
def get_instructions(self, f, bb):
for insn in bb.instructions:
yield add_va_int_cast(insn)
def extract_insn_features(self, f, bb, insn):
for feature, va in capa.features.extractors.viv.insn.extract_features(f, bb, insn):
yield feature, va

View File

@@ -0,0 +1,147 @@
import struct
import string
import envi
import vivisect.const
from capa.features import Characteristic
from capa.features.basicblock import BasicBlock
from capa.features.extractors.helpers import MIN_STACKSTRING_LEN
def interface_extract_basic_block_XXX(f, bb):
'''
parse features from the given basic block.
args:
f (viv_utils.Function): the function to process.
bb (viv_utils.BasicBlock): the basic block to process.
yields:
(Feature, int): the feature and the address at which its found.
'''
yield NotImplementedError('feature'), NotImplementedError('virtual address')
def _bb_has_tight_loop(f, bb):
'''
parse tight loops, true if last instruction in basic block branches to bb start
'''
if len(bb.instructions) > 0:
for bva, bflags in bb.instructions[-1].getBranches():
if bflags & vivisect.envi.BR_COND:
if bva == bb.va:
return True
return False
def extract_bb_tight_loop(f, bb):
''' check basic block for tight loop indicators '''
if _bb_has_tight_loop(f, bb):
yield Characteristic('tight loop', True), bb.va
def _bb_has_stackstring(f, bb):
'''
extract potential stackstring creation, using the following heuristics:
- basic block contains enough moves of constant bytes to the stack
'''
count = 0
for instr in bb.instructions:
if is_mov_imm_to_stack(instr):
# add number of operand bytes
src = instr.getOperands()[1]
count += get_printable_len(src)
if count > MIN_STACKSTRING_LEN:
return True
return False
def extract_stackstring(f, bb):
''' check basic block for stackstring indicators '''
if _bb_has_stackstring(f, bb):
yield Characteristic('stack string', True), bb.va
def is_mov_imm_to_stack(instr):
'''
Return if instruction moves immediate onto stack
'''
if not instr.mnem.startswith('mov'):
return False
try:
dst, src = instr.getOperands()
except ValueError:
# not two operands
return False
if not src.isImmed():
return False
# TODO what about 64-bit operands?
if not isinstance(dst, envi.archs.i386.disasm.i386SibOper) and \
not isinstance(dst, envi.archs.i386.disasm.i386RegMemOper):
return False
if not dst.reg:
return False
rname = dst._dis_regctx.getRegisterName(dst.reg)
if rname not in ['ebp', 'rbp', 'esp', 'rsp']:
return False
return True
def get_printable_len(oper):
'''
Return string length if all operand bytes are ascii or utf16-le printable
'''
if oper.tsize == 1:
chars = struct.pack('<B', oper.imm)
elif oper.tsize == 2:
chars = struct.pack('<H', oper.imm)
elif oper.tsize == 4:
chars = struct.pack('<I', oper.imm)
elif oper.tsize == 8:
chars = struct.pack('<Q', oper.imm)
if is_printable_ascii(chars):
return oper.tsize
if is_printable_utf16le(chars):
return oper.tsize / 2
return 0
def is_printable_ascii(chars):
return all(ord(c) < 127 and c in string.printable for c in chars)
def is_printable_utf16le(chars):
if all(c == '\x00' for c in chars[1::2]):
return is_printable_ascii(chars[::2])
def extract_features(f, bb):
'''
extract features from the given basic block.
args:
f (viv_utils.Function): the function from which to extract features
bb (viv_utils.BasicBlock): the basic block to process.
yields:
Feature, set[VA]: the features and their location found in this basic block.
'''
yield BasicBlock(), bb.va
for bb_handler in BASIC_BLOCK_HANDLERS:
for feature, va in bb_handler(f, bb):
yield feature, va
BASIC_BLOCK_HANDLERS = (
extract_bb_tight_loop,
extract_stackstring,
)

View File

@@ -0,0 +1,102 @@
import PE.carve as pe_carve # vivisect PE
from capa.features import Characteristic
from capa.features.file import Export
from capa.features.file import Import
from capa.features.file import Section
from capa.features import String
import capa.features.extractors.strings
def extract_file_embedded_pe(vw, file_path):
with open(file_path, 'rb') as f:
fbytes = f.read()
for offset, i in pe_carve.carve(fbytes, 1):
yield Characteristic('embedded pe', True), offset
def extract_file_export_names(vw, file_path):
for va, etype, name, _ in vw.getExports():
yield Export(name), va
def extract_file_import_names(vw, file_path):
'''
extract imported function names
1. imports by ordinal:
- modulename.#ordinal
2. imports by name, results in two features to support importname-only matching:
- modulename.importname
- importname
'''
for va, _, _, tinfo in vw.getImports():
# vivisect source: tinfo = "%s.%s" % (libname, impname)
modname, impname = tinfo.split('.')
if is_viv_ord_impname(impname):
# replace ord prefix with #
impname = '#%s' % impname[len('ord'):]
tinfo = '%s.%s' % (modname, impname)
yield Import(tinfo), va
else:
yield Import(tinfo), va
yield Import(impname), va
def is_viv_ord_impname(impname):
'''
return if import name matches vivisect's ordinal naming scheme `'ord%d' % ord`
'''
if not impname.startswith('ord'):
return False
try:
int(impname[len('ord'):])
except ValueError:
return False
else:
return True
def extract_file_section_names(vw, file_path):
for va, _, segname, _ in vw.getSegments():
yield Section(segname), va
def extract_file_strings(vw, file_path):
'''
extract ASCII and UTF-16 LE strings from file
'''
with open(file_path, 'rb') as f:
b = f.read()
for s in capa.features.extractors.strings.extract_ascii_strings(b):
yield String(s.s), s.offset
for s in capa.features.extractors.strings.extract_unicode_strings(b):
yield String(s.s), s.offset
def extract_features(vw, file_path):
'''
extract file features from given workspace
args:
vw (vivisect.VivWorkspace): the vivisect workspace
file_path: path to the input file
yields:
Tuple[Feature, VA]: a feature and its location.
'''
for file_handler in FILE_HANDLERS:
for feature, va in file_handler(vw, file_path):
yield feature, va
FILE_HANDLERS = (
extract_file_embedded_pe,
extract_file_export_names,
extract_file_import_names,
extract_file_section_names,
extract_file_strings,
)

View File

@@ -0,0 +1,99 @@
import vivisect.const
from capa.features import Characteristic
from capa.features.extractors import loops
def interface_extract_function_XXX(f):
'''
parse features from the given function.
args:
f (viv_utils.Function): the function to process.
yields:
(Feature, int): the feature and the address at which its found.
'''
yield NotImplementedError('feature'), NotImplementedError('virtual address')
def get_switches(vw):
'''
caching accessor to vivisect workspace switch constructs.
'''
if 'switches' in vw.metadata:
return vw.metadata['switches']
else:
# addresses of switches in the program
switches = set()
for case_va, _ in filter(lambda t: 'case' in t[1], vw.getNames()):
# assume that the xref to a case location is a switch construct
for switch_va, _, _, _ in vw.getXrefsTo(case_va):
switches.add(switch_va)
vw.metadata['switches'] = switches
return switches
def get_functions_with_switch(vw):
if 'functions_with_switch' in vw.metadata:
return vw.metadata['functions_with_switch']
else:
functions = set()
for switch in get_switches(vw):
functions.add(vw.getFunction(switch))
vw.metadata['functions_with_switch'] = functions
return functions
def extract_function_switch(f):
'''
parse if a function contains a switch statement based on location names
method can be optimized
'''
if f.va in get_functions_with_switch(f.vw):
yield Characteristic('switch', True), f.va
def extract_function_calls_to(f):
for src, _, _, _ in f.vw.getXrefsTo(f.va, rtype=vivisect.const.REF_CODE):
yield Characteristic('calls to', True), src
def extract_function_loop(f):
'''
parse if a function has a loop
'''
edges = []
for bb in f.basic_blocks:
if len(bb.instructions) > 0:
for bva, bflags in bb.instructions[-1].getBranches():
if bflags & vivisect.envi.BR_COND or bflags & vivisect.envi.BR_FALL or bflags & vivisect.envi.BR_TABLE:
edges.append((bb.va, bva))
if edges and loops.has_loop(edges):
yield Characteristic('loop', True), f.va
def extract_features(f):
'''
extract features from the given function.
args:
f (viv_utils.Function): the function from which to extract features
yields:
Feature, set[VA]: the features and their location found in this function.
'''
for func_handler in FUNCTION_HANDLERS:
for feature, va in func_handler(f):
yield feature, va
FUNCTION_HANDLERS = (
extract_function_switch,
extract_function_calls_to,
extract_function_loop
)

View File

@@ -0,0 +1,154 @@
import collections
import envi
import envi.archs.i386.disasm
import envi.archs.amd64.disasm
import vivisect.const
# pull out consts for lookup performance
i386RegOper = envi.archs.i386.disasm.i386RegOper
i386ImmOper = envi.archs.i386.disasm.i386ImmOper
i386ImmMemOper = envi.archs.i386.disasm.i386ImmMemOper
Amd64RipRelOper = envi.archs.amd64.disasm.Amd64RipRelOper
LOC_OP = vivisect.const.LOC_OP
IF_NOFALL = envi.IF_NOFALL
REF_CODE = vivisect.const.REF_CODE
FAR_BRANCH_MASK = (envi.BR_PROC | envi.BR_DEREF | envi.BR_ARCH)
DESTRUCTIVE_MNEMONICS = ('mov', 'lea', 'pop', 'xor')
def get_previous_instructions(vw, va):
'''
collect the instructions that flow to the given address, local to the current function.
args:
vw (vivisect.Workspace)
va (int): the virtual address to inspect
returns:
List[int]: the prior instructions, which may fallthrough and/or jump here
'''
ret = []
# find the immediate prior instruction.
# ensure that it fallsthrough to this one.
loc = vw.getPrevLocation(va, adjacent=True)
if loc is not None:
# from vivisect.const:
# location: (L_VA, L_SIZE, L_LTYPE, L_TINFO)
(pva, _, ptype, pinfo) = vw.getPrevLocation(va, adjacent=True)
if ptype == LOC_OP and not (pinfo & IF_NOFALL):
ret.append(pva)
# find any code refs, e.g. jmp, to this location.
# ignore any calls.
#
# from vivisect.const:
# xref: (XR_FROM, XR_TO, XR_RTYPE, XR_RFLAG)
for (xfrom, _, _, xflag) in vw.getXrefsTo(va, REF_CODE):
if (xflag & FAR_BRANCH_MASK) != 0:
continue
ret.append(xfrom)
return ret
class NotFoundError(Exception):
pass
def find_definition(vw, va, reg):
'''
scan backwards from the given address looking for assignments to the given register.
if a constant, return that value.
args:
vw (vivisect.Workspace)
va (int): the virtual address at which to start analysis
reg (int): the vivisect register to study
returns:
(va: int, value?: int|None): the address of the assignment and the value, if a constant.
raises:
NotFoundError: when the definition cannot be found.
'''
q = collections.deque()
seen = set([])
q.extend(get_previous_instructions(vw, va))
while q:
cur = q.popleft()
# skip if we've already processed this location
if cur in seen:
continue
seen.add(cur)
insn = vw.parseOpcode(cur)
if len(insn.opers) == 0:
q.extend(get_previous_instructions(vw, cur))
continue
opnd0 = insn.opers[0]
if not \
(isinstance(opnd0, i386RegOper)
and opnd0.reg == reg
and insn.mnem in DESTRUCTIVE_MNEMONICS):
q.extend(get_previous_instructions(vw, cur))
continue
# if we reach here, the instruction is destructive to our target register.
# we currently only support extracting the constant from something like: `mov $reg, IAT`
# so, any other pattern results in an unknown value, represented by None.
# this is a good place to extend in the future, if we need more robust support.
if insn.mnem != 'mov':
return (cur, None)
else:
opnd1 = insn.opers[1]
if isinstance(opnd1, i386ImmOper):
return (cur, opnd1.getOperValue(opnd1))
elif isinstance(opnd1, i386ImmMemOper):
return (cur, opnd1.getOperAddr(opnd1))
elif isinstance(opnd1, Amd64RipRelOper):
return (cur, opnd1.getOperAddr(insn))
else:
# might be something like: `mov $reg, dword_401000[eax]`
return (cur, None)
raise NotFoundError()
def is_indirect_call(vw, va, insn=None):
if insn is None:
insn = vw.parseOpcode(va)
return (insn.mnem == 'call'
and isinstance(insn.opers[0], envi.archs.i386.disasm.i386RegOper))
def resolve_indirect_call(vw, va, insn=None):
'''
inspect the given indirect call instruction and attempt to resolve the target address.
args:
vw (vivisect.Workspace)
va (int): the virtual address at which to start analysis
returns:
(va: int, value?: int|None): the address of the assignment and the value, if a constant.
raises:
NotFoundError: when the definition cannot be found.
'''
if insn is None:
insn = vw.parseOpcode(va)
assert is_indirect_call(vw, va, insn=insn)
return find_definition(vw, va, insn.opers[0].reg)

View File

@@ -0,0 +1,465 @@
import envi.memory
import envi.archs.i386.disasm
import vivisect.const
from capa.features import String
from capa.features import Bytes
from capa.features import Characteristic
from capa.features import MAX_BYTES_FEATURE_SIZE
from capa.features.insn import Number
from capa.features.insn import Offset
from capa.features.insn import Mnemonic
import capa.features.extractors.helpers
from capa.features.extractors.viv.indirect_calls import NotFoundError
from capa.features.extractors.viv.indirect_calls import resolve_indirect_call
def interface_extract_instruction_XXX(f, bb, insn):
'''
parse features from the given instruction.
args:
f (viv_utils.Function): the function to process.
bb (viv_utils.BasicBlock): the basic block to process.
insn (vivisect...Instruction): the instruction to process.
yields:
(Feature, int): the feature and the address at which its found.
'''
yield NotImplementedError('feature'), NotImplementedError('virtual address')
def get_imports(vw):
'''
caching accessor to vivisect workspace imports
avoids performance issues in vivisect when collecting locations
'''
if 'imports' in vw.metadata:
return vw.metadata['imports']
else:
imports = {p[0]: p[3] for p in vw.getImports()}
vw.metadata['imports'] = imports
return imports
def extract_insn_api_features(f, bb, insn):
'''parse API features from the given instruction.'''
# example:
#
# call dword [0x00473038]
if insn.mnem != 'call':
return
# traditional call via IAT
if isinstance(insn.opers[0], envi.archs.i386.disasm.i386ImmMemOper):
oper = insn.opers[0]
target = oper.getOperAddr(insn)
imports = get_imports(f.vw)
if target in imports.keys():
for feature, va in capa.features.extractors.helpers.generate_api_features(imports[target], insn.va):
yield feature, va
# call via thunk on x86,
# see 9324d1a8ae37a36ae560c37448c9705a at 0x407985
#
# this is also how calls to internal functions may be decoded on x64.
# see Lab21-01.exe_:0x140001178
elif isinstance(insn.opers[0], envi.archs.i386.disasm.i386PcRelOper):
target = insn.opers[0].getOperValue(insn)
try:
thunk = f.vw.getFunctionMeta(target, 'Thunk')
except vivisect.exc.InvalidFunction:
return
else:
if thunk:
for feature, va in capa.features.extractors.helpers.generate_api_features(thunk, insn.va):
yield feature, va
# call via import on x64
# see Lab21-01.exe_:0x14000118C
elif isinstance(insn.opers[0], envi.archs.amd64.disasm.Amd64RipRelOper):
op = insn.opers[0]
target = op.getOperAddr(insn)
imports = get_imports(f.vw)
if target in imports.keys():
for feature, va in capa.features.extractors.helpers.generate_api_features(imports[target], insn.va):
yield feature, va
elif isinstance(insn.opers[0], envi.archs.i386.disasm.i386RegOper):
try:
(_, target) = resolve_indirect_call(f.vw, insn.va, insn=insn)
except NotFoundError:
# not able to resolve the indirect call, sorry
return
if target is None:
# not able to resolve the indirect call, sorry
return
imports = get_imports(f.vw)
if target in imports.keys():
for feature, va in capa.features.extractors.helpers.generate_api_features(imports[target], insn.va):
yield feature, va
def extract_insn_number_features(f, bb, insn):
'''parse number features from the given instruction.'''
# example:
#
# push 3136B0h ; dwControlCode
for oper in insn.opers:
# this is for both x32 and x64
if not isinstance(oper, envi.archs.i386.disasm.i386ImmOper):
continue
v = oper.getOperValue(oper)
if f.vw.probeMemory(v, 1, envi.memory.MM_READ):
# this is a valid address
# assume its not also a constant.
continue
if insn.mnem == 'add' \
and insn.opers[0].isReg() \
and insn.opers[0].reg == envi.archs.i386.disasm.REG_ESP:
# skip things like:
#
# .text:00401140 call sub_407E2B
# .text:00401145 add esp, 0Ch
return
yield Number(v), insn.va
def extract_insn_bytes_features(f, bb, insn):
'''
parse byte sequence features from the given instruction.
example:
# push offset iid_004118d4_IShellLinkA ; riid
'''
for oper in insn.opers:
if insn.mnem == 'call':
# ignore call instructions
continue
if isinstance(oper, envi.archs.i386.disasm.i386ImmOper):
v = oper.getOperValue(oper)
elif isinstance(oper, envi.archs.i386.disasm.i386RegMemOper):
# handle case like:
# movzx ecx, ds:byte_423258[eax]
v = oper.disp
elif isinstance(oper, envi.archs.amd64.disasm.Amd64RipRelOper):
# see: Lab21-01.exe_:0x1400010D3
v = oper.getOperAddr(insn)
else:
continue
segm = f.vw.getSegment(v)
if not segm:
continue
segm_end = segm[0] + segm[1]
try:
# Do not read beyond the end of a segment
if v + MAX_BYTES_FEATURE_SIZE > segm_end:
extracted_bytes = f.vw.readMemory(v, segm_end - v)
else:
extracted_bytes = f.vw.readMemory(v, MAX_BYTES_FEATURE_SIZE)
except envi.SegmentationViolation:
pass
else:
if not capa.features.extractors.helpers.all_zeros(extracted_bytes):
yield Bytes(extracted_bytes), insn.va
def read_string(vw, offset):
try:
alen = vw.detectString(offset)
except envi.SegmentationViolation:
pass
else:
if alen > 0:
return vw.readMemory(offset, alen).decode('utf-8')
try:
ulen = vw.detectUnicode(offset)
except envi.SegmentationViolation:
pass
except IndexError:
# potential vivisect bug detecting Unicode at segment end
pass
else:
if ulen > 0:
if ulen % 2 == 1:
# vivisect seems to mis-detect the end unicode strings
# off by one, too short
ulen += 1
return vw.readMemory(offset, ulen).decode('utf-16')
raise ValueError('not a string', offset)
def extract_insn_string_features(f, bb, insn):
'''parse string features from the given instruction.'''
# example:
#
# push offset aAcr ; "ACR > "
for oper in insn.opers:
if isinstance(oper, envi.archs.i386.disasm.i386ImmOper):
v = oper.getOperValue(oper)
elif isinstance(oper, envi.archs.amd64.disasm.Amd64RipRelOper):
v = oper.getOperAddr(insn)
else:
continue
try:
s = read_string(f.vw, v)
except ValueError:
continue
else:
yield String(s.rstrip('\x00')), insn.va
def extract_insn_offset_features(f, bb, insn):
'''parse structure offset features from the given instruction.'''
# example:
#
# .text:0040112F cmp [esi+4], ebx
for oper in insn.opers:
# this is for both x32 and x64
if not isinstance(oper, envi.archs.i386.disasm.i386RegMemOper):
continue
if oper.reg == envi.archs.i386.disasm.REG_ESP:
continue
if oper.reg == envi.archs.i386.disasm.REG_EBP:
continue
# TODO: do x64 support for real.
if oper.reg == envi.archs.amd64.disasm.REG_RBP:
continue
yield Offset(oper.disp), insn.va
def is_security_cookie(f, bb, insn):
'''
check if an instruction is related to security cookie checks
'''
# security cookie check should use SP or BP
oper = insn.opers[1]
if oper.isReg() \
and oper.reg not in [envi.archs.i386.disasm.REG_ESP, envi.archs.i386.disasm.REG_EBP,
# TODO: do x64 support for real.
envi.archs.amd64.disasm.REG_RBP, envi.archs.amd64.disasm.REG_RSP]:
return False
# expect security cookie init in first basic block within first bytes (instructions)
bb0 = f.basic_blocks[0]
if bb == bb0 and insn.va < bb.va + 30:
return True
# ... or within last bytes (instructions) before a return
elif bb.instructions[-1].isReturn() and insn.va > bb.va + bb.size - 30:
return True
return False
def extract_insn_nzxor_characteristic_features(f, bb, insn):
'''
parse non-zeroing XOR instruction from the given instruction.
ignore expected non-zeroing XORs, e.g. security cookies.
'''
if insn.mnem != 'xor':
return
if insn.opers[0] == insn.opers[1]:
return
if is_security_cookie(f, bb, insn):
return
yield Characteristic('nzxor', True), insn.va
def extract_insn_mnemonic_features(f, bb, insn):
'''parse mnemonic features from the given instruction.'''
yield Mnemonic(insn.mnem), insn.va
def extract_insn_peb_access_characteristic_features(f, bb, insn):
'''
parse peb access from the given function. fs:[0x30] on x86, gs:[0x60] on x64
'''
# TODO extract x64
if insn.mnem not in ['push', 'mov']:
return
if 'fs' in insn.getPrefixName():
for oper in insn.opers:
# examples
#
# IDA: mov eax, large fs:30h
# viv: fs: mov eax,dword [0x00000030] ; i386ImmMemOper
# IDA: push large dword ptr fs:30h
# viv: fs: push dword [0x00000030]
# fs: push dword [eax + 0x30] ; i386RegMemOper, with eax = 0
if (isinstance(oper, envi.archs.i386.disasm.i386RegMemOper) and oper.disp == 0x30) or \
(isinstance(oper, envi.archs.i386.disasm.i386ImmMemOper) and oper.imm == 0x30):
yield Characteristic('peb access', True), insn.va
elif 'gs' in insn.getPrefixName():
for oper in insn.opers:
if (isinstance(oper, envi.archs.amd64.disasm.i386RegMemOper) and oper.disp == 0x60) or \
(isinstance(oper, envi.archs.amd64.disasm.i386ImmMemOper) and oper.imm == 0x60):
yield Characteristic('peb access', True), insn.va
else:
pass
def extract_insn_segment_access_features(f, bb, insn):
''' parse the instruction for access to fs or gs '''
prefix = insn.getPrefixName()
if prefix == 'fs':
yield Characteristic('fs access', True), insn.va
if prefix == 'gs':
yield Characteristic('gs access', True), insn.va
def get_section(vw, va):
for start, length, _, __ in vw.getMemoryMaps():
if start <= va < start + length:
return start
raise KeyError(va)
def extract_insn_cross_section_cflow(f, bb, insn):
'''
inspect the instruction for a CALL or JMP that crosses section boundaries.
'''
for va, flags in insn.getBranches():
if flags & envi.BR_FALL:
continue
try:
# skip 32-bit calls to imports
if insn.mnem == 'call' and isinstance(insn.opers[0], envi.archs.i386.disasm.i386ImmMemOper):
oper = insn.opers[0]
target = oper.getOperAddr(insn)
if target in get_imports(f.vw):
continue
# skip 64-bit calls to imports
elif insn.mnem == 'call' and isinstance(insn.opers[0], envi.archs.amd64.disasm.Amd64RipRelOper):
op = insn.opers[0]
target = op.getOperAddr(insn)
if target in get_imports(f.vw):
continue
if get_section(f.vw, insn.va) != get_section(f.vw, va):
yield Characteristic('cross section flow', True), insn.va
except KeyError:
continue
# this is a feature that's most relevant at the function scope,
# however, its most efficient to extract at the instruction scope.
def extract_function_calls_from(f, bb, insn):
if insn.mnem != 'call':
return
target = None
# traditional call via IAT, x32
if isinstance(insn.opers[0], envi.archs.i386.disasm.i386ImmMemOper):
oper = insn.opers[0]
target = oper.getOperAddr(insn)
yield Characteristic('calls from', True), target
# call via thunk on x86,
# see 9324d1a8ae37a36ae560c37448c9705a at 0x407985
#
# call to internal function on x64
# see Lab21-01.exe_:0x140001178
elif isinstance(insn.opers[0], envi.archs.i386.disasm.i386PcRelOper):
target = insn.opers[0].getOperValue(insn)
yield Characteristic('calls from', True), target
# call via IAT, x64
elif isinstance(insn.opers[0], envi.archs.amd64.disasm.Amd64RipRelOper):
op = insn.opers[0]
target = op.getOperAddr(insn)
yield Characteristic('calls from', True), target
if target and target == f.va:
# if we found a jump target and it's the function address
# mark as recursive
yield Characteristic('recursive call', True), target
# this is a feature that's most relevant at the function or basic block scope,
# however, its most efficient to extract at the instruction scope.
def extract_function_indirect_call_characteristic_features(f, bb, insn):
'''
extract indirect function call characteristic (e.g., call eax or call dword ptr [edx+4])
does not include calls like => call ds:dword_ABD4974
'''
if insn.mnem != 'call':
return
# Checks below work for x86 and x64
if isinstance(insn.opers[0], envi.archs.i386.disasm.i386RegOper):
# call edx
yield Characteristic('indirect call', True), insn.va
elif isinstance(insn.opers[0], envi.archs.i386.disasm.i386RegMemOper):
# call dword ptr [eax+50h]
yield Characteristic('indirect call', True), insn.va
elif isinstance(insn.opers[0], envi.archs.i386.disasm.i386SibOper):
# call qword ptr [rsp+78h]
yield Characteristic('indirect call', True), insn.va
def extract_features(f, bb, insn):
'''
extract features from the given insn.
args:
f (viv_utils.Function): the function from which to extract features
bb (viv_utils.BasicBlock): the basic block to process.
insn (vivisect...Instruction): the instruction to process.
yields:
Feature, set[VA]: the features and their location found in this insn.
'''
for insn_handler in INSTRUCTION_HANDLERS:
for feature, va in insn_handler(f, bb, insn):
yield feature, va
INSTRUCTION_HANDLERS = (
extract_insn_api_features,
extract_insn_number_features,
extract_insn_string_features,
extract_insn_bytes_features,
extract_insn_offset_features,
extract_insn_nzxor_characteristic_features,
extract_insn_mnemonic_features,
extract_insn_peb_access_characteristic_features,
extract_insn_cross_section_cflow,
extract_insn_segment_access_features,
extract_function_calls_from,
extract_function_indirect_call_characteristic_features
)

31
capa/features/file.py Normal file
View File

@@ -0,0 +1,31 @@
from capa.features import Feature
class Export(Feature):
def __init__(self, value):
# value is export name
super(Export, self).__init__([value])
self.value = value
def __str__(self):
return 'Export(%s)' % (self.value)
class Import(Feature):
def __init__(self, value):
# value is import name
super(Import, self).__init__([value])
self.value = value
def __str__(self):
return 'Import(%s)' % (self.value)
class Section(Feature):
def __init__(self, value):
# value is section name
super(Section, self).__init__([value])
self.value = value
def __str__(self):
return 'Section(%s)' % (self.value)

276
capa/features/freeze.py Normal file
View File

@@ -0,0 +1,276 @@
'''
capa freeze file format: `| capa0000 | + zlib(utf-8(json(...)))`
json format:
{
'version': 1,
'functions': {
int(function va): {
'basic blocks': {
int(basic block va): {
'instructions': [instruction va, ...]
},
...
},
...
},
...
},
'scopes': {
'file': [
(str(name), [any(arg), ...], int(va), ()),
...
},
'function': [
(str(name), [any(arg), ...], int(va), (int(function va), )),
...
],
'basic block': [
(str(name), [any(arg), ...], int(va), (int(function va),
int(basic block va))),
...
],
'instruction': [
(str(name), [any(arg), ...], int(va), (int(function va),
int(basic block va),
int(instruction va))),
...
],
}
}
'''
import json
import zlib
import logging
import capa.features.extractors
import capa.features
import capa.features.file
import capa.features.function
import capa.features.basicblock
import capa.features.insn
from capa.helpers import hex
logger = logging.getLogger(__name__)
def serialize_feature(feature):
return feature.freeze_serialize()
KNOWN_FEATURES = {
F.__name__: F
for F in capa.features.Feature.__subclasses__()
}
def deserialize_feature(doc):
F = KNOWN_FEATURES[doc[0]]
return F.freeze_deserialize(doc[1])
def dumps(extractor):
'''
serialize the given extractor to a string
args:
extractor: capa.features.extractor.FeatureExtractor:
returns:
str: the serialized features.
'''
ret = {
'version': 1,
'functions': {},
'scopes': {
'file': [],
'function': [],
'basic block': [],
'instruction': [],
}
}
for feature, va in extractor.extract_file_features():
ret['scopes']['file'].append(
serialize_feature(feature) + (hex(va), ())
)
for f in extractor.get_functions():
ret['functions'][hex(f)] = {}
for feature, va in extractor.extract_function_features(f):
ret['scopes']['function'].append(
serialize_feature(feature) + (hex(va), (hex(f), ))
)
for bb in extractor.get_basic_blocks(f):
ret['functions'][hex(f)][hex(bb)] = []
for feature, va in extractor.extract_basic_block_features(f, bb):
ret['scopes']['basic block'].append(
serialize_feature(feature) + (hex(va), (hex(f), hex(bb), ))
)
for insn, insnva in sorted([(insn, int(insn)) for insn in extractor.get_instructions(f, bb)]):
ret['functions'][hex(f)][hex(bb)].append(hex(insnva))
for feature, va in extractor.extract_insn_features(f, bb, insn):
ret['scopes']['instruction'].append(
serialize_feature(feature) + (hex(va), (hex(f), hex(bb), hex(insnva), ))
)
return json.dumps(ret)
def loads(s):
'''deserialize a set of features (as a NullFeatureExtractor) from a string.'''
doc = json.loads(s)
if doc.get('version') != 1:
raise ValueError('unsupported freeze format version: %d' % (doc.get('version')))
features = {
'file features': [],
'functions': {},
}
for fva, function in doc.get('functions', {}).items():
fva = int(fva, 0x10)
features['functions'][fva] = {
'features': [],
'basic blocks': {},
}
for bbva, bb in function.items():
bbva = int(bbva, 0x10)
features['functions'][fva]['basic blocks'][bbva] = {
'features': [],
'instructions': {},
}
for insnva in bb:
insnva = int(insnva, 0x10)
features['functions'][fva]['basic blocks'][bbva]['instructions'][insnva] = {
'features': [],
}
# in the following blocks, each entry looks like:
#
# ('MatchedRule', ('foo', ), '0x401000', ('0x401000', ))
# ^^^^^^^^^^^^^ ^^^^^^^^^ ^^^^^^^^^^ ^^^^^^^^^^^^^^
# feature name args addr func/bb/insn
for feature in doc.get('scopes', {}).get('file', []):
va, loc = feature[2:]
va = int(va, 0x10)
feature = deserialize_feature(feature[:2])
features['file features'].append((va, feature))
for feature in doc.get('scopes', {}).get('function', []):
# fetch the pair like:
#
# ('0x401000', ('0x401000', ))
# ^^^^^^^^^^ ^^^^^^^^^^^^^^
# addr func/bb/insn
va, loc = feature[2:]
va = int(va, 0x10)
loc = [int(lo, 0x10) for lo in loc]
# decode the feature from the pair like:
#
# ('MatchedRule', ('foo', ))
# ^^^^^^^^^^^^^ ^^^^^^^^^
# feature name args
feature = deserialize_feature(feature[:2])
features['functions'][loc[0]]['features'].append((va, feature))
for feature in doc.get('scopes', {}).get('basic block', []):
va, loc = feature[2:]
va = int(va, 0x10)
loc = [int(lo, 0x10) for lo in loc]
feature = deserialize_feature(feature[:2])
features['functions'][loc[0]]['basic blocks'][loc[1]]['features'].append((va, feature))
for feature in doc.get('scopes', {}).get('instruction', []):
va, loc = feature[2:]
va = int(va, 0x10)
loc = [int(lo, 0x10) for lo in loc]
feature = deserialize_feature(feature[:2])
features['functions'][loc[0]]['basic blocks'][loc[1]]['instructions'][loc[2]]['features'].append((va, feature))
return capa.features.extractors.NullFeatureExtractor(features)
MAGIC = 'capa0000'.encode('ascii')
def dump(extractor):
'''serialize the given extractor to a byte array.'''
return MAGIC + zlib.compress(dumps(extractor).encode('utf-8'))
def is_freeze(buf):
return buf[:len(MAGIC)] == MAGIC
def load(buf):
'''deserialize a set of features (as a NullFeatureExtractor) from a byte array.'''
if not is_freeze(buf):
raise ValueError('missing magic header')
return loads(zlib.decompress(buf[len(MAGIC):]).decode('utf-8'))
def main(argv=None):
import sys
import argparse
import capa.main
if argv is None:
argv = sys.argv[1:]
formats = [
('auto', '(default) detect file type automatically'),
('pe', 'Windows PE file'),
('sc32', '32-bit shellcode'),
('sc64', '64-bit shellcode'),
]
format_help = ', '.join(['%s: %s' % (f[0], f[1]) for f in formats])
parser = argparse.ArgumentParser(description="save capa features to a file")
parser.add_argument("sample", type=str,
help="Path to sample to analyze")
parser.add_argument("output", type=str,
help="Path to output file")
parser.add_argument("-v", "--verbose", action="store_true",
help="Enable verbose output")
parser.add_argument("-q", "--quiet", action="store_true",
help="Disable all output but errors")
parser.add_argument("-f", "--format", choices=[f[0] for f in formats], default="auto",
help="Select sample format, %s" % format_help)
args = parser.parse_args(args=argv)
if args.quiet:
logging.basicConfig(level=logging.ERROR)
logging.getLogger().setLevel(logging.ERROR)
elif args.verbose:
logging.basicConfig(level=logging.DEBUG)
logging.getLogger().setLevel(logging.DEBUG)
else:
logging.basicConfig(level=logging.INFO)
logging.getLogger().setLevel(logging.INFO)
vw = capa.main.get_workspace(args.sample, args.format)
# don't import this at top level to support ida/py3 backend
import capa.features.extractors.viv
extractor = capa.features.extractors.viv.VivisectFeatureExtractor(vw, args.sample)
with open(args.output, 'wb') as f:
f.write(dump(extractor))
return 0
if __name__ == "__main__":
import sys
sys.exit(main())

View File

46
capa/features/insn.py Normal file
View File

@@ -0,0 +1,46 @@
from capa.features import Feature
class API(Feature):
def __init__(self, name):
# Downcase library name if given
if '.' in name:
modname, impname = name.split('.')
name = modname.lower() + '.' + impname
super(API, self).__init__([name])
class Number(Feature):
def __init__(self, value, symbol=None):
super(Number, self).__init__([value])
self.value = value
self.symbol = symbol
def __str__(self):
if self.symbol:
return 'number(0x%x = %s)' % (self.value, self.symbol)
else:
return 'number(0x%x)' % (self.value)
class Offset(Feature):
def __init__(self, value, symbol=None):
super(Offset, self).__init__([value])
self.value = value
self.symbol = symbol
def __str__(self):
if self.symbol:
return 'offset(0x%x = %s)' % (self.value, self.symbol)
else:
return 'offset(0x%x)' % (self.value)
class Mnemonic(Feature):
def __init__(self, value):
super(Mnemonic, self).__init__([value])
self.value = value
def __str__(self):
return 'mnemonic(%s)' % (self.value)

18
capa/helpers.py Normal file
View File

@@ -0,0 +1,18 @@
_hex = hex
def hex(i):
# under py2.7, long integers get formatted with a trailing `L`
# and this is not pretty. so strip it out.
return _hex(oint(i)).rstrip('L')
def oint(i):
# there seems to be some trouble with using `int(viv_utils.Function)`
# with the black magic we do with binding the `__int__()` routine.
# i haven't had a chance to debug this yet (and i have no hotel wifi).
# so in the meantime, detect this, and call the method directly.
try:
return int(i)
except TypeError:
return i.__int__()

0
capa/ida/__init__.py Normal file
View File

View File

250
capa/ida/explorer/item.py Normal file
View File

@@ -0,0 +1,250 @@
import binascii
import codecs
import sys
from PyQt5 import QtCore
import idaapi
import idc
import capa.ida.helpers
def info_to_name(s):
''' '''
try:
return s.split('(')[1].rstrip(')')
except IndexError:
return ''
def ea_to_hex_str(ea):
''' '''
return '%08X' % ea
class CapaExplorerDataItem(object):
''' store data for CapaExplorerDataModel
TODO
'''
def __init__(self, parent, data):
''' '''
self._parent = parent
self._data = data
self._children = []
self._checked = False
self.flags = (QtCore.Qt.ItemIsEnabled | QtCore.Qt.ItemIsSelectable | QtCore.Qt.ItemIsTristate | QtCore.Qt.ItemIsUserCheckable)
if self._parent:
self._parent.appendChild(self)
def setIsEditable(self, isEditable=False):
''' modify item flags to be editable or not '''
if isEditable:
self.flags |= QtCore.Qt.ItemIsEditable
else:
self.flags &= ~QtCore.Qt.ItemIsEditable
def setChecked(self, checked):
''' set item as checked '''
self._checked = checked
def isChecked(self):
''' get item is checked '''
return self._checked
def appendChild(self, item):
''' add child item
@param item: CapaExplorerDataItem*
'''
self._children.append(item)
def child(self, row):
''' get child row
@param row: TODO
'''
return self._children[row]
def childCount(self):
''' get child count '''
return len(self._children)
def columnCount(self):
''' get column count '''
return len(self._data)
def data(self, column):
''' get data at column '''
try:
return self._data[column]
except IndexError:
return None
def parent(self):
''' get parent '''
return self._parent
def row(self):
''' get row location '''
if self._parent:
return self._parent._children.index(self)
return 0
def setData(self, column, value):
''' set data in column '''
self._data[column] = value
def children(self):
''' yield children '''
for child in self._children:
yield child
def removeChildren(self):
''' '''
del self._children[:]
def __str__(self):
''' get string representation of columns '''
return ' '.join([data for data in self._data if data])
@property
def info(self):
''' '''
return self._data[0]
@property
def ea(self):
''' '''
try:
return int(self._data[1], 16)
except ValueError:
return None
@property
def details(self):
''' '''
return self._data[2]
class CapaExplorerRuleItem(CapaExplorerDataItem):
''' store data relevant to capa function result '''
view_fmt = '%s (%d)'
def __init__(self, parent, name, count, definition):
''' '''
self._definition = definition
name = CapaExplorerRuleItem.view_fmt % (name, count) if count else name
super(CapaExplorerRuleItem, self).__init__(parent, [name, '', ''])
@property
def definition(self):
''' '''
return self._definition
class CapaExplorerFunctionItem(CapaExplorerDataItem):
''' store data relevant to capa function result '''
view_fmt = 'function(%s)'
def __init__(self, parent, name, ea):
''' '''
address = ea_to_hex_str(ea)
name = CapaExplorerFunctionItem.view_fmt % name
super(CapaExplorerFunctionItem, self).__init__(parent, [name, address, ''])
@property
def info(self):
''' '''
info = super(CapaExplorerFunctionItem, self).info
name = info_to_name(info)
return name if name else info
@info.setter
def info(self, name):
''' '''
self._data[0] = CapaExplorerFunctionItem.view_fmt % name
class CapaExplorerBlockItem(CapaExplorerDataItem):
''' store data relevant to capa basic block results '''
view_fmt = 'basic block(loc_%s)'
def __init__(self, parent, ea):
''' '''
address = ea_to_hex_str(ea)
name = CapaExplorerBlockItem.view_fmt % address
super(CapaExplorerBlockItem, self).__init__(parent, [name, address, ''])
class CapaExplorerDefaultItem(CapaExplorerDataItem):
''' store data relevant to capa default result '''
def __init__(self, parent, name, ea=None):
''' '''
if ea:
address = ea_to_hex_str(ea)
else:
address = ''
super(CapaExplorerDefaultItem, self).__init__(parent, [name, address, ''])
class CapaExplorerFeatureItem(CapaExplorerDataItem):
''' store data relevant to capa feature result '''
def __init__(self, parent, data):
super(CapaExplorerFeatureItem, self).__init__(parent, data)
class CapaExplorerInstructionViewItem(CapaExplorerFeatureItem):
def __init__(self, parent, name, ea):
''' '''
details = capa.ida.helpers.get_disasm_line(ea)
address = ea_to_hex_str(ea)
super(CapaExplorerInstructionViewItem, self).__init__(parent, [name, address, details])
self.ida_highlight = idc.get_color(ea, idc.CIC_ITEM)
class CapaExplorerByteViewItem(CapaExplorerFeatureItem):
def __init__(self, parent, name, ea):
''' '''
address = ea_to_hex_str(ea)
byte_snap = idaapi.get_bytes(ea, 32)
if byte_snap:
byte_snap = codecs.encode(byte_snap, 'hex').upper()
# TODO: better way?
if sys.version_info >= (3, 0):
details = ' '.join([byte_snap[i:i + 2].decode() for i in range(0, len(byte_snap), 2)])
else:
details = ' '.join([byte_snap[i:i + 2] for i in range(0, len(byte_snap), 2)])
else:
details = ''
super(CapaExplorerByteViewItem, self).__init__(parent, [name, address, details])
self.ida_highlight = idc.get_color(ea, idc.CIC_ITEM)
class CapaExplorerStringViewItem(CapaExplorerFeatureItem):
def __init__(self, parent, name, ea, value):
''' '''
address = ea_to_hex_str(ea)
super(CapaExplorerStringViewItem, self).__init__(parent, [name, address, value])
self.ida_highlight = idc.get_color(ea, idc.CIC_ITEM)

423
capa/ida/explorer/model.py Normal file
View File

@@ -0,0 +1,423 @@
from PyQt5 import QtCore, QtGui
from collections import deque
import binascii
import idaapi
import idc
from capa.ida.explorer.item import (
CapaExplorerDataItem,
CapaExplorerDefaultItem,
CapaExplorerFeatureItem,
CapaExplorerFunctionItem,
CapaExplorerRuleItem,
CapaExplorerStringViewItem,
CapaExplorerInstructionViewItem,
CapaExplorerByteViewItem,
CapaExplorerBlockItem
)
import capa.ida.helpers
# default highlight color used in IDA window
DEFAULT_HIGHLIGHT = 0xD096FF
class CapaExplorerDataModel(QtCore.QAbstractItemModel):
''' '''
COLUMN_INDEX_RULE_INFORMATION = 0
COLUMN_INDEX_VIRTUAL_ADDRESS = 1
COLUMN_INDEX_DETAILS = 2
COLUMN_COUNT = 3
def __init__(self, parent=None):
''' '''
super(CapaExplorerDataModel, self).__init__(parent)
self._root = CapaExplorerDataItem(None, ['Rule Information', 'Address', 'Details'])
def reset(self):
''' '''
# reset checkboxes and color highlights
# TODO: make less hacky
for idx in range(self._root.childCount()):
rindex = self.index(idx, 0, QtCore.QModelIndex())
for mindex in self.iterateChildrenIndexFromRootIndex(rindex, ignore_root=False):
mindex.internalPointer().setChecked(False)
self._util_reset_ida_highlighting(mindex.internalPointer(), False)
self.dataChanged.emit(mindex, mindex)
def clear(self):
''' '''
self.beginResetModel()
# TODO: make sure this isn't for memory
self._root.removeChildren()
self.endResetModel()
def columnCount(self, mindex):
''' get the number of columns for the children of the given parent
@param mindex: QModelIndex*
@retval column count
'''
if mindex.isValid():
return mindex.internalPointer().columnCount()
else:
return self._root.columnCount()
def data(self, mindex, role):
''' get data stored under the given role for the item referred to by the index
@param mindex: QModelIndex*
@param role: QtCore.Qt.*
@retval data to be displayed
'''
if not mindex.isValid():
return None
if role == QtCore.Qt.DisplayRole:
# display data in corresponding column
return mindex.internalPointer().data(mindex.column())
if role == QtCore.Qt.ToolTipRole and \
CapaExplorerDataModel.COLUMN_INDEX_RULE_INFORMATION == mindex.column() and \
isinstance(mindex.internalPointer(), CapaExplorerRuleItem):
# show tooltip containing rule definition
return mindex.internalPointer().definition
if role == QtCore.Qt.CheckStateRole and mindex.column() == CapaExplorerDataModel.COLUMN_INDEX_RULE_INFORMATION:
# inform view how to display content of checkbox - un/checked
return QtCore.Qt.Checked if mindex.internalPointer().isChecked() else QtCore.Qt.Unchecked
if role == QtCore.Qt.FontRole and mindex.column() in (CapaExplorerDataModel.COLUMN_INDEX_VIRTUAL_ADDRESS, CapaExplorerDataModel.COLUMN_INDEX_DETAILS):
return QtGui.QFont('Courier', weight=QtGui.QFont.Medium)
if role == QtCore.Qt.FontRole and mindex.internalPointer() == self._root:
return QtCore.QFont(bold=True)
return None
def flags(self, mindex):
''' get item flags for given index
@param mindex: QModelIndex*
@retval QtCore.Qt.ItemFlags
'''
if not mindex.isValid():
return QtCore.Qt.NoItemFlags
return mindex.internalPointer().flags
def headerData(self, section, orientation, role):
''' get data for the given role and section in the header with the specified orientation
@param section: int
@param orientation: QtCore.Qt.Orientation
@param role: QtCore.Qt.DisplayRole
@retval header data list()
'''
if orientation == QtCore.Qt.Horizontal and role == QtCore.Qt.DisplayRole:
return self._root.data(section)
return None
def index(self, row, column, parent):
''' get index of the item in the model specified by the given row, column and parent index
@param row: int
@param column: int
@param parent: QModelIndex*
@retval QModelIndex*
'''
if not self.hasIndex(row, column, parent):
return QtCore.QModelIndex()
if not parent.isValid():
parent_item = self._root
else:
parent_item = parent.internalPointer()
child_item = parent_item.child(row)
if child_item:
return self.createIndex(row, column, child_item)
else:
return QtCore.QModelIndex()
def parent(self, mindex):
''' get parent of the model item with the given index
if the item has no parent, an invalid QModelIndex* is returned
@param mindex: QModelIndex*
@retval QModelIndex*
'''
if not mindex.isValid():
return QtCore.QModelIndex()
child = mindex.internalPointer()
parent = child.parent()
if parent == self._root:
return QtCore.QModelIndex()
return self.createIndex(parent.row(), 0, parent)
def iterateChildrenIndexFromRootIndex(self, mindex, ignore_root=True):
''' depth-first traversal of child nodes
@param mindex: QModelIndex*
@retval yield QModelIndex*
'''
visited = set()
stack = deque((mindex,))
while True:
try:
cmindex = stack.pop()
except IndexError:
break
if cmindex not in visited:
if not ignore_root or cmindex is not mindex:
# ignore root
yield cmindex
visited.add(cmindex)
for idx in range(self.rowCount(cmindex)):
stack.append(cmindex.child(idx, 0))
def _util_reset_ida_highlighting(self, item, checked):
''' '''
if not isinstance(item, (CapaExplorerStringViewItem, CapaExplorerInstructionViewItem, CapaExplorerByteViewItem)):
# ignore other item types
return
curr_highlight = idc.get_color(item.ea, idc.CIC_ITEM)
if checked:
# item checked - record current highlight and set to new
item.ida_highlight = curr_highlight
idc.set_color(item.ea, idc.CIC_ITEM, DEFAULT_HIGHLIGHT)
else:
# item unchecked - reset highlight
if curr_highlight != DEFAULT_HIGHLIGHT:
# user modified highlight - record new highlight and do not modify
item.ida_highlight = curr_highlight
else:
# reset highlight to previous
idc.set_color(item.ea, idc.CIC_ITEM, item.ida_highlight)
def setData(self, mindex, value, role):
''' set the role data for the item at index to value
@param mindex: QModelIndex*
@param value: QVariant*
@param role: QtCore.Qt.EditRole
@retval True/False
'''
if not mindex.isValid():
return False
if role == QtCore.Qt.CheckStateRole and mindex.column() == CapaExplorerDataModel.COLUMN_INDEX_RULE_INFORMATION:
# user un/checked box - un/check parent and children
for cindex in self.iterateChildrenIndexFromRootIndex(mindex, ignore_root=False):
cindex.internalPointer().setChecked(value)
self._util_reset_ida_highlighting(cindex.internalPointer(), value)
self.dataChanged.emit(cindex, cindex)
return True
if role == QtCore.Qt.EditRole and value and \
mindex.column() == CapaExplorerDataModel.COLUMN_INDEX_RULE_INFORMATION and \
isinstance(mindex.internalPointer(), CapaExplorerFunctionItem):
# user renamed function - update IDA database and data model
old_name = mindex.internalPointer().info
new_name = str(value)
if idaapi.set_name(mindex.internalPointer().ea, new_name):
# success update IDA database - update data model
self.update_function_name(old_name, new_name)
return True
# no handle
return False
def rowCount(self, mindex):
''' get the number of rows under the given parent
when the parent is valid it means that is returning the number of
children of parent
@param mindex: QModelIndex*
@retval row count
'''
if mindex.column() > 0:
return 0
if not mindex.isValid():
item = self._root
else:
item = mindex.internalPointer()
return item.childCount()
def render_capa_results(self, rule_set, results):
''' populate data model with capa results
@param rule_set: TODO
@param results: TODO
'''
# prepare data model for changes
self.beginResetModel()
for (rule, ress) in results.items():
if rule_set.rules[rule].meta.get('lib', False):
# skip library rules
continue
# top level item is rule
parent = CapaExplorerRuleItem(self._root, rule, len(ress), rule_set.rules[rule].definition)
for (ea, res) in sorted(ress, key=lambda p: p[0]):
if rule_set.rules[rule].scope == capa.rules.FILE_SCOPE:
# file scope - parent is rule
parent2 = parent
elif rule_set.rules[rule].scope == capa.rules.FUNCTION_SCOPE:
parent2 = CapaExplorerFunctionItem(parent, idaapi.get_name(ea), ea)
elif rule_set.rules[rule].scope == capa.rules.BASIC_BLOCK_SCOPE:
parent2 = CapaExplorerBlockItem(parent, ea)
else:
# TODO: better way to notify a missed scope?
parent2 = CapaExplorerDefaultItem(parent, '', ea)
self._render_result(rule_set, res, parent2)
# reset data model after making changes
self.endResetModel()
def _render_result(self, rule_set, result, parent):
''' '''
if not result.success:
# TODO: display failed branches??
return
if isinstance(result.statement, capa.engine.Some):
if result.statement.count == 0:
if sum(map(lambda c: c.success, result.children)) > 0:
parent2 = CapaExplorerDefaultItem(parent, 'optional')
else:
parent2 = parent
else:
parent2 = CapaExplorerDefaultItem(parent, '%d or more' % result.statement.count)
elif not isinstance(result.statement, (capa.features.Feature, capa.engine.Element, capa.engine.Range, capa.engine.Regex)):
# when rending a structural node (and/or/not) then we only care about the node name.
'''
succs = list(filter(lambda c: bool(c), result.children))
if len(succs) == 1:
# skip structural node with single succeeding child
parent2 = parent
else:
parent2 = CapaExplorerDefaultItem(parent, result.statement.name.lower())
'''
parent2 = CapaExplorerDefaultItem(parent, result.statement.name.lower())
else:
# but when rendering a Feature, want to see any arguments to it
if len(result.locations) == 1:
# ea = result.locations.pop()
ea = next(iter(result.locations))
parent2 = self._render_feature(rule_set, parent, result.statement, ea, str(result.statement))
else:
parent2 = CapaExplorerDefaultItem(parent, str(result.statement))
for ea in sorted(result.locations):
self._render_feature(rule_set, parent2, result.statement, ea)
for child in result.children:
self._render_result(rule_set, child, parent2)
def _render_feature(self, rule_set, parent, feature, ea, name='-'):
''' render a given feature
@param rule_set: TODO
@param parent: TODO
@param result: TODO
@param ea: virtual address
@param name: TODO
'''
instruction_view = (
capa.features.Bytes,
capa.features.String,
capa.features.insn.API,
capa.features.insn.Mnemonic,
capa.features.insn.Number,
capa.features.insn.Offset
)
byte_view = (
capa.features.file.Section,
)
string_view = (
capa.engine.Regex,
)
if isinstance(feature, instruction_view):
return CapaExplorerInstructionViewItem(parent, name, ea)
if isinstance(feature, byte_view):
return CapaExplorerByteViewItem(parent, name, ea)
if isinstance(feature, string_view):
# TODO: move string collection to item constructor
if isinstance(feature, capa.engine.Regex):
return CapaExplorerStringViewItem(parent, name, ea, feature.match)
if isinstance(feature, capa.features.Characteristic):
# special rendering for characteristics
if feature.name in ('loop', 'recursive call', 'tight loop', 'switch'):
return CapaExplorerDefaultItem(parent, name)
if feature.name in ('embedded pe',):
return CapaExplorerByteViewItem(parent, name, ea)
return CapaExplorerInstructionViewItem(parent, name, ea)
if isinstance(feature, capa.features.MatchedRule):
# render feature as a rule item
return CapaExplorerRuleItem(parent, name, 0, rule_set.rules[feature.rule_name].definition)
if isinstance(feature, capa.engine.Range):
# render feature based upon type child
return self._render_feature(rule_set, parent, feature.child, ea, name)
# no handle, default to name and virtual address display
return CapaExplorerDefaultItem(parent, name, ea)
def update_function_name(self, old_name, new_name):
''' update all instances of function name
@param old_name: previous function name
@param new_name: new function name
'''
rmindex = self.index(0, 0, QtCore.QModelIndex())
# convert name to view format for matching
# TODO: handle this better
old_name = CapaExplorerFunctionItem.view_fmt % old_name
for mindex in self.match(rmindex, QtCore.Qt.DisplayRole, old_name, hits=-1, flags=QtCore.Qt.MatchRecursive):
if not isinstance(mindex.internalPointer(), CapaExplorerFunctionItem):
continue
mindex.internalPointer().info = new_name
self.dataChanged.emit(mindex, mindex)

View File

@@ -0,0 +1,75 @@
from PyQt5 import QtCore
from capa.ida.explorer.model import CapaExplorerDataModel
class CapaExplorerSortFilterProxyModel(QtCore.QSortFilterProxyModel):
def __init__(self, parent=None):
''' '''
super(CapaExplorerSortFilterProxyModel, self).__init__(parent)
def lessThan(self, left, right):
''' true if the value of the left item is less than value of right item
@param left: QModelIndex*
@param right: QModelIndex*
@retval True/False
'''
ldata = left.internalPointer().data(left.column())
rdata = right.internalPointer().data(right.column())
if ldata and rdata and left.column() == CapaExplorerDataModel.COLUMN_INDEX_VIRTUAL_ADDRESS and left.column() == right.column():
# convert virtual address before compare
return int(ldata, 16) < int(rdata, 16)
else:
# compare as lowercase
return ldata.lower() < rdata.lower()
def filterAcceptsRow(self, row, parent):
''' true if the item in the row indicated by the given row and parent
should be included in the model; otherwise returns false
@param row: int
@param parent: QModelIndex*
@retval True/False
'''
if self._filter_accepts_row_self(row, parent):
return True
alpha = parent
while alpha.isValid():
if self._filter_accepts_row_self(alpha.row(), alpha.parent()):
return True
alpha = alpha.parent()
if self._index_has_accepted_children(row, parent):
return True
return False
def add_single_string_filter(self, column, string):
''' add fixed string filter
@param column: key column
@param string: string to sort
'''
self.setFilterKeyColumn(column)
self.setFilterFixedString(string)
def _index_has_accepted_children(self, row, parent):
''' '''
mindex = self.sourceModel().index(row, 0, parent)
if mindex.isValid():
for idx in range(self.sourceModel().rowCount(mindex)):
if self._filter_accepts_row_self(idx, mindex):
return True
if self._index_has_accepted_children(idx, mindex):
return True
return False
def _filter_accepts_row_self(self, row, parent):
''' '''
return super(CapaExplorerSortFilterProxyModel, self).filterAcceptsRow(row, parent)

281
capa/ida/explorer/view.py Normal file
View File

@@ -0,0 +1,281 @@
from PyQt5 import QtWidgets, QtCore, QtGui
import idaapi
import idc
from capa.ida.explorer.model import CapaExplorerDataModel
from capa.ida.explorer.item import CapaExplorerFunctionItem
class CapaExplorerQtreeView(QtWidgets.QTreeView):
''' capa explorer QTreeView implementation
view controls UI action responses and displays data from
CapaExplorerDataModel
view does not modify CapaExplorerDataModel directly - data
modifications should be implemented in CapaExplorerDataModel
'''
def __init__(self, model, parent=None):
''' initialize CapaExplorerQTreeView
TODO
@param model: TODO
@param parent: TODO
'''
super(CapaExplorerQtreeView, self).__init__(parent)
self.setModel(model)
# TODO: get from parent??
self._model = model
self._parent = parent
# configure custom UI controls
self.setContextMenuPolicy(QtCore.Qt.CustomContextMenu)
self.setExpandsOnDoubleClick(False)
self.setSortingEnabled(True)
self._model.setDynamicSortFilter(False)
# configure view columns to auto-resize
for idx in range(CapaExplorerDataModel.COLUMN_COUNT):
self.header().setSectionResizeMode(idx, QtWidgets.QHeaderView.Interactive)
# connect slots to resize columns when expanded or collapsed
self.expanded.connect(self.resize_columns_to_content)
self.collapsed.connect(self.resize_columns_to_content)
# connect slots
self.customContextMenuRequested.connect(self._slot_custom_context_menu_requested)
self.doubleClicked.connect(self._slot_double_click)
# self.clicked.connect(self._slot_click)
self.setStyleSheet('QTreeView::item {padding-right: 15 px;padding-bottom: 2 px;}')
def reset(self):
''' reset user interface changes
called when view should reset any user interface changes
made since the last reset e.g. IDA window highlighting
'''
self.collapseAll()
self.resize_columns_to_content()
def resize_columns_to_content(self):
''' reset view columns to contents
TODO: prevent columns from shrinking
'''
self.header().resizeSections(QtWidgets.QHeaderView.ResizeToContents)
def _map_index_to_source_item(self, mindex):
''' map proxy model index to source model item
@param mindex: QModelIndex*
@retval QObject*
'''
return self._model.mapToSource(mindex).internalPointer()
def _send_data_to_clipboard(self, data):
''' copy data to the clipboard
@param data: data to be copied
'''
clip = QtWidgets.QApplication.clipboard()
clip.clear(mode=clip.Clipboard)
clip.setText(data, mode=clip.Clipboard)
def _new_action(self, display, data, slot):
''' create action for context menu
@param display: text displayed to user in context menu
@param data: data passed to slot
@param slot: slot to connect
@retval QAction*
'''
action = QtWidgets.QAction(display, self._parent)
action.setData(data)
action.triggered.connect(lambda checked: slot(action))
return action
def _load_default_context_menu_actions(self, data):
''' yield actions specific to function custom context menu
@param data: tuple
@yield QAction*
'''
default_actions = [
('Copy column', data, self._slot_copy_column),
('Copy row', data, self._slot_copy_row),
# ('Filter', data, self._slot_filter),
]
# add default actions
for action in default_actions:
yield self._new_action(*action)
def _load_function_context_menu_actions(self, data):
''' yield actions specific to function custom context menu
@param data: tuple
@yield QAction*
'''
function_actions = [
('Rename function', data, self._slot_rename_function),
]
# add function actions
for action in function_actions:
yield self._new_action(*action)
# add default actions
for action in self._load_default_context_menu_actions(data):
yield action
def _load_default_context_menu(self, pos, item, mindex):
''' create default custom context menu
creates custom context menu containing default actions
@param pos: TODO
@param item: TODO
@param mindex: TODO
@retval QMenu*
'''
menu = QtWidgets.QMenu()
for action in self._load_default_context_menu_actions((pos, item, mindex)):
menu.addAction(action)
return menu
def _load_function_item_context_menu(self, pos, item, mindex):
''' create function custom context menu
creates custom context menu containing actions specific to functions
and the default actions
@param pos: TODO
@param item: TODO
@param mindex: TODO
@retval QMenu*
'''
menu = QtWidgets.QMenu()
for action in self._load_function_context_menu_actions((pos, item, mindex)):
menu.addAction(action)
return menu
def _show_custom_context_menu(self, menu, pos):
''' display custom context menu in view
@param menu: TODO
@param pos: TODO
'''
if not menu:
return
menu.exec_(self.viewport().mapToGlobal(pos))
def _slot_copy_column(self, action):
''' slot connected to custom context menu
allows user to select a column and copy the data
to clipboard
@param action: QAction*
'''
_, item, mindex = action.data()
self._send_data_to_clipboard(item.data(mindex.column()))
def _slot_copy_row(self, action):
''' slot connected to custom context menu
allows user to select a row and copy the space-delimeted
data to clipboard
@param action: QAction*
'''
_, item, _ = action.data()
self._send_data_to_clipboard(str(item))
def _slot_rename_function(self, action):
''' slot connected to custom context menu
allows user to select a edit a function name and push
changes to IDA
@param action: QAction*
'''
_, item, mindex = action.data()
# make item temporary edit, reset after user is finished
item.setIsEditable(True)
self.edit(mindex)
item.setIsEditable(False)
def _slot_custom_context_menu_requested(self, pos):
''' slot connected to custom context menu request
displays custom context menu to user containing action
relevant to the data item selected
@param pos: TODO
'''
mindex = self.indexAt(pos)
if not mindex.isValid():
return
item = self._map_index_to_source_item(mindex)
column = mindex.column()
menu = None
if CapaExplorerDataModel.COLUMN_INDEX_RULE_INFORMATION == column and isinstance(item, CapaExplorerFunctionItem):
# user hovered function item
menu = self._load_function_item_context_menu(pos, item, mindex)
else:
# user hovered default item
menu = self._load_default_context_menu(pos, item, mindex)
# show custom context menu at view position
self._show_custom_context_menu(menu, pos)
def _slot_click(self):
''' slot connected to single click event '''
pass
def _slot_double_click(self, mindex):
''' slot connected to double click event
@param mindex: QModelIndex*
'''
if not mindex.isValid():
return
item = self._map_index_to_source_item(mindex)
column = mindex.column()
if CapaExplorerDataModel.COLUMN_INDEX_VIRTUAL_ADDRESS == column:
# user double-clicked virtual address column - navigate IDA to address
try:
idc.jumpto(int(item.data(1), 16))
except ValueError:
pass
if CapaExplorerDataModel.COLUMN_INDEX_RULE_INFORMATION == column:
# user double-clicked information column - un/expand
if self.isExpanded(mindex):
self.collapse(mindex)
else:
self.expand(mindex)

View File

@@ -0,0 +1,19 @@
import idaapi
import idc
def get_disasm_line(va):
''' '''
return idc.generate_disasm_line(va, idc.GENDSM_FORCE_CODE)
def is_func_start(ea):
''' check if function stat exists at virtual address '''
f = idaapi.get_func(ea)
return f and f.start_ea == ea
def get_func_start_ea(ea):
''' '''
f = idaapi.get_func(ea)
return f if f is None else f.start_ea

View File

@@ -0,0 +1,459 @@
import os
import logging
import collections
from PyQt5.QtWidgets import (
QHeaderView,
QAbstractItemView,
QMenuBar,
QAction,
QTabWidget,
QWidget,
QTextEdit,
QMenu,
QApplication,
QVBoxLayout,
QToolTip,
QCheckBox,
QTableWidget,
QTableWidgetItem
)
from PyQt5.QtGui import QCursor, QIcon
from PyQt5.QtCore import Qt
import idaapi
import capa.main
import capa.rules
import capa.features.extractors.ida
from capa.ida.explorer.view import CapaExplorerQtreeView
from capa.ida.explorer.model import CapaExplorerDataModel
from capa.ida.explorer.proxy import CapaExplorerSortFilterProxyModel
PLUGIN_NAME = 'capaex'
logger = logging.getLogger(PLUGIN_NAME)
class CapaExplorerIdaHooks(idaapi.UI_Hooks):
def __init__(self, screen_ea_changed_hook, action_hooks):
''' facilitate IDA UI hooks
@param screen_ea_changed: TODO
@param action_hooks: TODO
'''
super(CapaExplorerIdaHooks, self).__init__()
self._screen_ea_changed_hook = screen_ea_changed_hook
self._process_action_hooks = action_hooks
self._process_action_handle = None
self._process_action_meta = {}
def preprocess_action(self, name):
''' called prior to action completed
@param name: name of action defined by idagui.cfg
@retval must be 0
'''
self._process_action_handle = self._process_action_hooks.get(name, None)
if self._process_action_handle:
self._process_action_handle(self._process_action_meta)
# must return 0 for IDA
return 0
def postprocess_action(self):
''' called after action completed '''
if not self._process_action_handle:
return
self._process_action_handle(self._process_action_meta, post=True)
self._reset()
def screen_ea_changed(self, curr_ea, prev_ea):
''' called after screen ea is changed
@param curr_ea: current ea
@param prev_ea: prev ea
'''
self._screen_ea_changed_hook(idaapi.get_current_widget(), curr_ea, prev_ea)
def _reset(self):
''' reset internal state '''
self._process_action_handle = None
self._process_action_meta.clear()
class CapaExplorerForm(idaapi.PluginForm):
def __init__(self):
''' '''
super(CapaExplorerForm, self).__init__()
self.form_title = PLUGIN_NAME
self.parent = None
self._file_loc = __file__
self._ida_hooks = None
# models
self._model_data = None
self._model_proxy = None
# user interface elements
self._view_limit_results_by_function = None
self._view_tree = None
self._view_summary = None
self._view_tabs = None
self._view_menu_bar = None
def OnCreate(self, form):
''' '''
self.parent = self.FormToPyQtWidget(form)
self._load_interface()
self._load_capa_results()
self._load_ida_hooks()
self._view_tree.reset()
logger.info('form created.')
def Show(self):
''' '''
return idaapi.PluginForm.Show(self, self.form_title, options=(
idaapi.PluginForm.WOPN_TAB | idaapi.PluginForm.WCLS_CLOSE_LATER
))
def OnClose(self, form):
''' form is closed '''
self._unload_ida_hooks()
self._ida_reset()
logger.info('form closed.')
def _load_interface(self):
''' load user interface '''
# load models
self._model_data = CapaExplorerDataModel()
self._model_proxy = CapaExplorerSortFilterProxyModel()
self._model_proxy.setSourceModel(self._model_data)
# load tree
self._view_tree = CapaExplorerQtreeView(self._model_proxy, self.parent)
# load summary table
self._load_view_summary()
# load parent tab and children tab views
self._load_view_tabs()
self._load_view_checkbox_limit_by()
self._load_view_summary_tab()
self._load_view_tree_tab()
# load menu bar and sub menus
self._load_view_menu_bar()
self._load_file_menu()
# load parent view
self._load_view_parent()
def _load_view_tabs(self):
''' '''
tabs = QTabWidget()
self._view_tabs = tabs
def _load_view_menu_bar(self):
''' '''
bar = QMenuBar()
# bar.hovered.connect(self._slot_menu_bar_hovered)
self._view_menu_bar = bar
def _load_view_summary(self):
''' '''
table = QTableWidget()
table.setColumnCount(4)
table.verticalHeader().setVisible(False)
table.setSortingEnabled(False)
table.setEditTriggers(QAbstractItemView.NoEditTriggers)
table.setFocusPolicy(Qt.NoFocus)
table.setSelectionMode(QAbstractItemView.NoSelection)
table.setHorizontalHeaderLabels([
'Objectives',
'Behaviors',
'Techniques',
'Rule Hits'
])
table.horizontalHeader().setDefaultAlignment(Qt.AlignLeft)
table.setStyleSheet('QTableWidget::item { border: none; padding: 15px; }')
table.setShowGrid(False)
self._view_summary = table
def _load_view_checkbox_limit_by(self):
''' '''
check = QCheckBox('Limit results to current function')
check.setChecked(False)
check.stateChanged.connect(self._slot_checkbox_limit_by_changed)
self._view_checkbox_limit_by = check
def _load_view_parent(self):
''' load view parent '''
layout = QVBoxLayout()
layout.addWidget(self._view_tabs)
layout.setMenuBar(self._view_menu_bar)
self.parent.setLayout(layout)
def _load_view_tree_tab(self):
''' load view tree tab '''
layout = QVBoxLayout()
layout.addWidget(self._view_checkbox_limit_by)
layout.addWidget(self._view_tree)
tab = QWidget()
tab.setLayout(layout)
self._view_tabs.addTab(tab, 'Tree View')
def _load_view_summary_tab(self):
''' '''
layout = QVBoxLayout()
layout.addWidget(self._view_summary)
tab = QWidget()
tab.setLayout(layout)
self._view_tabs.addTab(tab, 'Summary')
def _load_file_menu(self):
''' load file menu actions '''
actions = (
('Reset view', 'Reset plugin view', self.reset),
('Run analysis', 'Run capa analysis on current database', self.reload),
)
menu = self._view_menu_bar.addMenu('File')
for name, _, handle in actions:
action = QAction(name, self.parent)
action.triggered.connect(handle)
# action.setToolTip(tip)
menu.addAction(action)
def _load_ida_hooks(self):
''' '''
action_hooks = {
'MakeName': self._ida_hook_rename,
'EditFunction': self._ida_hook_rename,
}
self._ida_hooks = CapaExplorerIdaHooks(self._ida_hook_screen_ea_changed, action_hooks)
self._ida_hooks.hook()
def _unload_ida_hooks(self):
''' unhook IDA user interface '''
if self._ida_hooks:
self._ida_hooks.unhook()
def _ida_hook_rename(self, meta, post=False):
''' hook for IDA rename action
called twice, once before action and once after
action completes
@param meta: TODO
@param post: TODO
'''
ea = idaapi.get_screen_ea()
if not ea or not capa.ida.helpers.is_func_start(ea):
return
curr_name = idaapi.get_name(ea)
if post:
# post action update data model w/ current name
self._model_data.update_function_name(meta.get('prev_name', ''), curr_name)
else:
# pre action so save current name for replacement later
meta['prev_name'] = curr_name
def _ida_hook_screen_ea_changed(self, widget, new_ea, old_ea):
''' '''
if not self._view_checkbox_limit_by.isChecked():
# ignore if checkbox not selected
return
if idaapi.get_widget_type(widget) != idaapi.BWN_DISASM:
# ignore views other than asm
return
# attempt to map virtual addresses to function start addresses
new_func_start = capa.ida.helpers.get_func_start_ea(new_ea)
old_func_start = capa.ida.helpers.get_func_start_ea(old_ea)
if new_func_start and new_func_start == old_func_start:
# navigated within the same function - do nothing
return
if new_func_start:
# navigated to new function - filter for function start virtual address
match = capa.ida.explorer.item.ea_to_hex_str(new_func_start)
else:
# navigated to virtual address not in valid function - clear filter
match = ''
# filter on virtual address to avoid updating filter string if function name is changed
self._model_proxy.add_single_string_filter(CapaExplorerDataModel.COLUMN_INDEX_VIRTUAL_ADDRESS, match)
def _load_capa_results(self):
''' '''
logger.info('-' * 80)
logger.info(' Using default embedded rules.')
logger.info(' ')
logger.info(' You can see the current default rule set here:')
logger.info(' https://github.com/fireeye/capa-rules')
logger.info('-' * 80)
rules_path = os.path.join(os.path.dirname(self._file_loc), '../..', 'rules')
rules = capa.main.get_rules(rules_path)
rules = capa.rules.RuleSet(rules)
results = capa.main.find_capabilities(rules, capa.features.extractors.ida.IdaFeatureExtractor(), True)
logger.info('analysis completed.')
self._model_data.render_capa_results(rules, results)
self._render_capa_summary(rules, results)
logger.info('render views completed.')
def _render_capa_summary(self, ruleset, results):
''' render results summary table
keep sync with capa.main
@param ruleset: TODO
@param results: TODO
'''
rules = set(filter(lambda x: not ruleset.rules[x].meta.get('lib', False), results.keys()))
objectives = set()
behaviors = set()
techniques = set()
for rule in rules:
parts = ruleset.rules[rule].meta.get(capa.main.RULE_CATEGORY, '').split('/')
if len(parts) == 0 or list(parts) == ['']:
continue
if len(parts) > 0:
objective = parts[0].replace('-', ' ')
objectives.add(objective)
if len(parts) > 1:
behavior = parts[1].replace('-', ' ')
behaviors.add(behavior)
if len(parts) > 2:
technique = parts[2].replace('-', ' ')
techniques.add(technique)
if len(parts) > 3:
raise capa.rules.InvalidRule(capa.main.RULE_CATEGORY + " tag must have at most three components")
# set row count to max set size
self._view_summary.setRowCount(max(map(len, (rules, objectives, behaviors, techniques))))
# format rule hits
rules = map(lambda x: '%s (%d)' % (x, len(results[x])), rules)
# sort results
columns = list(map(lambda x: sorted(x, key=lambda s: s.lower()), (objectives, behaviors, techniques, rules)))
# load results into table by column
for idx, column in enumerate(columns):
self._load_view_summary_column(idx, column)
# resize columns to content
self._view_summary.resizeColumnsToContents()
def _load_view_summary_column(self, column, texts):
''' '''
for row, text in enumerate(texts):
self._view_summary.setItem(row, column, QTableWidgetItem(text))
def _ida_reset(self):
''' reset IDA user interface '''
self._model_data.reset()
self._view_tree.reset()
self._view_checkbox_limit_by.setChecked(False)
def reload(self):
''' reload views and re-run capa analysis '''
self._ida_reset()
self._model_proxy.invalidate()
self._model_data.clear()
self._view_summary.setRowCount(0)
self._load_capa_results()
logger.info('reload complete.')
idaapi.info('%s reload completed.' % PLUGIN_NAME)
def reset(self):
''' reset user interface elements
e.g. checkboxes and IDA highlighting
'''
self._ida_reset()
logger.info('reset completed.')
idaapi.info('%s reset completed.' % PLUGIN_NAME)
def _slot_menu_bar_hovered(self, action):
''' display menu action tooltip
@param action: QAction*
@reference: https://stackoverflow.com/questions/21725119/why-wont-qtooltips-appear-on-qactions-within-a-qmenu
'''
QToolTip.showText(QCursor.pos(), action.toolTip(), self._view_menu_bar, self._view_menu_bar.actionGeometry(action))
def _slot_checkbox_limit_by_changed(self):
''' slot activated if checkbox clicked
if checked, configure function filter if screen ea is located
in function, otherwise clear filter
'''
match = ''
if self._view_checkbox_limit_by.isChecked():
ea = capa.ida.helpers.get_func_start_ea(idaapi.get_screen_ea())
if ea:
match = capa.ida.explorer.item.ea_to_hex_str(ea)
self._model_proxy.add_single_string_filter(CapaExplorerDataModel.COLUMN_INDEX_VIRTUAL_ADDRESS, match)
self._view_tree.resize_columns_to_content()
def main():
''' TODO: move to idaapi.plugin_t class '''
logging.basicConfig(level=logging.INFO)
global CAPA_EXPLORER_FORM
try:
# there is an instance, reload it
CAPA_EXPLORER_FORM
CAPA_EXPLORER_FORM.Close()
CAPA_EXPLORER_FORM = CapaExplorerForm()
except Exception:
# there is no instance yet
CAPA_EXPLORER_FORM = CapaExplorerForm()
CAPA_EXPLORER_FORM.Show()
if __name__ == '__main__':
main()

View File

@@ -0,0 +1,284 @@
# TODO documentation
import logging
import binascii
import textwrap
from collections import Counter, defaultdict
from PyQt5 import QtWidgets, QtCore
from PyQt5.QtWidgets import QTreeWidget, QTreeWidgetItem, QTextEdit, QHeaderView
import idc
import idaapi
import capa
import capa.main
from capa.ida import plugin_helpers
import capa.features.extractors.ida.helpers
logger = logging.getLogger('rulegen')
AUTHOR_NAME = ''
COLOR_HIGHLIGHT = 0xD096FF
def get_func_start(ea):
f = idaapi.get_func(ea)
if f:
return f.start_ea
else:
return None
class Hooks(idaapi.UI_Hooks):
'''
Notifies the plugin when navigating to another function
NOTE: it uses the global variable FLEX to access the
PluginForm object. This looks nasty, maybe there is a better way?
'''
def screen_ea_changed(self, ea, prev_ea):
widget = idaapi.get_current_widget()
if idaapi.get_widget_type(widget) != idaapi.BWN_DISASM:
# Ignore non disassembly views
return
try:
f1 = get_func_start(ea)
f2 = get_func_start(prev_ea)
if f1 != f2:
# changed to another function
RULE_GEN_FORM.reload_features_tree()
except Exception as e:
logger.warn('exception: %s', e)
class RuleGeneratorForm(idaapi.PluginForm):
def __init__(self):
super(RuleGeneratorForm, self).__init__()
self.title = 'capa rule generator'
self.parent = None
self.parent_items = {}
self.orig_colors = None
self.hooks = Hooks() # dirty?
if self.hooks.hook():
logger.info('UI notification hook installed successfully')
def init_ui(self):
self.tree = QTreeWidget()
self.rule_text = QTextEdit()
self.rule_text.setMinimumWidth(350)
self.reload_features_tree()
button_reset = QtWidgets.QPushButton('&Reset')
button_reset.clicked.connect(self.reset)
h_layout = QtWidgets.QHBoxLayout()
v_layout = QtWidgets.QVBoxLayout()
h_layout.addWidget(self.tree)
h_layout.addWidget(self.rule_text)
v_layout.addLayout(h_layout)
v_layout.addWidget(button_reset)
self.parent.setLayout(v_layout)
def reset(self):
plugin_helpers.reset_selection(self.tree)
plugin_helpers.reset_colors(self.orig_colors)
self.rule_text.setText('')
def reload_features_tree(self):
self.reset()
self.tree.clear()
self.orig_colors = None
self.parent_items = {}
features = self.get_features()
if not features:
return
feature_vas = set().union(*features.values())
self.orig_colors = plugin_helpers.get_orig_color_feature_vas(feature_vas)
self.create_tree(features)
self.tree.update()
def get_features(self):
# load like standalone tool
extractor = capa.features.extractors.ida.IdaFeatureExtractor()
f = idaapi.get_func(idaapi.get_screen_ea())
if not f:
logger.info('function does not exist at 0x%x', idaapi.get_screen_ea())
return
return self.extract_function_features(f)
def extract_function_features(self, f):
features = defaultdict(set)
for bb in idaapi.FlowChart(f, flags=idaapi.FC_PREDS):
for insn in capa.features.extractors.ida.helpers.get_instructions_in_range(bb.start_ea, bb.end_ea):
for feature, va in capa.features.extractors.ida.insn.extract_features(f, bb, insn):
features[feature].add(va)
for feature, va in capa.features.extractors.ida.basicblock.extract_features(f, bb):
features[feature].add(va)
return features
def create_tree(self, features):
self.tree.setMinimumWidth(400)
# self.tree.setMinimumHeight(300)
self.tree.setHeaderLabels(['Feature', 'Virtual Address', 'Disassembly'])
# auto resize columns
self.tree.header().setSectionResizeMode(QHeaderView.ResizeToContents)
self.tree.itemClicked.connect(self.on_item_clicked)
# features sorted by location of first occurrence
# TODO fix characteristic features display and rule text
for feature, vas in sorted(features.items(), key=lambda k: sorted(k[1])):
# level 0
if type(feature) not in self.parent_items:
self.parent_items[type(feature)] = plugin_helpers.add_child_item(self.tree, [feature.name.lower()])
# level 1
if feature not in self.parent_items:
self.parent_items[feature] = plugin_helpers.add_child_item(self.parent_items[type(feature)], [str(feature)])
# level n > 1
if len(vas) > 1:
for va in sorted(vas):
plugin_helpers.add_child_item(self.parent_items[feature], [str(feature), '0x%X' % va, plugin_helpers.get_disasm_line(va)], feature)
else:
va = vas.pop()
self.parent_items[feature].setText(0, str(feature))
self.parent_items[feature].setText(1, '0x%X' % va)
self.parent_items[feature].setText(2, plugin_helpers.get_disasm_line(va))
self.parent_items[feature].setData(0, 0x100, feature)
# @QtCore.pyqtSlot(QTreeWidgetItem, int)
def on_item_clicked(self, it, col):
# logger.debug('clicked %s, %s, %s', it, col, it.text(col))
# jump to address
if col == 1 and it.text(col):
va = int(it.text(col), 0x10)
if va:
idc.jumpto(va)
# highlight in disassembly
plugin_helpers.reset_colors(self.orig_colors)
selected = self.get_selected_items()
for va in selected.keys():
idc.set_color(va, idc.CIC_ITEM, COLOR_HIGHLIGHT)
self.update_rule_text()
def update_rule_text(self):
features = self.get_selected_items().values()
rule = self.get_rule_from_features(features)
self.rule_text.setText(rule)
def get_rule_from_features(self, features):
rule_parts = []
counted = zip(Counter(features).keys(), # equals to list(set(words))
Counter(features).values()) # counts the elements' frequency
# single features
for k, v in filter(lambda t: t[1] == 1, counted):
# TODO args to hex if int
if k.name.lower() == 'bytes':
# Convert raw bytes to uppercase hex representation (e.g., '12 34 56')
upper_hex_bytes = binascii.hexlify(args_to_str(k.args)).upper()
rule_value_str = ''
for i in range(0, len(upper_hex_bytes), 2):
rule_value_str += upper_hex_bytes[i:i + 2] + ' '
r = ' - %s: %s' % (k.name.lower(), rule_value_str)
else:
r = ' - %s: %s' % (k.name.lower(), args_to_str(k.args))
rule_parts.append(r)
# counted features
for k, v in filter(lambda t: t[1] > 1, counted):
r = ' - count(%s): %d' % (str(k), v)
rule_parts.append(r)
rule_prefix = textwrap.dedent('''
rule:
meta:
name:
author: %s
scope: function
examples:
- %s:0x%X
features:
''' % (AUTHOR_NAME, idc.retrieve_input_file_md5(), get_func_start(idc.here()))).strip()
return '%s\n%s' % (rule_prefix, '\n'.join(sorted(rule_parts)))
# TODO merge into capa_idautils, get feature data
def get_selected_items(self):
selected = {}
iterator = QtWidgets.QTreeWidgetItemIterator(self.tree, QtWidgets.QTreeWidgetItemIterator.Checked)
while iterator.value():
item = iterator.value()
if item.text(1):
# logger.debug('selected %s, %s, %s', item.text(1), item.text(0), item.data(0, 0x100))
selected[int(item.text(1), 0x10)] = item.data(0, 0x100)
iterator += 1
return selected
# ----------------------------------------------------------
# IDA Plugin API
# ----------------------------------------------------------
def OnCreate(self, form):
self.parent = self.FormToPyQtWidget(form)
self.init_ui()
def Show(self):
return idaapi.PluginForm.Show(self, self.title, options=(
idaapi.PluginForm.WOPN_RESTORE
| idaapi.PluginForm.WOPN_PERSIST
))
def OnClose(self, form):
self.reset()
if self.hooks.unhook():
logger.info('UI notification hook uninstalled successfully')
logger.info('RuleGeneratorForm closed')
def args_to_str(args):
a = []
for arg in args:
if (isinstance(arg, int) or isinstance(arg, long)) and arg > 10:
a.append('0x%X' % arg)
else:
a.append(str(arg))
return ','.join(a)
def main():
logging.basicConfig(level=logging.INFO)
global RULE_GEN_FORM
try:
# there is an instance, reload it
RULE_GEN_FORM
RULE_GEN_FORM.Close()
RULE_GEN_FORM = RuleGeneratorForm()
except Exception:
# there is no instance yet
RULE_GEN_FORM = RuleGeneratorForm()
RULE_GEN_FORM.Show()
if __name__ == '__main__':
main()

View File

@@ -0,0 +1,93 @@
import os
import logging
from PyQt5.QtWidgets import QTreeWidgetItem, QTreeWidgetItemIterator
from PyQt5.QtCore import Qt
import idc
import idaapi
CAPA_EXTENSION = '.capas'
logger = logging.getLogger('capa_ida')
def get_input_file(freeze=True):
'''
get input file path
freeze (bool): if True, get freeze file if it exists
'''
# try original file in same directory as idb/i64 without idb/i64 file extension
input_file = idc.get_idb_path()[:-4]
if freeze:
# use frozen file if it exists
freeze_file_cand = '%s%s' % (input_file, CAPA_EXTENSION)
if os.path.isfile(freeze_file_cand):
return freeze_file_cand
if not os.path.isfile(input_file):
# TM naming
input_file = '%s.mal_' % idc.get_idb_path()[:-4]
if not os.path.isfile(input_file):
input_file = idaapi.ask_file(0, '*.*', 'Please specify input file.')
if not input_file:
raise ValueError('could not find input file')
return input_file
def get_orig_color_feature_vas(vas):
orig_colors = {}
for va in vas:
orig_colors[va] = idc.get_color(va, idc.CIC_ITEM)
return orig_colors
def reset_colors(orig_colors):
if orig_colors:
for va, color in orig_colors.iteritems():
idc.set_color(va, idc.CIC_ITEM, orig_colors[va])
def reset_selection(tree):
iterator = QTreeWidgetItemIterator(tree, QTreeWidgetItemIterator.Checked)
while iterator.value():
item = iterator.value()
item.setCheckState(0, Qt.Unchecked) # column, state
iterator += 1
def get_disasm_line(va):
return idc.generate_disasm_line(va, idc.GENDSM_FORCE_CODE)
def get_selected_items(tree, skip_level_1=False):
selected = []
iterator = QTreeWidgetItemIterator(tree, QTreeWidgetItemIterator.Checked)
while iterator.value():
item = iterator.value()
if skip_level_1:
# hacky way to check if item is at level 1, if so, skip
# alternative, check if text in disasm column
if item.parent() and item.parent().parent() is None:
iterator += 1
continue
if item.text(1):
# logger.debug('selected %s, %s', item.text(0), item.text(1))
selected.append(int(item.text(1), 0x10))
iterator += 1
return selected
def add_child_item(parent, values, feature=None):
child = QTreeWidgetItem(parent)
child.setFlags(child.flags() | Qt.ItemIsTristate | Qt.ItemIsUserCheckable)
for i, v in enumerate(values):
child.setText(i, v)
if feature:
child.setData(0, 0x100, feature)
child.setCheckState(0, Qt.Unchecked)
return child

777
capa/main.py Normal file
View File

@@ -0,0 +1,777 @@
#!/usr/bin/env python2
'''
capa - detect capabilities in programs.
'''
import os
import os.path
import sys
import logging
import collections
import tqdm
import argparse
import capa.rules
import capa.engine
import capa.features
import capa.features.freeze
import capa.features.extractors
from capa.helpers import oint
SUPPORTED_FILE_MAGIC = set(['MZ'])
logger = logging.getLogger('capa')
def set_vivisect_log_level(level):
logging.getLogger('vivisect').setLevel(level)
logging.getLogger('vtrace').setLevel(level)
logging.getLogger('envi').setLevel(level)
def find_function_capabilities(ruleset, extractor, f):
# contains features from:
# - insns
# - function
function_features = collections.defaultdict(set)
bb_matches = collections.defaultdict(list)
for feature, va in extractor.extract_function_features(f):
function_features[feature].add(va)
for bb in extractor.get_basic_blocks(f):
# contains features from:
# - insns
# - basic blocks
bb_features = collections.defaultdict(set)
for feature, va in extractor.extract_basic_block_features(f, bb):
bb_features[feature].add(va)
for insn in extractor.get_instructions(f, bb):
for feature, va in extractor.extract_insn_features(f, bb, insn):
bb_features[feature].add(va)
function_features[feature].add(va)
_, matches = capa.engine.match(ruleset.basic_block_rules, bb_features, oint(bb))
for rule_name, res in matches.items():
bb_matches[rule_name].extend(res)
for va, _ in res:
function_features[capa.features.MatchedRule(rule_name)].add(va)
_, function_matches = capa.engine.match(ruleset.function_rules, function_features, oint(f))
return function_matches, bb_matches
def find_file_capabilities(ruleset, extractor, function_features):
file_features = collections.defaultdict(set)
for feature, va in extractor.extract_file_features():
# not all file features may have virtual addresses.
# if not, then at least ensure the feature shows up in the index.
# the set of addresses will still be empty.
if va:
file_features[feature].add(va)
else:
if feature not in file_features:
file_features[feature] = set()
logger.info('analyzed file and extracted %d features', len(file_features))
file_features.update(function_features)
_, matches = capa.engine.match(ruleset.file_rules, file_features, 0x0)
return matches
def find_capabilities(ruleset, extractor, disable_progress=None):
all_function_matches = collections.defaultdict(list)
all_bb_matches = collections.defaultdict(list)
for f in tqdm.tqdm(extractor.get_functions(), disable=disable_progress, unit=' functions'):
function_matches, bb_matches = find_function_capabilities(ruleset, extractor, f)
for rule_name, res in function_matches.items():
all_function_matches[rule_name].extend(res)
for rule_name, res in bb_matches.items():
all_bb_matches[rule_name].extend(res)
# mapping from matched rule feature to set of addresses at which it matched.
# type: Dict[MatchedRule, Set[int]]
function_features = {capa.features.MatchedRule(rule_name): set(map(lambda p: p[0], results))
for rule_name, results in all_function_matches.items()}
all_file_matches = find_file_capabilities(ruleset, extractor, function_features)
matches = {}
matches.update(all_bb_matches)
matches.update(all_function_matches)
matches.update(all_file_matches)
return matches
def pluck_meta(rules, key):
for rule in rules:
value = rule.meta.get(key)
if value:
yield value
def get_dispositions(matched_rules):
for disposition in pluck_meta(matched_rules, 'maec/analysis-conclusion'):
yield disposition
for disposition in pluck_meta(matched_rules, 'maec/analysis-conclusion-ov'):
yield disposition
def get_roles(matched_rules):
for role in pluck_meta(matched_rules, 'maec/malware-category'):
yield role
for role in pluck_meta(matched_rules, 'maec/malware-category-ov'):
yield role
RULE_CATEGORY = 'rule-category'
def is_other_feature_rule(rule):
'''
does this rule *not* have any of:
- maec/malware-category
- maec/analysis-conclusion
- rule-category
if so, it will be placed into the "other features" bucket
'''
if rule.meta.get('lib', False):
return False
for meta in ('maec/analysis-conclusion',
'maec/analysis-conclusion-ov',
'maec/malware-category',
'maec/malware-category-ov',
RULE_CATEGORY):
if meta in rule.meta:
return False
return True
def render_capabilities_default(ruleset, results):
rules = [ruleset.rules[rule_name] for rule_name in results.keys()]
# we render the highest level conclusions first:
#
# 1. is it malware?
# 2. what is the role? (dropper, backdoor, etc.)
#
# after this, we'll enumerate the specific objectives, behaviors, and techniques.
dispositions = list(sorted(get_dispositions(rules)))
if dispositions:
print('disposition: ' + ', '.join(dispositions))
categories = list(sorted(get_roles(rules)))
if categories:
print('role: ' + ', '.join(categories))
# rules may have a meta tag `rule-category` that specifies:
#
# rule-category: $objective[/$behavior[/$technique]]
#
# this classification describes a tree of increasingly specific conclusions.
# the tree allows us to tie a high-level conclusion, e.g. an objective, to
# the evidence of this - the behaviors, techniques, rules, and ultimately, features.
# this data structure is a nested map:
#
# objective name -> behavior name -> technique name -> rule name -> rule
#
# at each level, a matched rule is also legal.
# this indicates that only a portion of the rule-category was provided.
o = collections.defaultdict(
lambda: collections.defaultdict(
lambda: collections.defaultdict(
dict
)
)
)
objectives = set()
behaviors = set()
techniques = set()
for rule in rules:
objective = None
behavior = None
technique = None
parts = rule.meta.get(RULE_CATEGORY, '').split('/')
if len(parts) == 0 or list(parts) == ['']:
continue
if len(parts) > 0:
objective = parts[0].replace('-', ' ')
objectives.add(objective)
if len(parts) > 1:
behavior = parts[1].replace('-', ' ')
behaviors.add(behavior)
if len(parts) > 2:
technique = parts[2].replace('-', ' ')
techniques.add(technique)
if len(parts) > 3:
raise capa.rules.InvalidRule(RULE_CATEGORY + " tag must have at most three components")
if technique:
o[objective][behavior][technique][rule.name] = rule
elif behavior:
o[objective][behavior][rule.name] = rule
elif objective:
o[objective][rule.name] = rule
if objectives:
print('\nobjectives:')
for objective in sorted(objectives):
print(' ' + objective)
if behaviors:
print('\nbehaviors:')
for behavior in sorted(behaviors):
print(' ' + behavior)
if techniques:
print('\ntechniques:')
for technique in sorted(techniques):
print(' ' + technique)
other_features = list(filter(is_other_feature_rule, rules))
if other_features:
print('\nother features:')
for rule in sorted(map(lambda r: r.name, other_features)):
print(' ' + rule)
# now, render a tree of the objectives, behaviors, techniques, and matched rule names.
# it will look something like:
#
# details:
# load data
# load data from self
# load data from resource
# extract resource via API
#
# implementation note:
# when we enumerate the items in this tree, we have two cases:
#
# 1. usually, we'll get a pair (objective name, map of children); but its possible that
# 2. we'll get a pair (rule name, rule instance)
#
# this is why we do the `ininstance(..., Rule)` check below.
#
# i believe the alternative, to have separate data structures for the tree and rules,
# is probably more code and more confusing.
if o:
print('\ndetails:')
for objective, behaviors in o.items():
print(' ' + objective)
if isinstance(behaviors, capa.rules.Rule):
continue
for behavior, techniques in behaviors.items():
print(' ' + behavior)
if isinstance(techniques, capa.rules.Rule):
continue
for technique, rules in techniques.items():
print(' ' + technique)
if isinstance(rules, capa.rules.Rule):
continue
for rule in rules.keys():
print(' ' + rule)
def render_capabilities_concise(results):
'''
print the matching rules, newline separated.
example:
foo
bar
mimikatz::kull_m_arc_sendrecv
'''
for rule in sorted(results.keys()):
print(rule)
def render_capabilities_verbose(results):
'''
print the matching rules, and the functions in which they matched.
example:
foo:
- 0x401000
- 0x401005
bar:
- 0x402044
- 0x402076
mimikatz::kull_m_arc_sendrecv:
- 0x40105d
'''
for rule, ress in results.items():
print('%s:' % (rule))
seen = set([])
for (fva, _) in sorted(ress, key=lambda p: p[0]):
if fva in seen:
continue
print(' - 0x%x' % (fva))
seen.add(fva)
def render_result(res, indent=''):
'''
render the given Result to stdout.
args:
res (capa.engine.Result)
indent (str)
'''
# prune failing branches
if not res.success:
return
if isinstance(res.statement, capa.engine.Some):
if res.statement.count == 0:
# we asked for optional, so we'll match even if no children matched.
# but in this case, its not worth rendering the optional node.
if sum(map(lambda c: c.success, res.children)) > 0:
print('%soptional:' % indent)
else:
print("%s%d or more" % (indent, res.statement.count))
elif not isinstance(res.statement, (capa.features.Feature, capa.engine.Element, capa.engine.Range, capa.engine.Regex)):
# when rending a structural node (and/or/not),
# then we only care about the node name.
#
# for example:
#
# and:
# Number(0x3136b0): True
# Number(0x3136b0): True
print('%s%s:' % (indent, res.statement.name.lower()))
else:
# but when rendering a Feature, want to see any arguments to it
#
# for example:
#
# Number(0x3136b0): True
print('%s%s:' % (indent, res.statement))
for location in sorted(res.locations):
print('%s - virtual address: 0x%x' % (indent, location))
for children in res.children:
render_result(children, indent=indent + ' ')
def render_capabilities_vverbose(results):
'''
print the matching rules, the functions in which they matched,
and the logic tree with annotated matching features.
example:
function mimikatz::kull_m_arc_sendrecv:
- 0x40105d
Or:
And:
string("ACR > "):
- virtual address: 0x401089
number(0x3136b0):
- virtual address: 0x4010c8
'''
for rule, ress in results.items():
print('rule %s:' % (rule))
for (fva, res) in sorted(ress, key=lambda p: p[0]):
print(' - function 0x%x:' % (fva))
render_result(res, indent=' ')
def appears_rule_cat(rules, capabilities, rule_cat):
for rule_name in capabilities.keys():
if rules.rules[rule_name].meta.get('rule-category', '').startswith(rule_cat):
return True
return False
def is_supported_file_type(sample):
'''
Return if this is a supported file based on magic header values
'''
with open(sample, 'rb') as f:
magic = f.read(2)
if magic in SUPPORTED_FILE_MAGIC:
return True
else:
return False
def get_shellcode_vw(sample, arch='auto'):
'''
Return shellcode workspace using explicit arch or via auto detect
'''
import viv_utils
with open(sample, 'rb') as f:
sample_bytes = f.read()
if arch == 'auto':
# choose arch with most functions, idea by Jay G.
vw_cands = []
for arch in ['i386', 'amd64']:
vw_cands.append(viv_utils.getShellcodeWorkspace(sample_bytes, arch))
if not vw_cands:
raise ValueError('could not generate vivisect workspace')
vw = max(vw_cands, key=lambda vw: len(vw.getFunctions()))
else:
vw = viv_utils.getShellcodeWorkspace(sample_bytes, arch)
vw.setMeta('Format', 'blob') # TODO fix in viv_utils
return vw
def get_meta_str(vw):
'''
Return workspace meta information string
'''
meta = []
for k in ['Format', 'Platform', 'Architecture']:
if k in vw.metadata:
meta.append('%s: %s' % (k.lower(), vw.metadata[k]))
return '%s, number of functions: %d' % (', '.join(meta), len(vw.getFunctions()))
class UnsupportedFormatError(ValueError):
pass
def get_workspace(path, format):
import viv_utils
logger.info('generating vivisect workspace for: %s', path)
if format == 'auto':
if not is_supported_file_type(path):
raise UnsupportedFormatError()
vw = viv_utils.getWorkspace(path)
elif format == 'pe':
vw = viv_utils.getWorkspace(path)
elif format == 'sc32':
vw = get_shellcode_vw(path, arch='i386')
elif format == 'sc64':
vw = get_shellcode_vw(path, arch='amd64')
logger.info('%s', get_meta_str(vw))
return vw
def get_extractor_py2(path, format):
import capa.features.extractors.viv
vw = get_workspace(path, format)
return capa.features.extractors.viv.VivisectFeatureExtractor(vw, path)
class UnsupportedRuntimeError(RuntimeError):
pass
def get_extractor_py3(path, format):
raise UnsupportedRuntimeError()
def get_extractor(path, format):
'''
raises:
UnsupportedFormatError:
'''
if sys.version_info >= (3, 0):
return get_extractor_py3(path, format)
else:
return get_extractor_py2(path, format)
def is_nursery_rule_path(path):
'''
The nursery is a spot for rules that have not yet been fully polished.
For example, they may not have references to public example of a technique.
Yet, we still want to capture and report on their matches.
The nursery is currently a subdirectory of the rules directory with that name.
When nursery rules are loaded, their metadata section should be updated with:
`nursery=True`.
'''
return 'nursery' in path
def get_rules(rule_path):
if not os.path.exists(rule_path):
raise IOError('%s does not exist or cannot be accessed' % rule_path)
rules = []
if os.path.isfile(rule_path):
logger.info('reading rule file: %s', rule_path)
with open(rule_path, 'rb') as f:
rule = capa.rules.Rule.from_yaml(f.read().decode('utf-8'))
if is_nursery_rule_path(root):
rule.meta['nursery'] = True
rules.append(rule)
logger.debug('rule: %s scope: %s', rule.name, rule.scope)
elif os.path.isdir(rule_path):
logger.info('reading rules from directory %s', rule_path)
for root, dirs, files in os.walk(rule_path):
for file in files:
if not file.endswith('.yml'):
logger.warning('skipping non-.yml file: %s', file)
continue
path = os.path.join(root, file)
logger.debug('reading rule file: %s', path)
try:
rule = capa.rules.Rule.from_yaml_file(path)
except capa.rules.InvalidRule:
raise
else:
if is_nursery_rule_path(root):
rule.meta['nursery'] = True
rules.append(rule)
logger.debug('rule: %s scope: %s', rule.name, rule.scope)
return rules
def main(argv=None):
if argv is None:
argv = sys.argv[1:]
formats = [
('auto', '(default) detect file type automatically'),
('pe', 'Windows PE file'),
('sc32', '32-bit shellcode'),
('sc64', '64-bit shellcode'),
('freeze', 'features previously frozen by capa'),
]
format_help = ', '.join(['%s: %s' % (f[0], f[1]) for f in formats])
parser = argparse.ArgumentParser(description='detect capabilities in programs.')
parser.add_argument('sample', type=str,
help='Path to sample to analyze')
parser.add_argument('-r', '--rules', type=str, default='(embedded rules)',
help='Path to rule file or directory, use embedded rules by default')
parser.add_argument('-t', '--tag', type=str,
help='Filter on rule meta field values')
parser.add_argument('-v', '--verbose', action='store_true',
help='Enable verbose output')
parser.add_argument('-vv', '--vverbose', action='store_true',
help='Enable very verbose output')
parser.add_argument('-q', '--quiet', action='store_true',
help='Disable all output but errors')
parser.add_argument('-f', '--format', choices=[f[0] for f in formats], default='auto',
help='Select sample format, %s' % format_help)
args = parser.parse_args(args=argv)
if args.quiet:
logging.basicConfig(level=logging.ERROR)
logging.getLogger().setLevel(logging.ERROR)
elif args.verbose:
logging.basicConfig(level=logging.DEBUG)
logging.getLogger().setLevel(logging.DEBUG)
else:
logging.basicConfig(level=logging.INFO)
logging.getLogger().setLevel(logging.INFO)
# disable vivisect-related logging, it's verbose and not relevant for capa users
set_vivisect_log_level(logging.CRITICAL)
# py2 doesn't know about cp65001, which is a variant of utf-8 on windows
# tqdm bails when trying to render the progress bar in this setup.
# because cp65001 is utf-8, we just map that codepage to the utf-8 codec.
# see #380 and: https://stackoverflow.com/a/3259271/87207
import codecs
codecs.register(lambda name: codecs.lookup('utf-8') if name == 'cp65001' else None)
if args.rules == '(embedded rules)':
logger.info('-' * 80)
logger.info(' Using default embedded rules.')
logger.info(' To provide your own rules, use the form `capa.exe ./path/to/rules/ /path/to/mal.exe`.')
logger.info(' You can see the current default rule set here:')
logger.info(' https://github.com/fireeye/capa-rules')
logger.info('-' * 80)
if hasattr(sys, 'frozen') and hasattr(sys, '_MEIPASS'):
logger.debug('detected running under PyInstaller')
args.rules = os.path.join(sys._MEIPASS, 'rules')
logger.debug('default rule path (PyInstaller method): %s', args.rules)
else:
logger.debug('detected running from source')
args.rules = os.path.join(os.path.dirname(__file__), '..', 'rules')
logger.debug('default rule path (source method): %s', args.rules)
else:
logger.info('using rules path: %s', args.rules)
try:
rules = get_rules(args.rules)
rules = capa.rules.RuleSet(rules)
logger.info('successfully loaded %s rules', len(rules))
if args.tag:
rules = rules.filter_rules_by_meta(args.tag)
logger.info('selected %s rules', len(rules))
except (IOError, capa.rules.InvalidRule, capa.rules.InvalidRuleSet) as e:
logger.error('%s', str(e))
return -1
with open(args.sample, 'rb') as f:
taste = f.read(8)
if ((args.format == 'freeze')
or (args.format == 'auto' and capa.features.freeze.is_freeze(taste))):
with open(args.sample, 'rb') as f:
extractor = capa.features.freeze.load(f.read())
else:
try:
extractor = get_extractor(args.sample, args.format)
except UnsupportedFormatError:
logger.error("-" * 80)
logger.error(" Input file does not appear to be a PE file.")
logger.error(" ")
logger.error(" Today, capa currently only supports analyzing PE files (or shellcode, when using --format sc32|sc64).")
logger.error(" If you don't know the input file type, you can try using the `file` utility to guess it.")
logger.error("-" * 80)
return -1
except UnsupportedRuntimeError:
logger.error("-" * 80)
logger.error(" Unsupported runtime or Python interpreter.")
logger.error(" ")
logger.error(" Today, capa supports running under Python 2.7 using Vivisect for binary analysis.")
logger.error(" It can also run within IDA Pro, using either Python 2.7 or 3.5+.")
logger.error(" ")
logger.error(" If you're seeing this message on the command line, please ensure you're running Python 2.7.")
logger.error("-" * 80)
return -1
capabilities = find_capabilities(rules, extractor)
if appears_rule_cat(rules, capabilities, 'other-features/installer/'):
logger.warning("-" * 80)
logger.warning(" This sample appears to be an installer.")
logger.warning(" ")
logger.warning(" capa cannot handle installers well. This means the results may be misleading or incomplete.")
logger.warning(" You should try to understand the install mechanism and analyze created files with capa.")
logger.warning(" ")
logger.warning(" Use -v or -vv if you really want to see the capabilities identified by capa.")
logger.warning("-" * 80)
# capa will likely detect installer specific functionality.
# this is probably not what the user wants.
#
# do show the output in verbose mode, though.
if not (args.verbose or args.vverbose):
return -1
if appears_rule_cat(rules, capabilities, 'other-features/compiled-to-dot-net'):
logger.warning("-" * 80)
logger.warning(" This sample appears to be a .NET module.")
logger.warning(" ")
logger.warning(" .NET is a cross-platform framework for running managed applications.")
logger.warning(
" Today, capa cannot handle non-native files. This means that the results may be misleading or incomplete.")
logger.warning(" You may have to analyze the file manually, using a tool like the .NET decompiler dnSpy.")
logger.warning(" ")
logger.warning(" Use -v or -vv if you really want to see the capabilities identified by capa.")
logger.warning("-" * 80)
# capa won't detect much in .NET samples.
# it might match some file-level things.
# for consistency, bail on things that we don't support.
#
# do show the output in verbose mode, though.
if not (args.verbose or args.vverbose):
return -1
if appears_rule_cat(rules, capabilities, 'other-features/compiled-with-autoit'):
logger.warning("-" * 80)
logger.warning(" This sample appears to be compiled with AutoIt.")
logger.warning(" ")
logger.warning(" AutoIt is a freeware BASIC-like scripting language designed for automating the Windows GUI.")
logger.warning(
" Today, capa cannot handle AutoIt scripts. This means that the results will be misleading or incomplete.")
logger.warning(" You may have to analyze the file manually, using a tool like the AutoIt decompiler MyAut2Exe.")
logger.warning(" ")
logger.warning(" Use -v or -vv if you really want to see the capabilities identified by capa.")
logger.warning("-" * 80)
# capa will detect dozens of capabilities for AutoIt samples,
# but these are due to the AutoIt runtime, not the payload script.
# so, don't confuse the user with FP matches - bail instead
#
# do show the output in verbose mode, though.
if not (args.verbose or args.vverbose):
return -1
if appears_rule_cat(rules, capabilities, 'anti-analysis/packing/'):
logger.warning("-" * 80)
logger.warning(" This sample appears packed.")
logger.warning(" ")
logger.warning(" Packed samples have often been obfuscated to hide their logic.")
logger.warning(" capa cannot handle obfuscation well. This means the results may be misleading or incomplete.")
logger.warning(" If possible, you should try to unpack this input file before analyzing it with capa.")
logger.warning("-" * 80)
if args.vverbose:
render_capabilities_vverbose(capabilities)
elif args.verbose:
render_capabilities_verbose(capabilities)
else:
render_capabilities_default(rules, capabilities)
logger.info('done.')
return 0
def ida_main():
logging.basicConfig(level=logging.INFO)
logging.getLogger().setLevel(logging.INFO)
logger.info('-' * 80)
logger.info(' Using default embedded rules.')
logger.info(' ')
logger.info(' You can see the current default rule set here:')
logger.info(' https://github.com/fireeye/capa-rules')
logger.info('-' * 80)
if hasattr(sys, 'frozen') and hasattr(sys, '_MEIPASS'):
logger.debug('detected running under PyInstaller')
rules_path = os.path.join(sys._MEIPASS, 'rules')
logger.debug('default rule path (PyInstaller method): %s', rules_path)
else:
logger.debug('detected running from source')
rules_path = os.path.join(os.path.dirname(__file__), '..', 'rules')
logger.debug('default rule path (source method): %s', rules_path)
rules = get_rules(rules_path)
import capa.rules
rules = capa.rules.RuleSet(rules)
import capa.features.extractors.ida
capabilities = find_capabilities(rules, capa.features.extractors.ida.IdaFeatureExtractor())
render_capabilities_default(rules, capabilities)
def is_runtime_ida():
try:
import idc
except ImportError:
return False
else:
return True
if __name__ == "__main__":
if is_runtime_ida():
ida_main()
else:
sys.exit(main())

669
capa/rules.py Normal file
View File

@@ -0,0 +1,669 @@
import yaml
import uuid
import codecs
import logging
import binascii
import capa.engine
from capa.engine import *
import capa.features
import capa.features.file
import capa.features.function
import capa.features.basicblock
import capa.features.insn
from capa.features import MAX_BYTES_FEATURE_SIZE
logger = logging.getLogger(__name__)
FILE_SCOPE = 'file'
FUNCTION_SCOPE = 'function'
BASIC_BLOCK_SCOPE = 'basic block'
SUPPORTED_FEATURES = {
FILE_SCOPE: set([
capa.engine.Element,
capa.features.MatchedRule,
capa.features.file.Export,
capa.features.file.Import,
capa.features.file.Section,
capa.features.Characteristic('embedded pe'),
capa.features.String,
]),
FUNCTION_SCOPE: set([
capa.engine.Element,
capa.features.MatchedRule,
capa.features.insn.API,
capa.features.insn.Number,
capa.features.String,
capa.features.Bytes,
capa.features.insn.Offset,
capa.features.insn.Mnemonic,
capa.features.basicblock.BasicBlock,
capa.features.Characteristic('switch'),
capa.features.Characteristic('nzxor'),
capa.features.Characteristic('peb access'),
capa.features.Characteristic('fs access'),
capa.features.Characteristic('gs access'),
capa.features.Characteristic('cross section flow'),
capa.features.Characteristic('stack string'),
capa.features.Characteristic('calls from'),
capa.features.Characteristic('calls to'),
capa.features.Characteristic('indirect call'),
capa.features.Characteristic('loop'),
capa.features.Characteristic('recursive call')
]),
BASIC_BLOCK_SCOPE: set([
capa.engine.Element,
capa.features.MatchedRule,
capa.features.insn.API,
capa.features.insn.Number,
capa.features.String,
capa.features.Bytes,
capa.features.insn.Offset,
capa.features.insn.Mnemonic,
capa.features.Characteristic('nzxor'),
capa.features.Characteristic('peb access'),
capa.features.Characteristic('fs access'),
capa.features.Characteristic('gs access'),
capa.features.Characteristic('cross section flow'),
capa.features.Characteristic('tight loop'),
capa.features.Characteristic('stack string'),
capa.features.Characteristic('indirect call')
]),
}
class InvalidRule(ValueError):
def __init__(self, msg):
super(InvalidRule, self).__init__()
self.msg = msg
def __str__(self):
return 'invalid rule: %s' % (self.msg)
def __repr__(self):
return str(self)
class InvalidRuleWithPath(InvalidRule):
def __init__(self, path, msg):
super(InvalidRuleWithPath, self).__init__(msg)
self.path = path
self.msg = msg
self.__cause__ = None
def __str__(self):
return 'invalid rule: %s: %s' % (self.path, self.msg)
class InvalidRuleSet(ValueError):
def __init__(self, msg):
super(InvalidRuleSet, self).__init__()
self.msg = msg
def __str__(self):
return 'invalid rule set: %s' % (self.msg)
def __repr__(self):
return str(self)
def ensure_feature_valid_for_scope(scope, feature):
if isinstance(feature, capa.features.Characteristic):
if capa.features.Characteristic(feature.name) not in SUPPORTED_FEATURES[scope]:
raise InvalidRule('feature %s not support for scope %s' % (feature, scope))
elif not isinstance(feature, tuple(filter(lambda t: isinstance(t, type), SUPPORTED_FEATURES[scope]))):
raise InvalidRule('feature %s not support for scope %s' % (feature, scope))
def parse_int(s):
if s.startswith('0x'):
return int(s, 0x10)
else:
return int(s, 10)
def parse_range(s):
'''
parse a string "(0, 1)" into a range (min, max).
min and/or max may by None to indicate an unbound range.
'''
# we want to use `{` characters, but this is a dict in yaml.
if not s.startswith('('):
raise InvalidRule('invalid range: %s' % (s))
if not s.endswith(')'):
raise InvalidRule('invalid range: %s' % (s))
s = s[len('('):-len(')')]
min, _, max = s.partition(',')
min = min.strip()
max = max.strip()
if min:
min = parse_int(min.strip())
if min < 0:
raise InvalidRule('range min less than zero')
else:
min = None
if max:
max = parse_int(max.strip())
if max < 0:
raise InvalidRule('range max less than zero')
else:
max = None
if min is not None and max is not None:
if max < min:
raise InvalidRule('range max less than min')
return min, max
def parse_feature(key):
# keep this in sync with supported features
if key == 'api':
return capa.features.insn.API
elif key == 'string':
return capa.features.String
elif key == 'bytes':
return capa.features.Bytes
elif key == 'number':
return capa.features.insn.Number
elif key == 'offset':
return capa.features.insn.Offset
elif key == 'mnemonic':
return capa.features.insn.Mnemonic
elif key == 'basic blocks':
return capa.features.basicblock.BasicBlock
elif key == 'element':
return Element
elif key.startswith('characteristic(') and key.endswith(')'):
characteristic = key[len('characteristic('):-len(')')]
return lambda v: capa.features.Characteristic(characteristic, v)
elif key == 'export':
return capa.features.file.Export
elif key == 'import':
return capa.features.file.Import
elif key == 'section':
return capa.features.file.Section
elif key == 'match':
return capa.features.MatchedRule
else:
raise InvalidRule('unexpected statement: %s' % key)
def parse_symbol(s, value_type):
'''
s can be an int or a string
'''
if isinstance(s, str) and '=' in s:
value, symbol = s.split('=', 1)
symbol = symbol.strip()
if symbol == '':
raise InvalidRule('unexpected value: "%s", symbol name cannot be empty' % s)
else:
value = s
symbol = None
if isinstance(value, str):
if value_type == 'bytes':
try:
value = codecs.decode(value.replace(' ', ''), 'hex')
# TODO: Remove TypeError when Python2 is not used anymore
except (TypeError, binascii.Error):
raise InvalidRule('unexpected bytes value: "%s", must be a valid hex sequence' % value)
if len(value) > MAX_BYTES_FEATURE_SIZE:
raise InvalidRule('unexpected bytes value: byte sequences must be no larger than %s bytes' %
MAX_BYTES_FEATURE_SIZE)
else:
try:
value = parse_int(value)
except ValueError:
raise InvalidRule('unexpected value: "%s", must begin with numerical value' % value)
return value, symbol
def build_statements(d, scope):
if len(d.keys()) != 1:
raise InvalidRule('too many statements')
key = list(d.keys())[0]
if key == 'and':
return And(*[build_statements(dd, scope) for dd in d[key]])
elif key == 'or':
return Or(*[build_statements(dd, scope) for dd in d[key]])
elif key == 'not':
if len(d[key]) != 1:
raise InvalidRule('not statement must have exactly one child statement')
return Not(*[build_statements(dd, scope) for dd in d[key]])
elif key.endswith(' or more'):
count = int(key[:-len('or more')])
return Some(count, *[build_statements(dd, scope) for dd in d[key]])
elif key == 'optional':
# `optional` is an alias for `0 or more`
# which is useful for documenting behaviors,
# like with `write file`, we might say that `WriteFile` is optionally found alongside `CreateFileA`.
return Some(0, *[build_statements(dd, scope) for dd in d[key]])
elif key == 'function':
if scope != FILE_SCOPE:
raise InvalidRule('function subscope supported only for file scope')
if len(d[key]) != 1:
raise InvalidRule('subscope must have exactly one child statement')
return Subscope(FUNCTION_SCOPE, *[build_statements(dd, FUNCTION_SCOPE) for dd in d[key]])
elif key == 'basic block':
if scope != FUNCTION_SCOPE:
raise InvalidRule('basic block subscope supported only for function scope')
if len(d[key]) != 1:
raise InvalidRule('subscope must have exactly one child statement')
return Subscope(BASIC_BLOCK_SCOPE, *[build_statements(dd, BASIC_BLOCK_SCOPE) for dd in d[key]])
elif key.startswith('count(') and key.endswith(')'):
# e.g.:
#
# count(basic block)
# count(mnemonic(mov))
# count(characteristic(nzxor))
term = key[len('count('):-len(')')]
if term.startswith('characteristic('):
# characteristic features are specified a bit specially:
# they simply indicate the presence of something unusual/interesting,
# and we embed the name in the feature name, like `characteristic(nzxor)`.
#
# when we're dealing with counts, like `count(characteristic(nzxor))`,
# we can simply extract the feature and assume we're looking for `True` values.
Feature = parse_feature(term)
feature = Feature(True)
ensure_feature_valid_for_scope(scope, feature)
else:
# however, for remaining counted features, like `count(mnemonic(mov))`,
# we have to jump through hoops.
#
# when looking for the existance of such a feature, our rule might look like:
# - mnemonic: mov
#
# but here we deal with the form: `mnemonic(mov)`.
term, _, arg = term.partition('(')
Feature = parse_feature(term)
if arg:
arg = arg[:-len(')')]
# can't rely on yaml parsing ints embedded within strings
# like:
#
# count(offset(0xC))
# count(number(0x11223344))
# count(number(0x100 = symbol name))
if term in ('number', 'offset', 'bytes'):
value, symbol = parse_symbol(arg, term)
feature = Feature(value, symbol)
elif term in ('element'):
arg = parse_int(arg)
feature = Feature(arg)
else:
# arg is string, like:
#
# count(mnemonic(mov))
# count(string(error))
# TODO: what about embedded newlines?
feature = Feature(arg)
else:
feature = Feature()
ensure_feature_valid_for_scope(scope, feature)
count = d[key]
if isinstance(count, int):
return Range(feature, min=count, max=count)
elif count.endswith(' or more'):
min = parse_int(count[:-len(' or more')])
max = None
return Range(feature, min=min, max=max)
elif count.endswith(' or fewer'):
min = None
max = parse_int(count[:-len(' or fewer')])
return Range(feature, min=min, max=max)
elif count.startswith('('):
min, max = parse_range(count)
return Range(feature, min=min, max=max)
else:
raise InvalidRule('unexpected range: %s' % (count))
elif key == 'string' and d[key].startswith('/') and (d[key].endswith('/') or d[key].endswith('/i')):
try:
return Regex(d[key])
except re.error:
if d[key].endswith('/i'):
d[key] = d[key][:-len('i')]
raise InvalidRule('invalid regular expression: %s it should use Python syntax, try it at https://pythex.org' % d[key])
else:
Feature = parse_feature(key)
if key in ('number', 'offset', 'bytes'):
# parse numbers with symbol description, e.g. 0x4550 = IMAGE_DOS_SIGNATURE
# or regular numbers, e.g. 37
value, symbol = parse_symbol(d[key], key)
feature = Feature(value, symbol)
else:
feature = Feature(d[key])
ensure_feature_valid_for_scope(scope, feature)
return feature
def first(s):
return s[0]
def second(s):
return s[1]
class Rule(object):
def __init__(self, name, scope, statement, meta, definition=''):
super(Rule, self).__init__()
self.name = name
self.scope = scope
self.statement = statement
self.meta = meta
self.definition = definition
def __str__(self):
return 'Rule(name=%s)' % (self.name)
def __repr__(self):
return 'Rule(scope=%s, name=%s)' % (self.scope, self.name)
def get_dependencies(self):
'''
fetch the names of rules this rule relies upon.
these are only the direct dependencies; a user must
compute the transitive dependency graph themself, if they want it.
Returns:
List[str]: names of rules upon which this rule depends.
'''
deps = set([])
def rec(statement):
if isinstance(statement, capa.features.MatchedRule):
deps.add(statement.rule_name)
elif isinstance(statement, Statement):
for child in statement.get_children():
rec(child)
# else: might be a Feature, etc.
# which we don't care about here.
rec(self.statement)
return deps
def _extract_subscope_rules_rec(self, statement):
if isinstance(statement, Statement):
# for each child that is a subscope,
for subscope in filter(lambda statement: isinstance(statement, capa.engine.Subscope), statement.get_children()):
# create a new rule from it.
# the name is a randomly generated, hopefully unique value.
# ideally, this won't every be rendered to a user.
name = self.name + '/' + uuid.uuid4().hex
new_rule = Rule(name, subscope.scope, subscope.child, {
'name': name,
'scope': subscope.scope,
# these derived rules are never meant to be inspected separately,
# they are dependencies for the parent rule,
# so mark it as such.
'lib': True,
# metadata that indicates this is derived from a subscope statement
'capa/subscope-rule': True,
# metadata that links the child rule the parent rule
'capa/parent': self.name,
})
# update the existing statement to `match` the new rule
new_node = capa.features.MatchedRule(name)
statement.replace_child(subscope, new_node)
# and yield the new rule to our caller
yield new_rule
# now recurse to other nodes in the logic tree.
# note: we cannot recurse into the subscope sub-tree,
# because its been replaced by a `match` statement.
for child in statement.get_children():
for new_rule in self._extract_subscope_rules_rec(child):
yield new_rule
def extract_subscope_rules(self):
'''
scan through the statements of this rule,
replacing subscope statements with `match` references to a newly created rule,
which are yielded from this routine.
note: this mutates the current rule.
example::
for derived_rule in rule.extract_subscope_rules():
assert derived_rule.meta['capa/parent'] == rule.name
'''
# recurse through statements
# when encounter Subscope statement
# create new transient rule
# copy logic into the new rule
# replace old node with reference to new rule
# yield new rule
for new_rule in self._extract_subscope_rules_rec(self.statement):
yield new_rule
def evaluate(self, features):
return self.statement.evaluate(features)
@classmethod
def from_dict(cls, d, s):
name = d['rule']['meta']['name']
# if scope is not specified, default to function scope.
# this is probably the mode that rule authors will start with.
scope = d['rule']['meta'].get('scope', FUNCTION_SCOPE)
statements = d['rule']['features']
# the rule must start with a single logic node.
# doing anything else is too implicit and difficult to remove (AND vs OR ???).
if len(statements) != 1:
raise InvalidRule('rule must begin with a single top level statement')
if isinstance(statements[0], capa.engine.Subscope):
raise InvalidRule('top level statement may not be a subscope')
return cls(
name,
scope,
build_statements(statements[0], scope),
d['rule']['meta'],
s
)
@classmethod
def from_yaml(cls, s):
return cls.from_dict(yaml.safe_load(s), s)
@classmethod
def from_yaml_file(cls, path):
with open(path, 'rb') as f:
try:
return cls.from_yaml(f.read().decode('utf-8'))
except InvalidRule as e:
raise InvalidRuleWithPath(path, str(e))
def get_rules_with_scope(rules, scope):
'''
from the given collection of rules, select those with the given scope.
args:
rules (List[capa.rules.Rule]):
scope (str): one of the capa.rules.*_SCOPE constants.
returns:
List[capa.rules.Rule]:
'''
return list(rule for rule in rules if rule.scope == scope)
def get_rules_and_dependencies(rules, rule_name):
'''
from the given collection of rules, select a rule and its dependencies (transitively).
args:
rules (List[Rule]):
rule_name (str):
yields:
Rule:
'''
rules = {rule.name: rule for rule in rules}
wanted = set([rule_name])
def rec(rule):
wanted.add(rule.name)
for dep in rule.get_dependencies():
rec(rules[dep])
rec(rules[rule_name])
for rule in rules.values():
if rule.name in wanted:
yield rule
def ensure_rules_are_unique(rules):
seen = set([])
for rule in rules:
if rule.name in seen:
raise InvalidRule('duplicate rule name: ' + rule.name)
seen.add(rule.name)
def ensure_rule_dependencies_are_met(rules):
'''
raise an exception if a rule dependency does not exist.
raises:
InvalidRule: if a dependency is not met.
'''
rules = {rule.name: rule for rule in rules}
for rule in rules.values():
for dep in rule.get_dependencies():
if dep not in rules:
raise InvalidRule('rule "%s" depends on missing rule "%s"' % (rule.name, dep))
class RuleSet(object):
'''
a ruleset is initialized with a collection of rules, which it verifies and sorts into scopes.
each set of scoped rules is sorted topologically, which enables rules to match on past rule matches.
example:
ruleset = RuleSet([
Rule(...),
Rule(...),
...
])
capa.engine.match(ruleset.file_rules, ...)
'''
def __init__(self, rules):
super(RuleSet, self).__init__()
ensure_rules_are_unique(rules)
rules = self._extract_subscope_rules(rules)
ensure_rule_dependencies_are_met(rules)
if len(rules) == 0:
raise InvalidRuleSet('no rules selected')
self.file_rules = self._get_rules_for_scope(rules, FILE_SCOPE)
self.function_rules = self._get_rules_for_scope(rules, FUNCTION_SCOPE)
self.basic_block_rules = self._get_rules_for_scope(rules, BASIC_BLOCK_SCOPE)
self.rules = {rule.name: rule for rule in rules}
def __len__(self):
return len(self.rules)
@staticmethod
def _get_rules_for_scope(rules, scope):
'''
given a collection of rules, collect the rules that are needed at the given scope.
these rules are ordered topologically.
don't include "lib" rules, unless they are dependencies of other rules.
'''
scope_rules = set([])
# we need to process all rules, not just rules with the given scope.
# this is because rules with a higher scope, e.g. file scope, may have subscope rules
# at lower scope, e.g. function scope.
# so, we find all dependencies of all rules, and later will filter them down.
for rule in rules:
if rule.meta.get('lib', False):
continue
scope_rules.update(get_rules_and_dependencies(rules, rule.name))
return get_rules_with_scope(capa.engine.topologically_order_rules(scope_rules), scope)
@staticmethod
def _extract_subscope_rules(rules):
'''
process the given sequence of rules.
for each one, extract any embedded subscope rules into their own rule.
process these recursively.
then return a list of the refactored rules.
note: this operation mutates the rules passed in - they may now have `match` statements
for the extracted subscope rules.
'''
done = []
# use a queue of rules, because we'll be modifying the list (appending new items) as we go.
while rules:
rule = rules.pop(0)
for subscope_rule in rule.extract_subscope_rules():
rules.append(subscope_rule)
done.append(rule)
return done
def filter_rules_by_meta(self, tag):
'''
return new rule set with rules filtered based on all meta field values, adds all dependency rules
apply tag-based rule filter assuming that all required rules are loaded
can be used to specify selected rules vs. providing a rules child directory where capa cannot resolve
dependencies from unknown paths
TODO handle circular dependencies?
TODO support -t=metafield <k>
'''
rules = self.rules.values()
rules_filtered = set([])
for rule in rules:
for k, v in rule.meta.items():
if isinstance(v, str) and tag in v:
logger.debug('using rule "%s" and dependencies, found tag in meta.%s: %s', rule.name, k, v)
rules_filtered.update(set(capa.rules.get_rules_and_dependencies(rules, rule.name)))
break
return RuleSet(list(rules_filtered))

2
capa/version.py Normal file
View File

@@ -0,0 +1,2 @@
__version__ = '0.0.0'
__commit__ = '00000000'

BIN
ci/logo.ico Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 137 KiB

BIN
ci/logo.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 46 KiB

8
ci/tox.ini Normal file
View File

@@ -0,0 +1,8 @@
[pycodestyle]
; E402: module level import not at top of file
; W503: line break before binary operator
ignore = E402,W503
max-line-length = 160
statistics = True
count = True
exclude = .*

BIN
doc/capa_explorer.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 47 KiB

44
doc/installation.md Normal file
View File

@@ -0,0 +1,44 @@
# Installation
You can install capa in a few different ways. First, if you simply want to use capa, just download the [standalone binary](https://github.com/fireeye/capa/releases). If you want to use capa as a Python library, you can install the package directly from Github using `pip`. If you'd like to contribute patches or features to capa, you can work with a local copy of the source code.
## Method 1: Standalone installation
If you simply want to use capa, use the standalone binaries we host on Github: https://github.com/fireeye/capa/releases. These binary executable files contain all the source code, Python interpreter, and associated resources needed to make capa run. This means you can run it without any installation! Just invoke the file using your terminal shell to see the help documentation.
We used PyInstaller to create these packages.
## Method 2: Using capa as a Python library
To install capa as a Python library, you'll need to install a few dependencies, and then use `pip` to fetch the capa module.
### 1. Install requirements
First, install the requirements.
`$ pip install https://github.com/williballenthin/vivisect/zipball/master`
### 2. Install capa module
Second, use `pip` to install the capa module to your local Python environment. This fetches the library code to your computer, but does not keep editable source files around for you to hack on. If you'd like to edit the source files, see below.
`$ pip install https://github.com/fireeye/capa/archive/master.zip`
### 3. Use capa
You can now import the `capa` module from a Python script or use the IDA Pro plugins from the `capa/ida` directory. For more information please see the [usage](usage.md) documentation.
## Method 3: Inspecting the capa source code
If you'd like to review and modify the capa source code, you'll need to check it out from Github and install it locally. By following these instructions, you'll maintain a local directory of source code that you can modify and run easily.
### 1. Install requirements
First, install the requirements.
`$ pip install https://github.com/williballenthin/vivisect/zipball/master`
### 2. Check out source code
First, clone the capa git repository.
#### SSH
`$ git clone git@github.com:fireeye/capa.git /local/path/to/src`
#### HTTPS
`$ git clone https://github.com/fireeye/capa.git /local/path/to/src`
### 3. Install the local source code
Next, use `pip` to install the source code in "editable" mode. This means that Python will load the capa module from this local directory rather than copying it to `site-packages` or `dist-packages`. This is good, because it is easy for us to modify files and see the effects reflected immediately. But be careful not to remove this directory unless uninstalling capa.
`$ pip install -e ./local/path/to/src`
You'll find that the `capa.exe` (Windows) or `capa` (Linux) executables in your path now invoke the capa binary from this directory.

61
doc/limitations.md Normal file
View File

@@ -0,0 +1,61 @@
# Packers
Packed programs have often been obfuscated to hide their logic. Since capa cannot handle obfuscation well, results may be misleading or incomplete. If possible, users should unpack input files before analyzing them with capa.
If capa detects that a program may be packed using its rules it warns the user.
# Installers, run-time programs, etc.
capa cannot handle installers, run-time programs like .NET applications, or other packaged applications like AutoIt well. This means that the results may be misleading or incomplete.
If capa detects an installer, run-time program, etc. it warns the user.
# Wrapper functions and matches in child functions
Currently capa does not handle wrapper functions or other matches in child functions.
Consider this example call tree where `f1` calls a wrapper function `f2` and the `CreateProcess` API. `f2` writes to a file.
```
f1
f2 (WriteFile wrapper)
CreateFile
WriteFile
CreateProcess
```
Here capa does not match a rule that hits on file creation and execution on function `f1`.
Software often contains such nested calls because programmers wrap API calls in helper functions or because specific compilers or languages, such as Go, layer calls.
While a feature to capture nested functionality is desirable it introduces various issues and complications. These include:
- how to assign matches from child to parent functions?
- a potential significant increase in analysis requirements and rule matching complexity
Moreover, we require more real-world samples to see how prevalent this really is and how much it would improve capa's results.
# Loop scope
Encryption, encoding, or processing functions often contain loops and it could be beneficial to capture functionality within loops.
However, tracking all basic blocks part of a loop especially with nested loop constructs is not trivial.
As a compromise, capa provides the `characteristic(loop)` feature to filter on functions that contain a loop.
We need more practical use cases and test samples to justify the additional workload to implement a full loop scope feature.
# ATT&CK, MAEC, MBC, and other capability tagging
capa uses a custom category tagging that assigns capabilities with objective, behavior, and technique (see https://github.com/fireeye/capa#meta-block).
The category tagging is loosely based on the ELWUN/Nucleus capability tags.
While exploring other tagging mechanisms we discovered the following shortcomings:
- ATT&CK: does not cover all the capabilities we are trying to express and is intended for a different purpose (general adversary tactics and techniques)
- MAEC: the ELWUN tags are related to the MAEC format, but express capabilities more appropriately for us
- MBC: this is the right scope, but a rather new project, if there's more support and demand in the community for this schema further work in this direction could be promising
Adding tags from a new schema to the existing rules is a cumbersome process. We will hold on to amending rules until we have identified an appropriate schema.
Additionally, if we choose to support a public standard, we would like to provide expertise back to the community.

26
doc/usage.md Normal file
View File

@@ -0,0 +1,26 @@
# Usage
## Command line
After you have downloaded the standalone version of capa or installed it via `pip` (see the [installation](installation.md) documentation) you can run capa directly from your terminal shell.
- `$ capa -h`
- `$ capa malware.exe`
In this mode capa relies on vivisect which only runs under Python 2.
## IDA Pro
capa runs from within IDA Pro. Run `capa/main.py` via File - Script file... (ALT + F7).
When running in IDA, capa uses IDA's disassembly and file analysis as its backend. These results may vary from the standalone version that uses vivisect.
In IDA, capa supports Python 2 and Python 3. If you encounter issues with your specific setup please open a new [Issue](https://github.com/fireeye/capa/issues).
## IDA plugins
capa comes with two IDA Pro plugins located in the `capa/ida` directory.
### capa explorer
The capa explorer allows you to interactively display and browse capabilities capa identified in a binary.
![capa explorer](capa_explorer.png)
### Rule generator
The rule generator helps you to easily write new rules based on the function you are currently analyzing in your IDA disassembly view.

403
scripts/lint.py Normal file
View File

@@ -0,0 +1,403 @@
'''
Check the given capa rules for style issues.
Usage:
$ python scripts/lint.py rules/
'''
import os
import sys
import string
import hashlib
import logging
import os.path
import itertools
import argparse
import capa.main
import capa.engine
import capa.features
logger = logging.getLogger('capa.lint')
class Lint(object):
name = 'lint'
recommendation = ''
def check_rule(self, ctx, rule):
return False
class NameCasing(Lint):
name = 'rule name casing'
recommendation = 'Rename rule using to start with lower case letters'
def check_rule(self, ctx, rule):
return (rule.name[0] in string.ascii_uppercase and
rule.name[1] not in string.ascii_uppercase)
class MissingRuleCategory(Lint):
name = 'missing rule category'
recommendation = 'Add meta.rule-category so that the rule is emitted correctly'
def check_rule(self, ctx, rule):
return ('rule-category' not in rule.meta and
'maec/malware-category' not in rule.meta and
'lib' not in rule.meta)
class MissingScope(Lint):
name = 'missing scope'
recommendation = 'Add meta.scope so that the scope is explicit (defaults to `function`)'
def check_rule(self, ctx, rule):
return 'scope' not in rule.meta
class InvalidScope(Lint):
name = 'invalid scope'
recommendation = 'Use only file, function, or basic block rule scopes'
def check_rule(self, ctx, rule):
return rule.meta.get('scope') not in ('file', 'function', 'basic block')
class MissingAuthor(Lint):
name = 'missing author'
recommendation = 'Add meta.author so that users know who to contact with questions'
def check_rule(self, ctx, rule):
return 'author' not in rule.meta
class MissingExamples(Lint):
name = 'missing examples'
recommendation = 'Add meta.examples so that the rule can be tested and verified'
def check_rule(self, ctx, rule):
return ('examples' not in rule.meta or
not isinstance(rule.meta['examples'], list) or
len(rule.meta['examples']) == 0 or
rule.meta['examples'] == [None])
class MissingExampleOffset(Lint):
name = 'missing example offset'
recommendation = 'Add offset of example function'
def check_rule(self, ctx, rule):
if rule.meta.get('scope') in ('function', 'basic block'):
for example in rule.meta.get('examples', []):
if example and ':' not in example:
logger.debug('example: %s', example)
return True
class ExampleFileDNE(Lint):
name = 'referenced example doesn\'t exist'
recommendation = 'Add the referenced example to samples directory ($capa-root/tests/data or supplied via --samples)'
def check_rule(self, ctx, rule):
if not rule.meta.get('examples'):
# let the MissingExamples lint catch this case, don't double report.
return False
found = False
for example in rule.meta.get('examples', []):
if example:
example_id = example.partition(':')[0]
if example_id in ctx['samples']:
found = True
break
return not found
class DoesntMatchExample(Lint):
name = 'doesn\'t match on referenced example'
recommendation = 'Fix the rule logic or provide a different example'
def check_rule(self, ctx, rule):
if not ctx['is_thorough']:
return False
for example in rule.meta.get('examples', []):
example_id = example.partition(':')[0]
try:
path = ctx['samples'][example_id]
except KeyError:
# lint ExampleFileDNE will catch this.
# don't double report.
continue
try:
extractor = capa.main.get_extractor(path, 'auto')
capabilities = capa.main.find_capabilities(ctx['rules'], extractor, disable_progress=True)
except Exception as e:
logger.error('failed to extract capabilities: %s %s %s', rule.name, path, e)
return True
if rule.name not in capabilities:
return True
class FeatureStringTooShort(Lint):
name = 'feature string too short'
recommendation = 'capa only extracts strings with length >= 4; will not match on "{:s}"'
def check_features(self, ctx, features):
for feature in features:
if isinstance(feature, capa.features.String):
if len(feature.value) < 4:
self.recommendation = self.recommendation.format(feature.value)
return True
return False
def run_lints(lints, ctx, rule):
for lint in lints:
if lint.check_rule(ctx, rule):
yield lint
def run_feature_lints(lints, ctx, features):
for lint in lints:
if lint.check_features(ctx, features):
yield lint
NAME_LINTS = (
NameCasing(),
)
def lint_name(ctx, rule):
return run_lints(NAME_LINTS, ctx, rule)
SCOPE_LINTS = (
MissingScope(),
InvalidScope(),
)
def lint_scope(ctx, rule):
return run_lints(SCOPE_LINTS, ctx, rule)
META_LINTS = (
MissingRuleCategory(),
MissingAuthor(),
MissingExamples(),
MissingExampleOffset(),
ExampleFileDNE(),
)
def lint_meta(ctx, rule):
return run_lints(META_LINTS, ctx, rule)
FEATURE_LINTS = (
FeatureStringTooShort(),
)
def lint_features(ctx, rule):
features = get_features(ctx, rule)
return run_feature_lints(FEATURE_LINTS, ctx, features)
def get_features(ctx, rule):
# get features from rule and all dependencies including subscopes and matched rules
features = []
deps = [ctx['rules'].rules[dep] for dep in rule.get_dependencies()]
for r in [rule] + deps:
features.extend(get_rule_features(r))
return features
def get_rule_features(rule):
features = []
def rec(statement):
if isinstance(statement, capa.engine.Statement):
for child in statement.get_children():
rec(child)
else:
features.append(statement)
rec(rule.statement)
return features
LOGIC_LINTS = (
DoesntMatchExample(),
)
def lint_logic(ctx, rule):
return run_lints(LOGIC_LINTS, ctx, rule)
def is_nursery_rule(rule):
'''
The nursery is a spot for rules that have not yet been fully polished.
For example, they may not have references to public example of a technique.
Yet, we still want to capture and report on their matches.
'''
return rule.meta.get('nursery')
def lint_rule(ctx, rule):
logger.debug(rule.name)
violations = list(itertools.chain(
lint_name(ctx, rule),
lint_scope(ctx, rule),
lint_meta(ctx, rule),
lint_logic(ctx, rule),
lint_features(ctx, rule),
))
if len(violations) > 0:
category = rule.meta.get('rule-category')
print('')
print('%s%s %s' % (' (nursery) ' if is_nursery_rule(rule) else '',
rule.name,
('(%s)' % category) if category else ''))
level = 'WARN' if is_nursery_rule(rule) else 'FAIL'
for violation in violations:
print('%s %s: %s: %s' % (
' ' if is_nursery_rule(rule) else '', level, violation.name, violation.recommendation))
return len(violations) > 0 and not is_nursery_rule(rule)
def lint(ctx, rules):
'''
Args:
samples (Dict[string, string]): map from sample id to path.
for each sample, record sample id of sha256, md5, and filename.
see `collect_samples(path)`.
rules (List[Rule]): the rules to lint.
'''
did_suggest_fix = False
for rule in rules.rules.values():
if rule.meta.get('capa/subscope-rule', False):
continue
did_suggest_fix = lint_rule(ctx, rule) or did_suggest_fix
return did_suggest_fix
def collect_samples(path):
'''
recurse through the given path, collecting all file paths, indexed by their content sha256, md5, and filename.
'''
samples = {}
for root, dirs, files in os.walk(path):
for name in files:
if name.endswith('.viv'):
continue
if name.endswith('.idb'):
continue
if name.endswith('.i64'):
continue
path = os.path.join(root, name)
try:
with open(path, 'rb') as f:
buf = f.read()
except IOError:
continue
sha256 = hashlib.sha256()
sha256.update(buf)
md5 = hashlib.md5()
md5.update(buf)
samples[sha256.hexdigest().lower()] = path
samples[sha256.hexdigest().upper()] = path
samples[md5.hexdigest().lower()] = path
samples[md5.hexdigest().upper()] = path
samples[name] = path
return samples
def main(argv=None):
if argv is None:
argv = sys.argv[1:]
samples_path = os.path.join(os.path.dirname(__file__), '..', 'tests', 'data')
parser = argparse.ArgumentParser(description='A program.')
parser.add_argument('rules', type=str,
help='Path to rules')
parser.add_argument('--samples', type=str, default=samples_path,
help='Path to samples')
parser.add_argument('--thorough', action='store_true',
help='Enable thorough linting - takes more time, but does a better job')
parser.add_argument('-v', '--verbose', action='store_true',
help='Enable debug logging')
parser.add_argument('-q', '--quiet', action='store_true',
help='Disable all output but errors')
args = parser.parse_args(args=argv)
if args.verbose:
level = logging.DEBUG
elif args.quiet:
level = logging.ERROR
else:
level = logging.INFO
logging.basicConfig(level=level)
logging.getLogger('capa.lint').setLevel(level)
capa.main.set_vivisect_log_level(logging.CRITICAL)
logging.getLogger('capa').setLevel(logging.CRITICAL)
try:
rules = capa.main.get_rules(args.rules)
rules = capa.rules.RuleSet(rules)
logger.info('successfully loaded %s rules', len(rules))
except IOError as e:
logger.error('%s', str(e))
return -1
except capa.rules.InvalidRule as e:
logger.error('%s', str(e))
return -1
logger.info('collecting potentially referenced samples')
if not os.path.exists(args.samples):
logger.error('samples path %s does not exist', args.samples)
return -1
samples = collect_samples(args.samples)
ctx = {
'samples': samples,
'rules': rules,
'is_thorough': args.thorough,
}
did_violate = lint(ctx, rules)
if not did_violate:
logger.info('no suggestions, nice!')
return 0
else:
return 1
if __name__ == '__main__':
sys.exit(main())

81
scripts/show-features.py Normal file
View File

@@ -0,0 +1,81 @@
#!/usr/bin/env python2
'''
show the features extracted by capa.
'''
import sys
import logging
import argparse
import capa.main
import capa.rules
import capa.engine
import capa.features
import capa.features.freeze
import capa.features.extractors.viv
def main(argv=None):
if argv is None:
argv = sys.argv[1:]
formats = [
('auto', '(default) detect file type automatically'),
('pe', 'Windows PE file'),
('sc32', '32-bit shellcode'),
('sc64', '64-bit shellcode'),
('freeze', 'features previously frozen by capa'),
]
format_help = ', '.join(['%s: %s' % (f[0], f[1]) for f in formats])
parser = argparse.ArgumentParser(description="detect capabilities in programs.")
parser.add_argument("sample", type=str,
help="Path to sample to analyze")
parser.add_argument("-f", "--format", choices=[f[0] for f in formats], default="auto",
help="Select sample format, %s" % format_help)
parser.add_argument("-F", "--function", type=lambda x: int(x, 0),
help="Show features for specific function")
args = parser.parse_args(args=argv)
logging.basicConfig(level=logging.INFO)
logging.getLogger().setLevel(logging.INFO)
if args.format == 'freeze':
with open(args.sample, 'rb') as f:
extractor = capa.features.freeze.load(f.read())
else:
vw = capa.main.get_workspace(args.sample, args.format)
extractor = capa.features.extractors.viv.VivisectFeatureExtractor(vw, args.sample)
if not args.function:
for feature, va in extractor.extract_file_features():
if va:
print('file: 0x%08x: %s' % (va, feature))
else:
print('file: 0x00000000: %s' % (feature))
functions = extractor.get_functions()
if args.function:
if args.format == 'freeze':
functions = filter(lambda f: f == args.function, functions)
else:
functions = filter(lambda f: f.va == args.function, functions)
for f in functions:
for feature, va in extractor.extract_function_features(f):
print('func: 0x%08x: %s' % (va, feature))
for bb in extractor.get_basic_blocks(f):
for feature, va in extractor.extract_basic_block_features(f, bb):
print('bb : 0x%08x: %s' % (va, feature))
for insn in extractor.get_instructions(f, bb):
for feature, va in extractor.extract_insn_features(f, bb, insn):
print('insn: 0x%08x: %s' % (va, feature))
return 0
if __name__ == "__main__":
sys.exit(main())

71
scripts/testbed/README.md Normal file
View File

@@ -0,0 +1,71 @@
# Testbed
Goal of the testbed is to support the development of new `capa` rules. Scripts allow to test rules against a large sample set and to batch process samples, e.g. to freeze features or to generate other meta data used for testing.
The testbed contains malicious and benign files. Data sources are:
- Microsoft EXE and DLL files from `C:\Windows\System32`, `C:\Windows\SysWOW64`, etc.
- samples analyzed and annotated by FLARE analysts during malware analysis
Samples containing the keyword `slow` in their path indicate a longer test run time (>20 seconds) and can be ignored via the `-f` argument.
Running a rule against a large set of executable programs helps to quickly determine on which functions/samples a rule hits. This helps to identify:
- true positives: hits on expected functions
- false positives: hits on unexpected functions, for example
- if a rule is to generic or
- if a rule hits on a capability present in many (benign) samples
To provide additional context the testbed contains function names from the following data sources:
- benign files: function names from Microsoft's PDB information
- malicious files: function names provided by FLARE analysts and obtained from
the LabelMaker 2000 (LM2k) annotations repository
For each test sample the testbed contains the following files:
- a `.frz` file storing the extracted `capa` features
- `capa`'s serialized features, via `capa.features.freeze`
- a `.fnames` file mapping function addresses to function names
- JSON file that maps fvas to function names or
- CSV file with entries `idbmd5;md5;fva;fname`
- (optional) the binary file with extension `.exe_`, `.dll_`, or `.mal_`
## Scripts
### `run_rule_on_testbed.py`
Run a `capa` rule file against the testbed (frozen features in a directory).
Meant to be run on directories that contain `.frz` and `.fnames` files.
Example usage:
run_rule_on_testbed.py <testbed dir>
run_rule_on_testbed.py samples
With the `-s <image_path>` argument, the script exports images of function graphs to the provided path.
Converting the images requires `graphviz`. See https://graphviz.gitlab.io/about/; get Python interface via `pip install graphviz`.
## Helper Scripts
### `freeze_features.py`
Use `freeze_features.py` to freeze `capa` features of a file or of files in a directory.
Example usage:
freeze_features.py <testbed dir>
freeze_features.py samples
### `start_ida_dump_fnames.py`
Start IDA Pro in autonomous mode to dump JSON file of function names `{fva: fname}`. Processes a single file or a directory.
This script uses `_dump_fnames.py` to dump the JSON file of functions names and is meant to be run on benign files with PDB information. IDA should apply function names from the PDB information automatically.
Example usage:
start_ida_dump_fnames.py <candidate files dir>
start_ida_dump_fnames.py samples\benign
### `start_ida_export_fimages.py`
Start IDA Pro in autonomous mode to export images of function graphs.
`run_rule_on_testbed.py` integrates the export mechanism (`-s` option)
This script uses `_export_fimages.py` to export DOT files of function graphs and then converts them to PNG images using `graphviz`.
Example usage:
start_ida_export_fimages.py <target file> <output dir> -f <function list>
start_ida_export_fimages.py test.exe imgs -f 0x401000,0x402F90

View File

@@ -0,0 +1,2 @@
FNAMES_EXTENSION = '.fnames'
FREEZE_EXTENSION = '.frz'

View File

@@ -0,0 +1,46 @@
'''
IDAPython script to dump JSON file of functions names { fva: fname }.
Meant to be run on benign files with PDB information. IDA should apply function names from the PDB files automatically.
Can also be run on annotated IDA database files.
Example usage (via IDA autonomous mode):
ida.exe -A -S_dump_fnames.py "<output path>" <sample_path>
'''
import json
import idc
import idautils
def main():
if len(idc.ARGV) != 2:
# requires output file path argument
idc.qexit(-1)
# wait for auto-analysis to finish
idc.auto_wait()
INF_SHORT_DN_ATTR = idc.get_inf_attr(idc.INF_SHORT_DN) # short form of demangled names
fnames = {}
for f in idautils.Functions():
fname = idc.get_name(f)
if fname.startswith("sub_"):
continue
name_demangled = idc.demangle_name(fname, INF_SHORT_DN_ATTR)
if name_demangled:
fname = name_demangled
fnames[f] = fname
with open(idc.ARGV[1], "w") as f:
json.dump(fnames, f)
# exit IDA
idc.qexit(0)
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,44 @@
'''
IDAPython script to export DOT files of function graphs.
Example usage (via IDA autonomous mode):
ida.exe -A -S_export_fimages.py "<output dir>" <fva1> [<fva2> ...] <sample_path>
'''
import os
import idc
import idaapi
import ida_gdl
def main():
if len(idc.ARGV) < 3:
# requires output directory and function VAs argument(s)
idc.qexit(-1)
# wait for auto-analysis to finish
idc.auto_wait()
out_dir = idc.ARGV[1]
fvas = [int(fva, 0x10) for fva in idc.ARGV[2:]]
idb_name = os.path.split(idc.get_idb_path())[-1]
for fva in fvas:
fstart = idc.get_func_attr(fva, idc.FUNCATTR_START)
name = '%s_0x%x' % (idb_name.replace('.', '_'), fstart)
out_path = os.path.join(out_dir, name)
fname = idc.get_name(fstart)
if not ida_gdl.gen_flow_graph(out_path, '%s (0x%x)' % (fname, fstart), idaapi.get_func(fstart), 0, 0,
ida_gdl.CHART_GEN_DOT | ida_gdl.CHART_PRINT_NAMES):
print 'IDA error generating flow graph'
# TODO add label to DOT file, see https://stackoverflow.com/a/6452088/10548020
# TODO highlight where rule matched
# exit IDA
idc.qexit(0)
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,102 @@
'''
Freeze capa features.
Example usage:
freeze_features.py <test files dir>
freeze_features.py samples\benign
'''
import os
import sys
import time
import logging
import argparse
from scripts.testbed import FREEZE_EXTENSION
from capa.features.freeze import main as freeze_features
# only process files with these extensions
TARGET_EXTENSIONS = [
'.mal_',
'.exe_',
'.dll_',
'.sys_'
]
logger = logging.getLogger('check_rule')
def freeze(input_path, reprocess):
if not os.path.exists(input_path):
raise IOError('%s does not exist or cannot be accessed' % input_path)
if os.path.isfile(input_path):
outfile = '%s%s' % (input_path, FREEZE_EXTENSION)
freeze_file(input_path, outfile, reprocess)
elif os.path.isdir(input_path):
logger.info('freezing features of %s files in %s', '|'.join(TARGET_EXTENSIONS), input_path)
for root, dirs, files in os.walk(input_path):
for file in files:
if not os.path.splitext(file)[1] in TARGET_EXTENSIONS:
logger.debug('skipping non-target file: %s', file)
continue
path = os.path.join(root, file)
outfile = '%s%s' % (path, FREEZE_EXTENSION)
freeze_file(path, outfile, reprocess)
def freeze_file(path, output, reprocess=False):
logger.info('freezing features of %s', path)
if os.path.exists(output) and not reprocess:
logger.info('%s already exists, provide -r argument to reprocess', output)
return
try:
freeze_features([path, output]) # args: sample, output
except Exception as e:
logger.error('could not freeze features for %s: %s', path, str(e))
def main(argv=None):
if argv is None:
argv = sys.argv[1:]
parser = argparse.ArgumentParser(description="Freeze capa features of a file or of files in a directory")
parser.add_argument("file_path", type=str,
help="Path to file or directory to analyze")
parser.add_argument("-r", "--reprocess", action="store_true", default=False,
help="Overwrite existing analysis")
parser.add_argument("-v", "--verbose", action="store_true",
help="Enable verbose output")
parser.add_argument("-q", "--quiet", action="store_true",
help="Disable all output but errors")
args = parser.parse_args(args=argv)
if args.quiet:
logging.basicConfig(level=logging.ERROR)
logging.getLogger().setLevel(logging.ERROR)
elif args.verbose:
logging.basicConfig(level=logging.DEBUG)
logging.getLogger().setLevel(logging.DEBUG)
else:
logging.basicConfig(level=logging.INFO)
logging.getLogger().setLevel(logging.INFO)
time0 = time.time()
try:
freeze(args.file_path, args.reprocess)
except IOError as e:
logger.error('%s', str(e))
return -1
logger.info('freezing features took %d seconds', time.time() - time0)
return 0
if __name__ == "__main__":
sys.exit(main())

View File

@@ -0,0 +1,297 @@
'''
Run a capa rule file against the testbed (frozen features in a directory).
Example usage:
run_rule_on_testbed.py <path to rules> <rule name> <testbed dir>
run_rule_on_testbed.py ..\\rules "create pipe" samples
'''
import os
import sys
import json
import time
import logging
from collections import defaultdict
import argparse
import capa.main
import capa.rules
import capa.features.freeze
from scripts.testbed import FNAMES_EXTENSION, FREEZE_EXTENSION
from start_ida_export_fimages import export_fimages
logger = logging.getLogger(__name__)
# sorry globals...
file_count = 0
file_hits = 0
mal_hits = 0
other_hits = 0
function_hits = 0
errors = 0
function_names = set([])
CATEGORY = {
'malicious': 'MAL',
'benign': 'BEN',
}
def check_rule(path, rules, rule_name, only_matching, save_image, verbose):
global file_count, file_hits, mal_hits, other_hits, function_hits, errors
try:
capabilities = get_capabilities(path, rules)
except (ValueError, KeyError) as e:
logger.error('cannot load %s due to %s: %s', path, type(e).__name__, str(e))
errors += 1
return
file_count += 1
hits = get_function_hits(capabilities, rule_name)
if hits == 0:
if not only_matching:
render_no_hit(path)
else:
print('[x] rule matches %d function(s) in %s (%s)' % (hits, path, get_category(path)))
file_hits += 1
function_hits += hits
if get_category(path) == 'MAL':
mal_hits += 1
else:
other_hits += 1
if verbose:
render_hit_verbose(capabilities, path, verbose > 1)
if save_image:
fvas = ['0x%x' % fva for fva in get_hit_fvas(capabilities)]
file_path = get_idb_or_sample_path(path)
if file_path:
if not export_fimages(file_path, save_image, fvas):
logger.warning('exporting images failed')
else:
logger.warning('could not get IDB or sample path')
def get_idb_or_sample_path(path):
exts = ['.idb', '.i64', '.exe_', '.dll_', '.mal_']
roots = [os.path.splitext(path)[0], path]
for e in exts:
for r in roots:
p = '%s%s' % (r, e)
if os.path.exists(p):
return p
return None
def get_capabilities(path, rules):
logger.debug('matching rules in %s', path)
with open(path, 'rb') as f:
extractor = capa.features.freeze.load(f.read())
return capa.main.find_capabilities(rules, extractor, disable_progress=True)
def get_function_hits(capabilities, rule_name):
return len(capabilities.get(rule_name, []))
def get_category(path):
for c in CATEGORY:
if c in path:
return CATEGORY[c]
return 'UNK'
def render_no_hit(path):
print('[ ] no match in %s (%s)' % (path, get_category(path)))
def render_hit_verbose(capabilities, path, vverbose):
try:
fnames = load_fnames(path)
except IOError as e:
logger.error('%s', str(e))
fnames = None
for rule, ress in capabilities.items():
for (fva, res) in sorted(ress, key=lambda p: p[0]):
if fnames and fva in fnames:
fname = fnames[fva]
function_names.add(fname)
else:
fname = '<name unknown>'
print(' - function 0x%x (%s)' % (fva, fname))
if vverbose:
capa.main.render_result(res, indent=' ')
def get_hit_fvas(capabilities):
fvas = []
for rule, ress in capabilities.items():
for (fva, res) in sorted(ress, key=lambda p: p[0]):
fvas.append(fva)
return fvas
def load_fnames(path):
fnames_path = path.replace(FREEZE_EXTENSION, FNAMES_EXTENSION)
if not os.path.exists(fnames_path):
raise IOError('%s does not exist' % fnames_path)
logger.debug('fnames path: %s', fnames_path)
try:
# json file with format { fva: fname }
fnames = load_json(fnames_path)
logger.debug('loaded JSON file')
except TypeError:
# csv file with format idbmd5;md5;fva;fname
fnames = load_csv(fnames_path)
logger.debug('loaded CSV file')
fnames = convert_keys_to_int(fnames)
logger.debug('read %d function names' % len(fnames))
return fnames
def load_json(path):
with open(path, 'r') as f:
try:
funcs = json.load(f)
except ValueError as e:
logger.debug('not a JSON file, %s', str(e))
raise TypeError
return funcs
def load_csv(path):
funcs = defaultdict(str)
with open(path, 'r') as f:
data = f.read().splitlines()
for line in data:
try:
idbmd5, md5, fva, name = line.split(':', 3)
except ValueError as e:
logger.warning('%s: "%s"', str(e), line)
funcs[fva] = name
return funcs
def convert_keys_to_int(funcs_in):
funcs = {}
for k, v in funcs_in.iteritems():
try:
k = int(k)
except ValueError:
k = int(k, 0x10)
funcs[k] = v
return funcs
def print_summary(verbose, start_time):
global file_count, file_hits, function_hits, errors
print('\n[SUMMARY]')
m, s = divmod(time.time() - start_time, 60)
logger.info('ran for %d:%02d minutes', m, s)
ratio = ' (%d%%)' % ((float(file_hits) / file_count) * 100) if file_count else ''
print('matched %d function(s) in %d/%d%s sample(s), encountered %d error(s)' % (
function_hits, file_hits, file_count, ratio, errors))
print('%d hits on (MAL) files; %d hits on other files' % (mal_hits, other_hits))
if verbose:
if len(function_names) > 0:
print('matched function names (unique):')
for fname in function_names:
print ' - %s' % fname
def main(argv=None):
if argv is None:
argv = sys.argv[1:]
parser = argparse.ArgumentParser(description="Run capa rule file against frozen features in a directory")
parser.add_argument("rules", type=str,
help="Path to directory containing rules")
parser.add_argument("rule_name", type=str,
help="Name of rule to test")
parser.add_argument("frozen_path", type=str,
help="Path to frozen feature file or directory")
parser.add_argument("-f", "--fast", action="store_true",
help="Don't test slow files")
parser.add_argument("-o", "--only_matching", action="store_true",
help="Print only if rule matches")
parser.add_argument("-s", "--save_image", action="store",
help="Directory to save exported images of function graphs")
parser.add_argument("-v", "--verbose", action="count", default=0,
help="Increase output verbosity")
parser.add_argument("-q", "--quiet", action="store_true",
help="Disable all output but errors")
args = parser.parse_args(args=argv)
if args.quiet:
logging.basicConfig(level=logging.ERROR)
logging.getLogger().setLevel(logging.ERROR)
elif args.verbose:
logging.basicConfig(level=logging.DEBUG)
logging.getLogger().setLevel(logging.DEBUG)
else:
logging.basicConfig(level=logging.INFO)
logging.getLogger().setLevel(logging.INFO)
if not os.path.isdir(args.rules):
logger.error('%s is not a directory', args.rules)
return -1
# load rule
try:
rules = capa.main.get_rules(args.rules)
rules = list(capa.rules.get_rules_and_dependencies(rules, args.rule_name))
rules = capa.rules.RuleSet(rules)
except IOError as e:
logger.error('%s', str(e))
return -1
except capa.rules.InvalidRule as e:
logger.error('%s', str(e))
return -1
time0 = time.time()
print('[RULE %s]' % args.rule_name)
if os.path.isfile(args.frozen_path):
check_rule(args.frozen_path, rules, args.rule_name, args.only_matching, args.save_image, args.verbose)
try:
# get only freeze files from directory
freeze_files = []
for root, dirs, files in os.walk(args.frozen_path):
for file in files:
if not file.endswith(FREEZE_EXTENSION):
continue
path = os.path.join(root, file)
if args.fast and 'slow' in path:
logger.debug('fast mode skipping %s', path)
continue
freeze_files.append(path)
for path in sorted(freeze_files):
sample_time0 = time.time()
check_rule(path, rules, args.rule_name, args.only_matching, args.save_image, args.verbose)
logger.debug('rule check took %d seconds', time.time() - sample_time0)
except KeyboardInterrupt:
logger.info('Received keyboard interrupt, terminating')
print_summary(args.verbose, time0)
if __name__ == "__main__":
sys.exit(main())

View File

@@ -0,0 +1,131 @@
'''
Start IDA Pro in autonomous mode to dump JSON file of function names { fva: fname }.
Processes a single file or a directory.
Only runs on files with supported file extensions.
Example usage:
start_ida_dump_fnames.py <candidate files dir>
start_ida_dump_fnames.py samples\benign
'''
import os
import sys
import json
import hashlib
import logging
import subprocess
import argparse
from scripts.testbed import FNAMES_EXTENSION
IDA32_PATH = 'C:\\Program Files\\IDA Pro 7.3\\ida.exe'
IDA64_PATH = 'C:\\Program Files\\IDA Pro 7.3\\ida64.exe'
# expected in same directory as this file
DUMP_SCRIPT_PATH = os.path.abspath('_dump_fnames.py')
SUPPORTED_EXTENSIONS = [
'.exe_',
'.dll_',
'.sys_',
'.idb',
'.i64',
]
logger = logging.getLogger(__name__)
def call_ida_dump_script(sample_path, reprocess):
''' call IDA in autonomous mode and return True if success, False on failure '''
logger.info('processing %s (MD5: %s)', sample_path, get_md5_hexdigest(sample_path))
# TODO detect 64-bit binaries
if os.path.splitext(sample_path)[-1] == '.i64':
IDA_PATH = IDA64_PATH
else:
IDA_PATH = IDA32_PATH
if sample_path.endswith('.idb') or sample_path.endswith('.i64'):
sample_path = sample_path[:-4]
fnames = '%s%s' % (sample_path, FNAMES_EXTENSION)
if os.path.exists(fnames) and not reprocess:
logger.info('%s already exists and contains %d function names, provide -r argument to reprocess',
fnames, len(get_function_names(fnames)))
return True
out_path = os.path.split(fnames)[-1] # relative to IDA database file
args = [IDA_PATH, '-A', '-S%s "%s"' % (DUMP_SCRIPT_PATH, out_path), sample_path]
logger.debug('calling "%s"' % ' '.join(args))
subprocess.call(args)
if not os.path.exists(fnames):
logger.warning('%s was not created', fnames)
return False
logger.debug('extracted %d function names to %s', len(get_function_names(fnames)), fnames)
return True
def get_md5_hexdigest(sample_path):
m = hashlib.md5()
with open(sample_path, 'rb') as f:
m.update(f.read())
return m.hexdigest()
def get_function_names(fnames_file):
if not os.path.exists(fnames_file):
return None
with open(fnames_file, 'r') as f:
return json.load(f)
def main():
parser = argparse.ArgumentParser(
description="Launch IDA Pro in autonomous mode to dump function names of a file or of files in a directory")
parser.add_argument("file_path", type=str,
help="File or directory path to analyze")
parser.add_argument("-r", "--reprocess", action="store_true", default=False,
help="Overwrite existing analysis")
parser.add_argument("-v", "--verbose", action="store_true",
help="Enable verbose output")
args = parser.parse_args(args=sys.argv[1:])
if args.verbose:
logging.basicConfig(level=logging.DEBUG)
logging.getLogger().setLevel(logging.DEBUG)
else:
logging.basicConfig(level=logging.INFO)
logging.getLogger().setLevel(logging.INFO)
if not os.path.exists(args.file_path):
logger.warning('%s does not exist', args.file_path)
return -1
if os.path.isfile(args.file_path):
call_ida_dump_script(args.file_path, args.reprocess)
return 0
errors = 0
logger.info('processing files in %s with file extension %s', args.file_path, '|'.join(SUPPORTED_EXTENSIONS))
for root, dirs, files in os.walk(args.file_path):
for file in files:
if not os.path.splitext(file)[1] in SUPPORTED_EXTENSIONS:
logger.debug('%s does not have supported file extension', file)
continue
path = os.path.join(root, file)
if not call_ida_dump_script(path, args.reprocess):
errors += 1
if errors:
logger.warning('encountered %d errors', errors)
return 0
if __name__ == "__main__":
sys.exit(main())

View File

@@ -0,0 +1,135 @@
'''
Start IDA Pro in autonomous mode to export images of function graphs.
Example usage:
start_ida_export_fimages.py <target file> <output dir> -f <function list>
start_ida_export_fimages.py test.exe imgs -f 0x401000,0x402F90
'''
import os
import imp
import sys
import hashlib
import logging
import subprocess
import argparse
try:
imp.find_module('graphviz')
from graphviz import Source
graphviz_found = True
except ImportError:
graphviz_found = False
IDA32_PATH = 'C:\\Program Files\\IDA Pro 7.3\\ida.exe'
IDA64_PATH = 'C:\\Program Files\\IDA Pro 7.3\\ida64.exe'
# expected in same directory as this file
EXPORT_SCRIPT_PATH = os.path.abspath('_export_fimages.py')
logger = logging.getLogger(__name__)
def export_fimages(file_path, out_dir, functions, manual=False):
'''
Export images of function graphs.
:param file_path: file to analyze
:param out_dir: output directory
:param functions: list of strings of hex formatted fvas
:param manual: non-autonomous mode
:return: True on success, False otherwise
'''
if not graphviz_found:
logger.warning('please install graphviz to export images')
return False
if not os.path.exists(out_dir):
os.mkdir(out_dir)
script_args = [os.path.abspath(out_dir)] + functions
call_ida_script(EXPORT_SCRIPT_PATH, script_args, file_path, manual)
img_count = 0
for root, dirs, files in os.walk(out_dir):
for file in files:
if not file.endswith('.dot'):
continue
try:
s = Source.from_file(file, directory=out_dir)
s.render(file, directory=out_dir, format='png', cleanup=True)
img_count += 1
except BaseException:
logger.warning('graphviz error rendering file')
if img_count > 0:
logger.info('exported %d function graph images to "%s"', img_count, os.path.abspath(out_dir))
return True
else:
logger.warning('failed to export function graph images')
return False
def call_ida_script(script_path, script_args, sample_path, manual):
logger.info('processing %s (MD5: %s)', sample_path, get_md5_hexdigest(sample_path))
# TODO detect 64-bit binaries
if os.path.splitext(sample_path)[-1] == '.i64':
IDA_PATH = IDA64_PATH
else:
IDA_PATH = IDA32_PATH
args = [IDA_PATH, '-A', '-S%s %s' % (script_path, ' '.join(script_args)), sample_path]
if manual:
args.remove('-A')
logger.debug('calling "%s"' % ' '.join(args))
if subprocess.call(args) == 0:
return True
else:
return False
def get_md5_hexdigest(sample_path):
m = hashlib.md5()
with open(sample_path, 'rb') as f:
m.update(f.read())
return m.hexdigest()
def main():
parser = argparse.ArgumentParser(
description="Launch IDA Pro in autonomous mode to export images of function graphs")
parser.add_argument("file_path", type=str,
help="File to export from")
parser.add_argument("out_dir", type=str,
help="Export target directory")
parser.add_argument("-f", "--functions", action="store",
help="Comma separated list of functions to export")
parser.add_argument("-m", "--manual", action="store_true",
help="Manual mode: show IDA dialog boxes")
parser.add_argument("-v", "--verbose", action="store_true",
help="Enable verbose output")
args = parser.parse_args(args=sys.argv[1:])
if args.verbose:
logging.basicConfig(level=logging.DEBUG)
logging.getLogger().setLevel(logging.DEBUG)
else:
logging.basicConfig(level=logging.INFO)
logging.getLogger().setLevel(logging.INFO)
if not os.path.isfile(args.file_path):
logger.warning('%s is not a file', args.file_path)
return -1
functions = args.functions.split(',')
export_fimages(args.file_path, args.out_dir, functions, args.manual)
return 0
if __name__ == "__main__":
sys.exit(main())

62
setup.py Normal file
View File

@@ -0,0 +1,62 @@
import os
import sys
import setuptools
requirements = [
"six",
"tqdm",
"pyyaml",
"tabulate",
]
if sys.version_info >= (3, 0):
# py3
requirements.append("networkx")
else:
# py2
requirements.append("enum34")
requirements.append("vivisect")
requirements.append("viv-utils")
requirements.append("networkx==2.2") # v2.2 is last version supported by Python 2.7
# this sets __version__
# via: http://stackoverflow.com/a/7071358/87207
# and: http://stackoverflow.com/a/2073599/87207
with open(os.path.join("capa", "version.py"), "rb") as f:
exec(f.read())
def get_rule_paths():
return [os.path.join('..', x[0], '*.yml') for x in os.walk('rules')]
setuptools.setup(
name='capa',
version=__version__,
description="",
long_description="",
author="Willi Ballenthin, Moritz Raabe",
author_email='william.ballenthin@mandiant.com, moritz.raabe@mandiant.com',
url='https://www.github.com/fireeye/capa',
packages=setuptools.find_packages(exclude=['tests', 'testbed']),
package_dir={'capa': 'capa'},
package_data={'capa': get_rule_paths()},
entry_points={
"console_scripts": [
"capa=capa.main:main",
]
},
include_package_data=True,
install_requires=requirements,
zip_safe=False,
keywords='capa',
classifiers=[
'Development Status :: 3 - Alpha',
'Intended Audience :: Developers',
'Natural Language :: English',
"Programming Language :: Python :: 2",
"Programming Language :: Python :: 3",
],
)

78
tests/fixtures.py Normal file
View File

@@ -0,0 +1,78 @@
import os
import os.path
import collections
import pytest
import viv_utils
CD = os.path.dirname(__file__)
Sample = collections.namedtuple('Sample', ['vw', 'path'])
@pytest.fixture
def mimikatz():
path = os.path.join(CD, 'data', 'mimikatz.exe_')
return Sample(viv_utils.getWorkspace(path), path)
@pytest.fixture
def sample_a933a1a402775cfa94b6bee0963f4b46():
path = os.path.join(CD, 'data', 'a933a1a402775cfa94b6bee0963f4b46.dll_')
return Sample(viv_utils.getWorkspace(path), path)
@pytest.fixture
def kernel32():
path = os.path.join(CD, 'data', 'kernel32.dll_')
return Sample(viv_utils.getWorkspace(path), path)
@pytest.fixture
def sample_a198216798ca38f280dc413f8c57f2c2():
path = os.path.join(CD, 'data', 'a198216798ca38f280dc413f8c57f2c2.exe_')
return Sample(viv_utils.getWorkspace(path), path)
@pytest.fixture
def sample_9324d1a8ae37a36ae560c37448c9705a():
path = os.path.join(CD, 'data', '9324d1a8ae37a36ae560c37448c9705a.exe_')
return Sample(viv_utils.getWorkspace(path), path)
@pytest.fixture
def pma_lab_12_04():
path = os.path.join(CD, 'data', 'Practical Malware Analysis Lab 12-04.exe_')
return Sample(viv_utils.getWorkspace(path), path)
@pytest.fixture
def sample_bfb9b5391a13d0afd787e87ab90f14f5():
path = os.path.join(CD, 'data', 'bfb9b5391a13d0afd787e87ab90f14f5.dll_')
return Sample(viv_utils.getWorkspace(path), path)
@pytest.fixture
def sample_lab21_01():
path = os.path.join(CD, 'data', 'Practical Malware Analysis Lab 21-01.exe_')
return Sample(viv_utils.getWorkspace(path), path)
@pytest.fixture
def sample_c91887d861d9bd4a5872249b641bc9f9():
path = os.path.join(CD, 'data', 'c91887d861d9bd4a5872249b641bc9f9.exe_')
return Sample(viv_utils.getWorkspace(path), path)
@pytest.fixture
def sample_39c05b15e9834ac93f206bc114d0a00c357c888db567ba8f5345da0529cbed41():
path = os.path.join(CD, 'data', '39c05b15e9834ac93f206bc114d0a00c357c888db567ba8f5345da0529cbed41.dll_')
return Sample(viv_utils.getWorkspace(path), path)
@pytest.fixture
def sample_499c2a85f6e8142c3f48d4251c9c7cd6_raw32():
path = os.path.join(CD, 'data', '499c2a85f6e8142c3f48d4251c9c7cd6.raw32')
return Sample(viv_utils.getShellcodeWorkspace(path), path)

218
tests/test_engine.py Normal file
View File

@@ -0,0 +1,218 @@
import textwrap
import capa.rules
import capa.engine
from capa.engine import *
import capa.features
def test_element():
assert Element(1).evaluate(set([0])) == False
assert Element(1).evaluate(set([1])) == True
assert Element(1).evaluate(set([None])) == False
assert Element(1).evaluate(set([''])) == False
assert Element(1).evaluate(set([False])) == False
def test_and():
assert And(Element(1)).evaluate(set([0])) == False
assert And(Element(1)).evaluate(set([1])) == True
assert And(Element(1), Element(2)).evaluate(set([0])) == False
assert And(Element(1), Element(2)).evaluate(set([1])) == False
assert And(Element(1), Element(2)).evaluate(set([2])) == False
assert And(Element(1), Element(2)).evaluate(set([1, 2])) == True
def test_or():
assert Or(Element(1)).evaluate(set([0])) == False
assert Or(Element(1)).evaluate(set([1])) == True
assert Or(Element(1), Element(2)).evaluate(set([0])) == False
assert Or(Element(1), Element(2)).evaluate(set([1])) == True
assert Or(Element(1), Element(2)).evaluate(set([2])) == True
assert Or(Element(1), Element(2)).evaluate(set([1, 2])) == True
def test_not():
assert Not(Element(1)).evaluate(set([0])) == True
assert Not(Element(1)).evaluate(set([1])) == False
def test_some():
assert Some(0, Element(1)).evaluate(set([0])) == True
assert Some(1, Element(1)).evaluate(set([0])) == False
assert Some(2, Element(1), Element(2), Element(3)).evaluate(set([0])) == False
assert Some(2, Element(1), Element(2), Element(3)).evaluate(set([0, 1])) == False
assert Some(2, Element(1), Element(2), Element(3)).evaluate(set([0, 1, 2])) == True
assert Some(2, Element(1), Element(2), Element(3)).evaluate(set([0, 1, 2, 3])) == True
assert Some(2, Element(1), Element(2), Element(3)).evaluate(set([0, 1, 2, 3, 4])) == True
def test_complex():
assert True == Or(
And(Element(1), Element(2)),
Or(Element(3),
Some(2, Element(4), Element(5), Element(6)))
).evaluate(set([5, 6, 7, 8]))
assert False == Or(
And(Element(1), Element(2)),
Or(Element(3),
Some(2, Element(4), Element(5)))
).evaluate(set([5, 6, 7, 8]))
def test_range():
# unbounded range, but no matching feature
assert Range(Element(1)).evaluate({Element(2): {}}) == False
# unbounded range with matching feature should always match
assert Range(Element(1)).evaluate({Element(1): {}}) == True
assert Range(Element(1)).evaluate({Element(1): {0}}) == True
# unbounded max
assert Range(Element(1), min=1).evaluate({Element(1): {0}}) == True
assert Range(Element(1), min=2).evaluate({Element(1): {0}}) == False
assert Range(Element(1), min=2).evaluate({Element(1): {0, 1}}) == True
# unbounded min
assert Range(Element(1), max=0).evaluate({Element(1): {0}}) == False
assert Range(Element(1), max=1).evaluate({Element(1): {0}}) == True
assert Range(Element(1), max=2).evaluate({Element(1): {0}}) == True
assert Range(Element(1), max=2).evaluate({Element(1): {0, 1}}) == True
assert Range(Element(1), max=2).evaluate({Element(1): {0, 1, 3}}) == False
# we can do an exact match by setting min==max
assert Range(Element(1), min=1, max=1).evaluate({Element(1): {}}) == False
assert Range(Element(1), min=1, max=1).evaluate({Element(1): {1}}) == True
assert Range(Element(1), min=1, max=1).evaluate({Element(1): {1, 2}}) == False
# bounded range
assert Range(Element(1), min=1, max=3).evaluate({Element(1): {}}) == False
assert Range(Element(1), min=1, max=3).evaluate({Element(1): {1}}) == True
assert Range(Element(1), min=1, max=3).evaluate({Element(1): {1, 2}}) == True
assert Range(Element(1), min=1, max=3).evaluate({Element(1): {1, 2, 3}}) == True
assert Range(Element(1), min=1, max=3).evaluate({Element(1): {1, 2, 3, 4}}) == False
def test_match_adds_matched_rule_feature():
'''show that using `match` adds a feature for matched rules.'''
rule = textwrap.dedent('''
rule:
meta:
name: test rule
features:
- number: 100
''')
r = capa.rules.Rule.from_yaml(rule)
features, matches = capa.engine.match([r], {capa.features.insn.Number(100): {1}}, 0x0)
assert capa.features.MatchedRule('test rule') in features
def test_match_matched_rules():
'''show that using `match` adds a feature for matched rules.'''
rules = [
capa.rules.Rule.from_yaml(textwrap.dedent('''
rule:
meta:
name: test rule1
features:
- number: 100
''')),
capa.rules.Rule.from_yaml(textwrap.dedent('''
rule:
meta:
name: test rule2
features:
- match: test rule1
''')),
]
features, matches = capa.engine.match(capa.engine.topologically_order_rules(rules),
{capa.features.insn.Number(100): {1}}, 0x0)
assert capa.features.MatchedRule('test rule1') in features
assert capa.features.MatchedRule('test rule2') in features
# the ordering of the rules must not matter,
# the engine should match rules in an appropriate order.
features, matches = capa.engine.match(capa.engine.topologically_order_rules(reversed(rules)),
{capa.features.insn.Number(100): {1}}, 0x0)
assert capa.features.MatchedRule('test rule1') in features
assert capa.features.MatchedRule('test rule2') in features
def test_regex():
rules = [
capa.rules.Rule.from_yaml(textwrap.dedent('''
rule:
meta:
name: test rule
features:
- and:
- string: /.*bbbb.*/
''')),
capa.rules.Rule.from_yaml(textwrap.dedent('''
rule:
meta:
name: rule with implied wildcards
features:
- and:
- string: /bbbb/
''')),
capa.rules.Rule.from_yaml(textwrap.dedent('''
rule:
meta:
name: rule with anchor
features:
- and:
- string: /^bbbb/
''')),
]
features, matches = capa.engine.match(capa.engine.topologically_order_rules(rules),
{capa.features.insn.Number(100): {1}}, 0x0)
assert capa.features.MatchedRule('test rule') not in features
features, matches = capa.engine.match(capa.engine.topologically_order_rules(rules),
{capa.features.String('aaaa'): {1}}, 0x0)
assert capa.features.MatchedRule('test rule') not in features
features, matches = capa.engine.match(capa.engine.topologically_order_rules(rules),
{capa.features.String('aBBBBa'): {1}}, 0x0)
assert capa.features.MatchedRule('test rule') not in features
features, matches = capa.engine.match(capa.engine.topologically_order_rules(rules),
{capa.features.String('abbbba'): {1}}, 0x0)
assert capa.features.MatchedRule('test rule') in features
assert capa.features.MatchedRule('rule with implied wildcards') in features
assert capa.features.MatchedRule('rule with anchor') not in features
def test_regex_ignorecase():
rules = [
capa.rules.Rule.from_yaml(textwrap.dedent('''
rule:
meta:
name: test rule
features:
- and:
- string: /.*bbbb.*/i
''')),
]
features, matches = capa.engine.match(capa.engine.topologically_order_rules(rules),
{capa.features.String('aBBBBa'): {1}}, 0x0)
assert capa.features.MatchedRule('test rule') in features
def test_regex_complex():
rules = [
capa.rules.Rule.from_yaml(textwrap.dedent(r'''
rule:
meta:
name: test rule
features:
- or:
- string: /.*HARDWARE\\Key\\key with spaces\\.*/i
''')),
]
features, matches = capa.engine.match(capa.engine.topologically_order_rules(rules),
{capa.features.String(r'Hardware\Key\key with spaces\some value'): {1}}, 0x0)
assert capa.features.MatchedRule('test rule') in features

173
tests/test_freeze.py Normal file
View File

@@ -0,0 +1,173 @@
import textwrap
import capa.main
import capa.helpers
import capa.features
import capa.features.insn
import capa.features.extractors
import capa.features.freeze
from fixtures import *
EXTRACTOR = capa.features.extractors.NullFeatureExtractor({
'file features': [
(0x402345, capa.features.Characteristic('embedded pe', True)),
],
'functions': {
0x401000: {
'features': [
(0x401000, capa.features.Characteristic('switch', True)),
],
'basic blocks': {
0x401000: {
'features': [
(0x401000, capa.features.Characteristic('tight loop', True)),
],
'instructions': {
0x401000: {
'features': [
(0x401000, capa.features.insn.Mnemonic('xor')),
(0x401000, capa.features.Characteristic('nzxor', True)),
],
},
0x401002: {
'features': [
(0x401002, capa.features.insn.Mnemonic('mov')),
]
}
}
},
}
},
}
})
def test_null_feature_extractor():
assert list(EXTRACTOR.get_functions()) == [0x401000]
assert list(EXTRACTOR.get_basic_blocks(0x401000)) == [0x401000]
assert list(EXTRACTOR.get_instructions(0x401000, 0x0401000)) == [0x401000, 0x401002]
rules = capa.rules.RuleSet([
capa.rules.Rule.from_yaml(textwrap.dedent('''
rule:
meta:
name: xor loop
scope: basic block
features:
- and:
- characteristic(tight loop): true
- mnemonic: xor
- characteristic(nzxor): true
''')),
])
capabilities = capa.main.find_capabilities(rules, EXTRACTOR)
assert 'xor loop' in capabilities
def compare_extractors(a, b):
'''
args:
a (capa.features.extractors.NullFeatureExtractor)
b (capa.features.extractors.NullFeatureExtractor)
'''
# TODO: ordering of these things probably doesn't work yet
assert list(a.extract_file_features()) == list(b.extract_file_features())
assert list(a.get_functions()) == list(b.get_functions())
for f in a.get_functions():
assert list(a.get_basic_blocks(f)) == list(b.get_basic_blocks(f))
assert list(a.extract_function_features(f)) == list(b.extract_function_features(f))
for bb in a.get_basic_blocks(f):
assert list(a.get_instructions(f, bb)) == list(b.get_instructions(f, bb))
assert list(a.extract_basic_block_features(f, bb)) == list(b.extract_basic_block_features(f, bb))
for insn in a.get_instructions(f, bb):
assert list(a.extract_insn_features(f, bb, insn)) == list(b.extract_insn_features(f, bb, insn))
def compare_extractors_viv_null(viv_ext, null_ext):
'''
almost identical to compare_extractors but adds casts to ints since the VivisectFeatureExtractor returns objects
and NullFeatureExtractor returns ints
args:
viv_ext (capa.features.extractors.viv.VivisectFeatureExtractor)
null_ext (capa.features.extractors.NullFeatureExtractor)
'''
# TODO: ordering of these things probably doesn't work yet
assert list(viv_ext.extract_file_features()) == list(null_ext.extract_file_features())
assert to_int(list(viv_ext.get_functions())) == list(null_ext.get_functions())
for f in viv_ext.get_functions():
assert to_int(list(viv_ext.get_basic_blocks(f))) == list(null_ext.get_basic_blocks(to_int(f)))
assert list(viv_ext.extract_function_features(f)) == list(null_ext.extract_function_features(to_int(f)))
for bb in viv_ext.get_basic_blocks(f):
assert to_int(list(viv_ext.get_instructions(f, bb))) == list(null_ext.get_instructions(to_int(f), to_int(bb)))
assert list(viv_ext.extract_basic_block_features(f, bb)) == list(null_ext.extract_basic_block_features(to_int(f), to_int(bb)))
for insn in viv_ext.get_instructions(f, bb):
assert list(viv_ext.extract_insn_features(f, bb, insn)) == list(null_ext.extract_insn_features(to_int(f), to_int(bb), to_int(insn)))
def to_int(o):
'''helper to get int value of extractor items'''
if isinstance(o, list):
return map(lambda x: capa.helpers.oint(x), o)
else:
return capa.helpers.oint(o)
def test_freeze_s_roundtrip():
load = capa.features.freeze.loads
dump = capa.features.freeze.dumps
reanimated = load(dump(EXTRACTOR))
compare_extractors(EXTRACTOR, reanimated)
def test_freeze_b_roundtrip():
load = capa.features.freeze.load
dump = capa.features.freeze.dump
reanimated = load(dump(EXTRACTOR))
compare_extractors(EXTRACTOR, reanimated)
def roundtrip_feature(feature):
serialize = capa.features.freeze.serialize_feature
deserialize = capa.features.freeze.deserialize_feature
assert feature == deserialize(serialize(feature))
def test_serialize_features():
roundtrip_feature(capa.features.insn.API('advapi32.CryptAcquireContextW'))
roundtrip_feature(capa.features.String('SCardControl'))
roundtrip_feature(capa.features.insn.Number(0xFF))
roundtrip_feature(capa.features.insn.Offset(0x0))
roundtrip_feature(capa.features.insn.Mnemonic('push'))
roundtrip_feature(capa.features.file.Section('.rsrc'))
roundtrip_feature(capa.features.Characteristic('tight loop', True))
roundtrip_feature(capa.features.basicblock.BasicBlock())
roundtrip_feature(capa.features.file.Export('BaseThreadInitThunk'))
roundtrip_feature(capa.features.file.Import('kernel32.IsWow64Process'))
roundtrip_feature(capa.features.file.Import('#11'))
def test_freeze_sample(tmpdir, sample_9324d1a8ae37a36ae560c37448c9705a):
# tmpdir fixture handles cleanup
o = tmpdir.mkdir("capa").join("test.frz").strpath
assert capa.features.freeze.main([sample_9324d1a8ae37a36ae560c37448c9705a.path, o, '-v']) == 0
def test_freeze_load_sample(tmpdir, sample_9324d1a8ae37a36ae560c37448c9705a):
o = tmpdir.mkdir("capa").join("test.frz")
viv_extractor = capa.features.extractors.viv.VivisectFeatureExtractor(sample_9324d1a8ae37a36ae560c37448c9705a.vw,
sample_9324d1a8ae37a36ae560c37448c9705a.path)
with open(o.strpath, 'wb') as f:
f.write(capa.features.freeze.dump(viv_extractor))
null_extractor = capa.features.freeze.load(o.open('rb').read())
compare_extractors_viv_null(viv_extractor, null_extractor)

16
tests/test_helpers.py Normal file
View File

@@ -0,0 +1,16 @@
import codecs
from capa.features.extractors import helpers
def test_all_zeros():
# Python 2: <str>
# Python 3: <bytes>
a = b'\x00\x00\x00\x00'
b = codecs.decode('00000000', 'hex')
c = b'\x01\x00\x00\x00'
d = codecs.decode('01000000', 'hex')
assert helpers.all_zeros(a) is True
assert helpers.all_zeros(b) is True
assert helpers.all_zeros(c) is False
assert helpers.all_zeros(d) is False

188
tests/test_main.py Normal file
View File

@@ -0,0 +1,188 @@
import textwrap
import capa.main
import capa.rules
import capa.engine
from capa.engine import *
import capa.features
import capa.features.extractors.viv
from fixtures import *
def test_main(sample_9324d1a8ae37a36ae560c37448c9705a):
# tests rules can be loaded successfully
assert capa.main.main([sample_9324d1a8ae37a36ae560c37448c9705a.path, '-v']) == 0
def test_main_shellcode(sample_499c2a85f6e8142c3f48d4251c9c7cd6_raw32):
assert capa.main.main([sample_499c2a85f6e8142c3f48d4251c9c7cd6_raw32.path, '-v', '-f', 'sc32']) == 0
def test_ruleset():
rules = capa.rules.RuleSet([
capa.rules.Rule.from_yaml(textwrap.dedent('''
rule:
meta:
name: file rule
scope: file
features:
- characteristic(embedded pe): y
''')),
capa.rules.Rule.from_yaml(textwrap.dedent('''
rule:
meta:
name: function rule
scope: function
features:
- characteristic(switch): y
''')),
capa.rules.Rule.from_yaml(textwrap.dedent('''
rule:
meta:
name: basic block rule
scope: basic block
features:
- characteristic(nzxor): y
''')),
])
assert len(rules.file_rules) == 1
assert len(rules.function_rules) == 1
assert len(rules.basic_block_rules) == 1
def test_match_across_scopes_file_function(sample_9324d1a8ae37a36ae560c37448c9705a):
rules = capa.rules.RuleSet([
# this rule should match on a function (0x4073F0)
capa.rules.Rule.from_yaml(textwrap.dedent('''
rule:
meta:
name: install service
scope: function
examples:
- 9324d1a8ae37a36ae560c37448c9705a:0x4073F0
features:
- and:
- api: advapi32.OpenSCManagerA
- api: advapi32.CreateServiceA
- api: advapi32.StartServiceA
''')),
# this rule should match on a file feature
capa.rules.Rule.from_yaml(textwrap.dedent('''
rule:
meta:
name: .text section
scope: file
examples:
- 9324d1a8ae37a36ae560c37448c9705a
features:
- section: .text
''')),
# this rule should match on earlier rule matches:
# - install service, with function scope
# - .text section, with file scope
capa.rules.Rule.from_yaml(textwrap.dedent('''
rule:
meta:
name: .text section and install service
scope: file
examples:
- 9324d1a8ae37a36ae560c37448c9705a
features:
- and:
- match: install service
- match: .text section
''')),
])
extractor = capa.features.extractors.viv.VivisectFeatureExtractor(sample_9324d1a8ae37a36ae560c37448c9705a.vw, sample_9324d1a8ae37a36ae560c37448c9705a.path)
capabilities = capa.main.find_capabilities(rules, extractor)
assert 'install service' in capabilities
assert '.text section' in capabilities
assert '.text section and install service' in capabilities
def test_match_across_scopes(sample_9324d1a8ae37a36ae560c37448c9705a):
rules = capa.rules.RuleSet([
# this rule should match on a basic block (including at least 0x403685)
capa.rules.Rule.from_yaml(textwrap.dedent('''
rule:
meta:
name: tight loop
scope: basic block
examples:
- 9324d1a8ae37a36ae560c37448c9705a:0x403685
features:
- characteristic(tight loop): true
''')),
# this rule should match on a function (0x403660)
# based on API, as well as prior basic block rule match
capa.rules.Rule.from_yaml(textwrap.dedent('''
rule:
meta:
name: kill thread loop
scope: function
examples:
- 9324d1a8ae37a36ae560c37448c9705a:0x403660
features:
- and:
- api: kernel32.TerminateThread
- api: kernel32.CloseHandle
- match: tight loop
''')),
# this rule should match on a file feature and a prior function rule match
capa.rules.Rule.from_yaml(textwrap.dedent('''
rule:
meta:
name: kill thread program
scope: file
examples:
- 9324d1a8ae37a36ae560c37448c9705a
features:
- and:
- section: .text
- match: kill thread loop
''')),
])
extractor = capa.features.extractors.viv.VivisectFeatureExtractor(sample_9324d1a8ae37a36ae560c37448c9705a.vw, sample_9324d1a8ae37a36ae560c37448c9705a.path)
capabilities = capa.main.find_capabilities(rules, extractor)
assert 'tight loop' in capabilities
assert 'kill thread loop' in capabilities
assert 'kill thread program' in capabilities
def test_subscope_bb_rules(sample_9324d1a8ae37a36ae560c37448c9705a):
rules = capa.rules.RuleSet([
capa.rules.Rule.from_yaml(textwrap.dedent('''
rule:
meta:
name: test rule
scope: function
features:
- and:
- basic block:
- characteristic(tight loop): true
'''))
])
# tight loop at 0x403685
extractor = capa.features.extractors.viv.VivisectFeatureExtractor(sample_9324d1a8ae37a36ae560c37448c9705a.vw, sample_9324d1a8ae37a36ae560c37448c9705a.path)
capabilities = capa.main.find_capabilities(rules, extractor)
assert 'test rule' in capabilities
def test_byte_matching(sample_9324d1a8ae37a36ae560c37448c9705a):
rules = capa.rules.RuleSet([
capa.rules.Rule.from_yaml(textwrap.dedent('''
rule:
meta:
name: byte match test
scope: function
features:
- and:
- bytes: ED 24 9E F4 52 A9 07 47 55 8E E1 AB 30 8E 23 61
'''))
])
extractor = capa.features.extractors.viv.VivisectFeatureExtractor(sample_9324d1a8ae37a36ae560c37448c9705a.vw, sample_9324d1a8ae37a36ae560c37448c9705a.path)
capabilities = capa.main.find_capabilities(rules, extractor)
assert 'byte match test' in capabilities

455
tests/test_rules.py Normal file
View File

@@ -0,0 +1,455 @@
import textwrap
import pytest
import capa.rules
from capa.engine import Element
from capa.features.insn import Number, Offset
def test_rule_ctor():
r = capa.rules.Rule('test rule', capa.rules.FUNCTION_SCOPE, Element(1), {})
assert r.evaluate(set([0])) == False
assert r.evaluate(set([1])) == True
def test_rule_yaml():
rule = textwrap.dedent('''
rule:
meta:
name: test rule
author: user@domain.com
scope: function
examples:
- foo1234
- bar5678
features:
- and:
- element: 1
- element: 2
''')
r = capa.rules.Rule.from_yaml(rule)
assert r.evaluate(set([0])) == False
assert r.evaluate(set([0, 1])) == False
assert r.evaluate(set([0, 1, 2])) == True
assert r.evaluate(set([0, 1, 2, 3])) == True
def test_rule_yaml_complex():
rule = textwrap.dedent('''
rule:
meta:
name: test rule
features:
- or:
- and:
- element: 1
- element: 2
- or:
- element: 3
- 2 or more:
- element: 4
- element: 5
- element: 6
''')
r = capa.rules.Rule.from_yaml(rule)
assert r.evaluate(set([5, 6, 7, 8])) == True
assert r.evaluate(set([6, 7, 8])) == False
def test_rule_yaml_not():
rule = textwrap.dedent('''
rule:
meta:
name: test rule
features:
- and:
- element: 1
- not:
- element: 2
''')
r = capa.rules.Rule.from_yaml(rule)
assert r.evaluate(set([1])) == True
assert r.evaluate(set([1, 2])) == False
def test_rule_yaml_count():
rule = textwrap.dedent('''
rule:
meta:
name: test rule
features:
- count(element(100)): 1
''')
r = capa.rules.Rule.from_yaml(rule)
assert r.evaluate({Element(100): {}}) == False
assert r.evaluate({Element(100): {1}}) == True
assert r.evaluate({Element(100): {1, 2}}) == False
def test_rule_yaml_count_range():
rule = textwrap.dedent('''
rule:
meta:
name: test rule
features:
- count(element(100)): (1, 2)
''')
r = capa.rules.Rule.from_yaml(rule)
assert r.evaluate({Element(100): {}}) == False
assert r.evaluate({Element(100): {1}}) == True
assert r.evaluate({Element(100): {1, 2}}) == True
assert r.evaluate({Element(100): {1, 2, 3}}) == False
def test_invalid_rule_feature():
with pytest.raises(capa.rules.InvalidRule):
capa.rules.Rule.from_yaml(textwrap.dedent('''
rule:
meta:
name: test rule
features:
- foo: true
'''))
with pytest.raises(capa.rules.InvalidRule):
capa.rules.Rule.from_yaml(textwrap.dedent('''
rule:
meta:
name: test rule
scope: file
features:
- characteristic(nzxor): true
'''))
with pytest.raises(capa.rules.InvalidRule):
capa.rules.Rule.from_yaml(textwrap.dedent('''
rule:
meta:
name: test rule
scope: function
features:
- characteristic(embedded pe): true
'''))
with pytest.raises(capa.rules.InvalidRule):
capa.rules.Rule.from_yaml(textwrap.dedent('''
rule:
meta:
name: test rule
scope: basic block
features:
- characteristic(embedded pe): true
'''))
def test_lib_rules():
rules = capa.rules.RuleSet([
capa.rules.Rule.from_yaml(textwrap.dedent('''
rule:
meta:
name: a lib rule
lib: true
features:
- api: CreateFileA
''')),
capa.rules.Rule.from_yaml(textwrap.dedent('''
rule:
meta:
name: a standard rule
lib: false
features:
- api: CreateFileW
''')),
])
assert len(rules.function_rules) == 1
def test_subscope_rules():
rules = capa.rules.RuleSet([
capa.rules.Rule.from_yaml(textwrap.dedent('''
rule:
meta:
name: test rule
scope: file
features:
- and:
- characteristic(embedded pe): true
- function:
- and:
- characteristic(nzxor): true
- characteristic(switch): true
'''))
])
# the file rule scope will have one rules:
# - `test rule`
assert len(rules.file_rules) == 1
# the function rule scope have one rule:
# - the rule on which `test rule` depends
assert len(rules.function_rules) == 1
def test_duplicate_rules():
with pytest.raises(capa.rules.InvalidRule):
rules = capa.rules.RuleSet([
capa.rules.Rule.from_yaml(textwrap.dedent('''
rule:
meta:
name: rule-name
features:
- api: CreateFileA
''')),
capa.rules.Rule.from_yaml(textwrap.dedent('''
rule:
meta:
name: rule-name
features:
- api: CreateFileW
''')),
])
def test_missing_dependency():
with pytest.raises(capa.rules.InvalidRule):
rules = capa.rules.RuleSet([
capa.rules.Rule.from_yaml(textwrap.dedent('''
rule:
meta:
name: dependent rule
features:
- match: missing rule
''')),
])
def test_invalid_rules():
with pytest.raises(capa.rules.InvalidRule):
r = capa.rules.Rule.from_yaml(textwrap.dedent('''
rule:
meta:
name: test rule
features:
- characteristic(number(1)): True
'''))
with pytest.raises(capa.rules.InvalidRule):
r = capa.rules.Rule.from_yaml(textwrap.dedent('''
rule:
meta:
name: test rule
features:
- characteristic(count(element(100))): True
'''))
def test_number_symbol():
rule = textwrap.dedent('''
rule:
meta:
name: test rule
features:
- and:
- number: 1
- number: -1
- number: 2 = symbol name
- number: 3 = symbol name
- number: 4 = symbol name = another name
- number: 0x100 = symbol name
- number: 0x11 = (FLAG_A | FLAG_B)
''')
r = capa.rules.Rule.from_yaml(rule)
children = list(r.statement.get_children())
assert (Number(1) in children) == True
assert (Number(-1) in children) == True
assert (Number(2, 'symbol name') in children) == True
assert (Number(3, 'symbol name') in children) == True
assert (Number(4, 'symbol name = another name') in children) == True
assert (Number(0x100, 'symbol name') in children) == True
def test_count_number_symbol():
rule = textwrap.dedent('''
rule:
meta:
name: test rule
features:
- or:
- count(number(2 = symbol name)): 1
- count(number(0x100 = symbol name)): 2 or more
- count(number(0x11 = (FLAG_A | FLAG_B))): 2 or more
''')
r = capa.rules.Rule.from_yaml(rule)
assert r.evaluate({Number(2): {}}) == False
assert r.evaluate({Number(2): {1}}) == True
assert r.evaluate({Number(2): {1, 2}}) == False
assert r.evaluate({Number(0x100, 'symbol name'): {1}}) == False
assert r.evaluate({Number(0x100, 'symbol name'): {1, 2, 3}}) == True
def test_invalid_number():
with pytest.raises(capa.rules.InvalidRule):
r = capa.rules.Rule.from_yaml(textwrap.dedent('''
rule:
meta:
name: test rule
features:
- number: "this is a string"
'''))
with pytest.raises(capa.rules.InvalidRule):
r = capa.rules.Rule.from_yaml(textwrap.dedent('''
rule:
meta:
name: test rule
features:
- number: 2=
'''))
with pytest.raises(capa.rules.InvalidRule):
r = capa.rules.Rule.from_yaml(textwrap.dedent('''
rule:
meta:
name: test rule
features:
- number: symbol name = 2
'''))
def test_offset_symbol():
rule = textwrap.dedent('''
rule:
meta:
name: test rule
features:
- and:
- offset: 1
# what about negative offsets?
- offset: 2 = symbol name
- offset: 3 = symbol name
- offset: 4 = symbol name = another name
- offset: 0x100 = symbol name
''')
r = capa.rules.Rule.from_yaml(rule)
children = list(r.statement.get_children())
assert (Offset(1) in children) == True
assert (Offset(2, 'symbol name') in children) == True
assert (Offset(3, 'symbol name') in children) == True
assert (Offset(4, 'symbol name = another name') in children) == True
assert (Offset(0x100, 'symbol name') in children) == True
def test_count_offset_symbol():
rule = textwrap.dedent('''
rule:
meta:
name: test rule
features:
- or:
- count(offset(2 = symbol name)): 1
- count(offset(0x100 = symbol name)): 2 or more
- count(offset(0x11 = (FLAG_A | FLAG_B))): 2 or more
''')
r = capa.rules.Rule.from_yaml(rule)
assert r.evaluate({Offset(2): {}}) == False
assert r.evaluate({Offset(2): {1}}) == True
assert r.evaluate({Offset(2): {1, 2}}) == False
assert r.evaluate({Offset(0x100, 'symbol name'): {1}}) == False
assert r.evaluate({Offset(0x100, 'symbol name'): {1, 2, 3}}) == True
def test_invalid_offset():
with pytest.raises(capa.rules.InvalidRule):
r = capa.rules.Rule.from_yaml(textwrap.dedent('''
rule:
meta:
name: test rule
features:
- offset: "this is a string"
'''))
with pytest.raises(capa.rules.InvalidRule):
r = capa.rules.Rule.from_yaml(textwrap.dedent('''
rule:
meta:
name: test rule
features:
- offset: 2=
'''))
with pytest.raises(capa.rules.InvalidRule):
r = capa.rules.Rule.from_yaml(textwrap.dedent('''
rule:
meta:
name: test rule
features:
- offset: symbol name = 2
'''))
def test_filter_rules():
rules = capa.rules.RuleSet([
capa.rules.Rule.from_yaml(textwrap.dedent('''
rule:
meta:
name: rule 1
author: joe
features:
- api: CreateFile
''')),
capa.rules.Rule.from_yaml(textwrap.dedent('''
rule:
meta:
name: rule 2
features:
- string: joe
''')),
])
rules = rules.filter_rules_by_meta('joe')
assert len(rules) == 1
assert ('rule 1' in rules.rules)
def test_filter_rules_dependencies():
rules = capa.rules.RuleSet([
capa.rules.Rule.from_yaml(textwrap.dedent('''
rule:
meta:
name: rule 1
features:
- match: rule 2
''')),
capa.rules.Rule.from_yaml(textwrap.dedent('''
rule:
meta:
name: rule 2
features:
- match: rule 3
''')),
capa.rules.Rule.from_yaml(textwrap.dedent('''
rule:
meta:
name: rule 3
features:
- api: CreateFile
''')),
])
rules = rules.filter_rules_by_meta('rule 1')
assert(len(rules.rules) == 3)
assert('rule 1' in rules.rules)
assert('rule 2' in rules.rules)
assert('rule 3' in rules.rules)
def test_filter_rules_missing_dependency():
with pytest.raises(capa.rules.InvalidRule):
capa.rules.RuleSet([
capa.rules.Rule.from_yaml(textwrap.dedent('''
rule:
meta:
name: rule 1
author: joe
features:
- match: rule 2
''')),
])

297
tests/test_viv_features.py Normal file
View File

@@ -0,0 +1,297 @@
import collections
import viv_utils
import capa.features
import capa.features.file
import capa.features.function
import capa.features.basicblock
import capa.features.insn
import capa.features.extractors.viv.file
import capa.features.extractors.viv.function
import capa.features.extractors.viv.basicblock
import capa.features.extractors.viv.insn
from fixtures import *
def extract_file_features(vw, path):
features = set([])
for feature, va in capa.features.extractors.viv.file.extract_features(vw, path):
features.add(feature)
return features
def extract_function_features(f):
features = collections.defaultdict(set)
for bb in f.basic_blocks:
for insn in bb.instructions:
for feature, va in capa.features.extractors.viv.insn.extract_features(f, bb, insn):
features[feature].add(va)
for feature, va in capa.features.extractors.viv.basicblock.extract_features(f, bb):
features[feature].add(va)
for feature, va in capa.features.extractors.viv.function.extract_features(f):
features[feature].add(va)
return features
def extract_basic_block_features(f, bb):
features = set({})
for insn in bb.instructions:
for feature, _ in capa.features.extractors.viv.insn.extract_features(f, bb, insn):
features.add(feature)
for feature, _ in capa.features.extractors.viv.basicblock.extract_features(f, bb):
features.add(feature)
return features
def test_api_features(mimikatz):
features = extract_function_features(viv_utils.Function(mimikatz.vw, 0x403BAC))
assert capa.features.insn.API('advapi32.CryptAcquireContextW') in features
assert capa.features.insn.API('advapi32.CryptAcquireContext') in features
assert capa.features.insn.API('advapi32.CryptGenKey') in features
assert capa.features.insn.API('advapi32.CryptImportKey') in features
assert capa.features.insn.API('advapi32.CryptDestroyKey') in features
assert capa.features.insn.API('CryptAcquireContextW') in features
assert capa.features.insn.API('CryptAcquireContext') in features
assert capa.features.insn.API('CryptGenKey') in features
assert capa.features.insn.API('CryptImportKey') in features
assert capa.features.insn.API('CryptDestroyKey') in features
def test_api_features_64_bit(sample_a198216798ca38f280dc413f8c57f2c2):
features = extract_function_features(viv_utils.Function(sample_a198216798ca38f280dc413f8c57f2c2.vw, 0x4011B0))
assert capa.features.insn.API('kernel32.GetStringTypeA') in features
assert capa.features.insn.API('kernel32.GetStringType') in features
assert capa.features.insn.API('GetStringTypeA') in features
assert capa.features.insn.API('GetStringType') in features
# call via thunk in IDA Pro
features = extract_function_features(viv_utils.Function(sample_a198216798ca38f280dc413f8c57f2c2.vw, 0x401CB0))
assert capa.features.insn.API('msvcrt.vfprintf') in features
assert capa.features.insn.API('vfprintf') in features
def test_string_features(mimikatz):
features = extract_function_features(viv_utils.Function(mimikatz.vw, 0x40105D))
assert capa.features.String('SCardControl') in features
assert capa.features.String('SCardTransmit') in features
assert capa.features.String('ACR > ') in features
# other strings not in this function
assert capa.features.String('bcrypt.dll') not in features
def test_byte_features(sample_9324d1a8ae37a36ae560c37448c9705a):
features = extract_function_features(viv_utils.Function(sample_9324d1a8ae37a36ae560c37448c9705a.vw, 0x406F60))
wanted = capa.features.Bytes(b"\xED\x24\x9E\xF4\x52\xA9\x07\x47\x55\x8E\xE1\xAB\x30\x8E\x23\x61")
# use `==` rather than `is` because the result is not `True` but a truthy value.
assert wanted.evaluate(features) == True
def test_byte_features64(sample_lab21_01):
features = extract_function_features(viv_utils.Function(sample_lab21_01.vw, 0x1400010C0))
wanted = capa.features.Bytes(b"\x32\xA2\xDF\x2D\x99\x2B\x00\x00")
# use `==` rather than `is` because the result is not `True` but a truthy value.
assert wanted.evaluate(features) == True
def test_number_features(mimikatz):
features = extract_function_features(viv_utils.Function(mimikatz.vw, 0x40105D))
assert capa.features.insn.Number(0xFF) in features
assert capa.features.insn.Number(0x3136B0) in features
# the following are stack adjustments
assert capa.features.insn.Number(0xC) not in features
assert capa.features.insn.Number(0x10) not in features
def test_offset_features(mimikatz):
features = extract_function_features(viv_utils.Function(mimikatz.vw, 0x40105D))
assert capa.features.insn.Offset(0x0) in features
assert capa.features.insn.Offset(0x4) in features
assert capa.features.insn.Offset(0xC) in features
# the following are stack references
assert capa.features.insn.Offset(0x8) not in features
assert capa.features.insn.Offset(0x10) not in features
def test_nzxor_features(mimikatz):
features = extract_function_features(viv_utils.Function(mimikatz.vw, 0x410DFC))
assert capa.features.Characteristic('nzxor', True) in features # 0x0410F0B
def get_bb_insn(f, va):
'''fetch the BasicBlock and Instruction instances for the given VA in the given function.'''
for bb in f.basic_blocks:
for insn in bb.instructions:
if insn.va == va:
return (bb, insn)
raise KeyError(va)
def test_is_security_cookie(mimikatz):
# not a security cookie check
f = viv_utils.Function(mimikatz.vw, 0x410DFC)
for va in [0x0410F0B]:
bb, insn = get_bb_insn(f, va)
assert capa.features.extractors.viv.insn.is_security_cookie(f, bb, insn) == False
# security cookie initial set and final check
f = viv_utils.Function(mimikatz.vw, 0x46C54A)
for va in [0x46C557, 0x46C63A]:
bb, insn = get_bb_insn(f, va)
assert capa.features.extractors.viv.insn.is_security_cookie(f, bb, insn) == True
def test_mnemonic_features(mimikatz):
features = extract_function_features(viv_utils.Function(mimikatz.vw, 0x40105D))
assert capa.features.insn.Mnemonic('push') in features
assert capa.features.insn.Mnemonic('movzx') in features
assert capa.features.insn.Mnemonic('xor') in features
assert capa.features.insn.Mnemonic('in') not in features
assert capa.features.insn.Mnemonic('out') not in features
def test_peb_access_features(sample_a933a1a402775cfa94b6bee0963f4b46):
features = extract_function_features(viv_utils.Function(sample_a933a1a402775cfa94b6bee0963f4b46.vw, 0xABA6FEC))
assert capa.features.Characteristic('peb access', True) in features
def test_file_section_name_features(mimikatz):
features = extract_file_features(mimikatz.vw, mimikatz.path)
assert capa.features.file.Section('.rsrc') in features
assert capa.features.file.Section('.text') in features
assert capa.features.file.Section('.nope') not in features
def test_tight_loop_features(mimikatz):
f = viv_utils.Function(mimikatz.vw, 0x402EC4)
for bb in f.basic_blocks:
if bb.va != 0x402F8E:
continue
features = extract_basic_block_features(f, bb)
assert capa.features.Characteristic('tight loop', True) in features
assert capa.features.basicblock.BasicBlock() in features
def test_tight_loop_bb_features(mimikatz):
f = viv_utils.Function(mimikatz.vw, 0x402EC4)
for bb in f.basic_blocks:
if bb.va != 0x402F8E:
continue
features = extract_basic_block_features(f, bb)
assert capa.features.Characteristic('tight loop', True) in features
assert capa.features.basicblock.BasicBlock() in features
def test_file_export_name_features(kernel32):
features = extract_file_features(kernel32.vw, kernel32.path)
assert capa.features.file.Export('BaseThreadInitThunk') in features
assert capa.features.file.Export('lstrlenW') in features
def test_file_import_name_features(mimikatz):
features = extract_file_features(mimikatz.vw, mimikatz.path)
assert capa.features.file.Import('advapi32.CryptSetHashParam') in features
assert capa.features.file.Import('CryptSetHashParam') in features
assert capa.features.file.Import('kernel32.IsWow64Process') in features
assert capa.features.file.Import('msvcrt.exit') in features
assert capa.features.file.Import('cabinet.#11') in features
assert capa.features.file.Import('#11') not in features
def test_cross_section_flow_features(sample_a198216798ca38f280dc413f8c57f2c2):
features = extract_function_features(viv_utils.Function(sample_a198216798ca38f280dc413f8c57f2c2.vw, 0x4014D0))
assert capa.features.Characteristic('cross section flow', True) in features
# this function has calls to some imports,
# which should not trigger cross-section flow characteristic
features = extract_function_features(viv_utils.Function(sample_a198216798ca38f280dc413f8c57f2c2.vw, 0x401563))
assert capa.features.Characteristic('cross section flow', True) not in features
def test_segment_access_features(sample_a933a1a402775cfa94b6bee0963f4b46):
features = extract_function_features(viv_utils.Function(sample_a933a1a402775cfa94b6bee0963f4b46.vw, 0xABA6FEC))
assert capa.features.Characteristic('fs access', True) in features
def test_thunk_features(sample_9324d1a8ae37a36ae560c37448c9705a):
features = extract_function_features(viv_utils.Function(sample_9324d1a8ae37a36ae560c37448c9705a.vw, 0x407970))
assert capa.features.insn.API('kernel32.CreateToolhelp32Snapshot') in features
assert capa.features.insn.API('CreateToolhelp32Snapshot') in features
def test_file_embedded_pe(pma_lab_12_04):
features = extract_file_features(pma_lab_12_04.vw, pma_lab_12_04.path)
assert capa.features.Characteristic('embedded pe', True) in features
def test_stackstring_features(mimikatz):
features = extract_function_features(viv_utils.Function(mimikatz.vw, 0x4556E5))
assert capa.features.Characteristic('stack string', True) in features
def test_switch_features(mimikatz):
features = extract_function_features(viv_utils.Function(mimikatz.vw, 0x409411))
assert capa.features.Characteristic('switch', True) in features
features = extract_function_features(viv_utils.Function(mimikatz.vw, 0x409393))
assert capa.features.Characteristic('switch', True) not in features
def test_recursive_call_feature(sample_39c05b15e9834ac93f206bc114d0a00c357c888db567ba8f5345da0529cbed41):
features = extract_function_features(viv_utils.Function(sample_39c05b15e9834ac93f206bc114d0a00c357c888db567ba8f5345da0529cbed41.vw, 0x10003100))
assert capa.features.Characteristic('recursive call', True) in features
features = extract_function_features(viv_utils.Function(sample_39c05b15e9834ac93f206bc114d0a00c357c888db567ba8f5345da0529cbed41.vw, 0x10007B00))
assert capa.features.Characteristic('recursive call', True) not in features
def test_loop_feature(sample_39c05b15e9834ac93f206bc114d0a00c357c888db567ba8f5345da0529cbed41):
features = extract_function_features(viv_utils.Function(sample_39c05b15e9834ac93f206bc114d0a00c357c888db567ba8f5345da0529cbed41.vw, 0x10003D30))
assert capa.features.Characteristic('loop', True) in features
features = extract_function_features(viv_utils.Function(sample_39c05b15e9834ac93f206bc114d0a00c357c888db567ba8f5345da0529cbed41.vw, 0x10007250))
assert capa.features.Characteristic('loop', True) not in features
def test_file_string_features(sample_bfb9b5391a13d0afd787e87ab90f14f5):
features = extract_file_features(sample_bfb9b5391a13d0afd787e87ab90f14f5.vw, sample_bfb9b5391a13d0afd787e87ab90f14f5.path)
assert capa.features.String('WarStop') in features # ASCII, offset 0x40EC
assert capa.features.String('cimage/png') in features # UTF-16 LE, offset 0x350E
def test_function_calls_to(sample_9324d1a8ae37a36ae560c37448c9705a):
features = extract_function_features(viv_utils.Function(sample_9324d1a8ae37a36ae560c37448c9705a.vw, 0x406F60))
assert capa.features.Characteristic('calls to', True) in features
assert len(features[capa.features.Characteristic('calls to', True)]) == 1
def test_function_calls_to64(sample_lab21_01):
features = extract_function_features(viv_utils.Function(sample_lab21_01.vw, 0x1400052D0)) # memcpy
assert capa.features.Characteristic('calls to', True) in features
assert len(features[capa.features.Characteristic('calls to', True)]) == 8
def test_function_calls_from(sample_9324d1a8ae37a36ae560c37448c9705a):
features = extract_function_features(viv_utils.Function(sample_9324d1a8ae37a36ae560c37448c9705a.vw, 0x406F60))
assert capa.features.Characteristic('calls from', True) in features
assert len(features[capa.features.Characteristic('calls from', True)]) == 23
def test_basic_block_count(sample_9324d1a8ae37a36ae560c37448c9705a):
features = extract_function_features(viv_utils.Function(sample_9324d1a8ae37a36ae560c37448c9705a.vw, 0x406F60))
assert len(features[capa.features.basicblock.BasicBlock()]) == 26
def test_indirect_call_features(sample_a933a1a402775cfa94b6bee0963f4b46):
features = extract_function_features(viv_utils.Function(sample_a933a1a402775cfa94b6bee0963f4b46.vw, 0xABA68A0))
assert capa.features.Characteristic('indirect call', True) in features
assert len(features[capa.features.Characteristic('indirect call', True)]) == 3
def test_indirect_calls_resolved(sample_c91887d861d9bd4a5872249b641bc9f9):
features = extract_function_features(viv_utils.Function(sample_c91887d861d9bd4a5872249b641bc9f9.vw, 0x401A77))
assert capa.features.insn.API('kernel32.CreatePipe') in features
assert capa.features.insn.API('kernel32.SetHandleInformation') in features
assert capa.features.insn.API('kernel32.CloseHandle') in features
assert capa.features.insn.API('kernel32.WriteFile') in features