Merge pull request #1893 from mrexodia/dex-support

Initial plumbing to support DEX files
Differentiate between function-name and import for DEX
2026-03-15 06:18:57 -07:00 · 2024-01-31 12:03:23 +01:00 · 2023-12-08 01:12:48 +01:00 · 2023-12-08 00:15:20 +01:00 · 2023-12-08 00:15:20 +01:00 · 2023-12-08 00:15:20 +01:00
64 changed files with 2128 additions and 34510 deletions
--- a/.github/flake8.ini
+++ b/.github/flake8.ini
@@ -10,8 +10,6 @@ extend-ignore =
    F811,
    # E501 line too long  (prefer black)
    E501,
-    # E701 multiple statements on one line (colon)  (prefer black, see https://github.com/psf/black/issues/4173)
-    E701,
    # B010 Do not call setattr with a constant attribute value
    B010,
    # G200 Logging statement uses exception in arguments
--- a/.github/pyinstaller/pyinstaller.spec
+++ b/.github/pyinstaller/pyinstaller.spec
@@ -17,6 +17,7 @@ a = Analysis(
        # when invoking pyinstaller from the project root,
        # this gets invoked from the directory of the spec file,
        # i.e. ./.github/pyinstaller
+        ("../../assets", "assets"),
        ("../../rules", "rules"),
        ("../../sigs", "sigs"),
        ("../../cache", "cache"),
--- a/.github/workflows/build.yml
+++ b/.github/workflows/build.yml
@@ -57,15 +57,15 @@ jobs:
      - name: Build standalone executable
        run: pyinstaller --log-level DEBUG .github/pyinstaller/pyinstaller.spec
      - name: Does it run (PE)?
-        run: dist/capa -d "tests/data/Practical Malware Analysis Lab 01-01.dll_"
+        run: dist/capa "tests/data/Practical Malware Analysis Lab 01-01.dll_"
      - name: Does it run (Shellcode)?
-        run: dist/capa -d "tests/data/499c2a85f6e8142c3f48d4251c9c7cd6.raw32"
+        run: dist/capa "tests/data/499c2a85f6e8142c3f48d4251c9c7cd6.raw32"
      - name: Does it run (ELF)?
-        run: dist/capa -d "tests/data/7351f8a40c5450557b24622417fc478d.elf_"
+        run: dist/capa "tests/data/7351f8a40c5450557b24622417fc478d.elf_"
      - name: Does it run (CAPE)?
        run: |
          7z e "tests/data/dynamic/cape/v2.2/d46900384c78863420fb3e297d0a2f743cd2b6b3f7f82bf64059a168e07aceb7.json.gz"
-          dist/capa -d "d46900384c78863420fb3e297d0a2f743cd2b6b3f7f82bf64059a168e07aceb7.json"
+          dist/capa "d46900384c78863420fb3e297d0a2f743cd2b6b3f7f82bf64059a168e07aceb7.json"
      - uses: actions/upload-artifact@0b7f8abb1508181956e8e162db84b466c27e18ce # v3.1.2
        with:
          name: ${{ matrix.asset_name }}
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -3,35 +3,7 @@
 ## master (unreleased)

 ### New Features
-
-
-### Breaking Changes
-
-
-### New Rules (0)
-
-
-
-### Bug Fixes
-
-### capa explorer IDA Pro plugin
-
-### Development
-
-### Raw diffs
- [capa v7.0.0...master](https://github.com/mandiant/capa/compare/v7.0.0...master)
- [capa-rules v7.0.0...master](https://github.com/mandiant/capa-rules/compare/v7.0.0...master)
-
-## v7.0.0
-This is the v7.0.0 release of capa which was mainly worked on during the Google Summer of Code (GSoC) 2023. A huge
-shoutout to our GSoC contributors @colton-gabertan and @yelhamer for their amazing work.
-
-Also, a big thanks to the other contributors: @aaronatp, @Aayush-Goel-04, @bkojusner, @doomedraven, @ruppde, @larchchen, @JCoonradt, and @xusheng6.
-
-### New Features
-
 - add Ghidra backend #1770 #1767 @colton-gabertan @mike-hunhoff
- add Ghidra UI integration #1734 @colton-gabertan @mike-hunhoff
 - add dynamic analysis via CAPE sandbox reports #48 #1535 @yelhamer
  - add call scope #771 @yelhamer
  - add thread scope #1517 @yelhamer
@@ -41,7 +13,6 @@ Also, a big thanks to the other contributors: @aaronatp, @Aayush-Goel-04, @bkoju
 - binja: add support for forwarded exports #1646 @xusheng6
 - binja: add support for symtab names #1504 @xusheng6
 - add com class/interface features #322 @Aayush-goel-04
- dotnet: emit enclosing class information for nested classes #1780 #1913 @bkojusner @mike-hunhoff

 ### Breaking Changes

@@ -50,11 +21,8 @@ Also, a big thanks to the other contributors: @aaronatp, @Aayush-Goel-04, @bkoju
 - protobuf: deprecate `Metadata.analysis` in favor of `Metadata.analysis2` that is dynamic analysis aware @williballenthin
 - update freeze format to v3, adding support for dynamic analysis @williballenthin
 - extractor: ignore DLL name for api features #1815 @mr-tz
- main: introduce wrapping routines within main for working with CLI args #1813 @williballenthin
- move functions from `capa.main` to new `capa.loader` namespace #1821 @williballenthin
- proto: add `package` declaration #1960 @larchchen

-### New Rules (41)
+### New Rules (34)

 - nursery/get-ntoskrnl-base-address @mr-tz
 - host-interaction/network/connectivity/set-tcp-connection-state @johnk3r
@@ -89,53 +57,21 @@ Also, a big thanks to the other contributors: @aaronatp, @Aayush-Goel-04, @bkoju
 - data-manipulation/compression/create-cabinet-on-windows michael.hunhoff@mandiant.com jakub.jozwiak@mandiant.com
 - data-manipulation/compression/extract-cabinet-on-windows jakub.jozwiak@mandiant.com
 - lib/create-file-decompression-interface-context-on-windows jakub.jozwiak@mandiant.com
- nursery/enumerate-files-in-dotnet moritz.raabe@mandiant.com anushka.virgaonkar@mandiant.com
- nursery/get-mac-address-in-dotnet moritz.raabe@mandiant.com michael.hunhoff@mandiant.com echernofsky@google.com
- nursery/get-current-process-command-line william.ballenthin@mandiant.com
- nursery/get-current-process-file-path william.ballenthin@mandiant.com
- nursery/hook-routines-via-dlsym-rtld_next william.ballenthin@mandiant.com
- nursery/linked-against-hp-socket still@teamt5.org
- host-interaction/process/inject/process-ghostly-hollowing sara.rincon@mandiant.com
+-

 ### Bug Fixes
 - ghidra: fix `ints_to_bytes` performance #1761 @mike-hunhoff
 - binja: improve function call site detection @xusheng6
 - binja: use `binaryninja.load` to open files @xusheng6
 - binja: bump binja version to 3.5 #1789 @xusheng6
- elf: better detect ELF OS via GCC .ident directives #1928 @williballenthin
- elf: better detect ELF OS via Android dependencies #1947 @williballenthin
- fix setuptools package discovery #1886 @gmacon @mr-tz
- remove unnecessary scripts/vivisect-py2-vs-py3.sh file #1949 @JCoonradt

 ### capa explorer IDA Pro plugin
- various integration updates and minor bug fixes

 ### Development
- update ATT&CK/MBC data for linting #1932 @mr-tz
-
-#### Developer Notes
-With this new release, many classes and concepts have been split up into static (mostly identical to the
-prior implementations) and dynamic ones. For example, the legacy FeatureExtractor class has been renamed to
-StaticFeatureExtractor and the DynamicFeatureExtractor has been added.
-
-Starting from version 7.0, we have moved the component responsible for feature extractor from main to a new
-capabilities' module. Now, users wishing to utilize capa’s feature extraction abilities should use that module instead
-of importing the relevant logic from the main file.
-
-For sandbox-based feature extractors, we are using Pydantic models. Contributions of more models for other sandboxes
-are very welcome!
-
-With this release we've reorganized the logic found in `main()` to localize logic and ease readability and ease changes
-and integrations. The new "main routines" are expected to be used only within main functions, either capa main or
-related scripts. These functions should not be invoked from library code.
-
-Beyond copying code around, we've refined the handling of the input file/format/backend. The logic for picking the
-format and backend is more consistent. We've documented that the input file is not necessarily the sample itself
-(cape/freeze/etc.) inputs are not actually the sample.

 ### Raw diffs
- [capa v6.1.0...v7.0.0](https://github.com/mandiant/capa/compare/v6.1.0...v7.0.0)
- [capa-rules v6.1.0...v7.0.0](https://github.com/mandiant/capa-rules/compare/v6.1.0...v7.0.0)
+- [capa v6.1.0...master](https://github.com/mandiant/capa/compare/v6.1.0...master)
+- [capa-rules v6.1.0...master](https://github.com/mandiant/capa-rules/compare/v6.1.0...master)

 ## v6.1.0

@@ -1690,4 +1626,4 @@ Download a standalone binary below and checkout the readme [here on GitHub](http
 ### Raw diffs

  - [capa v1.0.0...v1.1.0](https://github.com/mandiant/capa/compare/v1.0.0...v1.1.0)
-  - [capa-rules v1.0.0...v1.1.0](https://github.com/mandiant/capa-rules/compare/v1.0.0...v1.1.0)
+  - [capa-rules v1.0.0...v1.1.0](https://github.com/mandiant/capa-rules/compare/v1.0.0...v1.1.0)
--- a/README.md
+++ b/README.md
@@ -2,7 +2,7 @@

 [![PyPI - Python Version](https://img.shields.io/pypi/pyversions/flare-capa)](https://pypi.org/project/flare-capa)
 [![Last release](https://img.shields.io/github/v/release/mandiant/capa)](https://github.com/mandiant/capa/releases)
-[![Number of rules](https://img.shields.io/badge/rules-866-blue.svg)](https://github.com/mandiant/capa-rules)
+[![Number of rules](https://img.shields.io/badge/rules-859-blue.svg)](https://github.com/mandiant/capa-rules)
 [![CI status](https://github.com/mandiant/capa/workflows/CI/badge.svg)](https://github.com/mandiant/capa/actions?query=workflow%3ACI+event%3Apush+branch%3Amaster)
 [![Downloads](https://img.shields.io/github/downloads/mandiant/capa/total)](https://github.com/mandiant/capa/releases)
 [![License](https://img.shields.io/badge/license-Apache--2.0-green.svg)](LICENSE.txt)
--- a/assets/classes.json.gz
+++ b/assets/classes.json.gz
--- a/assets/interfaces.json.gz
+++ b/assets/interfaces.json.gz
--- a/capa/features/address.py
+++ b/capa/features/address.py
@@ -10,7 +10,8 @@ import abc

 class Address(abc.ABC):
    @abc.abstractmethod
-    def __eq__(self, other): ...
+    def __eq__(self, other):
+        ...

    @abc.abstractmethod
    def __lt__(self, other):
@@ -176,6 +177,34 @@ class DNTokenOffsetAddress(Address):
        return self.token + self.offset


+class DexMethodAddress(int, Address):
+    def __new__(cls, offset: int):
+        return int.__new__(cls, offset)
+
+    def __repr__(self):
+        return f"DexMethodAddress(offset={hex(self)})"
+
+    def __str__(self) -> str:
+        return repr(self)
+
+    def __hash__(self):
+        return int.__hash__(self)
+
+
+class DexClassAddress(int, Address):
+    def __new__(cls, offset: int):
+        return int.__new__(cls, offset)
+
+    def __repr__(self):
+        return f"DexClassAddress(offset={hex(self)})"
+
+    def __str__(self) -> str:
+        return repr(self)
+
+    def __hash__(self):
+        return int.__hash__(self)
+
+
 class _NoAddress(Address):
    def __eq__(self, other):
        return True
--- a/capa/features/com/init.py
+++ b/capa/features/com/init.py
@@ -1,36 +0,0 @@
-# Copyright (C) 2023 Mandiant, Inc. All Rights Reserved.
-# Licensed under the Apache License, Version 2.0 (the "License");
-#  you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at: [package root]/LICENSE.txt
-# Unless required by applicable law or agreed to in writing, software distributed under the License
-#  is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and limitations under the License.
-from enum import Enum
-from typing import Dict, List
-
-from capa.helpers import assert_never
-
-
-class ComType(Enum):
-    CLASS = "class"
-    INTERFACE = "interface"
-
-
-COM_PREFIXES = {
-    ComType.CLASS: "CLSID_",
-    ComType.INTERFACE: "IID_",
-}
-
-
-def load_com_database(com_type: ComType) -> Dict[str, List[str]]:
-    # lazy load these python files since they are so large.
-    # that is, don't load them unless a COM feature is being handled.
-    import capa.features.com.classes
-    import capa.features.com.interfaces
-
-    if com_type == ComType.CLASS:
-        return capa.features.com.classes.COM_CLASSES
-    elif com_type == ComType.INTERFACE:
-        return capa.features.com.interfaces.COM_INTERFACES
-    else:
-        assert_never(com_type)
--- a/capa/features/com/classes.py
+++ b/capa/features/com/classes.py
--- a/capa/features/com/interfaces.py
+++ b/capa/features/com/interfaces.py
--- a/capa/features/common.py
+++ b/capa/features/common.py
@@ -409,7 +409,9 @@ ARCH_I386 = "i386"
 ARCH_AMD64 = "amd64"
 # dotnet
 ARCH_ANY = "any"
-VALID_ARCH = (ARCH_I386, ARCH_AMD64, ARCH_ANY)
+# dex
+ARCH_DALVIK = "dalvik"
+VALID_ARCH = (ARCH_I386, ARCH_AMD64, ARCH_ANY, ARCH_DALVIK)


 class Arch(Feature):
@@ -421,10 +423,11 @@ class Arch(Feature):
 OS_WINDOWS = "windows"
 OS_LINUX = "linux"
 OS_MACOS = "macos"
+OS_ANDROID = "android"
 # dotnet
 OS_ANY = "any"
 VALID_OS = {os.value for os in capa.features.extractors.elf.OS}
-VALID_OS.update({OS_WINDOWS, OS_LINUX, OS_MACOS, OS_ANY})
+VALID_OS.update({OS_WINDOWS, OS_LINUX, OS_MACOS, OS_ANY, OS_ANDROID})
 # internal only, not to be used in rules
 OS_AUTO = "auto"

@@ -452,28 +455,26 @@ class OS(Feature):
 FORMAT_PE = "pe"
 FORMAT_ELF = "elf"
 FORMAT_DOTNET = "dotnet"
-VALID_FORMAT = (FORMAT_PE, FORMAT_ELF, FORMAT_DOTNET)
+FORMAT_DEX = "dex"
+VALID_FORMAT = (FORMAT_PE, FORMAT_ELF, FORMAT_DOTNET, FORMAT_DEX)
 # internal only, not to be used in rules
 FORMAT_AUTO = "auto"
 FORMAT_SC32 = "sc32"
 FORMAT_SC64 = "sc64"
 FORMAT_CAPE = "cape"
-FORMAT_FREEZE = "freeze"
-FORMAT_RESULT = "result"
 STATIC_FORMATS = {
    FORMAT_SC32,
    FORMAT_SC64,
    FORMAT_PE,
    FORMAT_ELF,
    FORMAT_DOTNET,
-    FORMAT_FREEZE,
-    FORMAT_RESULT,
+    FORMAT_DEX,
 }
 DYNAMIC_FORMATS = {
    FORMAT_CAPE,
-    FORMAT_FREEZE,
-    FORMAT_RESULT,
 }
+FORMAT_FREEZE = "freeze"
+FORMAT_RESULT = "result"
 FORMAT_UNKNOWN = "unknown"


--- a/capa/features/extractors/cape/extractor.py
+++ b/capa/features/extractors/cape/extractor.py
@@ -128,14 +128,6 @@ class CapeExtractor(DynamicFeatureExtractor):
        if cr.info.version not in TESTED_VERSIONS:
            logger.warning("CAPE version '%s' not tested/supported yet", cr.info.version)

-        # TODO(mr-tz): support more file types
-        # https://github.com/mandiant/capa/issues/1933
-        if "PE" not in cr.target.file.type:
-            logger.error(
-                "capa currently only supports PE target files, this target file's type is: '%s'.\nPlease report this at: https://github.com/mandiant/capa/issues/1933",
-                cr.target.file.type,
-            )
-
        # observed in 2.4-CAPE reports from capesandbox.com
        if cr.static is None and cr.target.file.pe is not None:
            cr.static = Static()
--- a/capa/features/extractors/common.py
+++ b/capa/features/extractors/common.py
@@ -24,8 +24,11 @@ from capa.features.common import (
    OS_AUTO,
    ARCH_ANY,
    FORMAT_PE,
+    FORMAT_DEX,
    FORMAT_ELF,
+    OS_ANDROID,
    OS_WINDOWS,
+    ARCH_DALVIK,
    FORMAT_FREEZE,
    FORMAT_RESULT,
    Arch,
@@ -41,11 +44,12 @@ logger = logging.getLogger(__name__)
 # match strings for formats
 MATCH_PE = b"MZ"
 MATCH_ELF = b"\x7fELF"
+MATCH_DEX = b"dex\n"
 MATCH_RESULT = b'{"meta":'
 MATCH_JSON_OBJECT = b'{"'


-def extract_file_strings(buf: bytes, **kwargs) -> Iterator[Tuple[String, Address]]:
+def extract_file_strings(buf, **kwargs) -> Iterator[Tuple[String, Address]]:
    """
    extract ASCII and UTF-16 LE strings from file
    """
@@ -56,11 +60,13 @@ def extract_file_strings(buf: bytes, **kwargs) -> Iterator[Tuple[String, Address
        yield String(s.s), FileOffsetAddress(s.offset)


-def extract_format(buf: bytes) -> Iterator[Tuple[Feature, Address]]:
+def extract_format(buf) -> Iterator[Tuple[Feature, Address]]:
    if buf.startswith(MATCH_PE):
        yield Format(FORMAT_PE), NO_ADDRESS
    elif buf.startswith(MATCH_ELF):
        yield Format(FORMAT_ELF), NO_ADDRESS
+    elif len(buf) > 8 and buf.startswith(MATCH_DEX) and buf[7] == 0x00:
+        yield Format(FORMAT_DEX), NO_ADDRESS
    elif is_freeze(buf):
        yield Format(FORMAT_FREEZE), NO_ADDRESS
    elif buf.startswith(MATCH_RESULT):
@@ -96,6 +102,9 @@ def extract_arch(buf) -> Iterator[Tuple[Feature, Address]]:

        yield Arch(arch), NO_ADDRESS

+    elif len(buf) > 8 and buf.startswith(MATCH_DEX) and buf[7] == 0x00:
+        yield Arch(ARCH_DALVIK), NO_ADDRESS
+
    else:
        # we likely end up here:
        #  1. handling shellcode, or
@@ -129,6 +138,9 @@ def extract_os(buf, os=OS_AUTO) -> Iterator[Tuple[Feature, Address]]:

        yield OS(os), NO_ADDRESS

+    elif len(buf) > 8 and buf.startswith(MATCH_DEX) and buf[7] == 0x00:
+        yield OS(OS_ANDROID), NO_ADDRESS
+
    else:
        # we likely end up here:
        #  1. handling shellcode, or
--- a/capa/features/extractors/dexfile.py
+++ b/capa/features/extractors/dexfile.py
@@ -0,0 +1,421 @@
+# Copyright (C) 2023 Mandiant, Inc. All Rights Reserved.
+# Licensed under the Apache License, Version 2.0 (the "License");
+#  you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at: [package root]/LICENSE.txt
+# Unless required by applicable law or agreed to in writing, software distributed under the License
+#  is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and limitations under the License.
+import struct
+import logging
+from typing import Set, Dict, List, Tuple, Iterator, Optional, TypedDict
+from pathlib import Path
+from dataclasses import dataclass
+
+import dexparser.disassembler as disassembler
+from dexparser import DEXParser, uleb128_value
+
+from capa.features.file import Import, FunctionName
+from capa.features.common import (
+    OS,
+    FORMAT_DEX,
+    OS_ANDROID,
+    ARCH_DALVIK,
+    Arch,
+    Class,
+    Format,
+    String,
+    Feature,
+    Namespace,
+)
+from capa.features.address import NO_ADDRESS, Address, DexClassAddress, DexMethodAddress, FileOffsetAddress
+from capa.features.extractors.base_extractor import (
+    BBHandle,
+    InsnHandle,
+    SampleHashes,
+    FunctionHandle,
+    StaticFeatureExtractor,
+)
+
+logger = logging.getLogger(__name__)
+
+
+# Reference: https://source.android.com/docs/core/runtime/dex-format
+
+
+class DexProtoId(TypedDict):
+    shorty_idx: int
+    return_type_idx: int
+    param_off: int
+
+
+class DexMethodId(TypedDict):
+    class_idx: int
+    proto_idx: int
+    name_idx: int
+
+
+@dataclass
+class DexAnalyzedMethod:
+    class_type: str
+    name: str
+    shorty_descriptor: str
+    return_type: str
+    parameters: List[str]
+    id_offset: int = 0
+    code_offset: int = 0
+    access_flags: Optional[int] = None
+
+    @property
+    def address(self):
+        # NOTE: Some methods do not have code, in that case we use the method_id offset
+        if self.has_code:
+            return self.code_offset
+        else:
+            return self.id_offset
+
+    @property
+    def has_code(self):
+        # NOTE: code_offset is zero if the method is abstract/native or not defined in a class
+        return self.code_offset != 0
+
+    @property
+    def has_definition(self):
+        # NOTE: access_flags is only known if the method is defined in a class
+        return self.access_flags is not None
+
+    @property
+    def qualified_name(self):
+        return f"{self.class_type}::{self.name}"
+
+
+class DexFieldId(TypedDict):
+    class_idx: int
+    type_idx: int
+    name_idx: int
+
+
+class DexClassDef(TypedDict):
+    class_idx: int
+    access_flags: int
+    superclass_idx: int
+    interfaces_off: int
+    source_file_idx: int
+    annotations_off: int
+    class_data_off: int
+    static_values_off: int
+
+
+class DexFieldDef(TypedDict):
+    diff: int
+    access_flags: int
+
+
+class DexMethodDef(TypedDict):
+    diff: int
+    access_flags: int
+    code_off: int
+
+
+class DexClassData(TypedDict):
+    static_fields: List[DexFieldDef]
+    instance_fields: List[DexFieldDef]
+    direct_methods: List[DexMethodDef]
+    virtual_methods: List[DexMethodDef]
+
+
+@dataclass
+class DexAnalyzedClass:
+    offset: int
+    class_type: str
+    superclass_type: str
+    interfaces: List[str]
+    source_file: str
+    data: Optional[DexClassData]
+
+
+class DexAnnotation(TypedDict):
+    visibility: int
+    type_idx_diff: int
+    size_diff: int
+    name_idx_diff: int
+    value_type: int
+    encoded_value: int
+
+
+class DexAnalysis:
+    def get_strings(self):
+        # NOTE: Copied from dexparser, upstream later
+
+        strings: List[Tuple[int, bytes]] = []
+        string_ids_off = self.dex.header_data["string_ids_off"]
+
+        for i in range(self.dex.header_data["string_ids_size"]):
+            offset = struct.unpack("<L", self.dex.data[string_ids_off + (i * 4) : string_ids_off + (i * 4) + 4])[0]
+            c_size, size_offset = uleb128_value(self.dex.data, offset)
+            c_char = self.dex.data[offset + size_offset : offset + size_offset + c_size]
+            strings.append((offset, c_char))
+
+        return strings
+
+    def __init__(self, dex: DEXParser):
+        self.dex = dex
+
+        self.strings = self.get_strings()
+        self.strings_utf8: List[str] = []
+        for _, data in self.strings:
+            # NOTE: This is technically incorrect
+            # Reference: https://source.android.com/devices/tech/dalvik/dex-format#mutf-8
+            self.strings_utf8.append(data.decode("utf-8", errors="backslashreplace"))
+
+        self.type_ids: List[int] = dex.get_typeids()
+        self.method_ids: List[DexMethodId] = dex.get_methods()
+        self.proto_ids: List[DexProtoId] = dex.get_protoids()
+        self.field_ids: List[DexFieldId] = dex.get_fieldids()
+        self.class_defs: List[DexClassDef] = dex.get_classdef_data()
+
+        self._is_analyzing = True
+        self.used_classes: Set[str] = set()
+        self.classes = self._analyze_classes()
+        self.methods = self._analyze_methods()
+        self.methods_by_address: Dict[int, DexAnalyzedMethod] = {m.address: m for m in self.methods}
+
+        self.namespaces: Set[str] = set()
+        for class_type in self.used_classes:
+            idx = class_type.rfind(".")
+            if idx != -1:
+                self.namespaces.add(class_type[:idx])
+
+        for class_type in self.classes:
+            self.used_classes.remove(class_type)
+
+        # Only available after code analysis
+        self._is_analyzing = False
+
+    def analyze_code(self):
+        # Loop over the classes and analyze them
+        # self.classes: List[DexClass] = self.dex.get_class_data(offset=-1)
+        # self.annotations: List[DexAnnotation] = dex.get_annotations(offset=-1)
+        # self.static_values: List[int] = dex.get_static_values(offset=-1)
+        pass
+
+    def get_string(self, index: int) -> str:
+        return self.strings_utf8[index]
+
+    def _decode_descriptor(self, descriptor: str) -> str:
+        first = descriptor[0]
+        if first == "L":
+            pretty = descriptor[1:-1].replace("/", ".")
+            if self._is_analyzing:
+                self.used_classes.add(pretty)
+        elif first == "[":
+            pretty = self._decode_descriptor(descriptor[1:]) + "[]"
+        else:
+            pretty = disassembler.type_descriptor[first]
+        return pretty
+
+    def get_pretty_type(self, index: int) -> str:
+        if index == 0xFFFFFFFF:
+            return "<NO_INDEX>"
+        descriptor = self.get_string(self.type_ids[index])
+        return self._decode_descriptor(descriptor)
+
+    def _analyze_classes(self):
+        classes: Dict[str, DexAnalyzedClass] = {}
+        offset = self.dex.header_data["class_defs_off"]
+        for index, clazz in enumerate(self.class_defs):
+            class_type = self.get_pretty_type(clazz["class_idx"])
+
+            # Superclass
+            superclass_idx = clazz["superclass_idx"]
+            if superclass_idx != 0xFFFFFFFF:
+                superclass_type = self.get_pretty_type(superclass_idx)
+            else:
+                superclass_type = ""
+
+            # Interfaces
+            interfaces = []
+            interfaces_offset = clazz["interfaces_off"]
+            if interfaces_offset != 0:
+                size = struct.unpack("<L", self.dex.data[interfaces_offset : interfaces_offset + 4])[0]
+                for i in range(size):
+                    type_idx = struct.unpack(
+                        "<H", self.dex.data[interfaces_offset + 4 + i * 2 : interfaces_offset + 6 + i * 2]
+                    )[0]
+                    interface_type = self.get_pretty_type(type_idx)
+                    interfaces.append(interface_type)
+
+            # Source file
+            source_file_idx = clazz["source_file_idx"]
+            if source_file_idx != 0xFFFFFFFF:
+                source_file = self.get_string(source_file_idx)
+            else:
+                source_file = ""
+
+            # Data
+            data_offset = clazz["class_data_off"]
+            if data_offset != 0:
+                data = self.dex.get_class_data(data_offset)
+            else:
+                data = None
+
+            classes[class_type] = DexAnalyzedClass(
+                offset=offset + index * 32,
+                class_type=class_type,
+                superclass_type=superclass_type,
+                interfaces=interfaces,
+                source_file=source_file,
+                data=data,
+            )
+        return classes
+
+    def _analyze_methods(self):
+        methods: List[DexAnalyzedMethod] = []
+        for method_id in self.method_ids:
+            proto = self.proto_ids[method_id["proto_idx"]]
+            parameters = []
+
+            param_off = proto["param_off"]
+            if param_off != 0:
+                size = struct.unpack("<L", self.dex.data[param_off : param_off + 4])[0]
+                for i in range(size):
+                    type_idx = struct.unpack("<H", self.dex.data[param_off + 4 + i * 2 : param_off + 6 + i * 2])[0]
+                    param_type = self.get_pretty_type(type_idx)
+                    parameters.append(param_type)
+
+            methods.append(
+                DexAnalyzedMethod(
+                    class_type=self.get_pretty_type(method_id["class_idx"]),
+                    name=self.get_string(method_id["name_idx"]),
+                    shorty_descriptor=self.get_string(proto["shorty_idx"]),
+                    return_type=self.get_pretty_type(proto["return_type_idx"]),
+                    parameters=parameters,
+                )
+            )
+
+        # Fill in the missing method data
+        for clazz in self.classes.values():
+            if clazz.data is None:
+                continue
+
+            for method_def in clazz.data["direct_methods"]:
+                diff = method_def["diff"]
+                methods[diff].access_flags = method_def["access_flags"]
+                methods[diff].code_offset = method_def["code_off"]
+
+            for method_def in clazz.data["virtual_methods"]:
+                diff = method_def["diff"]
+                methods[diff].access_flags = method_def["access_flags"]
+                methods[diff].code_offset = method_def["code_off"]
+
+        # Fill in the missing code offsets with fake data
+        offset = self.dex.header_data["method_ids_off"]
+        for index, method in enumerate(methods):
+            method.id_offset = offset + index * 8
+
+        return methods
+
+    def extract_file_features(self) -> Iterator[Tuple[Feature, Address]]:
+        yield Format(FORMAT_DEX), NO_ADDRESS
+
+        for i in range(len(self.strings)):
+            yield String(self.strings_utf8[i]), FileOffsetAddress(self.strings[i][0])
+
+        for method in self.methods:
+            if method.has_definition:
+                yield FunctionName(method.qualified_name), DexMethodAddress(method.address)
+            else:
+                yield Import(method.qualified_name), DexMethodAddress(method.address)
+
+        for namespace in self.namespaces:
+            yield Namespace(namespace), NO_ADDRESS
+
+        for clazz in self.classes.values():
+            yield Class(clazz.class_type), DexClassAddress(clazz.offset)
+
+        for class_type in self.used_classes:
+            yield Class(class_type), NO_ADDRESS
+
+
+class DexFeatureExtractor(StaticFeatureExtractor):
+    def __init__(self, path: Path, *, code_analysis: bool):
+        super().__init__(hashes=SampleHashes.from_bytes(path.read_bytes()))
+        self.path: Path = path
+        self.code_analysis = code_analysis
+        self.dex = DEXParser(filedir=str(path))
+        self.analysis = DexAnalysis(self.dex)
+
+        # Perform more expensive code analysis only when requested
+        if self.code_analysis:
+            self.analysis.analyze_code()
+
+    def todo(self):
+        import inspect
+
+        message = "[DexparserFeatureExtractor:TODO] " + inspect.stack()[1].function
+        logger.debug(message)
+
+    def get_base_address(self):
+        return NO_ADDRESS
+
+    def extract_global_features(self) -> Iterator[Tuple[Feature, Address]]:
+        # These are hardcoded global features
+        yield Format(FORMAT_DEX), NO_ADDRESS
+        yield OS(OS_ANDROID), NO_ADDRESS
+        yield Arch(ARCH_DALVIK), NO_ADDRESS
+
+    def extract_file_features(self) -> Iterator[Tuple[Feature, Address]]:
+        yield from self.analysis.extract_file_features()
+
+    def is_library_function(self, addr: Address) -> bool:
+        assert isinstance(addr, DexMethodAddress)
+        method = self.analysis.methods_by_address[addr]
+        # exclude androidx/kotlin stuff?
+        return not method.has_definition
+
+    def get_function_name(self, addr: Address) -> str:
+        assert isinstance(addr, DexMethodAddress)
+        method = self.analysis.methods_by_address[addr]
+        return method.qualified_name
+
+    def get_functions(self) -> Iterator[FunctionHandle]:
+        if not self.code_analysis:
+            raise Exception("code analysis is disabled")
+
+        for method in self.analysis.methods:
+            yield FunctionHandle(DexMethodAddress(method.address), method)
+
+    def extract_function_features(self, f: FunctionHandle) -> Iterator[Tuple[Feature, Address]]:
+        if not self.code_analysis:
+            raise Exception("code analysis is disabled")
+        method: DexAnalyzedMethod = f.inner
+        if method.has_code:
+            return self.todo()
+            yield
+
+    def get_basic_blocks(self, f: FunctionHandle) -> Iterator[BBHandle]:
+        if not self.code_analysis:
+            raise Exception("code analysis is disabled")
+        method: DexAnalyzedMethod = f.inner
+        if method.has_code:
+            return self.todo()
+            yield
+
+    def extract_basic_block_features(self, f: FunctionHandle, bb: BBHandle) -> Iterator[Tuple[Feature, Address]]:
+        if not self.code_analysis:
+            raise Exception("code analysis is disabled")
+        return self.todo()
+        yield
+
+    def get_instructions(self, f: FunctionHandle, bb: BBHandle) -> Iterator[InsnHandle]:
+        if not self.code_analysis:
+            raise Exception("code analysis is disabled")
+        return self.todo()
+        yield
+
+    def extract_insn_features(
+        self, f: FunctionHandle, bb: BBHandle, insn: InsnHandle
+    ) -> Iterator[Tuple[Feature, Address]]:
+        if not self.code_analysis:
+            raise Exception("code analysis is disabled")
+        return self.todo()
+        yield
--- a/capa/features/extractors/dnfile/helpers.py
+++ b/capa/features/extractors/dnfile/helpers.py
@@ -131,14 +131,10 @@ def get_dotnet_managed_imports(pe: dnfile.dnPE) -> Iterator[DnType]:
            # remove get_/set_ from MemberRef name
            member_ref_name = member_ref_name[4:]

-        typerefnamespace, typerefname = resolve_nested_typeref_name(
-            member_ref.Class.row_index, member_ref.Class.row, pe
-        )
-
        yield DnType(
            token,
-            typerefname,
-            namespace=typerefnamespace,
+            member_ref.Class.row.TypeName,
+            namespace=member_ref.Class.row.TypeNamespace,
            member=member_ref_name,
            access=access,
        )
@@ -192,8 +188,6 @@ def get_dotnet_managed_methods(pe: dnfile.dnPE) -> Iterator[DnType]:
            TypeNamespace (index into String heap)
            MethodList (index into MethodDef table; it marks the first of a contiguous run of Methods owned by this Type)
    """
-    nested_class_table = get_dotnet_nested_class_table_index(pe)
-
    accessor_map: Dict[int, str] = {}
    for methoddef, methoddef_access in get_dotnet_methoddef_property_accessors(pe):
        accessor_map[methoddef] = methoddef_access
@@ -217,9 +211,7 @@ def get_dotnet_managed_methods(pe: dnfile.dnPE) -> Iterator[DnType]:
                # remove get_/set_
                method_name = method_name[4:]

-            typedefnamespace, typedefname = resolve_nested_typedef_name(nested_class_table, rid, typedef, pe)
-
-            yield DnType(token, typedefname, namespace=typedefnamespace, member=method_name, access=access)
+            yield DnType(token, typedef.TypeName, namespace=typedef.TypeNamespace, member=method_name, access=access)


 def get_dotnet_fields(pe: dnfile.dnPE) -> Iterator[DnType]:
@@ -233,8 +225,6 @@ def get_dotnet_fields(pe: dnfile.dnPE) -> Iterator[DnType]:
            TypeNamespace (index into String heap)
            FieldList (index into Field table; it marks the first of a contiguous run of Fields owned by this Type)
    """
-    nested_class_table = get_dotnet_nested_class_table_index(pe)
-
    for rid, typedef in iter_dotnet_table(pe, dnfile.mdtable.TypeDef.number):
        assert isinstance(typedef, dnfile.mdtable.TypeDefRow)

@@ -245,11 +235,8 @@ def get_dotnet_fields(pe: dnfile.dnPE) -> Iterator[DnType]:
            if field.row is None:
                logger.debug("TypeDef[0x%X] FieldList[0x%X] row is None", rid, idx)
                continue
-
-            typedefnamespace, typedefname = resolve_nested_typedef_name(nested_class_table, rid, typedef, pe)
-
            token: int = calculate_dotnet_token_value(field.table.number, field.row_index)
-            yield DnType(token, typedefname, namespace=typedefnamespace, member=field.row.Name)
+            yield DnType(token, typedef.TypeName, namespace=typedef.TypeNamespace, member=field.row.Name)


 def get_dotnet_managed_method_bodies(pe: dnfile.dnPE) -> Iterator[Tuple[int, CilMethodBody]]:
@@ -313,119 +300,19 @@ def get_dotnet_unmanaged_imports(pe: dnfile.dnPE) -> Iterator[DnUnmanagedMethod]
        yield DnUnmanagedMethod(token, module, method)


-def get_dotnet_table_row(pe: dnfile.dnPE, table_index: int, row_index: int) -> Optional[dnfile.base.MDTableRow]:
-    assert pe.net is not None
-    assert pe.net.mdtables is not None
-
-    if row_index - 1 <= 0:
-        return None
-
-    try:
-        table = pe.net.mdtables.tables.get(table_index, [])
-        return table[row_index - 1]
-    except IndexError:
-        return None
-
-
-def resolve_nested_typedef_name(
-    nested_class_table: dict, index: int, typedef: dnfile.mdtable.TypeDefRow, pe: dnfile.dnPE
-) -> Tuple[str, Tuple[str, ...]]:
-    """Resolves all nested TypeDef class names. Returns the namespace as a str and the nested TypeRef name as a tuple"""
-
-    if index in nested_class_table:
-        typedef_name = []
-        name = typedef.TypeName
-
-        # Append the current typedef name
-        typedef_name.append(name)
-
-        while nested_class_table[index] in nested_class_table:
-            # Iterate through the typedef table to resolve the nested name
-            table_row = get_dotnet_table_row(pe, dnfile.mdtable.TypeDef.number, nested_class_table[index])
-            if table_row is None:
-                return typedef.TypeNamespace, tuple(typedef_name[::-1])
-
-            name = table_row.TypeName
-            typedef_name.append(name)
-            index = nested_class_table[index]
-
-        # Document the root enclosing details
-        table_row = get_dotnet_table_row(pe, dnfile.mdtable.TypeDef.number, nested_class_table[index])
-        if table_row is None:
-            return typedef.TypeNamespace, tuple(typedef_name[::-1])
-
-        enclosing_name = table_row.TypeName
-        typedef_name.append(enclosing_name)
-
-        return table_row.TypeNamespace, tuple(typedef_name[::-1])
-
-    else:
-        return typedef.TypeNamespace, (typedef.TypeName,)
-
-
-def resolve_nested_typeref_name(
-    index: int, typeref: dnfile.mdtable.TypeRefRow, pe: dnfile.dnPE
-) -> Tuple[str, Tuple[str, ...]]:
-    """Resolves all nested TypeRef class names. Returns the namespace as a str and the nested TypeRef name as a tuple"""
-    # If the ResolutionScope decodes to a typeRef type then it is nested
-    if isinstance(typeref.ResolutionScope.table, dnfile.mdtable.TypeRef):
-        typeref_name = []
-        name = typeref.TypeName
-        # Not appending the current typeref name to avoid potential duplicate
-
-        # Validate index
-        table_row = get_dotnet_table_row(pe, dnfile.mdtable.TypeRef.number, index)
-        if table_row is None:
-            return typeref.TypeNamespace, (typeref.TypeName,)
-
-        while isinstance(table_row.ResolutionScope.table, dnfile.mdtable.TypeRef):
-            # Iterate through the typeref table to resolve the nested name
-            typeref_name.append(name)
-            name = table_row.TypeName
-            table_row = get_dotnet_table_row(pe, dnfile.mdtable.TypeRef.number, table_row.ResolutionScope.row_index)
-            if table_row is None:
-                return typeref.TypeNamespace, tuple(typeref_name[::-1])
-
-        # Document the root enclosing details
-        typeref_name.append(table_row.TypeName)
-
-        return table_row.TypeNamespace, tuple(typeref_name[::-1])
-
-    else:
-        return typeref.TypeNamespace, (typeref.TypeName,)
-
-
-def get_dotnet_nested_class_table_index(pe: dnfile.dnPE) -> Dict[int, int]:
-    """Build index for EnclosingClass based off the NestedClass row index in the nestedclass table"""
-    nested_class_table = {}
-
-    # Used to find nested classes in typedef
-    for _, nestedclass in iter_dotnet_table(pe, dnfile.mdtable.NestedClass.number):
-        assert isinstance(nestedclass, dnfile.mdtable.NestedClassRow)
-        nested_class_table[nestedclass.NestedClass.row_index] = nestedclass.EnclosingClass.row_index
-
-    return nested_class_table
-
-
 def get_dotnet_types(pe: dnfile.dnPE) -> Iterator[DnType]:
    """get .NET types from TypeDef and TypeRef tables"""
-    nested_class_table = get_dotnet_nested_class_table_index(pe)
-
    for rid, typedef in iter_dotnet_table(pe, dnfile.mdtable.TypeDef.number):
        assert isinstance(typedef, dnfile.mdtable.TypeDefRow)

-        typedefnamespace, typedefname = resolve_nested_typedef_name(nested_class_table, rid, typedef, pe)
-
        typedef_token: int = calculate_dotnet_token_value(dnfile.mdtable.TypeDef.number, rid)
-        yield DnType(typedef_token, typedefname, namespace=typedefnamespace)
+        yield DnType(typedef_token, typedef.TypeName, namespace=typedef.TypeNamespace)

    for rid, typeref in iter_dotnet_table(pe, dnfile.mdtable.TypeRef.number):
        assert isinstance(typeref, dnfile.mdtable.TypeRefRow)

-        typerefnamespace, typerefname = resolve_nested_typeref_name(typeref.ResolutionScope.row_index, typeref, pe)
-
        typeref_token: int = calculate_dotnet_token_value(dnfile.mdtable.TypeRef.number, rid)
-        yield DnType(typeref_token, typerefname, namespace=typerefnamespace)
+        yield DnType(typeref_token, typeref.TypeName, namespace=typeref.TypeNamespace)


 def calculate_dotnet_token_value(table: int, rid: int) -> int:
--- a/capa/features/extractors/dnfile/types.py
+++ b/capa/features/extractors/dnfile/types.py
@@ -6,17 +6,15 @@
 #  is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and limitations under the License.

-from typing import Tuple, Optional
+from typing import Optional


 class DnType:
-    def __init__(
-        self, token: int, class_: Tuple[str, ...], namespace: str = "", member: str = "", access: Optional[str] = None
-    ):
+    def __init__(self, token: int, class_: str, namespace: str = "", member: str = "", access: Optional[str] = None):
        self.token: int = token
        self.access: Optional[str] = access
        self.namespace: str = namespace
-        self.class_: Tuple[str, ...] = class_
+        self.class_: str = class_

        if member == ".ctor":
            member = "ctor"
@@ -44,13 +42,9 @@ class DnType:
        return str(self)

    @staticmethod
-    def format_name(class_: Tuple[str, ...], namespace: str = "", member: str = ""):
-        if len(class_) > 1:
-            class_str = "/".join(class_)  # Concat items in tuple, separated by a "/"
-        else:
-            class_str = "".join(class_)  # Convert tuple to str
+    def format_name(class_: str, namespace: str = "", member: str = ""):
        # like File::OpenRead
-        name: str = f"{class_str}::{member}" if member else class_str
+        name: str = f"{class_}::{member}" if member else class_
        if namespace:
            # like System.IO.File::OpenRead
            name = f"{namespace}.{name}"
--- a/capa/features/extractors/dotnetfile.py
+++ b/capa/features/extractors/dotnetfile.py
@@ -38,11 +38,8 @@ from capa.features.extractors.dnfile.helpers import (
    is_dotnet_mixed_mode,
    get_dotnet_managed_imports,
    get_dotnet_managed_methods,
-    resolve_nested_typedef_name,
-    resolve_nested_typeref_name,
    calculate_dotnet_token_value,
    get_dotnet_unmanaged_imports,
-    get_dotnet_nested_class_table_index,
 )

 logger = logging.getLogger(__name__)
@@ -95,25 +92,19 @@ def extract_file_namespace_features(pe: dnfile.dnPE, **kwargs) -> Iterator[Tuple

 def extract_file_class_features(pe: dnfile.dnPE, **kwargs) -> Iterator[Tuple[Class, Address]]:
    """emit class features from TypeRef and TypeDef tables"""
-    nested_class_table = get_dotnet_nested_class_table_index(pe)
-
    for rid, typedef in iter_dotnet_table(pe, dnfile.mdtable.TypeDef.number):
        # emit internal .NET classes
        assert isinstance(typedef, dnfile.mdtable.TypeDefRow)

-        typedefnamespace, typedefname = resolve_nested_typedef_name(nested_class_table, rid, typedef, pe)
-
        token = calculate_dotnet_token_value(dnfile.mdtable.TypeDef.number, rid)
-        yield Class(DnType.format_name(typedefname, namespace=typedefnamespace)), DNTokenAddress(token)
+        yield Class(DnType.format_name(typedef.TypeName, namespace=typedef.TypeNamespace)), DNTokenAddress(token)

    for rid, typeref in iter_dotnet_table(pe, dnfile.mdtable.TypeRef.number):
        # emit external .NET classes
        assert isinstance(typeref, dnfile.mdtable.TypeRefRow)

-        typerefnamespace, typerefname = resolve_nested_typeref_name(typeref.ResolutionScope.row_index, typeref, pe)
-
        token = calculate_dotnet_token_value(dnfile.mdtable.TypeRef.number, rid)
-        yield Class(DnType.format_name(typerefname, namespace=typerefnamespace)), DNTokenAddress(token)
+        yield Class(DnType.format_name(typeref.TypeName, namespace=typeref.TypeNamespace)), DNTokenAddress(token)


 def extract_file_os(**kwargs) -> Iterator[Tuple[OS, Address]]:
--- a/capa/features/extractors/elf.py
+++ b/capa/features/extractors/elf.py
@@ -108,9 +108,6 @@ class Shdr:
            buf,
        )

-    def get_name(self, elf: "ELF") -> str:
-        return elf.shstrtab.buf[self.name :].partition(b"\x00")[0].decode("ascii")
-

 class ELF:
    def __init__(self, f: BinaryIO):
@@ -123,7 +120,6 @@ class ELF:
        self.e_phnum: int
        self.e_shentsize: int
        self.e_shnum: int
-        self.e_shstrndx: int
        self.phbuf: bytes
        self.shbuf: bytes

@@ -155,15 +151,11 @@ class ELF:
        if self.bitness == 32:
            e_phoff, e_shoff = struct.unpack_from(self.endian + "II", self.file_header, 0x1C)
            self.e_phentsize, self.e_phnum = struct.unpack_from(self.endian + "HH", self.file_header, 0x2A)
-            self.e_shentsize, self.e_shnum, self.e_shstrndx = struct.unpack_from(
-                self.endian + "HHH", self.file_header, 0x2E
-            )
+            self.e_shentsize, self.e_shnum = struct.unpack_from(self.endian + "HH", self.file_header, 0x2E)
        elif self.bitness == 64:
            e_phoff, e_shoff = struct.unpack_from(self.endian + "QQ", self.file_header, 0x20)
            self.e_phentsize, self.e_phnum = struct.unpack_from(self.endian + "HH", self.file_header, 0x36)
-            self.e_shentsize, self.e_shnum, self.e_shstrndx = struct.unpack_from(
-                self.endian + "HHH", self.file_header, 0x3A
-            )
+            self.e_shentsize, self.e_shnum = struct.unpack_from(self.endian + "HH", self.file_header, 0x3A)
        else:
            raise NotImplementedError()

@@ -373,10 +365,6 @@ class ELF:
            except ValueError:
                continue

-    @property
-    def shstrtab(self) -> Shdr:
-        return self.parse_section_header(self.e_shstrndx)
-
    @property
    def linker(self):
        PT_INTERP = 0x3
@@ -828,50 +816,6 @@ def guess_os_from_sh_notes(elf: ELF) -> Optional[OS]:
    return None


-def guess_os_from_ident_directive(elf: ELF) -> Optional[OS]:
-    # GCC inserts the GNU version via an .ident directive
-    # that gets stored in a section named ".comment".
-    # look at the version and recognize common OSes.
-    #
-    # assume the GCC version matches the target OS version,
-    # which I guess could be wrong during cross-compilation?
-    # therefore, don't rely on this if possible.
-    #
-    # https://stackoverflow.com/q/6263425
-    # https://gcc.gnu.org/onlinedocs/cpp/Other-Directives.html
-
-    SHT_PROGBITS = 0x1
-    for shdr in elf.section_headers:
-        if shdr.type != SHT_PROGBITS:
-            continue
-
-        if shdr.get_name(elf) != ".comment":
-            continue
-
-        try:
-            comment = shdr.buf.decode("utf-8")
-        except ValueError:
-            continue
-
-        if "GCC:" not in comment:
-            continue
-
-        logger.debug(".ident: %s", comment)
-
-        # these values come from our testfiles, like:
-        # rg -a "GCC: " tests/data/
-        if "Debian" in comment:
-            return OS.LINUX
-        elif "Ubuntu" in comment:
-            return OS.LINUX
-        elif "Red Hat" in comment:
-            return OS.LINUX
-        elif "Android" in comment:
-            return OS.ANDROID
-
-    return None
-
-
 def guess_os_from_linker(elf: ELF) -> Optional[OS]:
    # search for recognizable dynamic linkers (interpreters)
    # for example, on linux, we see file paths like: /lib64/ld-linux-x86-64.so.2
@@ -907,10 +851,8 @@ def guess_os_from_abi_versions_needed(elf: ELF) -> Optional[OS]:
                return OS.HURD

            else:
-                # in practice, Hurd isn't a common/viable OS,
-                # so this is almost certain to be Linux,
-                # so lets just make that guess.
-                return OS.LINUX
+                # we don't have any good guesses based on versions needed
+                pass

    return None

@@ -923,8 +865,6 @@ def guess_os_from_needed_dependencies(elf: ELF) -> Optional[OS]:
            return OS.HURD
        if needed.startswith("libandroid.so"):
            return OS.ANDROID
-        if needed.startswith("liblog.so"):
-            return OS.ANDROID

    return None

@@ -987,13 +927,6 @@ def detect_elf_os(f) -> str:
        logger.warning("Error guessing OS from section header notes: %s", e)
        sh_notes_guess = None

-    try:
-        ident_guess = guess_os_from_ident_directive(elf)
-        logger.debug("guess: .ident: %s", ident_guess)
-    except Exception as e:
-        logger.warning("Error guessing OS from .ident directive: %s", e)
-        ident_guess = None
-
    try:
        linker_guess = guess_os_from_linker(elf)
        logger.debug("guess: linker: %s", linker_guess)
@@ -1045,11 +978,6 @@ def detect_elf_os(f) -> str:
    elif symtab_guess:
        ret = symtab_guess

-    elif ident_guess:
-        # at the bottom because we don't trust this too much
-        # due to potential for bugs with cross-compilation.
-        ret = ident_guess
-
    return ret.value if ret is not None else "unknown"


--- a/capa/features/extractors/ghidra/file.py
+++ b/capa/features/extractors/ghidra/file.py
@@ -127,10 +127,8 @@ def extract_file_strings() -> Iterator[Tuple[Feature, Address]]:
    """extract ASCII and UTF-16 LE strings"""

    for block in currentProgram().getMemory().getBlocks():  # type: ignore [name-defined] # noqa: F821
-        if not block.isInitialized():
-            continue
-
-        p_bytes = capa.features.extractors.ghidra.helpers.get_block_bytes(block)
+        if block.isInitialized():
+            p_bytes = capa.features.extractors.ghidra.helpers.get_block_bytes(block)

        for s in capa.features.extractors.strings.extract_ascii_strings(p_bytes):
            offset = block.getStart().getOffset() + s.offset
--- a/capa/features/extractors/ghidra/helpers.py
+++ b/capa/features/extractors/ghidra/helpers.py
@@ -275,27 +275,3 @@ def dereference_ptr(insn: ghidra.program.database.code.InstructionDB):
            return addr
    else:
        return to_deref
-
-
-def find_data_references_from_insn(insn, max_depth: int = 10):
-    """yield data references from given instruction"""
-    for reference in insn.getReferencesFrom():
-        if not reference.getReferenceType().isData():
-            # only care about data references
-            continue
-
-        to_addr = reference.getToAddress()
-
-        for _ in range(max_depth - 1):
-            data = getDataAt(to_addr)  # type: ignore [name-defined] # noqa: F821
-            if data and data.isPointer():
-                ptr_value = data.getValue()
-
-                if ptr_value is None:
-                    break
-
-                to_addr = ptr_value
-            else:
-                break
-
-        yield to_addr
--- a/capa/features/extractors/ghidra/insn.py
+++ b/capa/features/extractors/ghidra/insn.py
@@ -23,9 +23,6 @@ from capa.features.extractors.base_extractor import BBHandle, InsnHandle, Functi
 SECURITY_COOKIE_BYTES_DELTA = 0x40


-OPERAND_TYPE_DYNAMIC_ADDRESS = OperandType.DYNAMIC | OperandType.ADDRESS
-
-
 def get_imports(ctx: Dict[str, Any]) -> Dict[int, Any]:
    """Populate the import cache for this context"""
    if "imports_cache" not in ctx:
@@ -85,7 +82,7 @@ def check_for_api_call(
        if not capa.features.extractors.ghidra.helpers.check_addr_for_api(addr_ref, fakes, imports, externs):
            return
        ref = addr_ref.getOffset()
-    elif ref_type == OPERAND_TYPE_DYNAMIC_ADDRESS or ref_type == OperandType.DYNAMIC:
+    elif ref_type == OperandType.DYNAMIC | OperandType.ADDRESS or ref_type == OperandType.DYNAMIC:
        return  # cannot resolve dynamics statically
    else:
        # pure address does not need to get dereferenced/ handled
@@ -198,39 +195,46 @@ def extract_insn_offset_features(fh: FunctionHandle, bb: BBHandle, ih: InsnHandl
    if insn.getMnemonicString().startswith("LEA"):
        return

-    if capa.features.extractors.ghidra.helpers.is_stack_referenced(insn):
-        # ignore stack references
-        return
-
-    # Ghidra stores operands in 2D arrays if they contain offsets
-    for i in range(insn.getNumOperands()):
-        if insn.getOperandType(i) == OperandType.DYNAMIC:  # e.g. [esi + 4]
-            # manual extraction, since the default api calls only work on the 1st dimension of the array
-            op_objs = insn.getOpObjects(i)
-            if not op_objs:
-                continue
-
-            if isinstance(op_objs[-1], ghidra.program.model.scalar.Scalar):
-                op_off = op_objs[-1].getValue()
-            else:
-                op_off = 0
-
-            yield Offset(op_off), ih.address
-            yield OperandOffset(i, op_off), ih.address
+    # ignore any stack references
+    if not capa.features.extractors.ghidra.helpers.is_stack_referenced(insn):
+        # Ghidra stores operands in 2D arrays if they contain offsets
+        for i in range(insn.getNumOperands()):
+            if insn.getOperandType(i) == OperandType.DYNAMIC:  # e.g. [esi + 4]
+                # manual extraction, since the default api calls only work on the 1st dimension of the array
+                op_objs = insn.getOpObjects(i)
+                if isinstance(op_objs[-1], ghidra.program.model.scalar.Scalar):
+                    op_off = op_objs[-1].getValue()
+                    yield Offset(op_off), ih.address
+                    yield OperandOffset(i, op_off), ih.address
+                else:
+                    yield Offset(0), ih.address
+                    yield OperandOffset(i, 0), ih.address


 def extract_insn_bytes_features(fh: FunctionHandle, bb: BBHandle, ih: InsnHandle) -> Iterator[Tuple[Feature, Address]]:
    """
    parse referenced byte sequences
-
    example:
        push    offset iid_004118d4_IShellLinkA ; riid
    """
-    for addr in capa.features.extractors.ghidra.helpers.find_data_references_from_insn(ih.inner):
-        data = getDataAt(addr)  # type: ignore [name-defined] # noqa: F821
-        if data and not data.hasStringValue():
-            extracted_bytes = capa.features.extractors.ghidra.helpers.get_bytes(addr, MAX_BYTES_FEATURE_SIZE)
+    insn: ghidra.program.database.code.InstructionDB = ih.inner
+
+    if capa.features.extractors.ghidra.helpers.is_call_or_jmp(insn):
+        return
+
+    ref = insn.getAddress()  # init to insn addr
+    for i in range(insn.getNumOperands()):
+        if OperandType.isAddress(insn.getOperandType(i)):
+            ref = insn.getAddress(i)  # pulls pointer if there is one
+
+    if ref != insn.getAddress():  # bail out if there's no pointer
+        ghidra_dat = getDataAt(ref)  # type: ignore [name-defined] # noqa: F821
+        if (
+            ghidra_dat and not ghidra_dat.hasStringValue() and not ghidra_dat.isPointer()
+        ):  # avoid if the data itself is a pointer
+            extracted_bytes = capa.features.extractors.ghidra.helpers.get_bytes(ref, MAX_BYTES_FEATURE_SIZE)
            if extracted_bytes and not capa.features.extractors.helpers.all_zeros(extracted_bytes):
+                # don't extract byte features for obvious strings
                yield Bytes(extracted_bytes), ih.address


@@ -241,10 +245,24 @@ def extract_insn_string_features(fh: FunctionHandle, bb: BBHandle, ih: InsnHandl
    example:
        push offset aAcr     ; "ACR  > "
    """
-    for addr in capa.features.extractors.ghidra.helpers.find_data_references_from_insn(ih.inner):
-        data = getDataAt(addr)  # type: ignore [name-defined] # noqa: F821
-        if data and data.hasStringValue():
-            yield String(data.getValue()), ih.address
+    insn: ghidra.program.database.code.InstructionDB = ih.inner
+    dyn_addr = OperandType.DYNAMIC | OperandType.ADDRESS
+
+    ref = insn.getAddress()
+    for i in range(insn.getNumOperands()):
+        if OperandType.isScalarAsAddress(insn.getOperandType(i)):
+            ref = insn.getAddress(i)
+        # strings are also referenced dynamically via pointers & arrays, so we need to deref them
+        if insn.getOperandType(i) == dyn_addr:
+            ref = insn.getAddress(i)
+            dat = getDataAt(ref)  # type: ignore [name-defined] # noqa: F821
+            if dat and dat.isPointer():
+                ref = dat.getValue()
+
+    if ref != insn.getAddress():
+        ghidra_dat = getDataAt(ref)  # type: ignore [name-defined] # noqa: F821
+        if ghidra_dat and ghidra_dat.hasStringValue():
+            yield String(ghidra_dat.getValue()), ih.address


 def extract_insn_mnemonic_features(
@@ -341,7 +359,7 @@ def extract_insn_cross_section_cflow(
        ref = capa.features.extractors.ghidra.helpers.dereference_ptr(insn)
        if capa.features.extractors.ghidra.helpers.check_addr_for_api(ref, fakes, imports, externs):
            return
-    elif ref_type == OPERAND_TYPE_DYNAMIC_ADDRESS or ref_type == OperandType.DYNAMIC:
+    elif ref_type == OperandType.DYNAMIC | OperandType.ADDRESS or ref_type == OperandType.DYNAMIC:
        return  # cannot resolve dynamics statically
    else:
        # pure address does not need to get dereferenced/ handled
--- a/capa/features/freeze/init.py
+++ b/capa/features/freeze/init.py
@@ -9,7 +9,6 @@ Unless required by applicable law or agreed to in writing, software distributed
 is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and limitations under the License.
 """
-
 import json
 import zlib
 import logging
@@ -22,7 +21,6 @@ from pydantic import Field, BaseModel, ConfigDict
 # https://github.com/mandiant/capa/issues/1699
 from typing_extensions import TypeAlias

-import capa.loader
 import capa.helpers
 import capa.version
 import capa.features.file
@@ -55,6 +53,8 @@ class AddressType(str, Enum):
    FILE = "file"
    DN_TOKEN = "dn token"
    DN_TOKEN_OFFSET = "dn token offset"
+    DEX_METHOD_INDEX = "dex method index"
+    DEX_CLASS_INDEX = "dex class index"
    PROCESS = "process"
    THREAD = "thread"
    CALL = "call"
@@ -82,6 +82,12 @@ class Address(HashableModel):
        elif isinstance(a, capa.features.address.DNTokenOffsetAddress):
            return cls(type=AddressType.DN_TOKEN_OFFSET, value=(a.token, a.offset))

+        elif isinstance(a, capa.features.address.DexMethodAddress):
+            return cls(type=AddressType.DEX_METHOD_INDEX, value=int(a))
+
+        elif isinstance(a, capa.features.address.DexClassAddress):
+            return cls(type=AddressType.DEX_CLASS_INDEX, value=int(a))
+
        elif isinstance(a, capa.features.address.ProcessAddress):
            return cls(type=AddressType.PROCESS, value=(a.ppid, a.pid))

@@ -127,6 +133,14 @@ class Address(HashableModel):
            assert isinstance(offset, int)
            return capa.features.address.DNTokenOffsetAddress(token, offset)

+        elif self.type is AddressType.DEX_METHOD_INDEX:
+            assert isinstance(self.value, int)
+            return capa.features.address.DexMethodAddress(self.value)
+
+        elif self.type is AddressType.DEX_CLASS_INDEX:
+            assert isinstance(self.value, int)
+            return capa.features.address.DexClassAddress(self.value)
+
        elif self.type is AddressType.PROCESS:
            assert isinstance(self.value, tuple)
            ppid, pid = self.value
@@ -683,18 +697,14 @@ def main(argv=None):
        argv = sys.argv[1:]

    parser = argparse.ArgumentParser(description="save capa features to a file")
-    capa.main.install_common_args(parser, {"input_file", "format", "backend", "os", "signatures"})
+    capa.main.install_common_args(parser, {"sample", "format", "backend", "os", "signatures"})
    parser.add_argument("output", type=str, help="Path to output file")
    args = parser.parse_args(args=argv)
+    capa.main.handle_common_args(args)

-    try:
-        capa.main.handle_common_args(args)
-        capa.main.ensure_input_exists_from_cli(args)
-        input_format = capa.main.get_input_format_from_cli(args)
-        backend = capa.main.get_backend_from_cli(args, input_format)
-        extractor = capa.main.get_extractor_from_cli(args, input_format, backend)
-    except capa.main.ShouldExitError as e:
-        return e.status_code
+    sigpaths = capa.main.get_signatures(args.signatures)
+
+    extractor = capa.main.get_extractor(args.sample, args.format, args.os, args.backend, sigpaths, False)

    Path(args.output).write_bytes(dump(extractor))

--- a/capa/ghidra/README.md
+++ b/capa/ghidra/README.md
@@ -2,46 +2,23 @@
    <img src="/doc/img/ghidra_backend_logo.png" width=300 height=175>
 </div>

-The Ghidra feature extractor is an application of the FLARE team's open-source project, Ghidrathon, to integrate capa with Ghidra using Python 3. capa is a framework that uses a well-defined collection of rules to identify capabilities in a program. You can run capa against a PE file, ELF file, or shellcode and it tells you what it thinks the program can do. For example, it might suggest that the program is a backdoor, can install services, or relies on HTTP to communicate. The Ghidra feature extractor can be used to run capa analysis on your Ghidra databases without needing access to the original binary file. As a part of this integration, we've developed two scripts, [capa_explorer.py](https://raw.githubusercontent.com/mandiant/capa/master/capa/ghidra/capa_explorer.py) and [capa_ghidra.py](https://raw.githubusercontent.com/mandiant/capa/master/capa/ghidra/capa_ghidra.py), to display capa results directly in Ghidra.
+The Ghidra feature extractor is an application of the FLARE team's open-source project, Ghidrathon, to integrate capa with Ghidra using Python 3. capa is a framework that uses a well-defined collection of rules to identify capabilities in a program. You can run capa against a PE file, ELF file, or shellcode and it tells you what it thinks the program can do. For example, it might suggest that the program is a backdoor, can install services, or relies on HTTP to communicate. The Ghidra feature extractor can be used to run capa analysis on your Ghidra databases without needing access to the original binary file.

-### Using `capa_explorer.py`
-
-`capa_explorer.py` integrates capa results directly into Ghidra's UI. In the Symbol Tree Window, under the Namespaces section, you can find the matched rules as well as the corresponding functions that contain the matched features:
-
-![image](https://github.com/mandiant/capa/assets/66766340/eeae33f4-99d4-42dc-a5e8-4c1b8c661492)
-
-Labeled functions may be clicked in the Symbol Tree Window to navigate Ghidra's Disassembly Listing and Decompilation windows to the function locations. A comment listing each matched capa rule is inserted at the beginning of the function and a comment for each matched capa feature is added at the matched address within the function. These comments can be viewed using Ghidra's Disassembly Listing and Decompilation windows:
-
-![image](https://github.com/mandiant/capa/assets/66766340/bb2b4170-7fd4-45fc-8c7b-ff8f2e2f101b)
-
-The script also adds bookmarks for capa matches that are categorized under MITRE ATT&CK and Malware Behavior Catalog. These may be found and navigated using Ghidra's Bookmarks Window:
-
-![image](https://github.com/mandiant/capa/assets/66766340/7f9a66a9-7be7-4223-91c6-4b8fc4651336)
-
-### Using `capa_ghidra.py`
-
-`capa_ghidra.py` displays capa results in Ghidra's Console window and can be executed using Ghidra's Headless Analyzer. The following is an example of running `capa_ghidra.py` using the Ghidra Script Manager:
-
-Selecting capa rules:
-<img src="/doc/img/ghidra_script_mngr_rules.png">
-
-Choosing output format:
-<img src="/doc/img/ghidra_script_mngr_verbosity.png">
-
-Viewing results in Ghidra Console Window:
 <img src="/doc/img/ghidra_script_mngr_output.png">

-## Installation
+## Getting Started

-### Requirements
+### Installation

-| Tool | Version | Source |
+Please ensure that you have the following dependencies installed before continuing:
+
+| Dependency | Version | Source |
 |------------|---------|--------|
-| Ghidrathon | `>= 3.0.0` | https://github.com/mandiant/Ghidrathon/releases |
-| Ghidra | `>= 10.3.2` | https://github.com/NationalSecurityAgency/ghidra/releases |
-| Python | `>= 3.8.0` | https://www.python.org/downloads |
+| Ghidrathon | `>= 3.0.0` | https://github.com/mandiant/Ghidrathon |
+| Python | `>= 3.8` | https://www.python.org/downloads |
+| Ghidra | `>= 10.2` | https://ghidra-sre.org |

-You can run capa in Ghidra by completing the following steps using the Python 3 interpreter that you have configured for your Ghidrathon installation:
+In order to run capa using using Ghidra, you must install capa as a library, obtain the official capa rules that match the capa version you have installed, and configure the Python 3 script [capa_ghidra.py](/capa/ghidra/capa_ghidra.py). You can do this by completing the following steps using the Python 3 interpreter that you have configured for your Ghidrathon installation:

 1. Install capa and its dependencies from PyPI using the following command:
 ```bash
@@ -55,52 +32,63 @@ OR
 $ capa --version
 ```

-3. Copy [capa_explorer.py](https://raw.githubusercontent.com/mandiant/capa/master/capa/ghidra/capa_explorer.py) and [capa_ghidra.py](https://raw.githubusercontent.com/mandiant/capa/master/capa/ghidra/capa_ghidra.py) to your `$USER_HOME/ghidra_scripts` directory or manually add the absolute path of each script to the Ghidra Script Manager.
+3. Copy [capa_ghidra.py](/capa/ghidra/capa_ghidra.py) to your `$USER_HOME/ghidra_scripts` directory or manually add `</path/to/ghidra_capa.py/>` to the Ghidra Script Manager.

 ## Usage

-After completing the installation steps you can execute [capa_explorer.py](https://raw.githubusercontent.com/mandiant/capa/master/capa/ghidra/capa_explorer.py) and [capa_ghidra.py](https://raw.githubusercontent.com/mandiant/capa/master/capa/ghidra/capa_ghidra.py) using the Ghidra Script Manager. You can also execute [capa_ghidra.py](https://raw.githubusercontent.com/mandiant/capa/master/capa/ghidra/capa_ghidra.py) using Ghidra's Headless Analyzer.
+After completing the installation steps you can execute `capa_ghidra.py` using the Ghidra Script Manager or Headless Analyzer.

 ### Ghidra Script Manager

-Use the following steps to execute [capa_explorer.py](https://raw.githubusercontent.com/mandiant/capa/master/capa/ghidra/capa_explorer.py) and [capa_ghidra.py](https://raw.githubusercontent.com/mandiant/capa/master/capa/ghidra/capa_ghidra.py) using Ghidra's Script Manager:
-1. Open the Ghidra Script Manager by navigating to `Window > Script Manager`
-2. Locate [capa_explorer.py](https://raw.githubusercontent.com/mandiant/capa/master/capa/ghidra/capa_explorer.py) and [capa_ghidra.py](https://raw.githubusercontent.com/mandiant/capa/master/capa/ghidra/capa_ghidra.py) by selecting the `Python 3 > capa` category or using the Ghidra Script Manager search functionality
-3. Double-click [capa_explorer.py](https://raw.githubusercontent.com/mandiant/capa/master/capa/ghidra/capa_explorer.py) or [capa_ghidra.py](https://raw.githubusercontent.com/mandiant/capa/master/capa/ghidra/capa_ghidra.py) to execute the script
+To execute `capa_ghidra.py` using the Ghidra Script Manager, first open the Ghidra Script Manager by navigating to `Window > Script Manager` in the Ghidra Code Browser. Next, locate `capa_ghidra.py` by selecting the `Python 3 > capa` category or using the Ghidra Script Manager search funtionality. Finally, double-click `capa_ghidra.py` to execute the script. If you don't see `capa_ghidra.py`, make sure you have copied the script to your `$USER_HOME/ghidra_scripts` directory or manually added `</path/to/ghidra_capa.py/>` to the Ghidra Script Manager

-If you don't see [capa_explorer.py](https://raw.githubusercontent.com/mandiant/capa/master/capa/ghidra/capa_explorer.py) and [capa_ghidra.py](https://raw.githubusercontent.com/mandiant/capa/master/capa/ghidra/capa_ghidra.py) make sure you have copied these scripts to your `$USER_HOME/ghidra_scripts` directory or manually added the absolute path of each script to the Ghidra Script Manager.
+When executed, `capa_ghidra.py` asks you to provide your capa rules directory and preferred output format. `capa_ghidra.py` supports `default`, `verbose`, and `vverbose` output formats when executed from the Ghidra Script Manager. `capa_ghidra.py` writes output to the Ghidra Console Window.

-Both scripts ask you to provide the path of your capa rules directory. [capa_ghidra.py](https://raw.githubusercontent.com/mandiant/capa/master/capa/ghidra/capa_ghidra.py) also asks you to select `default`, `verbose`, and `vverbose` output formats used when writing output to the Ghidra Console Window.
+#### Example
+
+The following is an example of running `capa_ghidra.py` using the Ghidra Script Manager:
+
+Selecting capa rules:
+<img src="/doc/img/ghidra_script_mngr_rules.png">
+
+Choosing output format:
+<img src="/doc/img/ghidra_script_mngr_verbosity.png">
+
+Viewing results in Ghidra Console Window:
+<img src="/doc/img/ghidra_script_mngr_output.png">

 ### Ghidra Headless Analyzer

-To execute [capa_ghidra.py](https://raw.githubusercontent.com/mandiant/capa/master/capa/ghidra/capa_ghidra.py) using the Ghidra Headless Analyzer, you can use the Ghidra `analyzeHeadless` script located in your `<ghidra_install_path>/support` directory. You will need to provide the following arguments to the Ghidra `analyzeHeadless` script:
+To execute `capa_ghidra.py` using the Ghidra Headless Analyzer, you can use the Ghidra `analyzeHeadless` script located in your `$GHIDRA_HOME/support` directory. You will need to provide the following arguments to the Ghidra `analyzeHeadless` script:

-1. `<ghidra_project_path>`: path to Ghidra project
+1. `</path/to/ghidra/project/>`: path to Ghidra project
 2. `<ghidra_project_name>`: name of Ghidra Project
 3. `-process <sample_name>`: name of sample `<sample_name>`
-4. `-ScriptPath <capa_ghidra_path>`: OPTIONAL argument specifying the absolute path of [capa_ghidra.py](https://raw.githubusercontent.com/mandiant/capa/master/capa/ghidra/capa_ghidra.py)
-5. `-PostScript capa_ghidra.py`: execute [capa_ghidra.py](https://raw.githubusercontent.com/mandiant/capa/master/capa/ghidra/capa_ghidra.py) as post-analysis script
-6. `"<capa_args>"`: single, quoted string containing capa arguments that must specify capa rules directory and output format, e.g. `"<capa_rules_path> --verbose"`. [capa_ghidra.py](https://raw.githubusercontent.com/mandiant/capa/master/capa/ghidra/capa_ghidra.py) supports `default`, `verbose`, `vverbose` and `json` formats when executed using the Ghidra Headless Analyzer. [capa_ghidra.py](https://raw.githubusercontent.com/mandiant/capa/master/capa/ghidra/capa_ghidra.py) writes output to the console window used to execute the Ghidra `analyzeHeadless` script.
+4. `-ScriptPath </path/to/capa_ghidra/>`: OPTIONAL argument specifying path `</path/to/capa_ghidra/>` to `capa_ghidra.py`
+5. `-PostScript capa_ghidra.py`: executes `capa_ghidra.py` as post-analysis script
+6. `"<capa_args>"`: single, quoted string containing capa arguments that must specify capa rules directory and output format, e.g. `"<path/to/capa/rules> --verbose"`. `capa_ghidra.py` supports `default`, `verbose`, `vverbose` and `json` formats when executed using the Ghidra Headless Analyzer. `capa_ghidra.py` writes output to the console window used to execute the Ghidra `analyzeHeadless` script.
+7. `-processor <languageID>`: required ONLY if sample `<sample_name>` is shellcode. More information on specifying the `<languageID>` can be found in the `$GHIDRA_HOME/support/analyzeHeadlessREADME.html` documentation.

 The following is an example of combining these arguments into a single `analyzeHeadless` script command:

 ```
-<ghidra_install_path>/support/analyzeHeadless <ghidra_project_path> <ghidra_project_name> -process <sample_name> -PostScript capa_ghidra.py "<capa_rules_path> --verbose"
+$GHIDRA_HOME/support/analyzeHeadless </path/to/ghidra/project/> <ghidra_project_name> -process <sample_name> -PostScript capa_ghidra.py "/path/to/capa/rules/ --verbose"
 ```

-You may also want to run capa against a sample that you have not yet imported into your Ghidra project. The following is an example of importing a sample and running [capa_ghidra.py](https://raw.githubusercontent.com/mandiant/capa/master/capa/ghidra/capa_ghidra.py) using a single `analyzeHeadless` script command:
+You may also want to run capa against a sample that you have not yet imported into your Ghidra project. The following is an example of importing a sample and running `capa_ghidra.py` using a single `analyzeHeadless` script command:

 ```
-<ghidra_install_path>/support/analyzeHeadless <ghidra_project_path> <ghidra_project_name> -Import <sample_path> -PostScript capa_ghidra.py "<capa_rules_path> --verbose"
+$GHIDRA_HOME/support/analyzeHeadless </path/to/ghidra/project/> <ghidra_project_name> -Import </path/to/sample> -PostScript capa_ghidra.py "/path/to/capa/rules/ --verbose"
 ```

-You can also provide [capa_ghidra.py](https://raw.githubusercontent.com/mandiant/capa/master/capa/ghidra/capa_ghidra.py) the single argument `"help"` to view supported arguments when running the script using the Ghidra Headless Analyzer:
+You can also provide `capa_ghidra.py` the single argument `"help"` to view supported arguments when running the script using the Ghidra Headless Analyzer:
 ```
-<ghidra_install_path>/support/analyzeHeadless <ghidra_project_path> <ghidra_project_name> -process <sample_name> -PostScript capa_ghidra.py "help"
+$GHIDRA_HOME/support/analyzeHeadless </path/to/ghidra/project/> <ghidra_project_name> -process <sample_name> -PostScript capa_ghidra.py "help"
 ```

-The following is an example of running [capa_ghidra.py](https://raw.githubusercontent.com/mandiant/capa/master/capa/ghidra/capa_ghidra.py) against a shellcode sample using the Ghidra `analyzeHeadless` script:
+#### Example
+
+The following is an example of running `capa_ghidra.py` against a shellcode sample using the Ghidra `analyzeHeadless` script:
 ```
 $ analyzeHeadless /home/wumbo/Desktop/ghidra_projects/ capa_test -process 499c2a85f6e8142c3f48d4251c9c7cd6.raw32 -processor x86:LE:32:default -PostScript capa_ghidra.py "/home/wumbo/capa/rules -vv"
 [...]
--- a/capa/ghidra/capa_explorer.py
+++ b/capa/ghidra/capa_explorer.py
@@ -1,378 +0,0 @@
-# Integrate capa results with Ghidra UI
-# @author Colton Gabertan (gabertan.colton@gmail.com)
-# @category Python 3.capa
-
-# Copyright (C) 2023 Mandiant, Inc. All Rights Reserved.
-# Licensed under the Apache License, Version 2.0 (the "License");
-#  you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at: [package root]/LICENSE.txt
-# Unless required by applicable law or agreed to in writing, software distributed under the License
-#  is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and limitations under the License.
-import sys
-import json
-import logging
-import pathlib
-from typing import Any, Dict, List
-
-from ghidra.app.cmd.label import AddLabelCmd, CreateNamespacesCmd
-from ghidra.program.model.symbol import Namespace, SourceType, SymbolType
-
-import capa
-import capa.main
-import capa.rules
-import capa.render.json
-import capa.ghidra.helpers
-import capa.capabilities.common
-import capa.features.extractors.ghidra.extractor
-
-logger = logging.getLogger("capa_explorer")
-
-
-def add_bookmark(addr, txt, category="CapaExplorer"):
-    """create bookmark at addr"""
-    currentProgram().getBookmarkManager().setBookmark(addr, "Info", category, txt)  # type: ignore [name-defined] # noqa: F821
-
-
-def create_namespace(namespace_str):
-    """create new Ghidra namespace for each capa namespace"""
-
-    cmd = CreateNamespacesCmd(namespace_str, SourceType.USER_DEFINED)
-    cmd.applyTo(currentProgram())  # type: ignore [name-defined] # noqa: F821
-    return cmd.getNamespace()
-
-
-def create_label(ghidra_addr, name, capa_namespace):
-    """custom label cmd to overlay symbols under capa-generated namespaces"""
-
-    # prevent duplicate labels under the same capa-generated namespace
-    symbol_table = currentProgram().getSymbolTable()  # type: ignore [name-defined] # noqa: F821
-    for sym in symbol_table.getSymbols(ghidra_addr):
-        if sym.getName(True) == capa_namespace.getName(True) + Namespace.DELIMITER + name:
-            return
-
-    # create SymbolType.LABEL at addr
-    # prioritize capa-generated namespace (duplicate match @ new addr), else put under global Ghidra one (new match)
-    cmd = AddLabelCmd(ghidra_addr, name, True, SourceType.USER_DEFINED)
-    cmd.applyTo(currentProgram())  # type: ignore [name-defined] # noqa: F821
-
-    # assign new match overlay label to capa-generated namespace
-    cmd.getSymbol().setNamespace(capa_namespace)
-    return
-
-
-class CapaMatchData:
-    def __init__(
-        self,
-        namespace,
-        scope,
-        capability,
-        matches,
-        attack: List[Dict[Any, Any]],
-        mbc: List[Dict[Any, Any]],
-    ):
-        self.namespace = namespace
-        self.scope = scope
-        self.capability = capability
-        self.matches = matches
-        self.attack = attack
-        self.mbc = mbc
-
-    def bookmark_functions(self):
-        """create bookmarks for MITRE ATT&CK & MBC mappings"""
-
-        if self.attack == [] and self.mbc == []:
-            return
-
-        for key in self.matches.keys():
-            addr = toAddr(hex(key))  # type: ignore [name-defined] # noqa: F821
-            func = getFunctionContaining(addr)  # type: ignore [name-defined] # noqa: F821
-
-            # bookmark & tag MITRE ATT&CK tactics & MBC @ function scope
-            if func is not None:
-                func_addr = func.getEntryPoint()
-
-                if self.attack != []:
-                    for item in self.attack:
-                        attack_txt = ""
-                        for part in item.get("parts", {}):
-                            attack_txt = attack_txt + part + Namespace.DELIMITER
-                        attack_txt = attack_txt + item.get("id", {})
-                        add_bookmark(func_addr, attack_txt, "CapaExplorer::MITRE ATT&CK")
-
-                if self.mbc != []:
-                    for item in self.mbc:
-                        mbc_txt = ""
-                        for part in item.get("parts", {}):
-                            mbc_txt = mbc_txt + part + Namespace.DELIMITER
-                        mbc_txt = mbc_txt + item.get("id", {})
-                        add_bookmark(func_addr, mbc_txt, "CapaExplorer::MBC")
-
-    def set_plate_comment(self, ghidra_addr):
-        """set plate comments at matched functions"""
-        comment = getPlateComment(ghidra_addr)  # type: ignore [name-defined] # noqa: F821
-        rule_path = self.namespace.replace(Namespace.DELIMITER, "/")
-        # 2 calls to avoid duplicate comments via subsequent script runs
-        if comment is None:
-            # first comment @ function
-            comment = rule_path + "\n"
-            setPlateComment(ghidra_addr, comment)  # type: ignore [name-defined] # noqa: F821
-        elif rule_path not in comment:
-            comment = comment + rule_path + "\n"
-            setPlateComment(ghidra_addr, comment)  # type: ignore [name-defined] # noqa: F821
-        else:
-            return
-
-    def set_pre_comment(self, ghidra_addr, sub_type, description):
-        """set pre comments at subscoped matches of main rules"""
-        comment = getPreComment(ghidra_addr)  # type: ignore [name-defined] # noqa: F821
-        if comment is None:
-            comment = "capa: " + sub_type + "(" + description + ")" + ' matched in "' + self.capability + '"\n'
-            setPreComment(ghidra_addr, comment)  # type: ignore [name-defined] # noqa: F821
-        elif self.capability not in comment:
-            comment = (
-                comment + "capa: " + sub_type + "(" + description + ")" + ' matched in "' + self.capability + '"\n'
-            )
-            setPreComment(ghidra_addr, comment)  # type: ignore [name-defined] # noqa: F821
-        else:
-            return
-
-    def label_matches(self):
-        """label findings at function scopes and comment on subscope matches"""
-        capa_namespace = create_namespace(self.namespace)
-        symbol_table = currentProgram().getSymbolTable()  # type: ignore [name-defined] # noqa: F821
-
-        # handle function main scope of matched rule
-        # these will typically contain further matches within
-        if self.scope == "function":
-            for addr in self.matches.keys():
-                ghidra_addr = toAddr(hex(addr))  # type: ignore [name-defined] # noqa: F821
-
-                # classify new function label under capa-generated namespace
-                sym = symbol_table.getPrimarySymbol(ghidra_addr)
-                if sym is not None:
-                    if sym.getSymbolType() == SymbolType.FUNCTION:
-                        create_label(ghidra_addr, sym.getName(), capa_namespace)
-                        self.set_plate_comment(ghidra_addr)
-
-                    # parse the corresponding nodes, and pre-comment subscope matched features
-                    # under the encompassing function(s)
-                    for sub_match in self.matches.get(addr):
-                        for loc, node in sub_match.items():
-                            sub_ghidra_addr = toAddr(hex(loc))  # type: ignore [name-defined] # noqa: F821
-                            if sub_ghidra_addr == ghidra_addr:
-                                # skip duplicates
-                                continue
-
-                            # precomment subscope matches under the function
-                            if node != {}:
-                                for sub_type, description in parse_node(node):
-                                    self.set_pre_comment(sub_ghidra_addr, sub_type, description)
-        else:
-            # resolve the encompassing function for the capa namespace
-            # of non-function scoped main matches
-            for addr in self.matches.keys():
-                ghidra_addr = toAddr(hex(addr))  # type: ignore [name-defined] # noqa: F821
-
-                # basic block / insn scoped main matches
-                # Ex. See "Create Process on Windows" Rule
-                func = getFunctionContaining(ghidra_addr)  # type: ignore [name-defined] # noqa: F821
-                if func is not None:
-                    func_addr = func.getEntryPoint()
-                    create_label(func_addr, func.getName(), capa_namespace)
-                    self.set_plate_comment(func_addr)
-
-                # create subscope match precomments
-                for sub_match in self.matches.get(addr):
-                    for loc, node in sub_match.items():
-                        sub_ghidra_addr = toAddr(hex(loc))  # type: ignore [name-defined] # noqa: F821
-
-                        if node != {}:
-                            if func is not None:
-                                # basic block/ insn scope under resolved function
-                                for sub_type, description in parse_node(node):
-                                    self.set_pre_comment(sub_ghidra_addr, sub_type, description)
-                            else:
-                                # this would be a global/file scoped main match
-                                # try to resolve the encompassing function via the subscope match, instead
-                                # Ex. "run as service" rule
-                                sub_func = getFunctionContaining(sub_ghidra_addr)  # type: ignore [name-defined] # noqa: F821
-                                if sub_func is not None:
-                                    sub_func_addr = sub_func.getEntryPoint()
-                                    # place function in capa namespace & create the subscope match label in Ghidra's global namespace
-                                    create_label(sub_func_addr, sub_func.getName(), capa_namespace)
-                                    self.set_plate_comment(sub_func_addr)
-                                    for sub_type, description in parse_node(node):
-                                        self.set_pre_comment(sub_ghidra_addr, sub_type, description)
-                                else:
-                                    # addr is in some other file section like .data
-                                    # represent this location with a label symbol under the capa namespace
-                                    # Ex. See "Reference Base64 String" rule
-                                    for sub_type, description in parse_node(node):
-                                        # in many cases, these will be ghidra-labeled data, so just add the existing
-                                        # label symbol to the capa namespace
-                                        for sym in symbol_table.getSymbols(sub_ghidra_addr):
-                                            if sym.getSymbolType() == SymbolType.LABEL:
-                                                sym.setNamespace(capa_namespace)
-                                        self.set_pre_comment(sub_ghidra_addr, sub_type, description)
-
-
-def get_capabilities():
-    rules_dir: str = ""
-    try:
-        selected_dir = askDirectory("Choose capa rules directory", "Ok")  # type: ignore [name-defined] # noqa: F821
-        if selected_dir:
-            rules_dir = selected_dir.getPath()
-    except RuntimeError:
-        # RuntimeError thrown when user selects "Cancel"
-        pass
-
-    if not rules_dir:
-        logger.info("You must choose a capa rules directory before running capa.")
-        return ""  # return empty str to avoid handling both int and str types
-
-    rules_path: pathlib.Path = pathlib.Path(rules_dir)
-    logger.info("running capa using rules from %s", str(rules_path))
-
-    rules = capa.rules.get_rules([rules_path])
-    meta = capa.ghidra.helpers.collect_metadata([rules_path])
-    extractor = capa.features.extractors.ghidra.extractor.GhidraFeatureExtractor()
-
-    capabilities, counts = capa.capabilities.common.find_capabilities(rules, extractor, True)
-
-    if capa.capabilities.common.has_file_limitation(rules, capabilities, is_standalone=False):
-        popup("capa explorer encountered warnings during analysis. Please check the console output for more information.")  # type: ignore [name-defined] # noqa: F821
-        logger.info("capa encountered warnings during analysis")
-
-    return capa.render.json.render(meta, rules, capabilities)
-
-
-def get_locations(match_dict):
-    """recursively collect match addresses and associated nodes"""
-
-    for loc in match_dict.get("locations", {}):
-        # either an rva (absolute)
-        # or an offset into a file (file)
-        if loc.get("type", "") in ("absolute", "file"):
-            yield loc.get("value"), match_dict.get("node")
-
-    for child in match_dict.get("children", {}):
-        yield from get_locations(child)
-
-
-def parse_node(node_data):
-    """pull match descriptions and sub features by parsing node dicts"""
-
-    node = node_data.get(node_data.get("type"))
-
-    if "description" in node:
-        yield "description", node.get("description")
-
-    data = node.get(node.get("type"))
-    if isinstance(data, (str, int)):
-        feat_type = node.get("type")
-        if isinstance(data, int):
-            data = hex(data)
-        yield feat_type, data
-
-
-def parse_json(capa_data):
-    """Parse json produced by capa"""
-
-    for rule, capability in capa_data.get("rules", {}).items():
-        # structure to contain rule match address & supporting feature data
-        # {rule match addr:[{feature addr:{node_data}}]}
-        rule_matches: Dict[Any, List[Any]] = {}
-        for i in range(len(capability.get("matches"))):
-            # grab rule match location
-            match_loc = capability.get("matches")[i][0].get("value")
-            if match_loc is None:
-                # Ex. See "Reference Base64 string"
-                # {'type':'no address'}
-                match_loc = i
-            rule_matches[match_loc] = []
-
-            # grab extracted feature locations & corresponding node data
-            # feature[0]: location
-            # feature[1]: node
-            features = capability.get("matches")[i][1]
-            feat_dict = {}
-            for feature in get_locations(features):
-                feat_dict[feature[0]] = feature[1]
-                rule_matches[match_loc].append(feat_dict)
-
-        # dict data of currently matched rule
-        meta = capability["meta"]
-
-        # get MITRE ATT&CK and MBC
-        attack = meta.get("attack")
-        if attack is None:
-            attack = []
-        mbc = meta.get("mbc")
-        if mbc is None:
-            mbc = []
-
-        # scope match for the rule
-        scope = meta["scopes"].get("static")
-
-        fmt_rule = Namespace.DELIMITER + rule.replace(" ", "-")
-        if "namespace" in meta:
-            # split into list to help define child namespaces
-            # this requires the correct delimiter used by Ghidra
-            # Ex. 'communication/named-pipe/create/create pipe' -> capa::communication::named-pipe::create::create-pipe
-            namespace_str = Namespace.DELIMITER.join(meta["namespace"].split("/"))
-            namespace = "capa" + Namespace.DELIMITER + namespace_str + fmt_rule
-        else:
-            # lib rules via the official rules repo will not contain data
-            # for the "namespaces" key, so format using rule itself
-            # Ex. 'contain loop' -> capa::lib::contain-loop
-            namespace = "capa" + Namespace.DELIMITER + "lib" + fmt_rule
-
-        yield CapaMatchData(namespace, scope, rule, rule_matches, attack, mbc)
-
-
-def main():
-    logging.basicConfig(level=logging.INFO)
-    logging.getLogger().setLevel(logging.INFO)
-
-    if isRunningHeadless():  # type: ignore [name-defined] # noqa: F821
-        logger.error("unsupported Ghidra execution mode")
-        return capa.main.E_UNSUPPORTED_GHIDRA_EXECUTION_MODE
-
-    if not capa.ghidra.helpers.is_supported_ghidra_version():
-        logger.error("unsupported Ghidra version")
-        return capa.main.E_UNSUPPORTED_GHIDRA_VERSION
-
-    if not capa.ghidra.helpers.is_supported_file_type():
-        logger.error("unsupported file type")
-        return capa.main.E_INVALID_FILE_TYPE
-
-    if not capa.ghidra.helpers.is_supported_arch_type():
-        logger.error("unsupported file architecture")
-        return capa.main.E_INVALID_FILE_ARCH
-
-    # capa_data will always contain {'meta':..., 'rules':...}
-    # if the 'rules' key contains no values, then there were no matches
-    capa_data = json.loads(get_capabilities())
-    if capa_data.get("rules") is None:
-        logger.info("capa explorer found no matches")
-        popup("capa explorer found no matches.")  # type: ignore [name-defined] # noqa: F821
-        return capa.main.E_EMPTY_REPORT
-
-    for item in parse_json(capa_data):
-        item.bookmark_functions()
-        item.label_matches()
-    logger.info("capa explorer analysis complete")
-    popup("capa explorer analysis complete.\nPlease see results in the Bookmarks Window and Namespaces section of the Symbol Tree Window.")  # type: ignore [name-defined] # noqa: F821
-    return 0
-
-
-if __name__ == "__main__":
-    if sys.version_info < (3, 8):
-        from capa.exceptions import UnsupportedRuntimeError
-
-        raise UnsupportedRuntimeError("This version of capa can only be used with Python 3.8+")
-    exit_code = main()
-    if exit_code != 0:
-        popup("capa explorer encountered errors during analysis. Please check the console output for more information.")  # type: ignore [name-defined] # noqa: F821
-    sys.exit(exit_code)
--- a/capa/ghidra/capa_ghidra.py
+++ b/capa/ghidra/capa_ghidra.py
@@ -69,7 +69,7 @@ def run_headless():
    rules_path = pathlib.Path(args.rules)

    logger.debug("rule path: %s", rules_path)
-    rules = capa.rules.get_rules([rules_path])
+    rules = capa.main.get_rules([rules_path])

    meta = capa.ghidra.helpers.collect_metadata([rules_path])
    extractor = capa.features.extractors.ghidra.extractor.GhidraFeatureExtractor()
@@ -78,7 +78,7 @@ def run_headless():

    meta.analysis.feature_counts = counts["feature_counts"]
    meta.analysis.library_functions = counts["library_functions"]
-    meta.analysis.layout = capa.loader.compute_layout(rules, extractor, capabilities)
+    meta.analysis.layout = capa.main.compute_layout(rules, extractor, capabilities)

    if capa.capabilities.common.has_file_limitation(rules, capabilities, is_standalone=True):
        logger.info("capa encountered warnings during analysis")
@@ -119,7 +119,7 @@ def run_ui():
    rules_path: pathlib.Path = pathlib.Path(rules_dir)
    logger.info("running capa using rules from %s", str(rules_path))

-    rules = capa.rules.get_rules([rules_path])
+    rules = capa.main.get_rules([rules_path])

    meta = capa.ghidra.helpers.collect_metadata([rules_path])
    extractor = capa.features.extractors.ghidra.extractor.GhidraFeatureExtractor()
@@ -128,7 +128,7 @@ def run_ui():

    meta.analysis.feature_counts = counts["feature_counts"]
    meta.analysis.library_functions = counts["library_functions"]
-    meta.analysis.layout = capa.loader.compute_layout(rules, extractor, capabilities)
+    meta.analysis.layout = capa.main.compute_layout(rules, extractor, capabilities)

    if capa.capabilities.common.has_file_limitation(rules, capabilities, is_standalone=False):
        logger.info("capa encountered warnings during analysis")
--- a/capa/helpers.py
+++ b/capa/helpers.py
@@ -5,7 +5,6 @@
 # Unless required by applicable law or agreed to in writing, software distributed under the License
 #  is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and limitations under the License.
-import sys
 import json
 import inspect
 import logging
@@ -17,22 +16,12 @@ from pathlib import Path
 import tqdm

 from capa.exceptions import UnsupportedFormatError
-from capa.features.common import (
-    FORMAT_PE,
-    FORMAT_CAPE,
-    FORMAT_SC32,
-    FORMAT_SC64,
-    FORMAT_DOTNET,
-    FORMAT_FREEZE,
-    FORMAT_UNKNOWN,
-    Format,
-)
+from capa.features.common import FORMAT_PE, FORMAT_CAPE, FORMAT_SC32, FORMAT_SC64, FORMAT_DOTNET, FORMAT_UNKNOWN, Format

 EXTENSIONS_SHELLCODE_32 = ("sc32", "raw32")
 EXTENSIONS_SHELLCODE_64 = ("sc64", "raw64")
 EXTENSIONS_DYNAMIC = ("json", "json_")
 EXTENSIONS_ELF = "elf_"
-EXTENSIONS_FREEZE = "frz"

 logger = logging.getLogger("capa")

@@ -92,8 +81,6 @@ def get_format_from_extension(sample: Path) -> str:
        format_ = FORMAT_SC64
    elif sample.name.endswith(EXTENSIONS_DYNAMIC):
        format_ = get_format_from_report(sample)
-    elif sample.name.endswith(EXTENSIONS_FREEZE):
-        format_ = FORMAT_FREEZE
    return format_


@@ -169,7 +156,7 @@ def log_unsupported_format_error():

 def log_unsupported_cape_report_error(error: str):
    logger.error("-" * 80)
-    logger.error(" Input file is not a valid CAPE report: %s", error)
+    logger.error("Input file is not a valid CAPE report: %s", error)
    logger.error(" ")
    logger.error(" capa currently only supports analyzing standard CAPE reports in JSON format.")
    logger.error(
@@ -214,16 +201,3 @@ def log_unsupported_runtime_error():
        " If you're seeing this message on the command line, please ensure you're running a supported Python version."
    )
    logger.error("-" * 80)
-
-
-def is_running_standalone() -> bool:
-    """
-    are we running from a PyInstaller'd executable?
-    if so, then we'll be able to access `sys._MEIPASS` for the packaged resources.
-    """
-    # typically we only expect capa.main to be packaged via PyInstaller.
-    # therefore, this *should* be in capa.main; however,
-    # the Binary Ninja extractor uses this to resolve the BN API code,
-    # so we keep this in a common area.
-    # generally, other library code should not use this function.
-    return hasattr(sys, "frozen") and hasattr(sys, "_MEIPASS")
--- a/capa/ida/plugin/form.py
+++ b/capa/ida/plugin/form.py
@@ -636,7 +636,7 @@ class CapaExplorerForm(idaapi.PluginForm):
                if ida_kernwin.user_cancelled():
                    raise UserCancelledError("user cancelled")

-            return capa.rules.get_rules([rule_path], on_load_rule=on_load_rule)
+            return capa.main.get_rules([rule_path], on_load_rule=on_load_rule)
        except UserCancelledError:
            logger.info("User cancelled analysis.")
            return None
@@ -775,7 +775,7 @@ class CapaExplorerForm(idaapi.PluginForm):

                    meta.analysis.feature_counts = counts["feature_counts"]
                    meta.analysis.library_functions = counts["library_functions"]
-                    meta.analysis.layout = capa.loader.compute_layout(ruleset, self.feature_extractor, capabilities)
+                    meta.analysis.layout = capa.main.compute_layout(ruleset, self.feature_extractor, capabilities)
                except UserCancelledError:
                    logger.info("User cancelled analysis.")
                    return False
@@ -932,9 +932,9 @@ class CapaExplorerForm(idaapi.PluginForm):
                    update_wait_box("verifying cached results")

                    try:
-                        results: Optional[capa.render.result_document.ResultDocument] = (
-                            capa.ida.helpers.load_and_verify_cached_results()
-                        )
+                        results: Optional[
+                            capa.render.result_document.ResultDocument
+                        ] = capa.ida.helpers.load_and_verify_cached_results()
                    except Exception as e:
                        capa.ida.helpers.inform_user_ida_ui("Failed to verify cached results, reanalyzing program")
                        logger.exception("Failed to verify cached results (error: %s)", e)
@@ -1073,7 +1073,9 @@ class CapaExplorerForm(idaapi.PluginForm):

            self.view_rulegen_features.load_features(all_file_features, all_function_features)

-            self.set_view_status_label(f"capa rules: {settings.user[CAPA_SETTINGS_RULE_PATH]}")
+            self.set_view_status_label(
+                f"capa rules: {settings.user[CAPA_SETTINGS_RULE_PATH]} ({settings.user[CAPA_SETTINGS_RULE_PATH]} rules)"
+            )
        except Exception as e:
            logger.exception("Failed to render views (error: %s)", e)
            return False
@@ -1322,17 +1324,10 @@ class CapaExplorerForm(idaapi.PluginForm):
            idaapi.info("No rule to save.")
            return

-        rule_file_path = self.ask_user_capa_rule_file()
-        if not rule_file_path:
-            # dialog canceled
+        path = Path(self.ask_user_capa_rule_file())
+        if not path.exists():
            return

-        path = Path(rule_file_path)
-        if not path.parent.exists():
-            logger.warning("Failed to save file: parent directory '%s' does not exist.", path.parent)
-            return
-
-        logger.info("Saving rule to %s.", path)
        write_file(path, s)

    def slot_checkbox_limit_by_changed(self, state):
--- a/capa/ida/plugin/view.py
+++ b/capa/ida/plugin/view.py
@@ -194,17 +194,13 @@ class CapaExplorerRulegenPreview(QtWidgets.QTextEdit):
            "    namespace: <insert_namespace>",
            "    authors:",
            f"      - {author}",
-            "    scopes:",
-            f"      static: {scope}",
-            "      dynamic: unspecified",
+            f"    scope: {scope}",
            "    references:",
            "      - <insert_references>",
            "    examples:",
-            (
-                f"      - {capa.ida.helpers.get_file_md5().upper()}:{hex(ea)}"
-                if ea
-                else f"      - {capa.ida.helpers.get_file_md5().upper()}"
-            ),
+            f"      - {capa.ida.helpers.get_file_md5().upper()}:{hex(ea)}"
+            if ea
+            else f"      - {capa.ida.helpers.get_file_md5().upper()}",
            "  features:",
        ]
        self.setText("\n".join(metadata_default))
--- a/capa/loader.py
+++ b/capa/loader.py
@@ -1,544 +0,0 @@
-# Copyright (C) 2023 Mandiant, Inc. All Rights Reserved.
-# Licensed under the Apache License, Version 2.0 (the "License");
-#  you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at: [package root]/LICENSE.txt
-# Unless required by applicable law or agreed to in writing, software distributed under the License
-#  is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and limitations under the License.
-import sys
-import json
-import logging
-import datetime
-from typing import Set, Dict, List, Optional
-from pathlib import Path
-
-import halo
-from typing_extensions import assert_never
-
-import capa.perf
-import capa.rules
-import capa.engine
-import capa.helpers
-import capa.version
-import capa.render.json
-import capa.rules.cache
-import capa.render.default
-import capa.render.verbose
-import capa.features.common
-import capa.features.freeze as frz
-import capa.render.vverbose
-import capa.features.extractors
-import capa.render.result_document
-import capa.render.result_document as rdoc
-import capa.features.extractors.common
-import capa.features.extractors.pefile
-import capa.features.extractors.elffile
-import capa.features.extractors.dotnetfile
-import capa.features.extractors.base_extractor
-import capa.features.extractors.cape.extractor
-from capa.rules import RuleSet
-from capa.engine import MatchResults
-from capa.exceptions import UnsupportedOSError, UnsupportedArchError, UnsupportedFormatError
-from capa.features.common import (
-    OS_AUTO,
-    FORMAT_PE,
-    FORMAT_ELF,
-    FORMAT_AUTO,
-    FORMAT_CAPE,
-    FORMAT_SC32,
-    FORMAT_SC64,
-    FORMAT_DOTNET,
-)
-from capa.features.address import Address
-from capa.features.extractors.base_extractor import (
-    SampleHashes,
-    FeatureExtractor,
-    StaticFeatureExtractor,
-    DynamicFeatureExtractor,
-)
-
-logger = logging.getLogger(__name__)
-
-BACKEND_VIV = "vivisect"
-BACKEND_DOTNET = "dotnet"
-BACKEND_BINJA = "binja"
-BACKEND_PEFILE = "pefile"
-BACKEND_CAPE = "cape"
-BACKEND_FREEZE = "freeze"
-
-
-def is_supported_format(sample: Path) -> bool:
-    """
-    Return if this is a supported file based on magic header values
-    """
-    taste = sample.open("rb").read(0x100)
-
-    return len(list(capa.features.extractors.common.extract_format(taste))) == 1
-
-
-def is_supported_arch(sample: Path) -> bool:
-    buf = sample.read_bytes()
-
-    return len(list(capa.features.extractors.common.extract_arch(buf))) == 1
-
-
-def get_arch(sample: Path) -> str:
-    buf = sample.read_bytes()
-
-    for feature, _ in capa.features.extractors.common.extract_arch(buf):
-        assert isinstance(feature.value, str)
-        return feature.value
-
-    return "unknown"
-
-
-def is_supported_os(sample: Path) -> bool:
-    buf = sample.read_bytes()
-
-    return len(list(capa.features.extractors.common.extract_os(buf))) == 1
-
-
-def get_os(sample: Path) -> str:
-    buf = sample.read_bytes()
-
-    for feature, _ in capa.features.extractors.common.extract_os(buf):
-        assert isinstance(feature.value, str)
-        return feature.value
-
-    return "unknown"
-
-
-def get_meta_str(vw):
-    """
-    Return workspace meta information string
-    """
-    meta = []
-    for k in ["Format", "Platform", "Architecture"]:
-        if k in vw.metadata:
-            meta.append(f"{k.lower()}: {vw.metadata[k]}")
-    return f"{', '.join(meta)}, number of functions: {len(vw.getFunctions())}"
-
-
-def get_workspace(path: Path, input_format: str, sigpaths: List[Path]):
-    """
-    load the program at the given path into a vivisect workspace using the given format.
-    also apply the given FLIRT signatures.
-
-    supported formats:
-      - pe
-      - elf
-      - shellcode 32-bit
-      - shellcode 64-bit
-      - auto
-
-    this creates and analyzes the workspace; however, it does *not* save the workspace.
-    this is the responsibility of the caller.
-    """
-
-    # lazy import enables us to not require viv if user wants another backend.
-    import viv_utils
-    import viv_utils.flirt
-
-    logger.debug("generating vivisect workspace for: %s", path)
-    if input_format == FORMAT_AUTO:
-        if not is_supported_format(path):
-            raise UnsupportedFormatError()
-
-        # don't analyze, so that we can add our Flirt function analyzer first.
-        vw = viv_utils.getWorkspace(str(path), analyze=False, should_save=False)
-    elif input_format in {FORMAT_PE, FORMAT_ELF}:
-        vw = viv_utils.getWorkspace(str(path), analyze=False, should_save=False)
-    elif input_format == FORMAT_SC32:
-        # these are not analyzed nor saved.
-        vw = viv_utils.getShellcodeWorkspaceFromFile(str(path), arch="i386", analyze=False)
-    elif input_format == FORMAT_SC64:
-        vw = viv_utils.getShellcodeWorkspaceFromFile(str(path), arch="amd64", analyze=False)
-    else:
-        raise ValueError("unexpected format: " + input_format)
-
-    viv_utils.flirt.register_flirt_signature_analyzers(vw, [str(s) for s in sigpaths])
-
-    vw.analyze()
-
-    logger.debug("%s", get_meta_str(vw))
-    return vw
-
-
-def get_extractor(
-    input_path: Path,
-    input_format: str,
-    os_: str,
-    backend: str,
-    sigpaths: List[Path],
-    should_save_workspace=False,
-    disable_progress=False,
-    sample_path: Optional[Path] = None,
-) -> FeatureExtractor:
-    """
-    raises:
-      UnsupportedFormatError
-      UnsupportedArchError
-      UnsupportedOSError
-    """
-    if backend == BACKEND_CAPE:
-        import capa.features.extractors.cape.extractor
-
-        report = json.loads(input_path.read_text(encoding="utf-8"))
-        return capa.features.extractors.cape.extractor.CapeExtractor.from_report(report)
-
-    elif backend == BACKEND_DOTNET:
-        import capa.features.extractors.dnfile.extractor
-
-        if input_format not in (FORMAT_PE, FORMAT_DOTNET):
-            raise UnsupportedFormatError()
-
-        return capa.features.extractors.dnfile.extractor.DnfileFeatureExtractor(input_path)
-
-    elif backend == BACKEND_BINJA:
-        import capa.helpers
-        from capa.features.extractors.binja.find_binja_api import find_binja_path
-
-        # When we are running as a standalone executable, we cannot directly import binaryninja
-        # We need to fist find the binja API installation path and add it into sys.path
-        if capa.helpers.is_running_standalone():
-            bn_api = find_binja_path()
-            if bn_api.exists():
-                sys.path.append(str(bn_api))
-
-        try:
-            import binaryninja
-            from binaryninja import BinaryView
-        except ImportError:
-            raise RuntimeError(
-                "Cannot import binaryninja module. Please install the Binary Ninja Python API first: "
-                + "https://docs.binary.ninja/dev/batch.html#install-the-api)."
-            )
-
-        import capa.features.extractors.binja.extractor
-
-        if input_format not in (FORMAT_SC32, FORMAT_SC64):
-            if not is_supported_format(input_path):
-                raise UnsupportedFormatError()
-
-            if not is_supported_arch(input_path):
-                raise UnsupportedArchError()
-
-            if os_ == OS_AUTO and not is_supported_os(input_path):
-                raise UnsupportedOSError()
-
-        with halo.Halo(text="analyzing program", spinner="simpleDots", stream=sys.stderr, enabled=not disable_progress):
-            bv: BinaryView = binaryninja.load(str(input_path))
-            if bv is None:
-                raise RuntimeError(f"Binary Ninja cannot open file {input_path}")
-
-        return capa.features.extractors.binja.extractor.BinjaFeatureExtractor(bv)
-
-    elif backend == BACKEND_PEFILE:
-        import capa.features.extractors.pefile
-
-        return capa.features.extractors.pefile.PefileFeatureExtractor(input_path)
-
-    elif backend == BACKEND_VIV:
-        import capa.features.extractors.viv.extractor
-
-        if input_format not in (FORMAT_SC32, FORMAT_SC64):
-            if not is_supported_format(input_path):
-                raise UnsupportedFormatError()
-
-            if not is_supported_arch(input_path):
-                raise UnsupportedArchError()
-
-            if os_ == OS_AUTO and not is_supported_os(input_path):
-                raise UnsupportedOSError()
-
-        with halo.Halo(text="analyzing program", spinner="simpleDots", stream=sys.stderr, enabled=not disable_progress):
-            vw = get_workspace(input_path, input_format, sigpaths)
-
-            if should_save_workspace:
-                logger.debug("saving workspace")
-                try:
-                    vw.saveWorkspace()
-                except IOError:
-                    # see #168 for discussion around how to handle non-writable directories
-                    logger.info("source directory is not writable, won't save intermediate workspace")
-            else:
-                logger.debug("CAPA_SAVE_WORKSPACE unset, not saving workspace")
-
-        return capa.features.extractors.viv.extractor.VivisectFeatureExtractor(vw, input_path, os_)
-
-    elif backend == BACKEND_FREEZE:
-        return frz.load(input_path.read_bytes())
-
-    else:
-        raise ValueError("unexpected backend: " + backend)
-
-
-def get_file_extractors(input_file: Path, input_format: str) -> List[FeatureExtractor]:
-    file_extractors: List[FeatureExtractor] = []
-
-    if input_format == FORMAT_PE:
-        file_extractors.append(capa.features.extractors.pefile.PefileFeatureExtractor(input_file))
-
-    elif input_format == FORMAT_DOTNET:
-        file_extractors.append(capa.features.extractors.pefile.PefileFeatureExtractor(input_file))
-        file_extractors.append(capa.features.extractors.dotnetfile.DotnetFileFeatureExtractor(input_file))
-
-    elif input_format == FORMAT_ELF:
-        file_extractors.append(capa.features.extractors.elffile.ElfFeatureExtractor(input_file))
-
-    elif input_format == FORMAT_CAPE:
-        report = json.loads(input_file.read_text(encoding="utf-8"))
-        file_extractors.append(capa.features.extractors.cape.extractor.CapeExtractor.from_report(report))
-
-    return file_extractors
-
-
-def get_signatures(sigs_path: Path) -> List[Path]:
-    if not sigs_path.exists():
-        raise IOError(f"signatures path {sigs_path} does not exist or cannot be accessed")
-
-    paths: List[Path] = []
-    if sigs_path.is_file():
-        paths.append(sigs_path)
-    elif sigs_path.is_dir():
-        logger.debug("reading signatures from directory %s", sigs_path.resolve())
-        for file in sigs_path.rglob("*"):
-            if file.is_file() and file.suffix.lower() in (".pat", ".pat.gz", ".sig"):
-                paths.append(file)
-
-    # Convert paths to their absolute and normalized forms
-    paths = [path.resolve().absolute() for path in paths]
-
-    # load signatures in deterministic order: the alphabetic sorting of filename.
-    # this means that `0_sigs.pat` loads before `1_sigs.pat`.
-    paths = sorted(paths, key=lambda path: path.name)
-
-    for path in paths:
-        logger.debug("found signature file: %s", path)
-
-    return paths
-
-
-def get_sample_analysis(format_, arch, os_, extractor, rules_path, counts):
-    if isinstance(extractor, StaticFeatureExtractor):
-        return rdoc.StaticAnalysis(
-            format=format_,
-            arch=arch,
-            os=os_,
-            extractor=extractor.__class__.__name__,
-            rules=tuple(rules_path),
-            base_address=frz.Address.from_capa(extractor.get_base_address()),
-            layout=rdoc.StaticLayout(
-                functions=(),
-                # this is updated after capabilities have been collected.
-                # will look like:
-                #
-                # "functions": { 0x401000: { "matched_basic_blocks": [ 0x401000, 0x401005, ... ] }, ... }
-            ),
-            feature_counts=counts["feature_counts"],
-            library_functions=counts["library_functions"],
-        )
-    elif isinstance(extractor, DynamicFeatureExtractor):
-        return rdoc.DynamicAnalysis(
-            format=format_,
-            arch=arch,
-            os=os_,
-            extractor=extractor.__class__.__name__,
-            rules=tuple(rules_path),
-            layout=rdoc.DynamicLayout(
-                processes=(),
-            ),
-            feature_counts=counts["feature_counts"],
-        )
-    else:
-        raise ValueError("invalid extractor type")
-
-
-def collect_metadata(
-    argv: List[str],
-    input_path: Path,
-    input_format: str,
-    os_: str,
-    rules_path: List[Path],
-    extractor: FeatureExtractor,
-    counts: dict,
-) -> rdoc.Metadata:
-    # if it's a binary sample we hash it, if it's a report
-    # we fetch the hashes from the report
-    sample_hashes: SampleHashes = extractor.get_sample_hashes()
-    md5, sha1, sha256 = sample_hashes.md5, sample_hashes.sha1, sample_hashes.sha256
-
-    global_feats = list(extractor.extract_global_features())
-    extractor_format = [f.value for (f, _) in global_feats if isinstance(f, capa.features.common.Format)]
-    extractor_arch = [f.value for (f, _) in global_feats if isinstance(f, capa.features.common.Arch)]
-    extractor_os = [f.value for (f, _) in global_feats if isinstance(f, capa.features.common.OS)]
-
-    input_format = (
-        str(extractor_format[0]) if extractor_format else "unknown" if input_format == FORMAT_AUTO else input_format
-    )
-    arch = str(extractor_arch[0]) if extractor_arch else "unknown"
-    os_ = str(extractor_os[0]) if extractor_os else "unknown" if os_ == OS_AUTO else os_
-
-    if isinstance(extractor, StaticFeatureExtractor):
-        meta_class: type = rdoc.StaticMetadata
-    elif isinstance(extractor, DynamicFeatureExtractor):
-        meta_class = rdoc.DynamicMetadata
-    else:
-        assert_never(extractor)
-
-    rules = tuple(r.resolve().absolute().as_posix() for r in rules_path)
-
-    return meta_class(
-        timestamp=datetime.datetime.now(),
-        version=capa.version.__version__,
-        argv=tuple(argv) if argv else None,
-        sample=rdoc.Sample(
-            md5=md5,
-            sha1=sha1,
-            sha256=sha256,
-            path=input_path.resolve().as_posix(),
-        ),
-        analysis=get_sample_analysis(
-            input_format,
-            arch,
-            os_,
-            extractor,
-            rules,
-            counts,
-        ),
-    )
-
-
-def compute_dynamic_layout(
-    rules: RuleSet, extractor: DynamicFeatureExtractor, capabilities: MatchResults
-) -> rdoc.DynamicLayout:
-    """
-    compute a metadata structure that links threads
-    to the processes in which they're found.
-
-    only collect the threads at which some rule matched.
-    otherwise, we may pollute the json document with
-    a large amount of un-referenced data.
-    """
-    assert isinstance(extractor, DynamicFeatureExtractor)
-
-    matched_calls: Set[Address] = set()
-
-    def result_rec(result: capa.features.common.Result):
-        for loc in result.locations:
-            if isinstance(loc, capa.features.address.DynamicCallAddress):
-                matched_calls.add(loc)
-        for child in result.children:
-            result_rec(child)
-
-    for matches in capabilities.values():
-        for _, result in matches:
-            result_rec(result)
-
-    names_by_process: Dict[Address, str] = {}
-    names_by_call: Dict[Address, str] = {}
-
-    matched_processes: Set[Address] = set()
-    matched_threads: Set[Address] = set()
-
-    threads_by_process: Dict[Address, List[Address]] = {}
-    calls_by_thread: Dict[Address, List[Address]] = {}
-
-    for p in extractor.get_processes():
-        threads_by_process[p.address] = []
-
-        for t in extractor.get_threads(p):
-            calls_by_thread[t.address] = []
-
-            for c in extractor.get_calls(p, t):
-                if c.address in matched_calls:
-                    names_by_call[c.address] = extractor.get_call_name(p, t, c)
-                    calls_by_thread[t.address].append(c.address)
-
-            if calls_by_thread[t.address]:
-                matched_threads.add(t.address)
-                threads_by_process[p.address].append(t.address)
-
-        if threads_by_process[p.address]:
-            matched_processes.add(p.address)
-            names_by_process[p.address] = extractor.get_process_name(p)
-
-    layout = rdoc.DynamicLayout(
-        processes=tuple(
-            rdoc.ProcessLayout(
-                address=frz.Address.from_capa(p),
-                name=names_by_process[p],
-                matched_threads=tuple(
-                    rdoc.ThreadLayout(
-                        address=frz.Address.from_capa(t),
-                        matched_calls=tuple(
-                            rdoc.CallLayout(
-                                address=frz.Address.from_capa(c),
-                                name=names_by_call[c],
-                            )
-                            for c in calls_by_thread[t]
-                            if c in matched_calls
-                        ),
-                    )
-                    for t in threads
-                    if t in matched_threads
-                ),  # this object is open to extension in the future,
-                # such as with the function name, etc.
-            )
-            for p, threads in threads_by_process.items()
-            if p in matched_processes
-        )
-    )
-
-    return layout
-
-
-def compute_static_layout(rules: RuleSet, extractor: StaticFeatureExtractor, capabilities) -> rdoc.StaticLayout:
-    """
-    compute a metadata structure that links basic blocks
-    to the functions in which they're found.
-
-    only collect the basic blocks at which some rule matched.
-    otherwise, we may pollute the json document with
-    a large amount of un-referenced data.
-    """
-    functions_by_bb: Dict[Address, Address] = {}
-    bbs_by_function: Dict[Address, List[Address]] = {}
-    for f in extractor.get_functions():
-        bbs_by_function[f.address] = []
-        for bb in extractor.get_basic_blocks(f):
-            functions_by_bb[bb.address] = f.address
-            bbs_by_function[f.address].append(bb.address)
-
-    matched_bbs = set()
-    for rule_name, matches in capabilities.items():
-        rule = rules[rule_name]
-        if capa.rules.Scope.BASIC_BLOCK in rule.scopes:
-            for addr, _ in matches:
-                assert addr in functions_by_bb
-                matched_bbs.add(addr)
-
-    layout = rdoc.StaticLayout(
-        functions=tuple(
-            rdoc.FunctionLayout(
-                address=frz.Address.from_capa(f),
-                matched_basic_blocks=tuple(
-                    rdoc.BasicBlockLayout(address=frz.Address.from_capa(bb)) for bb in bbs if bb in matched_bbs
-                ),  # this object is open to extension in the future,
-                # such as with the function name, etc.
-            )
-            for f, bbs in bbs_by_function.items()
-            if len([bb for bb in bbs if bb in matched_bbs]) > 0
-        )
-    )
-
-    return layout
-
-
-def compute_layout(rules: RuleSet, extractor, capabilities) -> rdoc.Layout:
-    if isinstance(extractor, StaticFeatureExtractor):
-        return compute_static_layout(rules, extractor, capabilities)
-    elif isinstance(extractor, DynamicFeatureExtractor):
-        return compute_dynamic_layout(rules, extractor, capabilities)
-    else:
-        raise ValueError("extractor must be either a static or dynamic extracotr")
--- a/capa/main.py
+++ b/capa/main.py
--- a/capa/render/default.py
+++ b/capa/render/default.py
@@ -33,7 +33,7 @@ def render_meta(doc: rd.ResultDocument, ostream: StringIO):
        (width("md5", 22), width(doc.meta.sample.md5, 82)),
        ("sha1", doc.meta.sample.sha1),
        ("sha256", doc.meta.sample.sha256),
-        ("analysis", doc.meta.flavor.value),
+        ("analysis", doc.meta.flavor),
        ("os", doc.meta.analysis.os),
        ("format", doc.meta.analysis.format),
        ("arch", doc.meta.analysis.arch),
--- a/capa/render/proto/capa.proto
+++ b/capa/render/proto/capa.proto
@@ -1,7 +1,5 @@
 syntax = "proto3";

-package mandiant.capa;
-
 message APIFeature {
  string type = 1;
  string api = 2;
--- a/capa/render/proto/capa_pb2.py
+++ b/capa/render/proto/capa_pb2.py
--- a/capa/render/result_document.py
+++ b/capa/render/result_document.py
@@ -160,7 +160,8 @@ class CompoundStatementType:
    OPTIONAL = "optional"


-class StatementModel(FrozenModel): ...
+class StatementModel(FrozenModel):
+    ...


 class CompoundStatement(StatementModel):
@@ -649,9 +650,9 @@ class ResultDocument(FrozenModel):
        return ResultDocument(meta=meta, rules=rule_matches)

    def to_capa(self) -> Tuple[Metadata, Dict]:
-        capabilities: Dict[str, List[Tuple[capa.features.address.Address, capa.features.common.Result]]] = (
-            collections.defaultdict(list)
-        )
+        capabilities: Dict[
+            str, List[Tuple[capa.features.address.Address, capa.features.common.Result]]
+        ] = collections.defaultdict(list)

        # this doesn't quite work because we don't have the rule source for rules that aren't matched.
        rules_by_name = {
--- a/capa/render/verbose.py
+++ b/capa/render/verbose.py
@@ -22,7 +22,6 @@ Unless required by applicable law or agreed to in writing, software distributed
 is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and limitations under the License.
 """
-
 from typing import cast

 import tabulate
@@ -55,6 +54,12 @@ def format_address(address: frz.Address) -> str:
        assert isinstance(token, int)
        assert isinstance(offset, int)
        return f"token({capa.helpers.hex(token)})+{capa.helpers.hex(offset)}"
+    elif address.type == frz.AddressType.DEX_METHOD_INDEX:
+        assert isinstance(address.value, int)
+        return f"method({capa.helpers.hex(address.value)})"
+    elif address.type == frz.AddressType.DEX_CLASS_INDEX:
+        assert isinstance(address.value, int)
+        return f"class({capa.helpers.hex(address.value)})"
    elif address.type == frz.AddressType.PROCESS:
        assert isinstance(address.value, tuple)
        ppid, pid = address.value
--- a/capa/rules/init.py
+++ b/capa/rules/init.py
@@ -7,8 +7,9 @@
 # See the License for the specific language governing permissions and limitations under the License.

 import io
-import os
 import re
+import gzip
+import json
 import uuid
 import codecs
 import logging
@@ -26,7 +27,7 @@ except ImportError:
    # https://github.com/python/mypy/issues/1153
    from backports.functools_lru_cache import lru_cache  # type: ignore

-from typing import Any, Set, Dict, List, Tuple, Union, Callable, Iterator, Optional
+from typing import Any, Set, Dict, List, Tuple, Union, Iterator, Optional
 from dataclasses import asdict, dataclass

 import yaml
@@ -38,13 +39,11 @@ import capa.perf
 import capa.engine as ceng
 import capa.features
 import capa.optimizer
-import capa.features.com
 import capa.features.file
 import capa.features.insn
 import capa.features.common
 import capa.features.basicblock
 from capa.engine import Statement, FeatureSet
-from capa.features.com import ComType
 from capa.features.common import MAX_BYTES_FEATURE_SIZE, Feature
 from capa.features.address import Address

@@ -329,16 +328,42 @@ def ensure_feature_valid_for_scopes(scopes: Scopes, feature: Union[Feature, Stat
            raise InvalidRule(f"feature {feature} not supported for scopes {scopes}")


-def translate_com_feature(com_name: str, com_type: ComType) -> ceng.Statement:
-    com_db = capa.features.com.load_com_database(com_type)
-    guids: Optional[List[str]] = com_db.get(com_name)
-    if not guids:
+class ComType(Enum):
+    CLASS = "class"
+    INTERFACE = "interface"
+
+
+# COM data source https://github.com/stevemk14ebr/COM-Code-Helper/tree/master
+VALID_COM_TYPES = {
+    ComType.CLASS: {"db_path": "assets/classes.json.gz", "prefix": "CLSID_"},
+    ComType.INTERFACE: {"db_path": "assets/interfaces.json.gz", "prefix": "IID_"},
+}
+
+
+@lru_cache(maxsize=None)
+def load_com_database(com_type: ComType) -> Dict[str, List[str]]:
+    com_db_path: Path = capa.main.get_default_root() / VALID_COM_TYPES[com_type]["db_path"]
+
+    if not com_db_path.exists():
+        raise IOError(f"COM database path '{com_db_path}' does not exist or cannot be accessed")
+
+    try:
+        with gzip.open(com_db_path, "rb") as gzfile:
+            return json.loads(gzfile.read().decode("utf-8"))
+    except Exception as e:
+        raise IOError(f"Error loading COM database from '{com_db_path}'") from e
+
+
+def translate_com_feature(com_name: str, com_type: ComType) -> ceng.Or:
+    com_db = load_com_database(com_type)
+    guid_strings: Optional[List[str]] = com_db.get(com_name)
+    if guid_strings is None or len(guid_strings) == 0:
        logger.error(" %s doesn't exist in COM %s database", com_name, com_type)
        raise InvalidRule(f"'{com_name}' doesn't exist in COM {com_type} database")

-    com_features: List[Feature] = []
-    for guid in guids:
-        hex_chars = guid.replace("-", "")
+    com_features: List = []
+    for guid_string in guid_strings:
+        hex_chars = guid_string.replace("-", "")
        h = [hex_chars[i : i + 2] for i in range(0, len(hex_chars), 2)]
        reordered_hex_pairs = [
            h[3],
@@ -359,10 +384,9 @@ def translate_com_feature(com_name: str, com_type: ComType) -> ceng.Statement:
            h[15],
        ]
        guid_bytes = bytes.fromhex("".join(reordered_hex_pairs))
-        prefix = capa.features.com.COM_PREFIXES[com_type]
-        symbol = prefix + com_name
-        com_features.append(capa.features.common.String(guid, f"{symbol} as GUID string"))
-        com_features.append(capa.features.common.Bytes(guid_bytes, f"{symbol} as bytes"))
+        prefix = VALID_COM_TYPES[com_type]["prefix"]
+        com_features.append(capa.features.common.StringFactory(guid_string, f"{prefix+com_name} as GUID string"))
+        com_features.append(capa.features.common.Bytes(guid_bytes, f"{prefix+com_name} as bytes"))
    return ceng.Or(com_features)


@@ -578,9 +602,7 @@ def trim_dll_part(api: str) -> str:

    # kernel32.CreateFileA
    if api.count(".") == 1:
-        if "::" not in api:
-            # skip System.Convert::FromBase64String
-            api = api.split(".")[1]
+        api = api.split(".")[1]
    return api


@@ -800,13 +822,11 @@ def build_statements(d, scopes: Scopes):
        return feature

    elif key.startswith("com/"):
-        com_type_name = str(key[len("com/") :])
-        try:
-            com_type = ComType(com_type_name)
-        except ValueError:
-            raise InvalidRule(f"unexpected COM type: {com_type_name}")
+        com_type = str(key[len("com/") :]).upper()
+        if com_type not in [item.name for item in ComType]:
+            raise InvalidRule(f"unexpected COM type: {com_type}")
        value, description = parse_description(d[key], key, d.get("description"))
-        return translate_com_feature(value, com_type)
+        return translate_com_feature(value, ComType[com_type])

    else:
        Feature = parse_feature(key)
@@ -1692,105 +1712,3 @@ class RuleSet:
        matches.update(hard_matches)

        return (features3, matches)
-
-
-def is_nursery_rule_path(path: Path) -> bool:
-    """
-    The nursery is a spot for rules that have not yet been fully polished.
-    For example, they may not have references to public example of a technique.
-    Yet, we still want to capture and report on their matches.
-    The nursery is currently a subdirectory of the rules directory with that name.
-
-    When nursery rules are loaded, their metadata section should be updated with:
-      `nursery=True`.
-    """
-    return "nursery" in path.parts
-
-
-def collect_rule_file_paths(rule_paths: List[Path]) -> List[Path]:
-    """
-    collect all rule file paths, including those in subdirectories.
-    """
-    rule_file_paths = []
-    for rule_path in rule_paths:
-        if not rule_path.exists():
-            raise IOError(f"rule path {rule_path} does not exist or cannot be accessed")
-
-        if rule_path.is_file():
-            rule_file_paths.append(rule_path)
-        elif rule_path.is_dir():
-            logger.debug("reading rules from directory %s", rule_path)
-            for root, _, files in os.walk(rule_path):
-                if ".git" in root:
-                    # the .github directory contains CI config in capa-rules
-                    # this includes some .yml files
-                    # these are not rules
-                    # additionally, .git has files that are not .yml and generate the warning
-                    # skip those too
-                    continue
-                for file in files:
-                    if not file.endswith(".yml"):
-                        if not (file.startswith(".git") or file.endswith((".git", ".md", ".txt"))):
-                            # expect to see .git* files, readme.md, format.md, and maybe a .git directory
-                            # other things maybe are rules, but are mis-named.
-                            logger.warning("skipping non-.yml file: %s", file)
-                        continue
-                    rule_file_paths.append(Path(root) / file)
-    return rule_file_paths
-
-
-# TypeAlias. note: using `foo: TypeAlias = bar` is Python 3.10+
-RulePath = Path
-
-
-def on_load_rule_default(_path: RulePath, i: int, _total: int) -> None:
-    return
-
-
-def get_rules(
-    rule_paths: List[RulePath],
-    cache_dir=None,
-    on_load_rule: Callable[[RulePath, int, int], None] = on_load_rule_default,
-) -> RuleSet:
-    """
-    args:
-      rule_paths: list of paths to rules files or directories containing rules files
-      cache_dir: directory to use for caching rules, or will use the default detected cache directory if None
-      on_load_rule: callback to invoke before a rule is loaded, use for progress or cancellation
-    """
-    if cache_dir is None:
-        cache_dir = capa.rules.cache.get_default_cache_directory()
-    # rule_paths may contain directory paths,
-    # so search for file paths recursively.
-    rule_file_paths = collect_rule_file_paths(rule_paths)
-
-    # this list is parallel to `rule_file_paths`:
-    # rule_file_paths[i] corresponds to rule_contents[i].
-    rule_contents = [file_path.read_bytes() for file_path in rule_file_paths]
-
-    ruleset = capa.rules.cache.load_cached_ruleset(cache_dir, rule_contents)
-    if ruleset is not None:
-        return ruleset
-
-    rules: List[Rule] = []
-
-    total_rule_count = len(rule_file_paths)
-    for i, (path, content) in enumerate(zip(rule_file_paths, rule_contents)):
-        on_load_rule(path, i, total_rule_count)
-
-        try:
-            rule = capa.rules.Rule.from_yaml(content.decode("utf-8"))
-        except capa.rules.InvalidRule:
-            raise
-        else:
-            rule.meta["capa/path"] = path.as_posix()
-            rule.meta["capa/nursery"] = is_nursery_rule_path(path)
-
-            rules.append(rule)
-            logger.debug("loaded rule: '%s' with scope: %s", rule.name, rule.scopes)
-
-    ruleset = capa.rules.RuleSet(rules)
-
-    capa.rules.cache.cache_ruleset(cache_dir, ruleset)
-
-    return ruleset
--- a/capa/version.py
+++ b/capa/version.py
@@ -5,7 +5,7 @@
 # Unless required by applicable law or agreed to in writing, software distributed under the License
 #  is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and limitations under the License.
-__version__ = "7.0.0"
+__version__ = "6.1.0"


 def get_major_version():
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -36,8 +36,8 @@ dependencies = [
    "pyyaml==6.0.1",
    "tabulate==0.9.0",
    "colorama==0.4.6",
-    "termcolor==2.4.0",
-    "wcwidth==0.2.13",
+    "termcolor==2.3.0",
+    "wcwidth==0.2.12",
    "ida-settings==2.1.0",
    "viv-utils[flirt]==0.7.9",
    "halo==0.0.31",
@@ -50,25 +50,25 @@ dependencies = [
    "dncil==1.0.2",
    "pydantic==2.4.0",
    "protobuf==4.23.4",
+    "dexparser==1.2.0",
 ]
 dynamic = ["version"]

 [tool.setuptools.dynamic]
 version = {attr = "capa.version.__version__"}

-[tool.setuptools.packages.find]
-include = ["capa*"]
-namespaces = false
+[tool.setuptools]
+packages = ["capa"]

 [project.optional-dependencies]
 dev = [
    "pre-commit==3.5.0",
-    "pytest==8.0.0",
+    "pytest==7.4.3",
    "pytest-sugar==0.9.7",
    "pytest-instafail==0.5.0",
    "pytest-cov==4.1.0",
-    "flake8==7.0.0",
-    "flake8-bugbear==24.1.17",
+    "flake8==6.1.0",
+    "flake8-bugbear==23.11.26",
    "flake8-encodings==0.5.1",
    "flake8-comprehensions==3.14.0",
    "flake8-logging-format==0.9.0",
@@ -78,10 +78,10 @@ dev = [
    "flake8-simplify==0.21.0",
    "flake8-use-pathlib==0.3.0",
    "flake8-copyright==0.2.4",
-    "ruff==0.1.14",
-    "black==24.1.1",
-    "isort==5.13.2",
-    "mypy==1.8.0",
+    "ruff==0.1.6",
+    "black==23.11.0",
+    "isort==5.11.4",
+    "mypy==1.7.1",
    "psutil==5.9.2",
    "stix2==3.0.1",
    "requests==2.31.0",
@@ -90,15 +90,15 @@ dev = [
    "types-backports==0.1.3",
    "types-colorama==0.4.15.11",
    "types-PyYAML==6.0.8",
-    "types-tabulate==0.9.0.20240106",
+    "types-tabulate==0.9.0.3",
    "types-termcolor==1.1.4",
    "types-psutil==5.8.23",
-    "types_requests==2.31.0.20240125",
+    "types_requests==2.31.0.10",
    "types-protobuf==4.23.0.3",
 ]
 build = [
-    "pyinstaller==6.3.0",
-    "setuptools==69.0.3",
+    "pyinstaller==6.2.0",
+    "setuptools==69.0.2",
    "build==1.0.3"
 ]

--- a/2
+++ b/2
--- a/scripts/bulk-process.py
+++ b/scripts/bulk-process.py
@@ -36,7 +36,7 @@ example:
 usage:

    usage: bulk-process.py [-h] [-r RULES] [-d] [-q] [-n PARALLELISM] [--no-mp]
-                           input_directory
+                           input

    detect capabilities in programs.

@@ -62,6 +62,7 @@ Unless required by applicable law or agreed to in writing, software distributed
 is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and limitations under the License.
 """
+import os
 import sys
 import json
 import logging
@@ -73,10 +74,10 @@ from pathlib import Path
 import capa
 import capa.main
 import capa.rules
-import capa.loader
 import capa.render.json
 import capa.capabilities.common
 import capa.render.result_document as rd
+from capa.features.common import OS_AUTO

 logger = logging.getLogger("capa")

@@ -86,8 +87,11 @@ def get_capa_results(args):
    run capa against the file at the given path, using the given rules.

    args is a tuple, containing:
-      rules, signatures, format, backend, os, input_file
-    as provided via the CLI arguments.
+      rules (capa.rules.RuleSet): the rules to match
+      signatures (List[str]): list of file system paths to signature files
+      format (str): the name of the sample file format
+      os (str): the name of the operating system
+      path (str): the file system path to the sample to process

    args is a tuple because i'm not quite sure how to unpack multiple arguments using `map`.

@@ -102,58 +106,44 @@ def get_capa_results(args):
      meta (dict): the meta analysis results
      capabilities (dict): the matched capabilities and their result objects
    """
-    rules, signatures, format_, backend, os_, input_file = args
-
-    parser = argparse.ArgumentParser(description="detect capabilities in programs.")
-    capa.main.install_common_args(parser, wanted={"rules", "signatures", "format", "os", "backend", "input_file"})
-    argv = [
-        "--signatures",
-        signatures,
-        "--format",
-        format_,
-        "--backend",
-        backend,
-        "--os",
-        os_,
-        input_file,
-    ]
-    if rules:
-        argv += ["--rules", rules]
-    args = parser.parse_args(args=argv)
-
+    rules, sigpaths, format, os_, path = args
+    should_save_workspace = os.environ.get("CAPA_SAVE_WORKSPACE") not in ("0", "no", "NO", "n", None)
+    logger.info("computing capa results for: %s", path)
    try:
-        capa.main.handle_common_args(args)
-        capa.main.ensure_input_exists_from_cli(args)
-        input_format = capa.main.get_input_format_from_cli(args)
-        rules = capa.main.get_rules_from_cli(args)
-        backend = capa.main.get_backend_from_cli(args, input_format)
-        sample_path = capa.main.get_sample_path_from_cli(args, backend)
-        if sample_path is None:
-            os_ = "unknown"
-        else:
-            os_ = capa.loader.get_os(sample_path)
-        extractor = capa.main.get_extractor_from_cli(args, input_format, backend)
-    except capa.main.ShouldExitError as e:
-        # i'm not 100% sure if multiprocessing will reliably raise exceptions across process boundaries.
+        extractor = capa.main.get_extractor(
+            path, format, os_, capa.main.BACKEND_VIV, sigpaths, should_save_workspace, disable_progress=True
+        )
+    except capa.exceptions.UnsupportedFormatError:
+        # i'm 100% sure if multiprocessing will reliably raise exceptions across process boundaries.
        # so instead, return an object with explicit success/failure status.
        #
        # if success, then status=ok, and results found in property "ok"
        # if error, then status=error, and human readable message in property "error"
-        return {"path": input_file, "status": "error", "error": str(e), "status_code": e.status_code}
+        return {
+            "path": path,
+            "status": "error",
+            "error": f"input file does not appear to be a PE file: {path}",
+        }
+    except capa.exceptions.UnsupportedRuntimeError:
+        return {
+            "path": path,
+            "status": "error",
+            "error": "unsupported runtime or Python interpreter",
+        }
    except Exception as e:
        return {
-            "path": input_file,
+            "path": path,
            "status": "error",
            "error": f"unexpected error: {e}",
        }

    capabilities, counts = capa.capabilities.common.find_capabilities(rules, extractor, disable_progress=True)

-    meta = capa.loader.collect_metadata(argv, args.input_file, format_, os_, [], extractor, counts)
-    meta.analysis.layout = capa.loader.compute_layout(rules, extractor, capabilities)
+    meta = capa.main.collect_metadata([], path, format, os_, [], extractor, counts)
+    meta.analysis.layout = capa.main.compute_layout(rules, extractor, capabilities)

    doc = rd.ResultDocument.from_capa(meta, rules, capabilities)
-    return {"path": input_file, "status": "ok", "ok": doc.model_dump()}
+    return {"path": path, "status": "ok", "ok": doc.model_dump()}


 def main(argv=None):
@@ -161,16 +151,30 @@ def main(argv=None):
        argv = sys.argv[1:]

        parser = argparse.ArgumentParser(description="detect capabilities in programs.")
-        capa.main.install_common_args(parser, wanted={"rules", "signatures", "format", "os", "backend"})
-        parser.add_argument("input_directory", type=str, help="Path to directory of files to recursively analyze")
+        capa.main.install_common_args(parser, wanted={"rules", "signatures", "format", "os"})
+        parser.add_argument("input", type=str, help="Path to directory of files to recursively analyze")
        parser.add_argument(
            "-n", "--parallelism", type=int, default=multiprocessing.cpu_count(), help="parallelism factor"
        )
        parser.add_argument("--no-mp", action="store_true", help="disable subprocesses")
        args = parser.parse_args(args=argv)
+        capa.main.handle_common_args(args)
+
+        try:
+            rules = capa.main.get_rules(args.rules)
+            logger.info("successfully loaded %s rules", len(rules))
+        except (IOError, capa.rules.InvalidRule, capa.rules.InvalidRuleSet) as e:
+            logger.error("%s", str(e))
+            return -1
+
+        try:
+            sig_paths = capa.main.get_signatures(args.signatures)
+        except IOError as e:
+            logger.error("%s", str(e))
+            return -1

        samples = []
-        for file in Path(args.input_directory).rglob("*"):
+        for file in Path(args.input).rglob("*"):
            samples.append(file)

        cpu_count = multiprocessing.cpu_count()
@@ -199,22 +203,18 @@ def main(argv=None):
            logger.debug("using process mapper")
            mapper = pmap

-        rules = args.rules
-        if rules == [capa.main.RULES_PATH_DEFAULT_STRING]:
-            rules = None
-
        results = {}
        for result in mapper(
            get_capa_results,
-            [(rules, args.signatures, args.format, args.backend, args.os, str(sample)) for sample in samples],
+            [(rules, sig_paths, "pe", OS_AUTO, sample) for sample in samples],
            parallelism=args.parallelism,
        ):
            if result["status"] == "error":
                logger.warning(result["error"])
            elif result["status"] == "ok":
-                doc = rd.ResultDocument.model_validate(result["ok"]).model_dump_json(exclude_none=True)
-                results[result["path"]] = json.loads(doc)
-
+                results[result["path"].as_posix()] = rd.ResultDocument.model_validate(result["ok"]).model_dump_json(
+                    exclude_none=True
+                )
            else:
                raise ValueError(f"unexpected status: {result['status']}")

--- a/scripts/cache-ruleset.py
+++ b/scripts/cache-ruleset.py
@@ -15,7 +15,6 @@ Unless required by applicable law or agreed to in writing, software distributed
 is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and limitations under the License.
 """
-
 import sys
 import logging
 import argparse
@@ -37,27 +36,20 @@ def main(argv=None):

    parser = argparse.ArgumentParser(description="Cache ruleset.")
    capa.main.install_common_args(parser)
-    parser.add_argument("rules", type=str, help="Path to rules directory")
+    parser.add_argument("rules", type=str, action="append", help="Path to rules")
    parser.add_argument("cache", type=str, help="Path to cache directory")
    args = parser.parse_args(args=argv)
+    capa.main.handle_common_args(args)

-    # don't use capa.main.handle_common_args
-    # because it expects a different format for the --rules argument
-
-    if args.quiet:
-        logging.basicConfig(level=logging.WARNING)
-        logging.getLogger().setLevel(logging.WARNING)
-    elif args.debug:
-        logging.basicConfig(level=logging.DEBUG)
-        logging.getLogger().setLevel(logging.DEBUG)
+    if args.debug:
+        logging.getLogger("capa").setLevel(logging.DEBUG)
    else:
-        logging.basicConfig(level=logging.INFO)
-        logging.getLogger().setLevel(logging.INFO)
+        logging.getLogger("capa").setLevel(logging.ERROR)

    try:
        cache_dir = Path(args.cache)
        cache_dir.mkdir(parents=True, exist_ok=True)
-        rules = capa.rules.get_rules([Path(args.rules)], cache_dir)
+        rules = capa.main.get_rules(args.rules, cache_dir)
        logger.info("successfully loaded %s rules", len(rules))
    except (IOError, capa.rules.InvalidRule, capa.rules.InvalidRuleSet) as e:
        logger.error("%s", str(e))
--- a/scripts/capa2yara.py
+++ b/scripts/capa2yara.py
@@ -61,22 +61,7 @@ var_names = ["".join(letters) for letters in itertools.product(string.ascii_lowe


 # this have to be the internal names used by capa.py which are sometimes different to the ones written out in the rules, e.g. "2 or more" is "Some", count is Range
-unsupported = [
-    "characteristic",
-    "mnemonic",
-    "offset",
-    "subscope",
-    "Range",
-    "os",
-    "property",
-    "format",
-    "class",
-    "operand[0].number",
-    "operand[1].number",
-    "substring",
-    "arch",
-    "namespace",
-]
+unsupported = ["characteristic", "mnemonic", "offset", "subscope", "Range"]
 # further idea: shorten this list, possible stuff:
 # - 2 or more strings: e.g.
 # -- https://github.com/mandiant/capa-rules/blob/master/collection/file-managers/gather-direct-ftp-information.yml
@@ -105,7 +90,8 @@ condition_header = """
 condition_rule = """
 private rule capa_pe_file : CAPA {
    meta:
-        description = "Match in PE files. Used by other CAPA rules"
+        description = "match in PE files. used by all further CAPA rules"
+        author = "Arnim Rupp"
    condition:
        uint16be(0) == 0x4d5a
        or uint16be(0) == 0x558b
@@ -723,33 +709,36 @@ def main(argv=None):
        argv = sys.argv[1:]

    parser = argparse.ArgumentParser(description="Capa to YARA rule converter")
-    capa.main.install_common_args(parser, wanted={"tag"})
+    parser.add_argument("rules", type=str, help="Path to rules")
    parser.add_argument("--private", "-p", action="store_true", help="Create private rules", default=False)
-    parser.add_argument("rules", type=str, help="Path to rules directory")
+    capa.main.install_common_args(parser, wanted={"tag"})
+
    args = parser.parse_args(args=argv)
+    make_priv = args.private

-    # don't use capa.main.handle_common_args
-    # because it expects a different format for the --rules argument
-
-    if args.quiet:
-        logging.basicConfig(level=logging.WARNING)
-        logging.getLogger().setLevel(logging.WARNING)
-    elif args.debug:
-        logging.basicConfig(level=logging.DEBUG)
-        logging.getLogger().setLevel(logging.DEBUG)
+    if args.verbose:
+        level = logging.DEBUG
+    elif args.quiet:
+        level = logging.ERROR
    else:
-        logging.basicConfig(level=logging.INFO)
-        logging.getLogger().setLevel(logging.INFO)
+        level = logging.INFO
+
+    logging.basicConfig(level=level)
+    logging.getLogger("capa2yara").setLevel(level)

    try:
-        rules = capa.rules.get_rules([Path(args.rules)])
-        logger.info("successfully loaded %s rules", len(rules))
+        rules = capa.main.get_rules([Path(args.rules)])
+        namespaces = capa.rules.index_rules_by_namespace(list(rules.rules.values()))
+        logger.info("successfully loaded %d rules (including subscope rules which will be ignored)", len(rules))
+        if args.tag:
+            rules = rules.filter_rules_by_meta(args.tag)
+            logger.debug("selected %d rules", len(rules))
+            for i, r in enumerate(rules.rules, 1):
+                logger.debug(" %d. %s", i, r)
    except (IOError, capa.rules.InvalidRule, capa.rules.InvalidRuleSet) as e:
        logger.error("%s", str(e))
        return -1

-    namespaces = capa.rules.index_rules_by_namespace(list(rules.rules.values()))
-
    output_yar(
        "// Rules from Mandiant's https://github.com/mandiant/capa-rules converted to YARA using https://github.com/mandiant/capa/blob/master/scripts/capa2yara.py by Arnim Rupp"
    )
@@ -777,10 +766,10 @@ def main(argv=None):
        cround += 1
        logger.info("doing convert_rules(), round: %d", cround)
        num_rules = len(converted_rules)
-        count_incomplete += convert_rules(rules, namespaces, cround, args.private)
+        count_incomplete += convert_rules(rules, namespaces, cround, make_priv)

    # one last round to collect all unconverted rules
-    count_incomplete += convert_rules(rules, namespaces, 9000, args.private)
+    count_incomplete += convert_rules(rules, namespaces, 9000, make_priv)

    stats = "\n// converted rules              : " + str(len(converted_rules))
    stats += "\n//   among those are incomplete : " + str(count_incomplete)
--- a/scripts/capa_as_library.py
+++ b/scripts/capa_as_library.py
@@ -15,7 +15,6 @@ from pathlib import Path
 import capa.main
 import capa.rules
 import capa.engine
-import capa.loader
 import capa.features
 import capa.render.json
 import capa.render.utils as rutils
@@ -169,19 +168,19 @@ def render_dictionary(doc: rd.ResultDocument) -> Dict[str, Any]:


 # ==== render dictionary helpers
-def capa_details(rules_path: Path, input_file: Path, output_format="dictionary"):
+def capa_details(rules_path: Path, file_path: Path, output_format="dictionary"):
    # load rules from disk
-    rules = capa.rules.get_rules([rules_path])
+    rules = capa.main.get_rules([rules_path])

    # extract features and find capabilities
-    extractor = capa.loader.get_extractor(
-        input_file, FORMAT_AUTO, OS_AUTO, capa.main.BACKEND_VIV, [], should_save_workspace=False, disable_progress=True
+    extractor = capa.main.get_extractor(
+        file_path, FORMAT_AUTO, OS_AUTO, capa.main.BACKEND_VIV, [], False, disable_progress=True
    )
    capabilities, counts = capa.capabilities.common.find_capabilities(rules, extractor, disable_progress=True)

    # collect metadata (used only to make rendering more complete)
-    meta = capa.loader.collect_metadata([], input_file, FORMAT_AUTO, OS_AUTO, [rules_path], extractor, counts)
-    meta.analysis.layout = capa.loader.compute_layout(rules, extractor, capabilities)
+    meta = capa.main.collect_metadata([], file_path, FORMAT_AUTO, OS_AUTO, [rules_path], extractor, counts)
+    meta.analysis.layout = capa.main.compute_layout(rules, extractor, capabilities)

    capa_output: Any = False

@@ -207,7 +206,7 @@ if __name__ == "__main__":
    RULES_PATH = capa.main.get_default_root() / "rules"

    parser = argparse.ArgumentParser(description="Extract capabilities from a file")
-    parser.add_argument("input_file", help="file to extract capabilities from")
+    parser.add_argument("file", help="file to extract capabilities from")
    parser.add_argument("--rules", help="path to rules directory", default=RULES_PATH)
    parser.add_argument(
        "--output", help="output format", choices=["dictionary", "json", "texttable"], default="dictionary"
@@ -215,5 +214,5 @@ if __name__ == "__main__":
    args = parser.parse_args()
    if args.rules != RULES_PATH:
        args.rules = Path(args.rules)
-    print(capa_details(args.rules, Path(args.input_file), args.output))
+    print(capa_details(args.rules, Path(args.file), args.output))
    sys.exit(0)
--- a/scripts/capafmt.py
+++ b/scripts/capafmt.py
@@ -14,13 +14,11 @@ Unless required by applicable law or agreed to in writing, software distributed
 is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and limitations under the License.
 """
-
 import sys
 import logging
 import argparse
 from pathlib import Path

-import capa.main
 import capa.rules

 logger = logging.getLogger("capafmt")
@@ -31,7 +29,6 @@ def main(argv=None):
        argv = sys.argv[1:]

    parser = argparse.ArgumentParser(description="Capa rule formatter.")
-    capa.main.install_common_args(parser)
    parser.add_argument("path", type=str, help="Path to rule to format")
    parser.add_argument(
        "-i",
@@ -40,6 +37,8 @@ def main(argv=None):
        dest="in_place",
        help="Format the rule in place, otherwise, write formatted rule to STDOUT",
    )
+    parser.add_argument("-v", "--verbose", action="store_true", help="Enable debug logging")
+    parser.add_argument("-q", "--quiet", action="store_true", help="Disable all output but errors")
    parser.add_argument(
        "-c",
        "--check",
@@ -48,10 +47,15 @@ def main(argv=None):
    )
    args = parser.parse_args(args=argv)

-    try:
-        capa.main.handle_common_args(args)
-    except capa.main.ShouldExitError as e:
-        return e.status_code
+    if args.verbose:
+        level = logging.DEBUG
+    elif args.quiet:
+        level = logging.ERROR
+    else:
+        level = logging.INFO
+
+    logging.basicConfig(level=level)
+    logging.getLogger("capafmt").setLevel(level)

    rule = capa.rules.Rule.from_yaml_file(args.path, use_ruamel=True)
    reformatted_rule = rule.to_yaml()
--- a/scripts/detect-elf-os.py
+++ b/scripts/detect-elf-os.py
@@ -17,8 +17,8 @@ import logging
 import argparse
 import contextlib
 from typing import BinaryIO
+from pathlib import Path

-import capa.main
 import capa.helpers
 import capa.features.extractors.elf

@@ -36,16 +36,28 @@ def main(argv=None):
            argv = sys.argv[1:]

        parser = argparse.ArgumentParser(description="Detect the underlying OS for the given ELF file")
-        capa.main.install_common_args(parser, wanted={"input_file"})
+        parser.add_argument("sample", type=str, help="path to ELF file")
+
+        logging_group = parser.add_argument_group("logging arguments")
+
+        logging_group.add_argument("-d", "--debug", action="store_true", help="enable debugging output on STDERR")
+        logging_group.add_argument(
+            "-q", "--quiet", action="store_true", help="disable all status output except fatal errors"
+        )
+
        args = parser.parse_args(args=argv)

-        try:
-            capa.main.handle_common_args(args)
-            capa.main.ensure_input_exists_from_cli(args)
-        except capa.main.ShouldExitError as e:
-            return e.status_code
+        if args.quiet:
+            logging.basicConfig(level=logging.WARNING)
+            logging.getLogger().setLevel(logging.WARNING)
+        elif args.debug:
+            logging.basicConfig(level=logging.DEBUG)
+            logging.getLogger().setLevel(logging.DEBUG)
+        else:
+            logging.basicConfig(level=logging.INFO)
+            logging.getLogger().setLevel(logging.INFO)

-        f = args.input_file.open("rb")
+        f = Path(args.sample).open("rb")

    with contextlib.closing(f):
        try:
--- a/scripts/detect_duplicate_features.py
+++ b/scripts/detect_duplicate_features.py
@@ -48,7 +48,7 @@ def find_overlapping_rules(new_rule_path, rules_path):
    overlapping_rules = []

    # capa.rules.RuleSet stores all rules in given paths
-    ruleset = capa.rules.get_rules(rules_path)
+    ruleset = capa.main.get_rules(rules_path)

    for rule_name, rule in ruleset.rules.items():
        rule_features = rule.extract_all_features()
--- a/scripts/import-to-ida.py
+++ b/scripts/import-to-ida.py
@@ -28,7 +28,6 @@ Unless required by applicable law or agreed to in writing, software distributed
 is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and limitations under the License.
 """
-
 import logging
 import binascii
 from pathlib import Path
--- a/scripts/lint.py
+++ b/scripts/lint.py
@@ -13,7 +13,6 @@ Unless required by applicable law or agreed to in writing, software distributed
 is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and limitations under the License.
 """
-
 import gc
 import os
 import re
@@ -40,7 +39,6 @@ import tqdm.contrib.logging
 import capa.main
 import capa.rules
 import capa.engine
-import capa.loader
 import capa.helpers
 import capa.features.insn
 import capa.capabilities.common
@@ -309,8 +307,9 @@ class InvalidAttckOrMbcTechnique(Lint):
            with data_path.open("rb") as fd:
                self.data = json.load(fd)
            self.enabled_frameworks = self.data.keys()
-        except (FileNotFoundError, json.decoder.JSONDecodeError):
-            # linter-data.json missing, or JSON error: log an error and skip this lint
+        except BaseException:
+            # If linter-data.json is not present, or if an error happen
+            # we log an error and lint nothing.
            logger.warning(
                "Could not load 'scripts/linter-data.json'. The att&ck and mbc information will not be linted."
            )
@@ -356,20 +355,16 @@ def get_sample_capabilities(ctx: Context, path: Path) -> Set[str]:
        logger.debug("found cached results: %s: %d capabilities", nice_path, len(ctx.capabilities_by_sample[path]))
        return ctx.capabilities_by_sample[path]

+    if nice_path.name.endswith(capa.helpers.EXTENSIONS_SHELLCODE_32):
+        format_ = "sc32"
+    elif nice_path.name.endswith(capa.helpers.EXTENSIONS_SHELLCODE_64):
+        format_ = "sc64"
+    else:
+        format_ = capa.helpers.get_auto_format(nice_path)
+
    logger.debug("analyzing sample: %s", nice_path)
-
-    args = argparse.Namespace(input_file=nice_path, format=capa.main.FORMAT_AUTO, backend=capa.main.BACKEND_AUTO)
-    format_ = capa.main.get_input_format_from_cli(args)
-    backend = capa.main.get_backend_from_cli(args, format_)
-
-    extractor = capa.loader.get_extractor(
-        nice_path,
-        format_,
-        OS_AUTO,
-        backend,
-        DEFAULT_SIGNATURES,
-        should_save_workspace=False,
-        disable_progress=True,
+    extractor = capa.main.get_extractor(
+        nice_path, format_, OS_AUTO, capa.main.BACKEND_VIV, DEFAULT_SIGNATURES, False, disable_progress=True
    )

    capabilities, _ = capa.capabilities.common.find_capabilities(ctx.rules, extractor, disable_progress=True)
@@ -654,6 +649,16 @@ class FeatureNtdllNtoskrnlApi(Lint):
        return False


+class FormatLineFeedEOL(Lint):
+    name = "line(s) end with CRLF (\\r\\n)"
+    recommendation = "convert line endings to LF (\\n) for example using dos2unix"
+
+    def check_rule(self, ctx: Context, rule: Rule):
+        if len(rule.definition.split("\r\n")) > 0:
+            return False
+        return True
+
+
 class FormatSingleEmptyLineEOF(Lint):
    name = "EOF format"
    recommendation = "end file with a single empty line"
@@ -669,14 +674,16 @@ class FormatIncorrect(Lint):
    recommendation_template = "use scripts/capafmt.py or adjust as follows\n{:s}"

    def check_rule(self, ctx: Context, rule: Rule):
-        # EOL depends on Git and our .gitattributes defines text=auto (Git handles files it thinks is best)
-        # we prefer LF only, but enforcing across OSs seems tedious and unnecessary
-        actual = rule.definition.replace("\r\n", "\n")
+        actual = rule.definition
        expected = capa.rules.Rule.from_yaml(rule.definition, use_ruamel=True).to_yaml()

        if actual != expected:
            diff = difflib.ndiff(actual.splitlines(1), expected.splitlines(True))
            recommendation_template = self.recommendation_template
+            if "\r\n" in actual:
+                recommendation_template = (
+                    self.recommendation_template + "\nplease make sure that the file uses LF (\\n) line endings only"
+                )
            self.recommendation = recommendation_template.format("".join(diff))
            return True

@@ -790,6 +797,7 @@ def lint_features(ctx: Context, rule: Rule):


 FORMAT_LINTS = (
+    FormatLineFeedEOL(),
    FormatSingleEmptyLineEOF(),
    FormatStringQuotesIncorrect(),
    FormatIncorrect(),
@@ -982,11 +990,7 @@ def main(argv=None):
        help="Enable thorough linting - takes more time, but does a better job",
    )
    args = parser.parse_args(args=argv)
-
-    try:
-        capa.main.handle_common_args(args)
-    except capa.main.ShouldExitError as e:
-        return e.status_code
+    capa.main.handle_common_args(args)

    if args.debug:
        logging.getLogger("capa").setLevel(logging.DEBUG)
@@ -998,9 +1002,16 @@ def main(argv=None):
    time0 = time.time()

    try:
-        rules = capa.main.get_rules_from_cli(args)
-    except capa.main.ShouldExitError as e:
-        return e.status_code
+        rules = capa.main.get_rules(args.rules)
+        logger.info("successfully loaded %s rules", rules.source_rule_count)
+        if args.tag:
+            rules = rules.filter_rules_by_meta(args.tag)
+            logger.debug("selected %s rules", len(rules))
+            for i, r in enumerate(rules.rules, 1):
+                logger.debug(" %d. %s", i, r)
+    except (IOError, capa.rules.InvalidRule, capa.rules.InvalidRuleSet) as e:
+        logger.error("%s", str(e))
+        return -1

    logger.info("collecting potentially referenced samples")
    samples_path = Path(args.samples)
--- a/scripts/linter-data.json
+++ b/scripts/linter-data.json
@@ -43,8 +43,7 @@
      "T1598": "Phishing for Information",
      "T1598.001": "Phishing for Information::Spearphishing Service",
      "T1598.002": "Phishing for Information::Spearphishing Attachment",
-      "T1598.003": "Phishing for Information::Spearphishing Link",
-      "T1598.004": "Phishing for Information::Spearphishing Voice"
+      "T1598.003": "Phishing for Information::Spearphishing Link"
    },
    "Resource Development": {
      "T1583": "Acquire Infrastructure",
@@ -112,9 +111,7 @@
      "T1566": "Phishing",
      "T1566.001": "Phishing::Spearphishing Attachment",
      "T1566.002": "Phishing::Spearphishing Link",
-      "T1566.003": "Phishing::Spearphishing via Service",
-      "T1566.004": "Phishing::Spearphishing Voice",
-      "T1659": "Content Injection"
+      "T1566.003": "Phishing::Spearphishing via Service"
    },
    "Execution": {
      "T1047": "Windows Management Instrumentation",
@@ -178,7 +175,6 @@
      "T1098.003": "Account Manipulation::Additional Cloud Roles",
      "T1098.004": "Account Manipulation::SSH Authorized Keys",
      "T1098.005": "Account Manipulation::Device Registration",
-      "T1098.006": "Account Manipulation::Additional Container Cluster Roles",
      "T1133": "External Remote Services",
      "T1136": "Create Account",
      "T1136.001": "Create Account::Local Account",
@@ -268,8 +264,7 @@
      "T1574.010": "Hijack Execution Flow::Services File Permissions Weakness",
      "T1574.011": "Hijack Execution Flow::Services Registry Permissions Weakness",
      "T1574.012": "Hijack Execution Flow::COR_PROFILER",
-      "T1574.013": "Hijack Execution Flow::KernelCallbackTable",
-      "T1653": "Power Settings"
+      "T1574.013": "Hijack Execution Flow::KernelCallbackTable"
    },
    "Privilege Escalation": {
      "T1037": "Boot or Logon Initialization Scripts",
@@ -303,13 +298,6 @@
      "T1078.002": "Valid Accounts::Domain Accounts",
      "T1078.003": "Valid Accounts::Local Accounts",
      "T1078.004": "Valid Accounts::Cloud Accounts",
-      "T1098": "Account Manipulation",
-      "T1098.001": "Account Manipulation::Additional Cloud Credentials",
-      "T1098.002": "Account Manipulation::Additional Email Delegate Permissions",
-      "T1098.003": "Account Manipulation::Additional Cloud Roles",
-      "T1098.004": "Account Manipulation::SSH Authorized Keys",
-      "T1098.005": "Account Manipulation::Device Registration",
-      "T1098.006": "Account Manipulation::Additional Container Cluster Roles",
      "T1134": "Access Token Manipulation",
      "T1134.001": "Access Token Manipulation::Token Impersonation/Theft",
      "T1134.002": "Access Token Manipulation::Create Process with Token",
@@ -361,7 +349,6 @@
      "T1548.002": "Abuse Elevation Control Mechanism::Bypass User Account Control",
      "T1548.003": "Abuse Elevation Control Mechanism::Sudo and Sudo Caching",
      "T1548.004": "Abuse Elevation Control Mechanism::Elevated Execution with Prompt",
-      "T1548.005": "Abuse Elevation Control Mechanism::Temporary Elevated Cloud Access",
      "T1574": "Hijack Execution Flow",
      "T1574.001": "Hijack Execution Flow::DLL Search Order Hijacking",
      "T1574.002": "Hijack Execution Flow::DLL Side-Loading",
@@ -392,7 +379,6 @@
      "T1027.009": "Obfuscated Files or Information::Embedded Payloads",
      "T1027.010": "Obfuscated Files or Information::Command Obfuscation",
      "T1027.011": "Obfuscated Files or Information::Fileless Storage",
-      "T1027.012": "Obfuscated Files or Information::LNK Icon Smuggling",
      "T1036": "Masquerading",
      "T1036.001": "Masquerading::Invalid Code Signature",
      "T1036.002": "Masquerading::Right-to-Left Override",
@@ -402,7 +388,6 @@
      "T1036.006": "Masquerading::Space after Filename",
      "T1036.007": "Masquerading::Double File Extension",
      "T1036.008": "Masquerading::Masquerade File Type",
-      "T1036.009": "Masquerading::Break Process Trees",
      "T1055": "Process Injection",
      "T1055.001": "Process Injection::Dynamic-link Library Injection",
      "T1055.002": "Process Injection::Portable Executable Injection",
@@ -490,7 +475,6 @@
      "T1548.002": "Abuse Elevation Control Mechanism::Bypass User Account Control",
      "T1548.003": "Abuse Elevation Control Mechanism::Sudo and Sudo Caching",
      "T1548.004": "Abuse Elevation Control Mechanism::Elevated Execution with Prompt",
-      "T1548.005": "Abuse Elevation Control Mechanism::Temporary Elevated Cloud Access",
      "T1550": "Use Alternate Authentication Material",
      "T1550.001": "Use Alternate Authentication Material::Application Access Token",
      "T1550.002": "Use Alternate Authentication Material::Pass the Hash",
@@ -519,11 +503,10 @@
      "T1562.004": "Impair Defenses::Disable or Modify System Firewall",
      "T1562.006": "Impair Defenses::Indicator Blocking",
      "T1562.007": "Impair Defenses::Disable or Modify Cloud Firewall",
-      "T1562.008": "Impair Defenses::Disable or Modify Cloud Logs",
+      "T1562.008": "Impair Defenses::Disable Cloud Logs",
      "T1562.009": "Impair Defenses::Safe Mode Boot",
      "T1562.010": "Impair Defenses::Downgrade Attack",
      "T1562.011": "Impair Defenses::Spoof Security Alerting",
-      "T1562.012": "Impair Defenses::Disable or Modify Linux Audit System",
      "T1564": "Hide Artifacts",
      "T1564.001": "Hide Artifacts::Hidden Files and Directories",
      "T1564.002": "Hide Artifacts::Hidden Users",
@@ -535,7 +518,6 @@
      "T1564.008": "Hide Artifacts::Email Hiding Rules",
      "T1564.009": "Hide Artifacts::Resource Forking",
      "T1564.010": "Hide Artifacts::Process Argument Spoofing",
-      "T1564.011": "Hide Artifacts::Ignore Process Interrupts",
      "T1574": "Hijack Execution Flow",
      "T1574.001": "Hijack Execution Flow::DLL Search Order Hijacking",
      "T1574.002": "Hijack Execution Flow::DLL Side-Loading",
@@ -554,7 +536,6 @@
      "T1578.002": "Modify Cloud Compute Infrastructure::Create Cloud Instance",
      "T1578.003": "Modify Cloud Compute Infrastructure::Delete Cloud Instance",
      "T1578.004": "Modify Cloud Compute Infrastructure::Revert Cloud Instance",
-      "T1578.005": "Modify Cloud Compute Infrastructure::Modify Cloud Compute Configurations",
      "T1599": "Network Boundary Bridging",
      "T1599.001": "Network Boundary Bridging::Network Address Translation Traversal",
      "T1600": "Weaken Encryption",
@@ -567,8 +548,7 @@
      "T1612": "Build Image on Host",
      "T1620": "Reflective Code Loading",
      "T1622": "Debugger Evasion",
-      "T1647": "Plist File Modification",
-      "T1656": "Impersonation"
+      "T1647": "Plist File Modification"
    },
    "Credential Access": {
      "T1003": "OS Credential Dumping",
@@ -611,7 +591,6 @@
      "T1555.003": "Credentials from Password Stores::Credentials from Web Browsers",
      "T1555.004": "Credentials from Password Stores::Windows Credential Manager",
      "T1555.005": "Credentials from Password Stores::Password Managers",
-      "T1555.006": "Credentials from Password Stores::Cloud Secrets Management Stores",
      "T1556": "Modify Authentication Process",
      "T1556.001": "Modify Authentication Process::Domain Controller Authentication",
      "T1556.002": "Modify Authentication Process::Password Filter DLL",
@@ -642,7 +621,6 @@
      "T1012": "Query Registry",
      "T1016": "System Network Configuration Discovery",
      "T1016.001": "System Network Configuration Discovery::Internet Connection Discovery",
-      "T1016.002": "System Network Configuration Discovery::Wi-Fi Discovery",
      "T1018": "Remote System Discovery",
      "T1033": "System Owner/User Discovery",
      "T1040": "Network Sniffing",
@@ -681,8 +659,7 @@
      "T1615": "Group Policy Discovery",
      "T1619": "Cloud Storage Object Discovery",
      "T1622": "Debugger Evasion",
-      "T1652": "Device Driver Discovery",
-      "T1654": "Log Enumeration"
+      "T1652": "Device Driver Discovery"
    },
    "Lateral Movement": {
      "T1021": "Remote Services",
@@ -693,7 +670,6 @@
      "T1021.005": "Remote Services::VNC",
      "T1021.006": "Remote Services::Windows Remote Management",
      "T1021.007": "Remote Services::Cloud Services",
-      "T1021.008": "Remote Services::Direct Cloud VM Connections",
      "T1072": "Software Deployment Tools",
      "T1080": "Taint Shared Content",
      "T1091": "Replication Through Removable Media",
@@ -787,8 +763,7 @@
      "T1572": "Protocol Tunneling",
      "T1573": "Encrypted Channel",
      "T1573.001": "Encrypted Channel::Symmetric Cryptography",
-      "T1573.002": "Encrypted Channel::Asymmetric Cryptography",
-      "T1659": "Content Injection"
+      "T1573.002": "Encrypted Channel::Asymmetric Cryptography"
    },
    "Exfiltration": {
      "T1011": "Exfiltration Over Other Network Medium",
@@ -808,8 +783,7 @@
      "T1567": "Exfiltration Over Web Service",
      "T1567.001": "Exfiltration Over Web Service::Exfiltration to Code Repository",
      "T1567.002": "Exfiltration Over Web Service::Exfiltration to Cloud Storage",
-      "T1567.003": "Exfiltration Over Web Service::Exfiltration to Text Storage Sites",
-      "T1567.004": "Exfiltration Over Web Service::Exfiltration Over Webhook"
+      "T1567.003": "Exfiltration Over Web Service::Exfiltration to Text Storage Sites"
    },
    "Impact": {
      "T1485": "Data Destruction",
@@ -837,8 +811,7 @@
      "T1565": "Data Manipulation",
      "T1565.001": "Data Manipulation::Stored Data Manipulation",
      "T1565.002": "Data Manipulation::Transmitted Data Manipulation",
-      "T1565.003": "Data Manipulation::Runtime Data Manipulation",
-      "T1657": "Financial Theft"
+      "T1565.003": "Data Manipulation::Runtime Data Manipulation"
    }
  },
  "mbc": {
--- a/scripts/match-function-id.py
+++ b/scripts/match-function-id.py
@@ -62,7 +62,6 @@ import capa.engine
 import capa.helpers
 import capa.features
 import capa.features.freeze
-from capa.loader import BACKEND_VIV

 logger = logging.getLogger("capa.match-function-id")

@@ -72,53 +71,61 @@ def main(argv=None):
        argv = sys.argv[1:]

    parser = argparse.ArgumentParser(description="FLIRT match each function")
-    capa.main.install_common_args(parser, wanted={"input_file", "signatures", "format"})
+    parser.add_argument("sample", type=str, help="Path to sample to analyze")
    parser.add_argument(
        "-F",
        "--function",
        type=lambda x: int(x, 0x10),
        help="match a specific function by VA, rather than add functions",
    )
+    parser.add_argument(
+        "--signature",
+        action="append",
+        dest="signatures",
+        type=str,
+        default=[],
+        help="use the given signatures to identify library functions, file system paths to .sig/.pat files.",
+    )
+    parser.add_argument("-d", "--debug", action="store_true", help="Enable debugging output on STDERR")
+    parser.add_argument("-q", "--quiet", action="store_true", help="Disable all output but errors")
    args = parser.parse_args(args=argv)

-    try:
-        capa.main.handle_common_args(args)
-        capa.main.ensure_input_exists_from_cli(args)
-        input_format = capa.main.get_input_format_from_cli(args)
-        sig_paths = capa.main.get_signatures_from_cli(args, input_format, BACKEND_VIV)
-    except capa.main.ShouldExitError as e:
-        return e.status_code
+    if args.quiet:
+        logging.basicConfig(level=logging.ERROR)
+        logging.getLogger().setLevel(logging.ERROR)
+    elif args.debug:
+        logging.basicConfig(level=logging.DEBUG)
+        logging.getLogger().setLevel(logging.DEBUG)
+    else:
+        logging.basicConfig(level=logging.INFO)
+        logging.getLogger().setLevel(logging.INFO)
+
+    # disable vivisect-related logging, it's verbose and not relevant for capa users
+    capa.main.set_vivisect_log_level(logging.CRITICAL)

    analyzers = []
-    for sigpath in sig_paths:
-        sigs = viv_utils.flirt.load_flirt_signature(str(sigpath))
+    for sigpath in args.signatures:
+        sigs = viv_utils.flirt.load_flirt_signature(sigpath)

        with capa.main.timing("flirt: compiling sigs"):
            matcher = flirt.compile(sigs)

-        analyzer = viv_utils.flirt.FlirtFunctionAnalyzer(matcher, str(sigpath))
+        analyzer = viv_utils.flirt.FlirtFunctionAnalyzer(matcher, sigpath)
        logger.debug("registering viv function analyzer: %s", repr(analyzer))
        analyzers.append(analyzer)

-    vw = viv_utils.getWorkspace(str(args.input_file), analyze=True, should_save=False)
+    vw = viv_utils.getWorkspace(args.sample, analyze=True, should_save=False)

    functions = vw.getFunctions()
    if args.function:
        functions = [args.function]

-    seen = set()
    for function in functions:
        logger.debug("matching function: 0x%04x", function)
        for analyzer in analyzers:
-            viv_utils.flirt.match_function_flirt_signatures(analyzer.matcher, vw, function)
-            name = viv_utils.get_function_name(vw, function)
+            name = viv_utils.flirt.match_function_flirt_signatures(analyzer.matcher, vw, function)
            if name:
-                key = (function, name)
-                if key in seen:
-                    continue
-                else:
-                    print(f"0x{function:04x}: {name}")
-                    seen.add(key)
+                print(f"0x{function:04x}: {name}")

    return 0

--- a/scripts/profile-time.py
+++ b/scripts/profile-time.py
@@ -41,6 +41,7 @@ import timeit
 import logging
 import argparse
 import subprocess
+from pathlib import Path

 import tqdm
 import tabulate
@@ -49,7 +50,6 @@ import capa.main
 import capa.perf
 import capa.rules
 import capa.engine
-import capa.loader
 import capa.helpers
 import capa.features
 import capa.features.common
@@ -74,22 +74,42 @@ def main(argv=None):
        label += " (dirty)"

    parser = argparse.ArgumentParser(description="Profile capa performance")
-    capa.main.install_common_args(parser, wanted={"format", "os", "input_file", "signatures", "rules"})
+    capa.main.install_common_args(parser, wanted={"format", "os", "sample", "signatures", "rules"})
+
    parser.add_argument("--number", type=int, default=3, help="batch size of profile collection")
    parser.add_argument("--repeat", type=int, default=30, help="batch count of profile collection")
    parser.add_argument("--label", type=str, default=label, help="description of the profile collection")
+
    args = parser.parse_args(args=argv)
+    capa.main.handle_common_args(args)
+
+    try:
+        taste = capa.helpers.get_file_taste(Path(args.sample))
+    except IOError as e:
+        logger.error("%s", str(e))
+        return -1

    try:
-        capa.main.handle_common_args(args)
-        capa.main.ensure_input_exists_from_cli(args)
-        input_format = capa.main.get_input_format_from_cli(args)
-        backend = capa.main.get_backend_from_cli(args, input_format)
        with capa.main.timing("load rules"):
-            rules = capa.main.get_rules_from_cli(args)
-        extractor = capa.main.get_extractor_from_cli(args, input_format, backend)
-    except capa.main.ShouldExitError as e:
-        return e.status_code
+            rules = capa.main.get_rules(args.rules)
+    except IOError as e:
+        logger.error("%s", str(e))
+        return -1
+
+    try:
+        sig_paths = capa.main.get_signatures(args.signatures)
+    except IOError as e:
+        logger.error("%s", str(e))
+        return -1
+
+    if (args.format == "freeze") or (
+        args.format == capa.features.common.FORMAT_AUTO and capa.features.freeze.is_freeze(taste)
+    ):
+        extractor = capa.features.freeze.load(Path(args.sample).read_bytes())
+    else:
+        extractor = capa.main.get_extractor(
+            args.sample, args.format, args.os, capa.main.BACKEND_VIV, sig_paths, should_save_workspace=False
+        )

    with tqdm.tqdm(total=args.number * args.repeat, leave=False) as pbar:

--- a/scripts/proto-from-results.py
+++ b/scripts/proto-from-results.py
@@ -33,7 +33,6 @@ import logging
 import argparse
 from pathlib import Path

-import capa.main
 import capa.render.proto
 import capa.render.result_document

@@ -45,14 +44,26 @@ def main(argv=None):
        argv = sys.argv[1:]

    parser = argparse.ArgumentParser(description="Convert a capa JSON result document into the protobuf format")
-    capa.main.install_common_args(parser)
    parser.add_argument("json", type=str, help="path to JSON result document file, produced by `capa --json`")
+
+    logging_group = parser.add_argument_group("logging arguments")
+
+    logging_group.add_argument("-d", "--debug", action="store_true", help="enable debugging output on STDERR")
+    logging_group.add_argument(
+        "-q", "--quiet", action="store_true", help="disable all status output except fatal errors"
+    )
+
    args = parser.parse_args(args=argv)

-    try:
-        capa.main.handle_common_args(args)
-    except capa.main.ShouldExitError as e:
-        return e.status_code
+    if args.quiet:
+        logging.basicConfig(level=logging.WARNING)
+        logging.getLogger().setLevel(logging.WARNING)
+    elif args.debug:
+        logging.basicConfig(level=logging.DEBUG)
+        logging.getLogger().setLevel(logging.DEBUG)
+    else:
+        logging.basicConfig(level=logging.INFO)
+        logging.getLogger().setLevel(logging.INFO)

    rd = capa.render.result_document.ResultDocument.from_file(Path(args.json))
    pb = capa.render.proto.doc_to_pb2(rd)
--- a/scripts/proto-to-results.py
+++ b/scripts/proto-to-results.py
@@ -36,7 +36,6 @@ import logging
 import argparse
 from pathlib import Path

-import capa.main
 import capa.render.json
 import capa.render.proto
 import capa.render.proto.capa_pb2
@@ -50,16 +49,28 @@ def main(argv=None):
        argv = sys.argv[1:]

    parser = argparse.ArgumentParser(description="Convert a capa protobuf result document into the JSON format")
-    capa.main.install_common_args(parser)
    parser.add_argument(
        "pb", type=str, help="path to protobuf result document file, produced by `proto-from-results.py`"
    )
+
+    logging_group = parser.add_argument_group("logging arguments")
+
+    logging_group.add_argument("-d", "--debug", action="store_true", help="enable debugging output on STDERR")
+    logging_group.add_argument(
+        "-q", "--quiet", action="store_true", help="disable all status output except fatal errors"
+    )
+
    args = parser.parse_args(args=argv)

-    try:
-        capa.main.handle_common_args(args)
-    except capa.main.ShouldExitError as e:
-        return e.status_code
+    if args.quiet:
+        logging.basicConfig(level=logging.WARNING)
+        logging.getLogger().setLevel(logging.WARNING)
+    elif args.debug:
+        logging.basicConfig(level=logging.DEBUG)
+        logging.getLogger().setLevel(logging.DEBUG)
+    else:
+        logging.basicConfig(level=logging.INFO)
+        logging.getLogger().setLevel(logging.INFO)

    pb = Path(args.pb).read_bytes()

--- a/scripts/setup-linter-dependencies.py
+++ b/scripts/setup-linter-dependencies.py
@@ -178,8 +178,11 @@ def main(args: argparse.Namespace) -> None:
        data["mbc"] = MbcExtractor().run()

    logging.info("Writing results to %s", args.output)
-    with Path(args.output).open("w", encoding="utf-8") as jf:
-        json.dump(data, jf, indent=2)
+    try:
+        with Path(args.output).open("w", encoding="utf-8") as jf:
+            json.dump(data, jf, indent=2)
+    except BaseException as e:
+        logging.error("Exception encountered when writing results: %s", e)


 if __name__ == "__main__":
--- a/scripts/show-capabilities-by-function.py
+++ b/scripts/show-capabilities-by-function.py
@@ -55,11 +55,13 @@ Unless required by applicable law or agreed to in writing, software distributed
 is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and limitations under the License.
 """
+import os
 import sys
 import logging
 import argparse
 import collections
 from typing import Dict
+from pathlib import Path

 import colorama

@@ -74,7 +76,10 @@ import capa.render.verbose
 import capa.features.freeze
 import capa.capabilities.common
 import capa.render.result_document as rd
+from capa.helpers import get_file_taste
+from capa.features.common import FORMAT_AUTO
 from capa.features.freeze import Address
+from capa.features.extractors.base_extractor import FeatureExtractor, StaticFeatureExtractor

 logger = logging.getLogger("capa.show-capabilities-by-function")

@@ -137,37 +142,67 @@ def main(argv=None):
        argv = sys.argv[1:]

    parser = argparse.ArgumentParser(description="detect capabilities in programs.")
-    capa.main.install_common_args(
-        parser, wanted={"format", "os", "backend", "input_file", "signatures", "rules", "tag"}
-    )
+    capa.main.install_common_args(parser, wanted={"format", "os", "backend", "sample", "signatures", "rules", "tag"})
    args = parser.parse_args(args=argv)
+    capa.main.handle_common_args(args)

    try:
-        capa.main.handle_common_args(args)
-        capa.main.ensure_input_exists_from_cli(args)
-        input_format = capa.main.get_input_format_from_cli(args)
-        rules = capa.main.get_rules_from_cli(args)
-        backend = capa.main.get_backend_from_cli(args, input_format)
-        sample_path = capa.main.get_sample_path_from_cli(args, backend)
-        if sample_path is None:
-            os_ = "unknown"
-        else:
-            os_ = capa.loader.get_os(sample_path)
-        extractor = capa.main.get_extractor_from_cli(args, input_format, backend)
-    except capa.main.ShouldExitError as e:
-        return e.status_code
+        taste = get_file_taste(Path(args.sample))
+    except IOError as e:
+        logger.error("%s", str(e))
+        return -1
+
+    try:
+        rules = capa.main.get_rules(args.rules)
+        logger.info("successfully loaded %s rules", len(rules))
+        if args.tag:
+            rules = rules.filter_rules_by_meta(args.tag)
+            logger.info("selected %s rules", len(rules))
+    except (IOError, capa.rules.InvalidRule, capa.rules.InvalidRuleSet) as e:
+        logger.error("%s", str(e))
+        return -1
+
+    try:
+        sig_paths = capa.main.get_signatures(args.signatures)
+    except IOError as e:
+        logger.error("%s", str(e))
+        return -1
+
+    if (args.format == "freeze") or (args.format == FORMAT_AUTO and capa.features.freeze.is_freeze(taste)):
+        format_ = "freeze"
+        extractor: FeatureExtractor = capa.features.freeze.load(Path(args.sample).read_bytes())
+    else:
+        format_ = args.format
+        should_save_workspace = os.environ.get("CAPA_SAVE_WORKSPACE") not in ("0", "no", "NO", "n", None)
+
+        try:
+            extractor = capa.main.get_extractor(
+                args.sample, args.format, args.os, args.backend, sig_paths, should_save_workspace
+            )
+            assert isinstance(extractor, StaticFeatureExtractor)
+        except capa.exceptions.UnsupportedFormatError:
+            capa.helpers.log_unsupported_format_error()
+            return -1
+        except capa.exceptions.UnsupportedRuntimeError:
+            capa.helpers.log_unsupported_runtime_error()
+            return -1

    capabilities, counts = capa.capabilities.common.find_capabilities(rules, extractor)

-    meta = capa.loader.collect_metadata(argv, args.input_file, input_format, os_, args.rules, extractor, counts)
-    meta.analysis.layout = capa.loader.compute_layout(rules, extractor, capabilities)
+    meta = capa.main.collect_metadata(argv, args.sample, format_, args.os, args.rules, extractor, counts)
+    meta.analysis.layout = capa.main.compute_layout(rules, extractor, capabilities)

    if capa.capabilities.common.has_file_limitation(rules, capabilities):
        # bail if capa encountered file limitation e.g. a packed binary
        # do show the output in verbose mode, though.
        if not (args.verbose or args.vverbose or args.json):
-            return capa.main.E_FILE_LIMITATION
+            return -1

+    # colorama will detect:
+    #  - when on Windows console, and fixup coloring, and
+    #  - when not an interactive session, and disable coloring
+    # renderers should use coloring and assume it will be stripped out if necessary.
+    colorama.init()
    doc = rd.ResultDocument.from_capa(meta, rules, capabilities)
    print(render_matches_by_function(doc))
    colorama.deinit()
--- a/scripts/show-features.py
+++ b/scripts/show-features.py
@@ -64,15 +64,16 @@ Example::
    insn: 0x10001027: mnemonic(shl)
    ...
 """
+import os
 import sys
 import logging
 import argparse
 from typing import Tuple
+from pathlib import Path

 import capa.main
 import capa.rules
 import capa.engine
-import capa.loader
 import capa.helpers
 import capa.features
 import capa.exceptions
@@ -80,9 +81,17 @@ import capa.render.verbose as v
 import capa.features.freeze
 import capa.features.address
 import capa.features.extractors.pefile
-from capa.helpers import assert_never
+from capa.helpers import get_auto_format, log_unsupported_runtime_error
 from capa.features.insn import API, Number
-from capa.features.common import String, Feature, is_global_feature
+from capa.features.common import (
+    FORMAT_AUTO,
+    FORMAT_CAPE,
+    FORMAT_FREEZE,
+    DYNAMIC_FORMATS,
+    String,
+    Feature,
+    is_global_feature,
+)
 from capa.features.extractors.base_extractor import FunctionHandle, StaticFeatureExtractor, DynamicFeatureExtractor

 logger = logging.getLogger("capa.show-features")
@@ -97,33 +106,56 @@ def main(argv=None):
        argv = sys.argv[1:]

    parser = argparse.ArgumentParser(description="Show the features that capa extracts from the given sample")
-    capa.main.install_common_args(parser, wanted={"input_file", "format", "os", "signatures", "backend"})
+    capa.main.install_common_args(parser, wanted={"format", "os", "sample", "signatures", "backend"})

    parser.add_argument("-F", "--function", type=str, help="Show features for specific function")
    parser.add_argument("-P", "--process", type=str, help="Show features for specific process name")
    args = parser.parse_args(args=argv)
+    capa.main.handle_common_args(args)
+
+    if args.function and args.backend == "pefile":
+        print("pefile backend does not support extracting function features")
+        return -1

    try:
-        capa.main.handle_common_args(args)
-        capa.main.ensure_input_exists_from_cli(args)
+        _ = capa.helpers.get_file_taste(Path(args.sample))
+    except IOError as e:
+        logger.error("%s", str(e))
+        return -1

-        if args.function and args.backend == "pefile":
-            print("pefile backend does not support extracting function features")
+    try:
+        sig_paths = capa.main.get_signatures(args.signatures)
+    except IOError as e:
+        logger.error("%s", str(e))
+        return -1
+
+    format_ = args.format if args.format != FORMAT_AUTO else get_auto_format(args.sample)
+    if format_ == FORMAT_FREEZE:
+        # this should be moved above the previous if clause after implementing
+        # feature freeze for the dynamic analysis flavor
+        extractor = capa.features.freeze.load(Path(args.sample).read_bytes())
+    else:
+        should_save_workspace = os.environ.get("CAPA_SAVE_WORKSPACE") not in ("0", "no", "NO", "n", None)
+        try:
+            extractor = capa.main.get_extractor(
+                args.sample, format_, args.os, args.backend, sig_paths, should_save_workspace
+            )
+        except capa.exceptions.UnsupportedFormatError as e:
+            if format_ == FORMAT_CAPE:
+                capa.helpers.log_unsupported_cape_report_error(str(e))
+            else:
+                capa.helpers.log_unsupported_format_error()
+            return -1
+        except capa.exceptions.UnsupportedRuntimeError:
+            log_unsupported_runtime_error()
            return -1

-        input_format = capa.main.get_input_format_from_cli(args)
-
-        backend = capa.main.get_backend_from_cli(args, input_format)
-        extractor = capa.main.get_extractor_from_cli(args, input_format, backend)
-    except capa.main.ShouldExitError as e:
-        return e.status_code
-
-    if isinstance(extractor, DynamicFeatureExtractor):
+    if format_ in DYNAMIC_FORMATS:
+        assert isinstance(extractor, DynamicFeatureExtractor)
        print_dynamic_analysis(extractor, args)
-    elif isinstance(extractor, StaticFeatureExtractor):
-        print_static_analysis(extractor, args)
    else:
-        assert_never(extractor)
+        assert isinstance(extractor, StaticFeatureExtractor)
+        print_static_analysis(extractor, args)

    return 0

--- a/scripts/show-unused-features.py
+++ b/scripts/show-unused-features.py
@@ -8,11 +8,13 @@ Unless required by applicable law or agreed to in writing, software distributed
 is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and limitations under the License.
 """
+import os
 import sys
 import typing
 import logging
 import argparse
 from typing import Set, Tuple
+from pathlib import Path
 from collections import Counter

 import tabulate
@@ -29,7 +31,8 @@ import capa.features.freeze
 import capa.features.address
 import capa.features.extractors.pefile
 import capa.features.extractors.base_extractor
-from capa.features.common import FORMAT_FREEZE, Feature
+from capa.helpers import log_unsupported_runtime_error
+from capa.features.common import Feature
 from capa.features.extractors.base_extractor import FunctionHandle, StaticFeatureExtractor

 logger = logging.getLogger("show-unused-features")
@@ -39,9 +42,10 @@ def format_address(addr: capa.features.address.Address) -> str:
    return v.format_address(capa.features.freeze.Address.from_capa((addr)))


-def get_rules_feature_set(rules: capa.rules.RuleSet) -> Set[Feature]:
+def get_rules_feature_set(rules_path) -> Set[Feature]:
+    ruleset = capa.main.get_rules(rules_path)
    rules_feature_set: Set[Feature] = set()
-    for _, rule in rules.rules.items():
+    for _, rule in ruleset.rules.items():
        rules_feature_set.update(rule.extract_all_features())

    return rules_feature_set
@@ -102,23 +106,44 @@ def main(argv=None):
        argv = sys.argv[1:]

    parser = argparse.ArgumentParser(description="Show the features that capa doesn't have rules for yet")
-    capa.main.install_common_args(parser, wanted={"format", "os", "input_file", "signatures", "backend", "rules"})
+    capa.main.install_common_args(parser, wanted={"format", "os", "sample", "signatures", "backend", "rules"})
+
    parser.add_argument("-F", "--function", type=str, help="Show features for specific function")
    args = parser.parse_args(args=argv)
+    capa.main.handle_common_args(args)

    if args.function and args.backend == "pefile":
        print("pefile backend does not support extracting function features")
        return -1

    try:
-        capa.main.handle_common_args(args)
-        capa.main.ensure_input_exists_from_cli(args)
-        rules = capa.main.get_rules_from_cli(args)
-        input_format = capa.main.get_input_format_from_cli(args)
-        backend = capa.main.get_backend_from_cli(args, input_format)
-        extractor = capa.main.get_extractor_from_cli(args, input_format, backend)
-    except capa.main.ShouldExitError as e:
-        return e.status_code
+        taste = capa.helpers.get_file_taste(Path(args.sample))
+    except IOError as e:
+        logger.error("%s", str(e))
+        return -1
+
+    try:
+        sig_paths = capa.main.get_signatures(args.signatures)
+    except IOError as e:
+        logger.error("%s", str(e))
+        return -1
+
+    if (args.format == "freeze") or (
+        args.format == capa.features.common.FORMAT_AUTO and capa.features.freeze.is_freeze(taste)
+    ):
+        extractor = capa.features.freeze.load(Path(args.sample).read_bytes())
+    else:
+        should_save_workspace = os.environ.get("CAPA_SAVE_WORKSPACE") not in ("0", "no", "NO", "n", None)
+        try:
+            extractor = capa.main.get_extractor(
+                args.sample, args.format, args.os, args.backend, sig_paths, should_save_workspace
+            )
+        except capa.exceptions.UnsupportedFormatError:
+            capa.helpers.log_unsupported_format_error()
+            return -1
+        except capa.exceptions.UnsupportedRuntimeError:
+            log_unsupported_runtime_error()
+            return -1

    assert isinstance(extractor, StaticFeatureExtractor), "only static analysis supported today"

@@ -134,7 +159,7 @@ def main(argv=None):
        function_handles = tuple(extractor.get_functions())

    if args.function:
-        if input_format == FORMAT_FREEZE:
+        if args.format == "freeze":
            function_handles = tuple(filter(lambda fh: fh.address == args.function, function_handles))
        else:
            function_handles = tuple(filter(lambda fh: format_address(fh.address) == args.function, function_handles))
@@ -149,7 +174,7 @@ def main(argv=None):

    feature_map.update(get_file_features(function_handles, extractor))

-    rules_feature_set = get_rules_feature_set(rules)
+    rules_feature_set = get_rules_feature_set(args.rules)

    print_unused_features(feature_map, rules_feature_set)
    return 0
@@ -181,8 +206,7 @@ def ida_main():
    feature_map.update(get_file_features(function_handles, extractor))

    rules_path = capa.main.get_default_root() / "rules"
-    rules = capa.rules.get_rules([rules_path])
-    rules_feature_set = get_rules_feature_set(rules)
+    rules_feature_set = get_rules_feature_set([rules_path])

    print_unused_features(feature_map, rules_feature_set)

--- a/scripts/vivisect-py2-vs-py3.sh
+++ b/scripts/vivisect-py2-vs-py3.sh
@@ -0,0 +1,69 @@
+#!/usr/bin/env bash
+
+int() {
+  int=$(bc <<< "scale=0; ($1 + 0.5)/1")
+}
+
+export TIMEFORMAT='%3R'
+threshold_time=90
+threshold_py3_time=60 # Do not warn if it doesn't take at least 1 minute to run
+rm tests/data/*.viv 2>/dev/null
+mkdir results
+for file in tests/data/*
+do
+  file=$(printf %q "$file") # Handle names with white spaces
+  file_name=$(basename $file)
+  echo $file_name
+
+  rm "$file.viv" 2>/dev/null
+  py3_time=$(sh -c "time python3 scripts/show-features.py $file >> results/p3-$file_name.out 2>/dev/null" 2>&1)
+  rm "$file.viv" 2>/dev/null
+  py2_time=$(sh -c "time python2 scripts/show-features.py $file >> results/p2-$file_name.out 2>/dev/null" 2>&1)
+
+  int $py3_time
+  if (($int > $threshold_py3_time))
+  then
+    percentage=$(bc <<< "scale=3; $py2_time/$py3_time*100 + 0.5")
+    int $percentage
+    if (($int < $threshold_py3_time))
+    then
+      echo -n "  SLOWER ($percentage): "
+    fi
+  fi
+  echo "  PY2($py2_time) PY3($py3_time)"
+done
+
+threshold_features=98
+counter=0
+average=0
+results_for() {
+  py3=$(cat "results/p3-$file_name.out" | grep "$1" | wc -l)
+  py2=$(cat "results/p2-$file_name.out" | grep "$1" | wc -l)
+  if (($py2 > 0))
+  then
+    percentage=$(bc <<< "scale=2; 100*$py3/$py2")
+    average=$(bc <<< "scale=2; $percentage + $average")
+    count=$(($count + 1))
+    int $percentage
+    if (($int < $threshold_features))
+    then
+      echo -e "$1: py2($py2) py3($py3) $percentage% - $file_name"
+    fi
+  fi
+}
+
+rm tests/data/*.viv 2>/dev/null
+echo -e '\nRESULTS:'
+for file in tests/data/*
+do
+  file_name=$(basename $file)
+  if test -f "results/p2-$file_name.out"; then
+    results_for 'insn'
+    results_for 'file'
+    results_for 'func'
+    results_for 'bb'
+  fi
+done
+
+average=$(bc <<< "scale=2; $average/$count")
+echo "TOTAL: $average"
--- a/tests/data
+++ b/tests/data
--- a/tests/fixtures.py
+++ b/tests/fixtures.py
@@ -106,11 +106,11 @@ def get_viv_extractor(path: Path):
    ]

    if "raw32" in path.name:
-        vw = capa.loader.get_workspace(path, "sc32", sigpaths=sigpaths)
+        vw = capa.main.get_workspace(path, "sc32", sigpaths=sigpaths)
    elif "raw64" in path.name:
-        vw = capa.loader.get_workspace(path, "sc64", sigpaths=sigpaths)
+        vw = capa.main.get_workspace(path, "sc64", sigpaths=sigpaths)
    else:
-        vw = capa.loader.get_workspace(path, FORMAT_AUTO, sigpaths=sigpaths)
+        vw = capa.main.get_workspace(path, FORMAT_AUTO, sigpaths=sigpaths)
    vw.saveWorkspace()
    extractor = capa.features.extractors.viv.extractor.VivisectFeatureExtractor(vw, path, OS_AUTO)
    fixup_viv(path, extractor)
@@ -393,10 +393,6 @@ def get_data_path_by_name(name) -> Path:
        return CD / "data" / "ea2876e9175410b6f6719f80ee44b9553960758c7d0f7bed73c0fe9a78d8e669.dll_"
    elif name.startswith("1038a2"):
        return CD / "data" / "1038a23daad86042c66bfe6c9d052d27048de9653bde5750dc0f240c792d9ac8.elf_"
-    elif name.startswith("nested_typedef"):
-        return CD / "data" / "dotnet" / "dd9098ff91717f4906afe9dafdfa2f52.exe_"
-    elif name.startswith("nested_typeref"):
-        return CD / "data" / "dotnet" / "2c7d60f77812607dec5085973ff76cea.dll_"
    else:
        raise ValueError(f"unexpected sample fixture: {name}")

@@ -1278,114 +1274,6 @@ FEATURE_PRESENCE_TESTS_DOTNET = sorted(
            ),  # MemberRef method
            False,
        ),
-        (
-            "nested_typedef",
-            "file",
-            capa.features.common.Class("mynamespace.myclass_outer0"),
-            True,
-        ),
-        (
-            "nested_typedef",
-            "file",
-            capa.features.common.Class("mynamespace.myclass_outer1"),
-            True,
-        ),
-        (
-            "nested_typedef",
-            "file",
-            capa.features.common.Class("mynamespace.myclass_outer0/myclass_inner0_0"),
-            True,
-        ),
-        (
-            "nested_typedef",
-            "file",
-            capa.features.common.Class("mynamespace.myclass_outer0/myclass_inner0_1"),
-            True,
-        ),
-        (
-            "nested_typedef",
-            "file",
-            capa.features.common.Class("mynamespace.myclass_outer1/myclass_inner1_0"),
-            True,
-        ),
-        (
-            "nested_typedef",
-            "file",
-            capa.features.common.Class("mynamespace.myclass_outer1/myclass_inner1_1"),
-            True,
-        ),
-        (
-            "nested_typedef",
-            "file",
-            capa.features.common.Class("mynamespace.myclass_outer1/myclass_inner1_0/myclass_inner_inner"),
-            True,
-        ),
-        (
-            "nested_typedef",
-            "file",
-            capa.features.common.Class("myclass_inner_inner"),
-            False,
-        ),
-        (
-            "nested_typedef",
-            "file",
-            capa.features.common.Class("myclass_inner1_0"),
-            False,
-        ),
-        (
-            "nested_typedef",
-            "file",
-            capa.features.common.Class("myclass_inner1_1"),
-            False,
-        ),
-        (
-            "nested_typedef",
-            "file",
-            capa.features.common.Class("myclass_inner0_0"),
-            False,
-        ),
-        (
-            "nested_typedef",
-            "file",
-            capa.features.common.Class("myclass_inner0_1"),
-            False,
-        ),
-        (
-            "nested_typeref",
-            "file",
-            capa.features.file.Import("Android.OS.Build/VERSION::SdkInt"),
-            True,
-        ),
-        (
-            "nested_typeref",
-            "file",
-            capa.features.file.Import("Android.Media.Image/Plane::Buffer"),
-            True,
-        ),
-        (
-            "nested_typeref",
-            "file",
-            capa.features.file.Import("Android.Provider.Telephony/Sent/Sent::ContentUri"),
-            True,
-        ),
-        (
-            "nested_typeref",
-            "file",
-            capa.features.file.Import("Android.OS.Build::SdkInt"),
-            False,
-        ),
-        (
-            "nested_typeref",
-            "file",
-            capa.features.file.Import("Plane::Buffer"),
-            False,
-        ),
-        (
-            "nested_typeref",
-            "file",
-            capa.features.file.Import("Sent::ContentUri"),
-            False,
-        ),
    ],
    # order tests by (file, item)
    # so that our LRU cache is most effective.
--- a/tests/test_cape_model.py
+++ b/tests/test_cape_model.py
@@ -6,13 +6,10 @@
 #  is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and limitations under the License.
 import gzip
-from typing import Type
 from pathlib import Path

-import pytest
 import fixtures

-from capa.exceptions import EmptyReportError, UnsupportedFormatError
 from capa.features.extractors.cape.models import Call, CapeReport

 CD = Path(__file__).resolve().parent
@@ -44,35 +41,6 @@ def test_cape_model_can_load(version: str, filename: str):
    assert report is not None


-@fixtures.parametrize(
-    "version,filename,exception",
-    [
-        ("v2.2", "0000a65749f5902c4d82ffa701198038f0b4870b00a27cfca109f8f933476d82.json.gz", None),
-        ("v2.2", "55dcd38773f4104b95589acc87d93bf8b4a264b4a6d823b73fb6a7ab8144c08b.json.gz", None),
-        ("v2.2", "77c961050aa252d6d595ec5120981abf02068c968f4a5be5958d10e87aa6f0e8.json.gz", EmptyReportError),
-        ("v2.2", "d46900384c78863420fb3e297d0a2f743cd2b6b3f7f82bf64059a168e07aceb7.json.gz", None),
-        ("v2.4", "36d218f384010cce9f58b8193b7d8cc855d1dff23f80d16e13a883e152d07921.json.gz", UnsupportedFormatError),
-        ("v2.4", "41ce492f04accef7931b84b8548a6ca717ffabb9bedc4f624de2d37a5345036c.json.gz", UnsupportedFormatError),
-        ("v2.4", "515a6269965ccdf1005008e017ec87fafb97fd2464af1c393ad93b438f6f33fe.json.gz", UnsupportedFormatError),
-        ("v2.4", "5d61700feabba201e1ba98df3c8210a3090c8c9f9adbf16cb3d1da3aaa2a9d96.json.gz", UnsupportedFormatError),
-        ("v2.4", "5effaf6795932d8b36755f89f99ce7436421ea2bd1ed5bc55476530c1a22009f.json.gz", UnsupportedFormatError),
-        ("v2.4", "873275144af88e9b95ea2c59ece39b8ce5a9d7fe09774b683050098ac965054d.json.gz", UnsupportedFormatError),
-        ("v2.4", "8b9aaf4fad227cde7a7dabce7ba187b0b923301718d9d40de04bdd15c9b22905.json.gz", UnsupportedFormatError),
-        ("v2.4", "b1c4aa078880c579961dc5ec899b2c2e08ae5db80b4263e4ca9607a68e2faef9.json.gz", UnsupportedFormatError),
-        ("v2.4", "fb7ade52dc5a1d6128b9c217114a46d0089147610f99f5122face29e429a1e74.json.gz", None),
-    ],
-)
-def test_cape_extractor(version: str, filename: str, exception: Type[BaseException]):
-    path = CAPE_DIR / version / filename
-
-    if exception:
-        with pytest.raises(exception):
-            _ = fixtures.get_cape_extractor(path)
-    else:
-        cr = fixtures.get_cape_extractor(path)
-        assert cr is not None
-
-
 def test_cape_model_argument():
    call = Call.model_validate_json(
        """
--- a/tests/test_rules.py
+++ b/tests/test_rules.py
@@ -949,7 +949,6 @@ def test_count_api():
            features:
                - or:
                    - count(api(kernel32.CreateFileA)): 1
-                    - count(api(System.Convert::FromBase64String)): 1
        """
    )
    r = capa.rules.Rule.from_yaml(rule)
@@ -958,7 +957,6 @@ def test_count_api():
    assert bool(r.evaluate({API("kernel32.CreateFile"): set()})) is False
    assert bool(r.evaluate({API("CreateFile"): {ADDR1}})) is False
    assert bool(r.evaluate({API("CreateFileA"): {ADDR1}})) is True
-    assert bool(r.evaluate({API("System.Convert::FromBase64String"): {ADDR1}})) is True


 def test_invalid_number():
--- a/tests/test_scripts.py
+++ b/tests/test_scripts.py
@@ -40,10 +40,7 @@ def get_rule_path():
    [
        pytest.param("capa2yara.py", [get_rules_path()]),
        pytest.param("capafmt.py", [get_rule_path()]),
-        # testing some variations of linter script
-        pytest.param("lint.py", ["-t", "create directory", get_rules_path()]),
-        # `create directory` rule has native and .NET example PEs
-        pytest.param("lint.py", ["--thorough", "-t", "create directory", get_rules_path()]),
+        # not testing lint.py as it runs regularly anyway
        pytest.param("match-function-id.py", [get_file_path()]),
        pytest.param("show-capabilities-by-function.py", [get_file_path()]),
        pytest.param("show-features.py", [get_file_path()]),
Author	SHA1	Message	Date
Moritz	ec1ddb506c	Merge pull request #1893 from mrexodia/dex-support Initial plumbing to support DEX files	2024-01-31 12:03:23 +01:00
Duncan Ogilvie	e2f655428e	Differentiate between function-name and import for DEX	2023-12-08 01:12:48 +01:00
Duncan Ogilvie	b5a4d766d9	Add string features for DEX and clean up method handling	2023-12-08 00:15:20 +01:00
Duncan Ogilvie	b77103a646	Mark DEX methods without code as library functions	2023-12-08 00:15:20 +01:00
Duncan Ogilvie	036f147df8	Support function-name, class, namespace for DEX	2023-12-08 00:15:20 +01:00
Duncan Ogilvie	52d20d2f46	Combine DEX feature extraction into a single class	2023-12-08 00:15:19 +01:00
Duncan Ogilvie	e90be5a9bb	Initial plumbing to support DEX files	2023-12-08 00:15:16 +01:00