From 7043aae69bd463a5f05e6300efe84987c45973ea Mon Sep 17 00:00:00 2001 From: Willi Ballenthin Date: Tue, 17 Mar 2026 09:58:08 +0100 Subject: [PATCH] plans: add chat sessions and transcripts exploring mapa problem statements --- doc/plans/cu-boundary-eval.md | 105 +++++++++++++ doc/plans/mapa-cu-ideas.md | 256 ++++++++++++++++++++++++++++++++ doc/plans/problem-transcript.md | 1 + 3 files changed, 362 insertions(+) create mode 100644 doc/plans/cu-boundary-eval.md create mode 100644 doc/plans/mapa-cu-ideas.md create mode 100644 doc/plans/problem-transcript.md diff --git a/doc/plans/cu-boundary-eval.md b/doc/plans/cu-boundary-eval.md new file mode 100644 index 00000000..2feb13c1 --- /dev/null +++ b/doc/plans/cu-boundary-eval.md @@ -0,0 +1,105 @@ +# Compilation Unit Boundary Evaluation + +Ground truth data from assemblage: ~500K binaries with PDB-derived source file mappings, ~1000 functions each. This document describes how to turn that data into a repeatable evaluation pipeline for CU boundary detection. + +## Ground truth definition + +Each function in a binary has a source file path from debug info. Source files are either **impl files** (`.c`, `.cpp`, `.cc`, `.cxx`) or **headers** (`.h`, `.hpp`, extensionless STL headers like `vector`, `xmemory`, etc.). A compilation unit is defined by its impl file — header-sourced functions are template instantiations or inline expansions compiled into whichever CU includes them. + +To assign each function to a CU: + +1. Sort functions by start address. +2. For each function, if its source is an impl file, that's its CU. +3. Otherwise, forward-fill: inherit the CU of the most recent preceding impl function. If none exists (header functions at the start of the binary), backward-fill from the first impl file. +4. Ground truth boundaries = the set of gap indices where `CU[i] != CU[i-1]`. + +This assignment was validated on 10 binaries (21K functions). Every ground truth boundary has an impl file on at least one side — there are zero boundaries where both flanking functions come from headers. Forward-fill produces CU fragmentation ratios of 1.0-1.2x (nearly perfectly contiguous CUs). + +## What the data looks like + +Source file paths fall into classifiable categories based on filesystem structure. User code lives under `c:\assemblage\builds\HASH\PROJECT\*.cpp` (not under `external/`). Vendored libraries are at `...\PROJECT\external\LIBNAME\*`. MSVC STL headers are under `C:\Program Files\...\MSVC\...\include\` (e.g., `vector`, `xmemory`). CRT sources come from `D:\a\_work\...\vctools\crt\...` or `d:\th\minkernel\crts\...`. + +Library identity is derivable from the path prefix. The assemblage build system makes external libraries explicit via the `external/LIBNAME/` convention. + +### Template/STL interleaving + +Functions from STL headers appear heavily interspersed within CUs. In SmxDecompiler.exe, 72% of functions come from MSVC STL headers, appearing in runs of up to 258 functions within a single CU. These are per-CU template instantiation copies (the same `std::_Verify_range` appears in 4 different CUs), not independent entities. + +This means "source file changed" is a terrible boundary signal because it fires on every header-to-impl or header-to-header transition within a CU. On 3DSIFT.exe (heavy Eigen template use), source-file-change achieves only 3% precision. The actual CU boundary rate is low: only 5.9% of all inter-function gaps are real CU boundaries. + +### CRT fragmentation + +CRT code (from the ucrt static library) shows heavy fragmentation: dozens of tiny 1-3 function CUs interleaved by the linker. These are the hardest boundaries to detect and arguably the least useful to distinguish (separating `delete_scalar.cpp` from `new_array.cpp` has limited practical value). Consider collapsing all CRT functions into a single "CRT" meta-CU for scoring, or reporting CRT and non-CRT scores separately. + +### Multi-fragment CUs + +Some impl files appear in non-contiguous blocks (e.g., `inflate.c` interrupted by `infback.c`/`inffast.c` in zlib). This is uncommon in application code but frequent in CRT. For boundary scoring, treat each contiguous fragment as its own segment — the metric measures where transitions occur regardless of whether distant segments share a label. + +## Metrics + +### Primary: Boundary F1 + +For a predicted set of boundary gap indices vs the ground truth set, compute precision (fraction of predicted boundaries that are real), recall (fraction of real boundaries that are predicted), and F1 (harmonic mean). This directly measures the "where are the seams?" question. It's strict: a boundary predicted one function off from the true position scores as one FP + one FN. + +### Secondary: WindowDiff + +Standard text segmentation metric, more forgiving of near-misses. Slides a window of size `k` across the sequence and counts positions where the number of boundaries within the window differs between prediction and ground truth. Lower is better. Use `k = avg_segment_length / 2`. + +### Library-level metrics + +| Problem | Ground truth | Metric | +|---------|-------------|--------| +| CU boundaries | Impl-file assignment (above) | Boundary F1 + WindowDiff | +| Library clustering | Path-derived category (user-code, external:zlib, crt, msvc-stdlib, ...) | Adjusted Rand Index | +| User vs library | Binary classification from category | Per-function accuracy | + +These are separate evaluations that can be run independently and solved in order: CU boundaries first, then library grouping, then user/library classification. + +## Baseline results + +Tested on the 10-binary sample (21K functions, 1238 CU boundaries): + +| Heuristic | Precision | Recall | F1 range | +|-----------|-----------|--------|----------| +| No boundaries | - | 0.00 | 0.00 | +| Every source-file change | 0.03-0.79 | 1.00 | 0.06-0.88 | +| Category change | 0.11-0.50 | 0.02-0.28 | 0.04-0.29 | +| Impl-file stream tracking | 1.00 | 1.00 | 1.00 | + +The impl-file oracle (skip headers, flag when the .c/.cpp changes) achieves perfect F1. This is the ceiling; it confirms that CU boundaries are exactly the impl-file transitions in the source data. The gap between the naive baselines and this oracle is the space we're trying to close using binary-level features. + +## Evaluation pipeline + +### Step 1: Ground truth extraction + +From debug info, produce a normalized table per binary: + +``` +binary_hash | func_index | func_start | func_end | cu_id | library_category +``` + +`cu_id` is the impl file path (or a hash of it). `library_category` is derived from the path prefix. Store as Parquet or SQLite since the CSV format won't scale to 500K binaries. + +Pre-compute per-binary boundary sets as arrays of gap indices. + +### Step 2: Predictor interface + +A predictor takes a binary's function list (start addresses and sizes only, no source info, no debug info) and returns a set of predicted boundary indices. The predictor can use any features extractable from the raw binary: inter-function gap sizes and padding patterns, call graph edge density across each gap, data reference patterns, string attribution, function naming (if symbols present), alignment byte patterns, etc. + +### Step 3: Scoring + +For each binary, compute boundary F1 between prediction and ground truth. Aggregate across the dataset: + +Report both macro-average (mean F1 across binaries, treating each binary equally) and micro-average (pool all gaps, compute F1 once, biased toward larger binaries). Add per-category breakdowns (user code only, CRT only, etc.) to understand where the predictor succeeds and fails. + +### Step 4: Iterate + +Try different feature combinations and scoring strategies, compare F1. Per mapa-cu-ideas.md, the boundary scorer is the spine of the pipeline and every algorithm becomes a feature generator for gap scoring. + +## Design decisions + +Boundary F1 was chosen over clustering metrics (ARI, NMI) because those conflate boundary detection with label assignment. We want to decouple the two: find boundaries first, then label segments. + +Forward-fill for CU assignment matches how compilation works. The compiler processes the impl file's own functions first, then template instantiations from included headers. Header functions belong to the most recent impl file. The 1.0-1.2x fragmentation ratio validates this model. + +Header-to-header transitions are never counted as boundaries because the data shows zero CU boundaries where both flanking functions come from headers. Every CU has at least one impl file as its root. Template instantiations from different headers within the same CU are not separate CUs. diff --git a/doc/plans/mapa-cu-ideas.md b/doc/plans/mapa-cu-ideas.md new file mode 100644 index 00000000..19ce88a4 --- /dev/null +++ b/doc/plans/mapa-cu-ideas.md @@ -0,0 +1,256 @@ +> read problem-transcript.md and digest it, summarize it, restructure it and explain it. +> +> ok. now ultrathink. consider the problem deeply. try to give some insights, some novel perspective, ideas for techniques and algorithms. what strikes you? + +I've been sitting with this for a while. Here's what strikes me, organized from the structural observations that feel most fundamental down to concrete algorithmic ideas. + +------ + +## The Deep Structure: A Graph Pinned to a Line + +The most mathematically distinctive thing about this problem is that you have a *graph* (the call graph, the data xref graph) whose nodes are *pinned to a one-dimensional manifold* (the address space). This is a very specific structure that most generic graph algorithms ignore entirely. It means you're not just doing community detection — you're doing community detection where the communities are *intervals*. That's a massive constraint that dramatically shrinks the search space. + +Think about it: if you have 10,000 functions, the number of arbitrary partitions is astronomically large. But the number of ways to partition 10,000 linearly-ordered nodes into *contiguous groups* is only on the order of 2^10,000 in the worst case, and if you're looking for, say, 20 boundaries, it's just C(10000, 20) — large, but tractable with dynamic programming. The contiguity assumption turns a combinatorially explosive problem into something closer to 1D segmentation, which is a much better-understood and more tractable class of problems. + +## The Call Edge Length Distribution — A Fingerprint of Structure + +Here's an idea I find really compelling. For every edge in the call graph, compute the *address distance* between caller and callee: `|addr(A) - addr(B)|`. Now plot the histogram of these edge lengths across the whole binary. + +I'd expect this histogram to be multimodal. There should be a peak near zero representing intra-compilation-unit calls (functions calling their neighbors). There should be a broader hump at medium range representing intra-library cross-CU calls. And there should be long-range outliers representing cross-boundary calls — where user code calls into a library at a distant address. + +The valleys between these modes are telling you the *characteristic scales* of the hierarchy. The first valley gives you the typical size of a compilation unit. The second gives you the typical size of a library. This is essentially a multi-scale analysis of the binary's structure, and you get it almost for free before running any clustering algorithm. You could use those characteristic distances as thresholds for later steps. + +## Min-Cut / Max-Flow for Boundary Detection + +The transcript describes a "tension" along paths between known first-party anchors (like `main`) and known library anchors (like FLIRT-matched or string-anchored functions). This maps almost perfectly to a **minimum cut problem**. + +Set it up like this: create a super-source *S* connected to all known first-party nodes, and a super-sink *T* connected to all known library nodes. The edges in the call graph get capacities. What should the capacity be? This is where you encode your priors — you could weight by *inverse address distance* (so cutting a long-range edge is cheap, cutting a short-range edge between neighbors is expensive), or by the number of shared data references, or both. Then the min-cut between S and T partitions the graph into first-party and third-party, and the *cut edges themselves are literally the API boundary* — the places where user code transitions into library code. + +This is elegant because it naturally handles ambiguity. You don't have to hand-define a propagation rule; the flow algorithm figures out where the cheapest place to draw the line is. And you can add new anchors (from FLIRT, strings, human input) simply by connecting them to S or T, and the cut re-adjusts. + +## The Genome Annotation Analogy + +This problem is strikingly similar to **genome annotation** in bioinformatics. You have a long linear sequence (the binary). It contains functional regions (genes = functions). These regions cluster into operons and gene families (compilation units, libraries). You have some known markers (homolog matches ≈ FLIRT signatures). You want to segment the sequence into regions of different character, exploiting the fact that related elements are co-located. + +The classic tool for this is the **Hidden Markov Model**. Imagine an HMM where the hidden states represent "this function belongs to first-party code," "this function belongs to library A, CU 1," etc. The observations at each state are the features of that function — its string references, its call graph characteristics, its instruction patterns. The transition probabilities encode the contiguity assumption: staying in the same CU is high probability, transitioning between CUs within a library is medium, jumping between libraries is low. + +You'd train this in a semi-supervised way, anchoring some states with your ground truth, and letting Viterbi or forward-backward propagate through the rest. The HMM framework is natural here because the binary *is* a sequence, and the Markov property (what matters is what group you're currently in, not the full history) matches the linker's behavior well. + +## Personalized PageRank as a Soft Classification + +Here's a technique that handles the graph propagation cleanly. Run **personalized PageRank** (PPR) seeded from each anchor set. One PPR vector seeded from `main` gives you a "reachability-from-first-party" score for every node. Another seeded from known zlib functions gives a "zlib affinity" score. A third from OpenSSL anchors gives an "OpenSSL affinity" score. + +Each function is then softly classified by which PPR vector dominates at that node. The beauty is that PPR naturally decays with graph distance and respects the connectivity structure — a tightly-knit library will "trap" the random walk, giving high PPR scores internally and low scores outside. The boundary between two regions is where the competing PPR scores cross over. + +You could even visualize this as a heat map over the address space, and the boundaries would pop out visually. + +## Exploiting the Bipartite Structure: Code ↔ Data + +The transcript mentions that data locality mirrors code locality, but I think this signal is underexploited if you treat it as a separate, independent confirmation channel. Consider the *bipartite graph* between functions and the data items they reference. Two functions that reference overlapping or nearby data are likely from the same compilation unit. This gives you a similarity measure between functions that's *completely independent* of the call graph. + +You could build a function-to-function similarity matrix where similarity = how much their referenced data overlaps or how close their referenced data is in the data section. Then cluster on this matrix. The resulting clusters should correspond to compilation units, and they give you a second opinion that you can reconcile with the call-graph-derived clusters. + +When both signals agree, you have high confidence. Where they disagree, something interesting is happening — maybe a utility function shared across CUs, or a data section that got reordered. + +## Directed Graph Asymmetry as a Library Detector + +Here's a subtlety the transcript alludes to but doesn't fully develop. Call graphs are *directed*, and the directionality matters. Library code has a specific shape: it gets **called from far away** (user code invokes the library's public API) but **calls to nearby functions** (internal library routines). So a library entry point has high *in-degree from distant addresses*but low *out-degree to distant addresses*. + +You could compute, for each function, two metrics: the average address distance of its callers, and the average address distance of its callees. Plot every function on this 2D plane. I'd expect clusters to emerge — first-party code has short-range callers and a mix of callees (some local, some long-range calls into libraries), while library internals have short-range everything. Library *entry points* are the distinctive ones: long-range callers, short-range callees. They're the "necks" between the communities. + +## The Bisection Oracle Strategy + +The "git bisect" idea from the transcript is really interesting and I think it deserves formal treatment. If you have an expensive oracle (heavy analysis, LLM query, human analyst) and you want to minimize queries, this is an **active learning** problem. + +Concretely: you have a sequence of functions ordered by address, and you know that somewhere in this sequence there's a boundary between first-party and library code. You believe the sequence is monotonic (once you cross from first-party to library, you don't come back — at least within a region). Binary search finds the boundary in O(log n) queries. But real binaries might have multiple interleaved regions, so you'd need a slightly more sophisticated approach — maybe something like the **multi-armed bandit** formulation, where each candidate boundary location has an uncertainty estimate, and you query the point that maximally reduces global uncertainty. + +## Strongly Connected Components as Ground Truth CU Detectors + +One more concrete trick: compute the **strongly connected components** of the call graph. Mutual recursion (A calls B, B calls A) almost always happens within a single compilation unit. So SCCs give you "for free" sets of functions that definitely belong together. These become seeds for clustering — merge each SCC into a single super-node, then run your boundary detection on the reduced graph. + +------ + +## Synthesis + +If I were designing the pipeline, I'd layer these techniques: + +Start with **cheap deterministic anchors** — FLIRT matches, string attribution, `main` identification. Then use the **edge-length histogram** to estimate the characteristic scales. Run **SCC detection** to collapse definite-same-CU functions. Build edge weights from address proximity and shared data references, then use **min-cut** from known-first-party to known-library to find the primary boundary. Within each side, use **agglomerative clustering** on the address axis, breaking only at points where the call graph edge density drops off, to find compilation unit boundaries. Use **personalized PageRank** from each library's anchors for soft assignment. Finally, reconcile against the **data xref bipartite clustering** as independent validation. + +The result would be a hierarchical segmentation of the address space with confidence scores, and the boundaries would be exactly the API transition points that a reverse engineer cares about. + + + +--- + +These are good, and the interesting thing is where they overlap with what I already proposed — and more importantly, where the *gaps between them* reveal something deeper. Let me think through what these trigger. + +------ + +> here are some additional ideas. do they cause you to have any new insights or ideas? +> +> To transition MAPA from a good heuristic tool to a cutting-edge analytical engine, here are four algorithms that fit perfectly into their architecture: Algorithm 1: Spatial-Graph Label Propagation (Solving the "Tension") Instead of manually traversing graphs, use Label Propagation. +> +> 1. Build an adjacency matrix of the functions. +> 2. Weight the edges by both call frequency AND physical memory proximity. (e.g., Function A and B are strongly linked if A calls B and they are 100 bytes apart). +> 3. Lock the "Anchor" nodes (Ground Truth). +> 4. Run the propagation algorithm. It will mathematically resolve the "tension" the speaker mentioned, assigning a probability score to every single function (e.g., "Function 0x40100 is 88% likely to be zlib"). Algorithm 2: Spectral Clustering for "Compilation Units" The speaker wants to identify individual compilation units (source files). Because linkers lay out source files contiguously, compilation units form ultra-dense "cliques" in the graph. +> +> - Use Spectral Graph Partitioning. By calculating the Laplacian of the call graph, the algorithm will naturally shatter the binary into discrete chunks. +> - Because compilation units are sequential in memory, you can penalize the clustering algorithm for grouping functions that are physically far apart. The output will be distinct, colored blocks perfectly mapping to the original .c or .cpp files. Algorithm 3: Minimum Cut / Maximum Flow for API Boundaries To find the exact boundary between user code and library code, treat the graph like a pipe network carrying water from main to the library strings. +> - Run a Min-Cut/Max-Flow algorithm. The algorithm will identify the "bottleneck" edges where water struggles to pass from the malware to the library. +> - Those bottlenecks are, almost by definition, the public API functions of the statically linked library. + + + +## The Three Algorithms Are Solving Three Different Sub-Problems + +The most important realization is that these three aren't competing approaches — they're a **layered pipeline** attacking different levels of the hierarchy. Min-cut finds the macro boundary (first-party vs. library). Spectral clustering finds the meso-structure (compilation units). Label propagation assigns soft identities (which library, what confidence). That ordering matters, and each stage's output feeds the next. + +But there's something missing from this pipeline: none of them address **how to discover the right number of groups**. Min-cut gives you a bisection, but a real binary might have first-party code plus six libraries. Spectral clustering needs you to pick *k*. Label propagation needs seed labels. So the first real question is: **how many libraries are we even looking for?** + +## The Eigengap Tells You How Many Libraries Exist + +This is where the spectral approach becomes more powerful than the proposal suggests. Don't just use spectral clustering to *partition* — first use the **eigenvalue spectrum of the graph Laplacian** to *discover* the structure. Specifically, compute the first, say, 50 eigenvalues of the address-weighted call graph Laplacian and look for the **eigengap** — the largest jump between consecutive eigenvalues. If the first three eigenvalues are near zero and then the fourth jumps up, that's telling you the graph has roughly three natural communities. You've learned the number of libraries (plus first-party code) before doing any clustering at all. + +Even better: the **eigenvectors** themselves, when plotted against function address, should look like step functions. Within a compilation unit, all functions have similar eigenvector values (because they're tightly coupled). At a boundary, the values jump. So you can literally do **change-point detection on the Fiedler vector** (the eigenvector corresponding to the second-smallest eigenvalue) plotted over the address axis, and the change points are your compilation unit boundaries. This fuses the spectral and spatial approaches in a way that's more principled than penalizing distance in the clustering objective. + +## The Weighting Scheme Deserves More Thought + +Algorithm 1 proposes weighting edges by call frequency AND memory proximity. This sounds right but the *combination function* matters enormously. Additive weighting (`w = α·calls + β·proximity`) treats them as interchangeable signals. Multiplicative weighting (`w = calls · f(proximity)`) treats proximity as a **prior** that modulates graph evidence. I think the multiplicative form is more correct. + +Here's why: a call edge between two functions 50 bytes apart is *qualitatively different* from a call edge between two functions 500KB apart. The first is almost certainly intra-CU. The second is almost certainly cross-boundary. The call exists in both cases, but its *meaning* changes based on distance. A multiplicative weight captures this — a long-range call gets a small weight, so cutting it is cheap (it's probably a cross-boundary call), while cutting a short-range call is expensive. + +Concretely, something like `w(e) = calls(e) · exp(-distance(e) / λ)` where λ is the characteristic CU size (which you estimated from the edge-length histogram I described earlier). This exponential decay naturally encodes "nearby calls bind tightly, distant calls bind weakly." + +## A Critique of Min-Cut That Leads Somewhere Interesting + +The min-cut formulation as described has a subtle problem: it's **binary**. It partitions into first-party and library. But what if there are three libraries? You'd need to run it multiple times, or use multi-terminal cut, which gets expensive. + +But here's what the min-cut framing *does* give you that's uniquely valuable: the **cut edges are the API boundary**, and the **capacity of the cut is a confidence measure**. A large min-cut value means the boundary is well-defined (lots of edges had to be severed). A small min-cut means the separation is tenuous — maybe you've drawn the line in the wrong place, or maybe the "library" is actually tightly integrated with user code. + +This leads to a practical idea: **iterate**. Run min-cut to get a first-party/library partition. Then *within the library side*, run min-cut again between the two most distant anchor sets (say, zlib anchors vs. OpenSSL anchors). This recursive bisection via min-cut naturally produces the hierarchy. And at each level, the cut capacity tells you how confident you should be. + +## A New Idea These Three Together Trigger: Diffusion Distance as a Universal Metric + +All three algorithms are really measuring the same underlying thing from different angles: **how easily does information flow between two nodes?** Label propagation is a diffusion process. Spectral clustering uses the eigenvectors of the diffusion operator. Min-cut measures the bottleneck in flow. + +This suggests a unifying abstraction: compute the **diffusion distance** between every pair of functions. Diffusion distance is the L2 distance between the rows of the matrix `exp(-t·L)` where L is the graph Laplacian and t is a time parameter. At small t, only very close (tightly connected, nearby) functions are similar. At large t, entire libraries blur into single points. + +The beauty is that **different values of t reveal different levels of the hierarchy**. Small t resolves compilation units. Medium t resolves libraries. Large t resolves the first-party/library split. You get the entire hierarchical segmentation from a single mathematical object just by tuning a parameter. + +You could visualize this as a dendrogram: at what "diffusion time" do two functions merge into the same cluster? Functions in the same CU merge almost immediately. Functions in the same library merge at medium time. Everything merges eventually. The dendrogram *is* the hierarchical classification you're after. + +## One Final Practical Insight: Robustness to Bad Disassembly + +All of these techniques assume the call graph is accurate, but in real binaries — especially malware — disassembly is imperfect. Indirect calls get missed. Tail calls look like jumps rather than calls. Obfuscated code deliberately breaks analysis. + +This is where the **spatial signal acts as a safety net**. Even when the graph is incomplete or noisy, the address layout is always there and always reliable. If your spectral clustering says two functions belong in different groups but they're 20 bytes apart with no other functions between them, the spatial evidence should override. The address axis is your highest-confidence signal; the graph is the informative but noisy supplement. Any algorithm you build should be designed to degrade gracefully — if the graph were *completely* absent, you should still be able to do a passable job purely on address-range heuristics and data xref locality. The graph makes it better, but the 1D structure is the backbone. + +--- + +> here are some further considerations. how should we integrate them into our list of strategies? +> +> 1. Features that seem underexploited. +> +> - 23a. Shared constant pools and tables. +> - 23b. Exception/unwind metadata neighborhoods. +> - 23c. Import-family fingerprints. +> - 23d. Repeated alignment and gap patterns between code blocks. +> - 23e. Data-section locality matching code-section locality. +> - 23f. Ego-network shape: not just who a function calls, but what its 1-hop and 2-hop neighborhood looks like. +> - 23g. Call-direction asymmetry across a candidate boundary. +> +> 1. A specific novel tactic: score gaps, not just nodes. +> +> - 24a. For each gap between adjacent functions in address order, compute “boundary likelihood.” +> - 24b. Features could include: +> - sudden change in seed evidence +> - reduced internal xrefs across the gap +> - padding/alignment pattern +> - change in referenced data neighborhood +> - change in import/API usage style +> - change in graph embedding neighborhood +> - 24c. Then solve for a sparse set of boundaries. +> - 24d. This is more direct than asking “what label is each function?” +> +> 1. Another promising tactic: recover “data islands” first. +> +> - 25a. Cluster strings/tables/rodata/data by address locality and xref neighborhoods. +> - 25b. Label those islands with higher confidence than code. +> - 25c. Then attach functions to the islands they reference. +> - 25d. Libraries often expose themselves through data long before their code is obvious. +> +> 1. A third tactic: active refinement near high-entropy frontiers. +> +> - 26a. Most of the binary may be easy. +> - 26b. The cost is concentrated near a few uncertain boundaries. +> - 26c. So only there should MAPA use expensive methods: +> - deeper semantic analysis +> - more complex embeddings +> - decompiler-derived features +> - analyst interaction +> - 26d. This is much better than spending equal effort on every function. + +These four ideas aren't just additions to the list — they restructure how I think the whole pipeline should work. Let me explain why. + +## The Paradigm Shift: Score Boundaries, Not Functions + +Item 24 is the single most important idea in this entire conversation so far, and I think it should become the **primary framing** of the problem rather than a tactic alongside others. + +Every algorithm we've discussed — label propagation, spectral clustering, min-cut, PPR, HMMs — has been asking "what label does each function get?" But that's the indirect question. The direct question is "where are the seams?" And those seams live *between* functions, in the gaps. This reframing matters practically because the number of gaps (N-1 for N functions) is fixed, the features per gap are rich and computable, and the output you actually want — a segmentation — is literally a sparse subset of those gaps marked as boundaries. + +This turns the whole problem into **supervised or semi-supervised boundary classification on a 1D sequence of gaps**, which is vastly simpler and more tractable than graph community detection. You compute a feature vector for each gap, you have some gaps where you know a boundary exists (between a FLIRT-matched library function and an adjacent non-library function, for instance), and you train a classifier. Or you don't even need training — you define a scoring function and then solve for the *sparsest* set of boundaries that explains your evidence, which is essentially LASSO or L1-penalized segmentation. + +The gap features listed in 24b are excellent, and I'd add a few more that fall out of the earlier discussion: the diffusion distance between the functions flanking the gap, the min-cut capacity if you sever the graph at that point, and the eigenvector discontinuity at that gap in the Fiedler vector. Every algorithm we previously discussed becomes a **feature generator for the gap scorer** rather than a standalone solution. That's the architectural insight — the boundary scorer is the spine, and everything else feeds into it. + +## Data Islands as the Foundation Layer + +Item 25 is the second most important idea, and it reorders the pipeline in a way I didn't previously consider. The argument is subtle and correct: data is *more reliably clusterable than code* for several reasons. Strings are human-readable and often uniquely attributable to a library. Data items don't have the ambiguity of indirect calls. The data section layout reflects compilation unit structure just as faithfully as the code section, but with less noise. And critically, data doesn't get obfuscated nearly as often as code does in malware — the strings still need to be there at runtime. + +So the pipeline should be: cluster the data section first into "islands" using address locality and cross-reference structure. Label those islands using string content, constant patterns, and any other attributable features. Then project those labels back onto functions via data references. A function that references three strings in the "zlib island" is almost certainly zlib code, even if you've never seen its bytes before and FLIRT doesn't match it. + +This is powerful because it provides anchors that are **denser and more confident** than what you get from FLIRT or code-side analysis alone. It also means the data-island clustering becomes the *first* stage of the pipeline, before any code-side graph analysis. You start from data, anchor the code, then refine. + +## The Underexploited Features Suggest a Richer Feature Space for Gaps + +Items 23a through 23g aren't just a wishlist — they fill specific blind spots in the gap scoring framework: + +**23a (shared constant pools/tables)** and **23e (data-section locality)** feed directly into the data-island strategy. But they also provide a gap feature: if the functions on either side of a gap reference data from the *same* data island, the gap is unlikely to be a boundary. If they reference data from *different* islands, it probably is. + +**23b (exception/unwind metadata)** is a particularly clever signal I hadn't considered. Exception handling tables are generated per-compilation-unit by most compilers, and the unwinder metadata groups functions together in a way that directly reflects CU boundaries. This is almost ground truth when it's present — you should parse `.pdata`/`.xdata` (on Windows) or `.eh_frame` (on Linux) and treat co-membership in the same unwind group as a hard constraint. Functions sharing an unwind info record are in the same CU, period. + +**23c (import-family fingerprints)** is interesting because it captures *style*. First-party code that does file I/O will call a different set of APIs than zlib's compression internals. You're not just looking at individual API calls — you're looking at the *distribution* of API families a function touches. Two functions that both use memory allocation + string manipulation + crypto APIs probably belong to the same logical unit (maybe an OpenSSL subsystem), while a function that only touches file I/O is probably a different module. This becomes a gap feature: does the API usage style change across this gap? + +**23d (alignment and padding patterns)** is a beautiful low-level signal. Compilers emit characteristic alignment between functions (16-byte alignment, `int3` padding on MSVC, `nop` sleds on GCC). But *linkers* insert different padding between compilation units — often larger gaps or different padding bytes. So the gap's physical characteristics (how many bytes of padding, what the padding looks like) are direct evidence of whether it's an intra-CU gap or an inter-CU gap. This is a nearly free feature that requires no graph analysis at all. + +**23f (ego-network shape)** is where things get really interesting algorithmically. Two functions might both have degree 5 in the call graph, but one's ego network is a tree (it calls 5 leaf functions that call nothing else — probably a dispatcher or API facade) while the other's ego network is a dense clique (its 5 callees all call each other — probably deep library internals). The *shape* of the local neighborhood, not just its size, is a fingerprint of where you are in the program's structure. You could compute small graph motif counts or local clustering coefficients and use those as features. + +**23g (call-direction asymmetry)** completes the picture for boundary detection specifically. At a real API boundary, you expect calls going *into* the library (from the user-code side) but not calls going *out* of the library back into user code (libraries don't call their callers). So at a candidate boundary gap, you count directed edges crossing it in each direction. A strong asymmetry — many calls going deeper, few or none coming back — is a strong boundary signal. A gap where calls flow freely in both directions is probably internal to a single module. + +## Active Refinement as the Final Stage, Not a Separate Strategy + +Item 26 should be understood as the **execution strategy** for the whole pipeline, not a separate algorithm. The core insight is that uncertainty is spatially concentrated. After running the cheap, fast stages (data island clustering, gap feature computation, label propagation from anchors), most of the binary is probably classified with high confidence. What remains is a handful of ambiguous frontier zones — maybe 5-10% of the address space where two competing labels are close in score. + +This is where you focus your expensive techniques: run the decompiler only on the ambiguous functions. Compute embeddings only for the frontier. Query the analyst only about the uncertain gaps. And do it in the "bisect" order — query the gap in the middle of the largest remaining uncertain region first, because its answer splits the ambiguity in half. + +This means the pipeline architecture should explicitly support a **confidence map** over the address space. Every stage produces not just labels but uncertainty estimates. The uncertainty map guides where to spend further effort. And there's a natural stopping criterion: stop when the maximum remaining uncertainty drops below a threshold, or when the analyst says "good enough." + +## The Revised Pipeline + +Integrating everything, here's how I'd now structure it: + +**Stage 0 — Feature extraction.** Parse the binary. Extract functions, call graph, data xrefs, strings, exception metadata, alignment patterns, import calls. Cheap and mechanical. + +**Stage 1 — Data island clustering.** Cluster the data section by address locality and xref structure. Label islands using string content and constant pool attribution. This gives you the densest, most confident set of anchors. + +**Stage 2 — Anchor projection.** Project data-island labels onto functions via xrefs. Combine with FLIRT matches and known entry points (`main`, etc.). Every function now has either a confident label, a tentative label, or no label. + +**Stage 3 — Gap scoring.** For every inter-function gap, compute the full feature vector: padding pattern, data-island change, API style change, call-graph edge density across the gap, call-direction asymmetry, ego-network similarity of flanking functions, exception metadata boundary, eigenvector discontinuity. Score each gap's boundary likelihood. + +**Stage 4 — Segmentation.** Solve for the sparse set of boundaries that best explains the gap scores and the anchor constraints. This could be L1-penalized changepoint detection, or an HMM where the gap scores inform transition probabilities. + +**Stage 5 — Label assignment.** Within each segment, assign a library/CU label using the anchors, propagated labels, and graph community structure. This is where label propagation or spectral methods work *within* the already-established boundaries rather than trying to find them. + +**Stage 6 — Active refinement.** Identify the highest-uncertainty boundaries. Apply expensive techniques — decompilation, semantic analysis, analyst queries — specifically at those points. Update the segmentation. Repeat until confidence is sufficient. + +The key design principle is that boundary detection (stages 3-4) and label assignment (stage 5) are **decoupled**. You find where the seams are first, then figure out what's on each side. And the most reliable signals (data, metadata, physical layout) come first, with the noisier signals (graph topology, semantic analysis) used for refinement rather than discovery. diff --git a/doc/plans/problem-transcript.md b/doc/plans/problem-transcript.md new file mode 100644 index 00000000..604c0d15 --- /dev/null +++ b/doc/plans/problem-transcript.md @@ -0,0 +1 @@ +In this project called MAPA, we are Finding ways to render the context of executable programs. We're primarily doing this at the function level. Then we have a couple different ways of rendering the results. Rendering the list of functions, some context, like their strings and API calls. And outputting this to the terminal, ordering the functions by address. Or we have an HTML view which shows the functions as blocks laid out left to right, top to bottom um as well as the strings that are tagged and some Call graph analysis that allows us to identify the neighborhood of a given function or set of functions. And all of this, the whole purpose of this project, this map-up project, is to give us Insights and views into the layouts of compiled programs for the purpose of enabling reverse engineers and malware analysts. Specifically, we frequently encounter programs that have been statically linked with library code, whether this is from the MSVC runtime. or open source libraries like zlib, open SSL, PNG parsers, or or other arbitrary libraries. Many malware samples are are linked against these library codes. And if we can differentiate the the first order of business is differentiating first party code from third party code and identifying if a function belongs to one or the other And then the next question is, you know, within the library code, which library is which? Or even to go deeper than that, like which compilation unit is which? So for example, something like Zlib might have multiple source files, each one a compilation unit. They're all part of a single open source library which is zlib but one might be the compressor the other do the decompressor or something like that And so a function could be in these compilation units, it could be within the libraries, and then ultimately something that's part of zlib is not the first party code, it's library code. And so we want to be able to mark a function at each one of those levels. That's what our ultimate goal is. And the reason we want to do that is because when a user begins reverse engineering from, for example, the entry point. Then we want to be able to like kind of bound them in so that they don't go reverse engineering something that isn't first party code. ideally we are able to like say oh everything over here is zlib zlibs for compression so what's happening here is compression focus on maybe the api boundary Um and don't try to reverse engineer all the logic of this library. So that's what we're trying to do. But when we're analyzing a compiled program, we don't have Many indicators for what we're dealing with. We're able to disassemble the program and reconstruct the functions and relationships among functions and data um cross-references And so we're able to say one function calls another function. We're able to say a function references some data. Possibly we some data references a function if it's a function pointer or like a V-table, for example. And so it's kind of this network relationship, a graph relationship among the different entities in the program. And then we have sometimes some contextual information where we can recognize certain things. For example, we might use something like flirt to identify functions by a pretty well-trusted algorithm for looking at the the bytes that make up a function and matching it and and calling a function maybe a part of a particular library or having a particular their name. And we can also recognize strings, strings that come from open source libraries that are unique. We can start saying ah this is the string that's definitely a part of zlib. Okay, so therefore the code that references The string is therefore z lib. This is our goal for this project We want to find a way to recover where a function belongs within those three categories. First party, third party, which library, and then which compilation unit. And one final thing that we can exploit and that we are expecting to exploit here. in whatever f fuzzy or deterministic algorithm we come up with is the fact that linkers, unless they're going above and beyond and using kind of a hardcore optimization, what they would typically want to do is lay out things that exploit data locality and specifically within a compilation unit, a compilation unit is almost always laid out sequentially and contiguously within the compiled program. And compilation units that are related to one another, such as within the same library, are often laid out close to one another. or contiguously. And so we want to exploit this. This means if two functions come from the same compilation unit, they're probably basically found right next to one. another and in the same region of the binary. And this also seems to be the case for data as well. Now it's not that the data in the code are interspersed, but within the code and the data section um and within the data section, um data that is used by the same compilation unit tends to be close together. This is kind of the key insight that we want to exploit as well. And so we're going to be imagining how can we traverse the graph and create algorithms over the graph of these nodes and entities and things in such a way that we can identify where the boundaries between the the groups are. And also if we consider this graph, but also imagine that the nodes in the graph are kind of pinned to like physical locations within an address space, is there a characteristic that we can use to again help us differentiate the groups? For example, like We kind of imagine a library may have a lot of calls within itself and among itself, and then there's probably a call into the library via its public interface from some other part of the program. which is probably located at it in a different like physical part of the address space. And so there's maybe like one transition in the call graph into the more tightly connected like library code or some such thing. Or likewise we we kind of think like the colleagues of a library are probably also part of the library or other like lower level libraries. routines like the MSVC runtime or something like that. And so once we identify a library function, probably all of the other colleagues are also libraries. And so this is, you know, a kind of simple app algorithm that we could suspect probably works pretty well. And so if we can use strings, for example, to identify all the strings related to a library, and then we traverse all the colleys downwards like we can maybe mark up a lot of the by the program into like library code uh already. There's also, at least the way I think about it, this kind of like You know, if we represent the program as a graph of nodes, you know, kind of via the call graph, there are certain points that we might have ground like more or less ground truth on. Like the main routine is almost certainly first party code, so we can mark it as such. And then maybe we have a couple functions that reference strings that are known to be uh part of open source libraries and now those are anchored to be uh okay those are library functions and in between there's like this path through the graph, the call graph, between the user code and the library code. And there's like a tension here It's ambiguous initially like which of the nodes along the path are user code versus library code. And as we make further deductions or use algorithms to kind of clarify which points which, then the boundary between these two regions becomes more distinct. and and and better fitted until ultimately we can kind of clearly say this is the um the transition from user code into the public interface of the library code, that sort of thing. And so the job is to kind of from these initial starting points, you know, explore outwards maybe using algorithms or other techniques, maybe this is one way to think about it. And maybe we can guide even a human to make judgments to say like, well, can you tell us about this function? What do you think? Which side is the upon. And now once we know that, uh what about this function? And we can almost do like a a git bisect to figure out uh in a minimum number of steps like where that um boundary is. Now we don't actually probably want to ask a human to do that because it's kind of a hard thing for them to answer. But if there's an expensive technique that takes like a lot of CPU or memory, we could ask a limited number of questions. And if we do it in this kind of intelligent way we could kind of quickly come up to this boundary without you know doing a brute force attack against all the different nodes in the graph to to find boundaries. So again, let me pause here. This is what we're trying to do. This is the goal of the project. And this is ultimately what we're trying to solve here. What is the second motive? \ No newline at end of file