As you've reverse engineered software, you've likely asked the following questions:
- Which libraries were statically linked into this executable?
- Does this executable share some code with another executable that I've analyzed?
- What are the differences between version 1 and version 2 of a given executable?
- Does this executable share code with another executable in a large collection of binaries?
- Was this function pulled from an open-source library?
BSim is intended to help with these questions (and others) by providing a way to search collections of binaries for similar, but not necessarily identical, functions.
The idea behind BSim is to generate a feature vector for each function in a binary. The vectors are generated by Ghidra's decompiler. Each feature represents a small piece of data flow and/or control flow of the associated function. The decompiler normalizes the feature vector representation so that different, but functionally equivalent, pieces of code often produce the same features. Certain attributes, such as values of constants, names of registers, and data types, are intentionally not incorporated into the features.
BSim vectors are compared using cosine similarity.
Discrepancies between the vectors for foo
and bar
which are caused by differences in compilers, target architectures, and/or small changes to the source code typically result in vectors which are close but not identical.
BSim vectors can be stored in a dedicated database. BSim databases intended to hold large1 numbers of vectors maintain an index based on locality-sensitive hashing. The index drastically reduces the number of vector comparisons needed and allows for rapid retrieval of results.
Querying foo
against a BSim database typically yields a number of potential matches.
Each individual match for foo
can be compared to foo
in a side-by-side view, and certain information (such as function name) can be quickly copied from a match to foo
.
We frequently call BSim vectors the BSim signature of a function, or just the signature when the context is clear.
We can think of each feature as representing a small piece of the behavior of a function, analogous to a snippet of source code. Functions whose BSim vectors are close typically have many features in common, that is, they have similar behavior. Hence the name "BSim": Behavioral Similiarity.
Using BSim involves the following components:
- A BSim Client, i.e., an instance of Ghidra with the BSim plugin enabled.
- This is where the reverse engineering happens.
- A BSim Database, which stores the BSim signatures.
- Also stores some metadata about each function and its containing executable.
- In particular, stores the ghidra:// URL of the associated Ghidra program.
- Does not store disassembly or decompiled functions.
- A Ghidra Project, which stores the analyzed programs used to populate the BSim database.
- Given a BSim match, the BSim client can use the ghidra:// URL to retrieve a program from a Ghidra project for side-by-side comparisons.
- Note that a single BSim database can reference multiple Ghidra projects.
There are three supported database backends for BSim:
-
PostgreSQL
- The Ghidra distribution includes the source for PostgreSQL, a PostgreSQL plugin for BSim, and a build script.
- Populated from shared Ghidra projects (i.e., requires a Ghidra server).
- Server not supported on Windows (no restriction on clients).
-
Elasticsearch
- The
BSimElasticPlugin
extension contains an Elasticsearch plugin for BSim. - This plugin must be installed into an existing Elasticsearch database.
- Populated from shared Ghidra projects.
- The
-
H2
- Simplest way to use BSim:
- Backed by files on the user's machine (don't need to install database server).
- Can be created and populated quickly.
- Supported on all platforms.
- Does not support large collections of binaries or multiple users.
- Can be populated from non-shared (local) or shared Ghidra projects.
- Simplest way to use BSim:
Next Section: Starting Ghidra and Enabling BSim
Footnotes
-
Creating a database requires a database template, which determines the specifics of the index. Currently, Ghidra provides a medium template, intended for databases holding up to 10 million unique vectors, and a large template, intended for databases holding up to 100 million unique vectors. ↩