Skip to content

How GenRex works

Dominika Regéciová edited this page Jan 29, 2024 · 2 revisions

How GenRex works 🦖

GenRex accepts lists of strings and their type. The output is the list of regular expressions with additional information. These contain statistical information about the number of sample string occurrences, which is vital for creating YARA rules; original strings included in the regular expressions, or how many unique strings are in the cluster.

The whole process of evaluation in GenRex can be separated into several phases, which are described in detail following the model in Figure below:

In the first phase, GenRex collects input and preprocesses strings, and the length of n-gram is calculated as a half of the average lengths. Then, the strings are divided into clusters based on common n-grams. Statistical information (such as the number of unique strings) is calculated during clustering.

In the second phase, GenRex creates a prefix trie automaton for each cluster, using n-gram to detect not only common prefixes but also common subparts within the strings. The trie is then translated into equations and solved to create a regular expression.

In the last phase, we apply a series of domain-specific optimizations on the regular expression to provide optimal results.

genrex

Domain-specific Preprocessing

With preprocessing, the objective is to increase the chances that we detect the right similarities in named objects, not just standard notations based on a specific operating system or report from a sandbox.

Named objects can contain common prefixes that can hide the similarity based on the objects themselves, such as Global\ or Local. These prefixes indicate the namespace on Windows systems. There is a global namespace used primarily by services in client/server applications. Each client session has a separate namespace, so we can have multiple clients running the same applications without interfering with each other. These should be removed because they can lead to misguided results in the clustering phase and create a regular expression that would not detect named objects without these specific prefixes.

Other examples are cases where reports can contain a username in a path like C:\Users\Susan. In practical situations, we can expect any username imaginable. For this reason, Susan should be replaced by [^]+, representing any username.

The last case that we would like to explain in more detail is GUID. GUIDs, globally unique identifiers, are strings in format [A-Za-z0-9]{8}-[A-Za-z0-9]{4}-[A-Za-z0-9]{4}-[A-Za-z0-9]{4}-[A-Za-z0-9]{12}, sometimes represented with surrounding braces. GUIDs are often connected to software created by Microsoft, but their usage is more widespread - sometimes called UUID, universally unique identifier. With them, we can identify one of each installation of a program or hardware. A change of one byte similarly changes the meaning of GUID as in hashes, so it does not make sense to create a regular expression that detects partial similarities. In GenRex, strings containing only GUIDs are removed and strings containing other information are kept, but the GUID part is later substituted for a regular expression.

Clustering Phase

The clustering phase accepts the input strings and two boolean flags that influence the algorithm.

The first flag is called the cuckoo flag. If this flag is set up, an additional step substitutes characters from their hexadecimal format \xDD to their Latin-1 representation.

The second flag is called store_original_strings. This will save strings in their original format. Otherwise, only pre-processed versions of strings are stored.

Because the strings can be in input multiple times, GenRex creates a set of unique strings based on input, so clustering is not processing the same string more than once.

The length of ngram is calculated from a set of unique strings. All strings that are shorter than one ngram, are removed. Clustering itself is based on ngrams acquired from strings with fixed length. After clustering, the average length of string is calculated. If some string differs in length, the cluster is split into two. This will minimize the situation when most strings are short, so ngram is short and longer strings are clustered based on small similarities.

We are also saving statistical information about strings, such as the minimal number of string occurrences in the sample and other information. The ngram that created the cluster is also stored as it will be used in the next step.

Regular Expression Phase

This phase creates a regular expression for each cluster. The process is based on prefix trees, equations, and heuristics.

Firstly, we create a small prefix tree that contains ngram from the cluster. Then, we make the tree with all strings in cluster append we are interleaving them with prefix tree. This detects not only common prefixes but also detects common substrings in the middle parts of strings. This was shown to be more effective than the minimization process because suffixes are detected in the equation part.

Note that before strings are added to a tree, a selection of a few patterns, such as GUID-like patterns, are detected and substituted. These were selected based on the domain-specific nature of strings. We detect cases that can vary, but they have a common theme.

In the next step, the tree is rewritten into a set of equations. A solution is a regular expression obtained by Brzozowski’s algebraic method. During the step, two main operations are done - union and concatenation. The common suffix is detected during this step.

The result is indeed a regular expression but in a very general format. In order to generate even better-looking results, heuristics are taking place.

Domain-specific Optimizations for Regular Expressions

The heuristics check if we can simplify and, in some cases, even generalize the regular expression. For example, if we have a regular expression hello(1|2|3|4), we generalize it into hello[0-9]. The heuristics are still a work in progress, and if you have some suggestions, feel free to contact us.