Skip to content

Commit

Permalink
Add Predicting CVSS Vectors with text embeddings and random forests
Browse files Browse the repository at this point in the history
  • Loading branch information
dguerri committed Aug 13, 2024
1 parent 41a0f07 commit 341a6b5
Showing 1 changed file with 399 additions and 0 deletions.
399 changes: 399 additions & 0 deletions _posts/2024-08-13-cvss-vectors-with-embeddings-and-random-forests.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,399 @@
---
layout: post
description: Predicting CVSS Vectors with text embeddings and random forests
comments: true
date: 2024-08-13
last-update: 22024-08-13
---

NIST's Common Vulnerability Scoring System (CVSS) assigns a severity score to vulnerabilities, but they are facing a backlog due to the increasing number of vulnerabilities and changes in how they are analyzed. This makes it difficult for security professionals to assess security risk and prioritize which vulnerabilities to fix first.
This post explores how AI can help with this problem.

# Predicting CVSS Vectors with text embeddings and random forests

## The problem

If you work in information security, you are probably aware of the delays in enriching or classifying CVE impacts that we are experiencing in 2024.

One of the value-added by the NIST CVE analysis is the CVSS vector. In short a CVSS vector is a concise string that summarises the characteristics and severity of a software vulnerability based on the Common Vulnerability Scoring System (CVSS). It consists of several metrics like Attack Vector, Attack Complexity, and Impact scores, each with a specific value. This vector allows for a standardised way to assess and communicate the risk associated with a vulnerability, enabling security professionals to prioritise their response efforts.

According to their own website, the National Vulnerability Database (NVD) is facing a growing backlog of vulnerabilities due to increased software and changes in interagency support.

While NIST is prioritising the analysis of significant vulnerabilities, and while they are exploring long-term solutions like a consortium of stakeholders to collaborate on improving the NVD, as of today, companies involved in vulnerability management have to come up with alternative solutions to estimate the risk they are facing and prioritise vulnerability remediation.

![Image.png](https://res.craft.do/user/full/0c177b17-813d-3fbd-b831-96c793c08936/doc/3F0A08E2-24DE-4280-80C9-F59CB31DA699/3D7385E5-39CE-458A-BC4B-C097851F4CCE_2/YQ9qpWKFN1xCNYmYO0BzFZvvyMravq65UHsEL9xTO40z/Image.png)

AI is a big, hot topic today, and many think it will steal our jobs and be applied to virtually everything, in the long term.

While that sounds quite inaccurate to me, I also believe that the value of (text) embeddings is something that didn't have the right attention so far 🙂. So let's see how can we do something useful with AI.

### What are text embeddings?

Text embeddings are a way to represent words or phrases as dense vectors (lists of numbers) in a high-dimensional space.

In simple terms, these vectors ***capture the semantic meaning*** of the text***,*** with respect to the training data, positioning words or phrases with similar meanings closer to each other in this space. This allows computers to perform various natural language processing tasks by measuring the distance or similarity between these vectors.

Embeddings are basically calculated as the activation of one of the latest layers of a trained neural network. Currently, the most popular way to generate an embedding model is by utilising pre-trained Large Language Models (LLMs) like BERT or GPT variants. These models have revolutionised natural language processing due to their ability to capture contextual relationships between words and generate high-quality embeddings.

It should be noted that LLMs are trained on gargantuan amount of data to be generally useful. That means that embedding models derived from LLMs are, generally, capable of finding and encoding meaning for many domains.

As you probably know, or as you can imagine, more specialised (or fine-tuned) LLMs can perform much better on the domain they have been optimised for.

In this post, I will use a general purpose text embedding model, namely [text-embedding-gecko](https://cloud.google.com/vertex-ai/generative-ai/docs/model-reference/text-embeddings-api#model_versions) from Google VertexAI. More precisely, I will use `text-embedding-004`, which is a mono-modal embedder trained on data written in English language.

### Okay, cool, but what does all this have to do with CVEs?

In order to assign a CVSS vector to a CVE, a security analyst would carefully read and understand the vulnerability description. Then, using their security knowledge, they would create a meaningful vector, allowing the attribution of a base risk score to the vulnerability.

Can AI do that? It certainly can, to some extent.

You probably cannot just feed the description to a large language model and hope to get a super accurate CVSS vector. At least in 2024.

But fear not, AI is not just LLMs and sharks.

Getting text and embeddings is a great way to extract meaning from words. That meaning, encoded with a vector in a highly-dimensional space, is a perfect candidate for classifying machine learning models.

Let's see how to combine these two things.

## The juicy part

![Image.png](https://res.craft.do/user/full/0c177b17-813d-3fbd-b831-96c793c08936/doc/3F0A08E2-24DE-4280-80C9-F59CB31DA699/FBD728AE-0348-4AC8-BEE2-7C55912D65BA_2/aASyZaZuPgl0UNqWcJXyDziphpMZdcxa7JykCKcxKfMz/Image.png)

### Step 1. Getting the data

This is the easiest part.

NIST allows downloading CVE feeds straight from their site.

The data has been incrementally improving over the years, but it's still not great as historically there isn't much structure on some CVE attributes and each CNA has its own way of filing CVEs...

Anyway, NIST has been doing a decent job at assigning and improving the CVSS vector during vulnerabilities lifecycle. At least until 2024.

For our little experiment, we will download CVE data for 2022 and 2023, saving them in a handy [Google BigQuery](https://cloud.google.com/bigquery?hl=en) table.

```python
feed_format = 'https://nvd.nist.gov/feeds/json/cve/1.1/nvdcve-1.1-{}.json.gz'
years = (2022, 2023)

[...]
skipped = 0
processed = 0
for year in years:
year_cves = []
nvd_data = nvd_feed_url_to_json(feed_format.format(year))

print(f"Processing data for year {year}: "
f"{len(nvd_data['CVE_Items'])} total CVEs")
for item in nvd_data['CVE_Items']:
relevant_data = extract_cve_from_item(item)
if relevant_data is None:
skipped += 1
else:
year_cves.append(relevant_data)
processed += 1

print("Saving to bigquery")
save_to_bigquery(
project_id=PROJECT_ID, dataset_id=DATASET_ID, table_id=TABLE_ID,
schema=BIGQUERY_CVES_SCHEMA, data_list=year_cves)

print(f"Processed CVEs: {processed}\nSkipped CVEs {skipped}")
```

`extract_cve_from_item` is a simple function extracting from "raw" CVE data the information we need to train our model. We only handle CVEs with `baseMetricV3` as we are only interested in CVSS v3.

```python
def extract_cve_from_item(item: Dict[str, Any]) -> Optional[Dict[str, Any]]:
if 'baseMetricV3' not in item['impact']:
return None

en_text = next(
(item['value'] for item in item['cve']['description']['description_data']
if item['lang'] == 'en'), None)
if en_text is None
return None

return {
'cve_id': item['cve']['CVE_data_meta']['ID'],
'assigner': item['cve']['CVE_data_meta']['ASSIGNER'],
'cvssv3_vector_string': item['impact']['baseMetricV3']['cvssV3']['vectorString'],
'cvssv3_attack_vector': item['impact']['baseMetricV3']['cvssV3']['attackVector'],
'cvssv3_attack_complexity': item['impact']['baseMetricV3']['cvssV3']['attackComplexity'],
'cvssv3_privileges_required': item['impact']['baseMetricV3']['cvssV3']['privilegesRequired'],
'cvssv3_user_interaction': item['impact']['baseMetricV3']['cvssV3']['userInteraction'],
'cvssv3_scope': item['impact']['baseMetricV3']['cvssV3']['scope'],
'cvssv3_confidentiality_impact': item['impact']['baseMetricV3']['cvssV3']['confidentialityImpact'],
'cvssv3_integrity_impact': item['impact']['baseMetricV3']['cvssV3']['integrityImpact'],
'cvssv3_availability_impact': item['impact']['baseMetricV3']['cvssV3']['availabilityImpact'],
'cvssv3_base_score': item['impact']['baseMetricV3']['cvssV3']['baseScore'],
'cvssv3_base_severity': item['impact']['baseMetricV3']['cvssV3']['baseSeverity'],
'english_description': en_text,
}
```

Sample output for the above snippet:

```other
Processing data for year 2022: 24961 total CVEs
Saving to bigquery
Loaded 23746 rows to BigQuery table cves
Processing data for year 2023: 28462 total CVEs
Saving to bigquery
Loaded 25378 rows to BigQuery table cves
Processed CVEs: 70406
Skipped CVEs 5895
```

### Step 2. Getting the embeddings

Now that we have a BigQuery table with all the relevant pieces of information, we need to translate `english_description` into features we can use to classify the CVE into multiple classes.

To do so, we will be using the Google Vertex AI python API. Note that in order to deal with to the rate limiting on that API, we are using an helper function splitting the data into batches and making sure we don't submit more than 100 requests per second.

I load the data set from BigQuery into a Pandas DataFrame

```python
from google.cloud import bigquery
import pandas as pd

def run_bq_query(sql: str) -> pd.DataFrame:
# Create BQ client
bq_client = bigquery.Client(project=PROJECT_ID)
# Try dry run before executing query to catch any errors
job_config = bigquery.QueryJobConfig(dry_run=True, use_query_cache=False)
# This will raise an exception if there's a query error
bq_client.query(sql, job_config=job_config)
# If dry run succeeds without errors, proceed to run query
job_config = bigquery.QueryJobConfig()
client_result = bq_client.query(sql, job_config=job_config)
job_id = client_result.job_id
# Wait for query/job to finish running, then get & return data frame
df = client_result.result().to_arrow().to_pandas()
print(f"Finished job_id: {job_id}")
return df

cve_df = run_bq_query("SELECT * FROM nvd.cves")
```

Below I calculate the embeddings for the whole table. For the model used here, each embedding is a 768-dimensional vector.

```python
from vertexai.language_models import TextEmbeddingModel

# Initialize Vertex AI SDK
[...]
# Load the TextEmbeddingModel
model = TextEmbeddingModel.from_pretrained("text-embedding-004")

embeddings = encode_text_to_embedding_batched(
sentences=new_cve_df["english_description"].tolist(),
model=model,
api_calls_per_second = 95/60,
batch_size = 20)
```

And finally add the new column with the embeddings to the DataFrame

```python
cve_df.insert(
loc=len(cve_df.columns), column='description_embeddings',
value=embeddings.tolist())
```

## Step 3. Training the model

Now that we have numeric vectors describing out CVEs, we can try to classify them.

We could use a single classifier, trained to predict the whole CVSS Vector, but that would be inaccurate given the possible combinations of 8 different dimensions of that vector.

So, we will be training 8 different classifiers, one per each dimension of the CVSS Vector.

```python
# Dimensions we are interested in predicting
categories = [
'cvssv3_attack_vector',
'cvssv3_attack_complexity',
'cvssv3_privileges_required',
'cvssv3_user_interaction',
'cvssv3_scope',
'cvssv3_confidentiality_impact',
'cvssv3_integrity_impact',
'cvssv3_availability_impact',
]
```

For each Random Forest classifier, we will split the data in training and test set, also calculating the accuracy of each one:

```python
estimators = 300
X = new_cve_df['description_embeddings'].to_list()
classifiers = {}
for category in categories:
print(f"Training random forest classifier for {category}...")
y = new_cve_df[category].values
# Define training and test sets
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size = 0.2, random_state = 2)
# Train the classifier
clf = RandomForestClassifier(n_estimators=estimators,n_jobs=-1)
clf.fit(X_train, y_train)
classifiers[category] = {'clf': clf}
# Calculate the classifier accuracy
y_pred = clf.predict(X_test)
classifiers[category]['acc'] = accuracy_score(y_test, y_pred)
```

Let's see how the classifiers performed on the test set:

```python
for k, v in classifiers.items():
print(f"{k:30} - accuracy: {v['acc']:>6.3f}")
```

Which produced:

```plaintext
cvssv3_attack_vector - accuracy: 0.901
cvssv3_attack_complexity - accuracy: 0.964
cvssv3_privileges_required - accuracy: 0.753
cvssv3_user_interaction - accuracy: 0.924
cvssv3_scope - accuracy: 0.958
cvssv3_confidentiality_impact - accuracy: 0.831
cvssv3_integrity_impact - accuracy: 0.833
cvssv3_availability_impact - accuracy: 0.868
```

Not bad for a quick and dirty model!

To summarise, "just" looking at the CVE description we are able to predict

- the attack vector, attack complexity, need of user interaction, scope with an accuracy of over 90%
- the impacts with an accuracy of over 83%
- the need of privileges with an accuracy of 75%

### 'nuff test data, how do we perform on 2024 data?

Let's try a couple of unclassified CVEs.

#### CVE-2024-39243

For [this](https://nvd.nist.gov/vuln/detail/CVE-2024-39243)[ CVE](https://nvd.nist.gov/vuln/detail/CVE-2024-26646), we do have a CVSS 3.1 vector on the NIST website:

`CVSS:3.1/AV:N/AC:L/PR:N/UI:N/S:U/C:H/I:H/A:H`

The CVE description recites:

```plaintext
An issue discovered in skycaiji 2.8 allows attackers to run arbitrary code
via crafted POST request to /index.php?s=/admin/develop/editor_save.
```

Let's see what happens:

```plaintext
# Test classifiers
en_desc = 'An issue discovered in skycaiji 2.8 allows attackers to run arbitrary code via crafted POST request to /index.php?s=/admin/develop/editor_save.'
# Get the embedding of this desc and predict its category
desc_embedding = model.get_embeddings([en_desc])[0].values
# Predict each dimension of the CVSS Vector
for category in categories:
pred = classifiers[category]['clf'].predict([desc_embedding])
print(f"Predicted {category:>30}: {pred[0]}")
```

```plaintext
Predicted cvssv3_attack_vector: NETWORK
Predicted cvssv3_attack_complexity: LOW
Predicted cvssv3_privileges_required: NONE
Predicted cvssv3_user_interaction: NONE
Predicted cvssv3_scope: UNCHANGED
Predicted cvssv3_confidentiality_impact: HIGH
Predicted cvssv3_integrity_impact: HIGH
Predicted cvssv3_availability_impact: HIGH
```

bullseye!

#### CVE-2024-27981

For [this](https://nvd.nist.gov/vuln/detail/CVE-2024-39243)[ ](https://nvd.nist.gov/vuln/detail/CVE-2024-26646)[CVE](https://nvd.nist.gov/vuln/detail/CVE-2024-27981), we do not have a CVSS 3.1 vector on the NIST website.

The CVE description says:

```plaintext
A Command Injection vulnerability found in a Self-Hosted UniFi Network Servers
(Linux) with UniFi Network Application (Version 8.0.28 and earlier) allows a
malicious actor with UniFi Network Application Administrator credentials to
escalate privileges to root on the host device. Affected Products: UniFi
Network Application (Version 8.0.28 and earlier).
Mitigation: Update UniFi Network Application to Version 8.1.113 or later.
```

Using the same code as before, we get:

```plaintext
Predicted cvssv3_attack_vector: NETWORK
Predicted cvssv3_attack_complexity: LOW
Predicted cvssv3_privileges_required: LOW
Predicted cvssv3_user_interaction: NONE
Predicted cvssv3_scope: UNCHANGED
Predicted cvssv3_confidentiality_impact: HIGH
Predicted cvssv3_integrity_impact: HIGH
Predicted cvssv3_availability_impact: HIGH
```

The vector roughly make sense to me. Maybe with some caveats.

Impact dimensions seem accurate, and so does the attack vector. User interaction is not required indeed, as long as we have some (limited?) privileges: i.e., admin on the appliance.

Things that make less sense to me are:

- Scope: the description doesn't help much, as I don't know what I can do with a UniFi Network Application Administrator access.
- Attack complexity is not clear. Unless we know how to exploit this, we can't really tell...

#### CVE-2024-4764

We have a vector for this one:

`CVSS:3.1/AV:N/AC:L/PR:N/UI:N/S:U/C:H/I:H/A:H`

Let's see what our prediction says, just looking at the following description:

```plaintext
Multiple WebRTC threads could have claimed a newly connected audio input
leading to use-after-free. This vulnerability affects Firefox < 126.
```

```plaintext
Predicted cvssv3_attack_vector: NETWORK
Predicted cvssv3_attack_complexity: LOW
Predicted cvssv3_privileges_required: NONE
Predicted cvssv3_user_interaction: NONE
Predicted cvssv3_scope: UNCHANGED
Predicted cvssv3_confidentiality_impact: HIGH
Predicted cvssv3_integrity_impact: HIGH
Predicted cvssv3_availability_impact: HIGH
```

Another one bites the dust :)

## How can we improve?

75%, for the worst case, might not be enough in some scenarios. And, in general, we might want to improve that 83%+ too, to avoid a false sense of security (or insecurity) while calculating the overall vulnerability management risk for our domain.

There are several ways to improve the model. To name some:

1. Add more unstructured/structured data from the CVE when calculating the embeddings.
2. Add some information about the vulnerable software or intelligence data we have from different security feeds.
3. Test different classifiers or experiment with classifiers parameters.
4. Fine-tune the text embedding model for vulnerability management and cybersecurity.
5. Train a model from scratch on cybersecurity data.

## Conclusion (written by AI!)

In conclusion, this exploration into leveraging text embeddings and machine learning for CVSS vector prediction has showcased the potential of AI to address the critical challenge of timely vulnerability assessment. The ability to automatically estimate CVSS vectors based on vulnerability descriptions, even before official assignments, empowers organisations to proactively manage their security risks. While the initial results demonstrate promise, there remains room for further refinement and enhancement through the integration of domain-specific knowledge and the exploration of more advanced language models and classification algorithms.

As the cybersecurity landscape continues to evolve, the integration of AI-driven solutions like this will become increasingly vital in ensuring the resilience of our digital infrastructure.

By automating and accelerating the process of CVSS vector prediction, we can help organisations stay ahead of emerging threats and safeguard their critical assets against exploitation. The path forward involves continued research and collaboration to develop even more sophisticated models capable of accurately capturing the nuances and complexities of vulnerabilities, ultimately contributing to a more secure and resilient digital future.

0 comments on commit 341a6b5

Please sign in to comment.