Skip to content

Commit

Permalink
Fix formatting issues
Browse files Browse the repository at this point in the history
  • Loading branch information
profvjreddi committed May 7, 2024
1 parent 78e7c44 commit 72a934f
Showing 1 changed file with 11 additions and 11 deletions.
22 changes: 11 additions & 11 deletions contents/robust_ai/robust_ai.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -295,9 +295,9 @@ Table XX provides an extensive comparative analysis of transient, permanent, and

#### Definition and Characteristics

Adversarial attacks are methods that aim to trick models into making incorrect predictions by providing it with specially crafted, deceptive inputs (called adversarial examples) \[[@parrish2023adversarial]\]. By adding slight perturbations to input data, adversaries can \"hack\" a model's pattern recognition and deceive it. These are sophisticated techniques where slight, often imperceptible alterations to input data can trick an ML model into making a wrong prediction.
Adversarial attacks are methods that aim to trick models into making incorrect predictions by providing it with specially crafted, deceptive inputs (called adversarial examples) [@parrish2023adversarial]. By adding slight perturbations to input data, adversaries can \"hack\" a model's pattern recognition and deceive it. These are sophisticated techniques where slight, often imperceptible alterations to input data can trick an ML model into making a wrong prediction.

In text-to-image models like DALLE \[[@ramesh2021zero]\] or Stable Diffusion \[[@rombach2022highresolution]\], one can generate prompts that lead to unsafe images. For example, by altering the pixel values of an image, attackers can deceive a facial recognition system into identifying a face as a different person.
In text-to-image models like DALLE [@ramesh2021zero] or Stable Diffusion [@rombach2022highresolution], one can generate prompts that lead to unsafe images. For example, by altering the pixel values of an image, attackers can deceive a facial recognition system into identifying a face as a different person.

Adversarial attacks exploit the way ML models learn and make decisions during inference. These models work on the principle of recognizing patterns in data. An adversary crafts special inputs with perturbations to mislead the model's pattern recognition\-\--essentially 'hacking' the model's perceptions.

Expand All @@ -311,9 +311,9 @@ Adversarial attacks fall under different scenarios:

The landscape of machine learning models is both complex and broad, especially given their relatively recent integration into commercial applications. This rapid adoption, while transformative, has brought to light numerous vulnerabilities within these models. Consequently, a diverse array of adversarial attack methods has emerged, each strategically exploiting different aspects of different models. Below, we highlight a subset of these methods, showcasing the multifaceted nature of adversarial attacks on machine learning models:

* **Generative Adversarial Networks (GANs)** are deep learning models that consist of two networks competing against each other: a generator and and a discriminator \[[@goodfellow2020generative]\]. The generator tries to synthesize realistic data, while the discriminator evaluates whether they are real or fake. GANs can be used to craft adversarial examples. The generator network is trained to produce inputs that are misclassified by the target model. These GAN-generated images can then be used to attack a target classifier or detection model. The generator and the target model are engaged in a competitive process, with the generator continually improving its ability to create deceptive examples, and the target model enhancing its resistance to such examples. GANs provide a powerful framework for crafting complex and diverse adversarial inputs, illustrating the adaptability of generative models in the adversarial landscape.
* **Generative Adversarial Networks (GANs)** are deep learning models that consist of two networks competing against each other: a generator and and a discriminator [@goodfellow2020generative]. The generator tries to synthesize realistic data, while the discriminator evaluates whether they are real or fake. GANs can be used to craft adversarial examples. The generator network is trained to produce inputs that are misclassified by the target model. These GAN-generated images can then be used to attack a target classifier or detection model. The generator and the target model are engaged in a competitive process, with the generator continually improving its ability to create deceptive examples, and the target model enhancing its resistance to such examples. GANs provide a powerful framework for crafting complex and diverse adversarial inputs, illustrating the adaptability of generative models in the adversarial landscape.

* **Transfer Learning Adversarial Attacks** exploit the knowledge transferred from a pre-trained model to a target model, enabling the creation of adversarial examples that can deceive both models.These attacks pose a growing concern, particularly when adversaries have knowledge of the feature extractor but lack access to the classification head (the part or layer that is responsible for making the final classifications). Referred to as\"headless attacks,\" these transferable adversarial strategies leverage the expressive capabilities of feature extractors to craft perturbations while being oblivious to the label space or training data. The existence of such attacks underscores the importance of developing robust defenses for transfer learning applications, especially since pre-trained models are commonly used \[[@ahmed2020headless]\].
* **Transfer Learning Adversarial Attacks** exploit the knowledge transferred from a pre-trained model to a target model, enabling the creation of adversarial examples that can deceive both models.These attacks pose a growing concern, particularly when adversaries have knowledge of the feature extractor but lack access to the classification head (the part or layer that is responsible for making the final classifications). Referred to as\"headless attacks,\" these transferable adversarial strategies leverage the expressive capabilities of feature extractors to craft perturbations while being oblivious to the label space or training data. The existence of such attacks underscores the importance of developing robust defenses for transfer learning applications, especially since pre-trained models are commonly used [@ahmed2020headless].

#### Mechanisms of Adversarial Attacks

Expand Down Expand Up @@ -373,7 +373,7 @@ As the field of adversarial machine learning evolves, researchers continue to ex

Adversarial attacks on machine learning systems have emerged as a significant concern in recent years, highlighting the potential vulnerabilities and risks associated with the widespread adoption of ML technologies. These attacks involve carefully crafted perturbations to input data that can deceive or mislead ML models, leading to incorrect predictions or misclassifications. The impact of adversarial attacks on ML systems is far-reaching and can have serious consequences in various domains.

One striking example of the impact of adversarial attacks was demonstrated by researchers in 2017. They experimented with small black and white stickers on stop signs \[[@eykholt2018robust]\]. To the human eye, these stickers did not obscure the sign or prevent its interpretability. However, when images of the sticker-modified stop signs were fed into standard traffic sign classification ML models, a shocking result emerged. The models misclassified the stop signs as speed limit signs over 85% of the time.
One striking example of the impact of adversarial attacks was demonstrated by researchers in 2017. They experimented with small black and white stickers on stop signs [@eykholt2018robust]. To the human eye, these stickers did not obscure the sign or prevent its interpretability. However, when images of the sticker-modified stop signs were fed into standard traffic sign classification ML models, a shocking result emerged. The models misclassified the stop signs as speed limit signs over 85% of the time.

This demonstration shed light on the alarming potential of simple adversarial stickers to trick ML systems into misreading critical road signs. The implications of such attacks in the real world are significant, particularly in the context of autonomous vehicles. If deployed on actual roads, these adversarial stickers could cause self-driving cars to misinterpret stop signs as speed limits, leading to dangerous situations. Researchers warned that this could result in rolling stops or unintended acceleration into intersections, endangering public safety.

Expand All @@ -397,7 +397,7 @@ The impact of adversarial attacks on ML systems is significant and multifaceted.

#### Definition and Characteristics

Data poisoning is an attack where the training data is tampered with, leading to a compromised model \[[@biggio2012poisoning]\]. Attackers can modify existing training examples, insert new malicious data points, or influence the data collection process. The poisoned data is labeled in such a way as to skew the model's learned behavior. This can be particularly damaging in applications where ML models make automated decisions based on learned patterns. Beyond training sets, poisoning tests and validation data can allow adversaries to boost reported model performance artificially.
Data poisoning is an attack where the training data is tampered with, leading to a compromised model [@biggio2012poisoning]. Attackers can modify existing training examples, insert new malicious data points, or influence the data collection process. The poisoned data is labeled in such a way as to skew the model's learned behavior. This can be particularly damaging in applications where ML models make automated decisions based on learned patterns. Beyond training sets, poisoning tests and validation data can allow adversaries to boost reported model performance artificially.

![NightShade's poisoning effects on Stable Diffusion (Source: [TOMÉ](https://telefonicatech.com/en/blog/attacks-on-artificial-intelligence-iii-data-poisoning))](./images/png/poisoning_example.png){#fig-poisoning-example}

Expand All @@ -411,7 +411,7 @@ The process usually involves the following steps:

The impacts of data poisoning extend beyond just classification errors or accuracy drops. In critical applications like healthcare, such alterations can lead to significant trust and safety issues [@marulli2022sensitivity]. Later on we will discuss a few case studies of these issues.

There are six main categories of data poisoning \[[@oprea2022poisoning]\]:
There are six main categories of data poisoning [@oprea2022poisoning]:

* **Availability Attacks**: these attacks aim to compromise the overall functionality of a model. They cause it to misclassify most testing samples, rendering the model unusable for practical applications. An example is label flipping, where labels of a specific, targeted class are replaced with labels from a different one.

Expand Down Expand Up @@ -483,7 +483,7 @@ Addressing the impact of data poisoning requires a proactive approach to data se

##### Case Study 1

In 2017, researchers demonstrated a data poisoning attack against a popular toxicity classification model called Perspective \[[@hosseini2017deceiving]\]. This ML model is used to detect toxic comments online.
In 2017, researchers demonstrated a data poisoning attack against a popular toxicity classification model called Perspective [@hosseini2017deceiving]. This ML model is used to detect toxic comments online.

The researchers added synthetically generated toxic comments with slight misspellings and grammatical errors to the model's training data. This slowly corrupted the model, causing it to misclassify increasing numbers of severely toxic inputs as non-toxic over time.

Expand All @@ -495,17 +495,17 @@ This case highlights how data poisoning can degrade model accuracy and reliabili

![Samples of dirty-label poison data regarding mismatched text/image pairs (Source: [Shan](https://arxiv.org/pdf/2310.13828))](./images/png/dirty_label_example.png){#fig-dirty-label-example}

Interestingly enough, data poisoning attacks are not always malicious \[[@shan2023prompt]\]. Nightshade, a tool developed by a team led by Professor Ben Zhao at the University of Chicago, utilizes data poisoning to help artists protect their art against scraping and copyright violations by generative AI models. Artists can use the tool to make subtle modifications to their images before uploading them online.
Interestingly enough, data poisoning attacks are not always malicious [@shan2023prompt]. Nightshade, a tool developed by a team led by Professor Ben Zhao at the University of Chicago, utilizes data poisoning to help artists protect their art against scraping and copyright violations by generative AI models. Artists can use the tool to make subtle modifications to their images before uploading them online.

While these changes are indiscernible to the human eye, they can significantly disrupt the performance of generative AI models when incorporated into the training data. Generative models can be manipulated into generating hallucinations and weird images. For example, with only 300 poisoned images, the University of Chicago researchers were able to trick the latest Stable Diffusion model into generating images of dogs that look like cats or images of cows when prompted for cars.

As the number of poisoned images on the internet increases, the performance of the models that use scraped data will deteriorate exponentially. First, the poisoned data is hard to detect, and would require a manual elimination process. Second, the \"poison\" spreads quickly to other labels because generative models rely on connections between words and concepts as they generate images. So a poisoned image of a \"car\" could spread into generated images associated with words like \"truck\", \"train\", \"bus\", etc.

On the flip side, this tool can be used maliciously and can affect legitimate applications of the generative models. This goes to show the very challenging and novel nature of machine learning attacks.

\[@fig]-poisoning demonstrates the effects of different levels of data poisoning (50 samples, 100 samples, and 300 samples of poisoned images) on generating images in different categories. Notice how the images start deforming and deviating from the desired category. For example , after 300 poison samples a car prompt generates a cow.
@fig]-poisoning demonstrates the effects of different levels of data poisoning (50 samples, 100 samples, and 300 samples of poisoned images) on generating images in different categories. Notice how the images start deforming and deviating from the desired category. For example , after 300 poison samples a car prompt generates a cow.

!\[Data poisoning. Credit: \[@shan2023prompt].\](images/png/image14.png){#fig-poisoning}
!Data poisoning. Credit: @shan2023prompt].(images/png/image14.png){#fig-poisoning}

### Distribution Shifts

Expand Down

0 comments on commit 72a934f

Please sign in to comment.