diff --git a/contents/benchmarking/benchmarking.qmd b/contents/benchmarking/benchmarking.qmd index acd04e57..2a149fbd 100644 --- a/contents/benchmarking/benchmarking.qmd +++ b/contents/benchmarking/benchmarking.qmd @@ -563,7 +563,7 @@ The [Common Objects in Context (COCO) dataset][https://cocodataset.org/](@lin201 #### GPT-3 (2020) -While the above examples primarily focus on image datasets, there have been significant developments in text datasets as well. One notable example is GPT-3[@brown2020language], developed by OpenAI. GPT-3 is a language model trained on a diverse range of internet text. Although the dataset used to train GPT-3 is not publicly available, the model itself, consisting of 175 billion parameters, is a testament to the scale and complexity of modern machine learning datasets and models. +While the above examples primarily focus on image datasets, there have been significant developments in text datasets as well. One notable example is GPT-3 [@brown2020language], developed by OpenAI. GPT-3 is a language model trained on a diverse range of internet text. Although the dataset used to train GPT-3 is not publicly available, the model itself, consisting of 175 billion parameters, is a testament to the scale and complexity of modern machine learning datasets and models. #### Present and Future @@ -667,7 +667,7 @@ As machine learning models become more sophisticated, so do the benchmarks requi **Out-of-Distribution Generalization**: Testing how well models perform on data that is different from the original training distribution. This evaluates the model's ability to generalize to new, unseen data. Example benchmarks are Wilds [@koh2021wilds], RxRx, and ANC-Bench. -**Adversarial Robustness:** Evaluating model performance under adversarial attacks or perturbations to the input data. This tests the model's robustness. Example benchmarks are ImageNet-A[@hendrycks2021natural], ImageNet-C[@xie2020adversarial], and CIFAR-10.1. +**Adversarial Robustness:** Evaluating model performance under adversarial attacks or perturbations to the input data. This tests the model's robustness. Example benchmarks are ImageNet-A [@hendrycks2021natural], ImageNet-C [@xie2020adversarial], and CIFAR-10.1. **Real-World Performance:** Testing models on real-world datasets that closely match end tasks, rather than just canned benchmark datasets. Examples are medical imaging datasets for healthcare tasks or actual customer support chat logs for dialogue systems.