Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Melanoma task #310

Open
wants to merge 39 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
39 commits
Select commit Hold shift + click to select a range
1186639
skeleton based on Sebastian's description and the MNIST task
cxzhang4 Oct 20, 2024
ae5709c
added initial test file, script where I will interactively try out th…
cxzhang4 Oct 20, 2024
bb37f7b
added more skeleton files
cxzhang4 Oct 22, 2024
b59e3e2
there exists code that downloads and unzips
cxzhang4 Oct 25, 2024
996780b
extra comment
cxzhang4 Oct 25, 2024
72c535e
dataset constructs
cxzhang4 Oct 25, 2024
1c0e12e
benchmark code for image loaders
cxzhang4 Oct 25, 2024
e950423
idrk
cxzhang4 Oct 25, 2024
d3ceff2
added resize script for melanoma dataset
cxzhang4 Oct 25, 2024
9566b52
faijweoif
cxzhang4 Oct 25, 2024
47b090b
magick 2 times as slow
cxzhang4 Oct 25, 2024
9794efd
jwaoeifajwoeij
cxzhang4 Oct 27, 2024
5f71fe9
added my local cache dirs to gitignore
cxzhang4 Oct 27, 2024
dd8da0c
finished resizing images in hard-coded cache dir
cxzhang4 Oct 29, 2024
9b1c240
code to generate hf dataset, still need to check for full reproducibi…
cxzhang4 Nov 5, 2024
9cff991
looks ok with hard-coded cache, still need to test properly
cxzhang4 Nov 8, 2024
46707fe
caching does not seem to work
cxzhang4 Nov 8, 2024
6ec3b21
caching does not work
cxzhang4 Nov 8, 2024
9e8ebb4
manually set a different cache dir
cxzhang4 Nov 10, 2024
a7c655d
using extrasmall version for testing
cxzhang4 Nov 12, 2024
0f9f547
looks like caching works
cxzhang4 Nov 12, 2024
10f5da9
looks ok but download is slooow
cxzhang4 Nov 26, 2024
1252718
looks ok
cxzhang4 Nov 28, 2024
681779e
removed manual cache dirs, references to irrelevant files in gitignor…
cxzhang4 Nov 28, 2024
a4908c0
updated description. using curl, not hfhub
cxzhang4 Nov 28, 2024
3527acd
enable byte compilation
sebffischer Oct 29, 2024
3ead917
Magick to base loader (#299)
cxzhang4 Nov 8, 2024
968c035
update news
sebffischer Nov 8, 2024
ab7bc64
Bump JamesIves/github-pages-deploy-action from 4.6.8 to 4.6.9 (#302)
dependabot[bot] Nov 21, 2024
394de72
improve docs
sebffischer Nov 21, 2024
30697db
feat(tab resnet): allow numeric values for multiplier param
sebffischer Nov 25, 2024
3ce80de
feat(mlp): add n_layers parameter (#307)
sebffischer Nov 26, 2024
e6d1c9f
fix leanification (#306)
sebffischer Nov 26, 2024
d6bbca2
resolved merge conflict'
cxzhang4 Nov 28, 2024
439762c
Merge branch 'main' into melanoma_task
sebffischer Nov 29, 2024
0c259ca
cleanup
cxzhang4 Nov 29, 2024
cf1148e
deleted hfhub testS
cxzhang4 Nov 29, 2024
bc0f7c8
TODO: move lazy tensor construction outside of the cache. Look at tin…
cxzhang4 Nov 29, 2024
07d3fea
tests not working
cxzhang4 Dec 2, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -14,4 +14,4 @@ inst/doc
/doc/
/Meta/
CRAN-SUBMISSION
paper/data
paper/data
2 changes: 2 additions & 0 deletions DESCRIPTION
Original file line number Diff line number Diff line change
Expand Up @@ -56,6 +56,7 @@ Imports:
withr
Suggests:
callr,
curl,
future,
ggplot2,
igraph,
Expand Down Expand Up @@ -125,6 +126,7 @@ Collate:
'PipeOpTorchReshape.R'
'PipeOpTorchSoftmax.R'
'TaskClassif_lazy_iris.R'
'TaskClassif_melanoma.R'
'TaskClassif_mnist.R'
'TaskClassif_tiny_imagenet.R'
'TorchDescriptor.R'
Expand Down
127 changes: 127 additions & 0 deletions R/TaskClassif_melanoma.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,127 @@
#' @title Melanoma Image classification
#' @name mlr_tasks_melanoma
#' @description
#' Classification of melanoma tumor images.
#'
#' The data comes from the 2020 SIIM-ISIC challenge.
#'
#' @section Construction:
#' ```
#' tsk("melanoma")
#' ```
#'
#' @template task_download
#'
#' @source
#' \url{https://challenge2020.isic-archive.com/}
#'
#' @section Properties:
#' `r rd_info_task_torch("melanoma", missings = FALSE)`
#'
#' @references
#' `r format_bib("melanoma2021")`
#' task = tsk("melanoma")
#' task
NULL

# @param path (`character(1)`)\cr
# The cache_dir/datasets/melanoma folder
constructor_melanoma = function(path) {
require_namespaces("curl")

base_url = "https://huggingface.co/datasets/carsonzhang/ISIC_2020_small/resolve/main/"

compressed_tarball_file_name = "hf_ISIC_2020_small.tar.gz"
compressed_tarball_path = file.path(path, compressed_tarball_file_name)
curl::curl_download(paste0(base_url, compressed_tarball_file_name), compressed_tarball_path)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because curl is in suggests, we should run mlr3misc::require_namespaces("curl") before so users get a good error message when they don't have it installed.

Copy link
Collaborator Author

@cxzhang4 cxzhang4 Nov 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But we should just write require_namespaces() without the mlr3misc:: right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes!

utils::untar(compressed_tarball_path, exdir = path)
on.exit({file.remove(compressed_tarball_path)}, add = TRUE)

training_metadata_file_name = "ISIC_2020_Training_GroundTruth_v2.csv"
training_metadata = data.table::fread(file.path(path, training_metadata_file_name))

test_metadata_file_name = "ISIC_2020_Test_Metadata.csv"
test_metadata = file.path(path, test_metadata_file_name)

training_metadata = training_metadata[, split := "train"]
test_metadata = setnames(test_metadata,
old = c("image", "patient", "anatom_site_general"),
new = c("image_name", "patient_id", "anatom_site_general_challenge")
)[, split := "test"]
# response column needs to be filled for the test data
metadata = rbind(training_metadata, test_metadata, fill = TRUE)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is being filled here?

metadata[, image_name := NULL]
metadata[, target := NULL]
metadata = setnames(metadata, old = "benign_malignant", new = "outcome")

melanoma_ds_generator = torch::dataset(
initialize = function() {
self$.metadata = metadata
self$.path = path
},
.getitem = function(idx) {
force(idx)

x = torchvision::base_loader(file.path(self$.path, paste0(self$.metadata[idx, ]$file_name)))
x = torchvision::transform_to_tensor(x)

return(list(x = x))
},
.length = function() {
nrow(self$.metadata)
}
)

melanoma_ds = melanoma_ds_generator()

dd = as_data_descriptor(melanoma_ds, list(x = c(NA, 3, 128, 128)))
lt = lazy_tensor(dd)

return(cbind(metadata, data.table(image = lt)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The return value of this is cached. I don't think we should cache the lazy_tensor itself, as we might change some things in the implementation in newer versions. I would only use standard R types as a return value here. The lazy tensor can then be created after the caching. It's not so expensive

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can check tiny imagenet for how it's done

}

load_task_melanoma = function(id = "melanoma") {
cached_constructor = function(backend) {
data = cached(constructor_melanoma, "datasets", "melanoma")$data

data[, outcome := factor(get(outcome), levels = c("benign", "malignant"))]
set(data, j = "outcome", value = factor(get(outcome), levels = c("benign", "malignant")))

char_features = c("sex", "anatom_site_general_challenge")
data[, (char_features) := lapply(.SD, factor), .SDcols = char_features]

dt = cbind(
data,
data.table(
..row_id = seq_len(nrow(data))
)
)

DataBackendDataTable$new(data = dt, primary_key = "..row_id")
}

backend = DataBackendLazy$new(
constructor = cached_constructor,
rownames = seq_len(32701 + 10982),
col_info = load_col_info("melanoma"),
primary_key = "..row_id"
)

task = TaskClassif$new(
backend = backend,
id = "melanoma",
target = "outcome",
label = "Melanoma Classification"
)

task$set_col_roles("patient_id", "group")
task$col_roles$feature = c("sex", "anatom_site_general_challenge", "age_approx", "image")

backend$hash = task$man = "mlr3torch::mlr_tasks_melanoma"

task$filter(1:32701)

return(task)
}

register_task("melanoma", load_task_melanoma)
9 changes: 9 additions & 0 deletions R/bibentries.R
Original file line number Diff line number Diff line change
Expand Up @@ -112,6 +112,15 @@ bibentries = c(# nolint start
booktitle = "Proceedings of the IEEE conference on computer vision and pattern recognition ",
pages = "2818--2826 ",
year = "2016 "
),
melanoma2021 = bibentry("article",
title = "A patient-centric dataset of images and metadata for identifying melanomas using clinical context",
author = "Rotemberg, V. and Kurtansky, N. and Betz-Stablein, B. and Caffery, L. and Chousakos, E. and Codella, N. and Combalia, M. and Dusza, S. and Guitera, P. and Gutman, D. and Halpern, A. and Helba, B. and Kittler, H. and Kose, K. and Langer, S. and Lioprys, K. and Malvehy, J. and Musthaq, S. and Nanda, J. and Reiter, O. and Shih, G. and Stratigos, A. and Tschandl, P. and Weber, J. and Soyer, P.",
journal = "Scientific Data",
volume = "8",
pages = "34",
year = "2021",
doi = "10.1038/s41597-021-00815-z"
)
) # nolint end

11 changes: 0 additions & 11 deletions benchmarks/dataset.R

This file was deleted.

63 changes: 63 additions & 0 deletions benchmarks/image_loaders/benchmark_image_loaders.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
library(torch)
library(torchvision)
library(mlr3torch)
library(here)

library(data.table)
setDTthreads(threads = 1)

training_metadata = fread(here::here("cache", "ISIC_2020_Training_GroundTruth.csv"))

# hard-coded cache directory that I use locally
cache_dir = here("cache")

ds_base_loader = torch::dataset(
initialize = function(n_images) {
self$.metadata = fread(here(cache_dir, "ISIC_2020_Training_GroundTruth.csv"))[1:n_images, ]
self$.path = file.path(here(cache_dir), "train")
},
.getitem = function(idx) {
force(idx)

x = torchvision::base_loader(file.path(self$.path, paste0(self$.metadata[idx, ]$image_name, ".jpg")))
x = torchvision::transform_to_tensor(x)

return(list(x = x))
},
.length = function() {
nrow(self$.metadata)
}
)

ds_magick_loader = torch::dataset(
initialize = function(n_images) {
self$.metadata = fread(here(cache_dir, "ISIC_2020_Training_GroundTruth.csv"))[1:n_images, ]
self$.path = file.path(here(cache_dir), "train")
},
.getitem = function(idx) {
force(idx)

image_name = self$.metadata[idx, ]$image_name

x = magick::image_read(file.path(self$.path, paste0(image_name, ".jpg")))
x = torchvision::transform_to_tensor(x)

return(list(x = x, image_name = image_name))
},
.length = function() {
nrow(self$.metadata)
}
)

n_images = 10

ds_base = ds_base_loader(n_images)
ds_magick = ds_magick_loader(n_images)

bmr = bench::mark(
for (i in 1:n_images) ds_base$.getitem(i),
for (i in 1:n_images) ds_magick$.getitem(i),
memory = FALSE
)

print(bmr)
30 changes: 0 additions & 30 deletions benchmarks/merge.R

This file was deleted.

80 changes: 80 additions & 0 deletions data-raw/melanoma.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,80 @@
devtools::load_all()

# manually construct the task once
# library(here)
# library(data.table)
library(data.table)
withr::local_options(mlr3torch.cache = TRUE)

constructor_melanoma = function(path) {
require_namespaces("curl")

base_url = "https://huggingface.co/datasets/carsonzhang/ISIC_2020_small/resolve/main/"

compressed_tarball_file_name = "hf_ISIC_2020_small.tar.gz"
compressed_tarball_path = file.path(path, compressed_tarball_file_name)
curl::curl_download(paste0(base_url, compressed_tarball_file_name), compressed_tarball_path)
utils::untar(compressed_tarball_path, exdir = path)
on.exit({file.remove(compressed_tarball_path)}, add = TRUE)

training_metadata_file_name = "ISIC_2020_Training_GroundTruth_v2.csv"
training_metadata = fread(file.path(path, training_metadata_file_name))

test_metadata_file_name = "ISIC_2020_Test_Metadata.csv"
test_metadata = fread(file.path(path, test_metadata_file_name))

training_metadata = training_metadata[, split := "train"]
test_metadata = setnames(test_metadata,
old = c("image", "patient", "anatom_site_general"),
new = c("image_name", "patient_id", "anatom_site_general_challenge")
)[, split := "test"]
# response column needs to be filled for the test data
metadata = rbind(training_metadata, test_metadata, fill = TRUE)
metadata[, image_name := NULL]
metadata[, target := NULL]
metadata = setnames(metadata, old = "benign_malignant", new = "outcome")

melanoma_ds_generator = torch::dataset(
initialize = function() {
self$.metadata = metadata
self$.path = path
},
.getitem = function(idx) {
force(idx)

x = torchvision::base_loader(file.path(self$.path, paste0(self$.metadata[idx, ]$file_name)))
x = torchvision::transform_to_tensor(x)

return(list(x = x))
},
.length = function() {
nrow(self$.metadata)
}
)

melanoma_ds = melanoma_ds_generator()

dd = as_data_descriptor(melanoma_ds, list(x = c(NA, 3, 128, 128)))
lt = lazy_tensor(dd)

return(cbind(metadata, data.table(image = lt)))
}

bench::system_time(melanoma_dt <- constructor_melanoma(file.path(get_cache_dir(), "datasets", "melanoma")))
# melanoma_dt = constructor_melanoma(file.path(get_cache_dir(), "datasets", "melanoma"))

# change the encodings of variables: diagnosis, outcome
melanoma_dt[, outcome := factor(get(outcome), levels = c("benign", "malignant"))]

char_features = c("sex", "anatom_site_general_challenge")
melanoma_dt[, (char_features) := lapply(.SD, factor), .SDcols = char_features]

tsk_melanoma = as_task_classif(melanoma_dt, target = "outcome", id = "melanoma")
tsk_melanoma$set_col_roles("patient_id", "group")
tsk_melanoma$col_roles$feature = c(char_features, "age_approx", "image")

tsk_melanoma$label = "Melanoma Classification"

ci = col_info(tsk_melanoma$backend)

saveRDS(ci, here::here("inst/col_info/melanoma.rds"))
10 changes: 1 addition & 9 deletions data-raw/tiny_imagenet.R
Original file line number Diff line number Diff line change
Expand Up @@ -2,12 +2,4 @@ devtools::load_all()

ci = col_info(get_private(tsk("tiny_imagenet")$backend)$.constructor())

saveRDS(ci, here::here("inst/col_info/tiny_imagenet.rds"))

mlr3:::DataBackendCbind$new(c)


split = factor(rep(c("train", "valid", "test"), times = c(100000, 10000, 10000)))

ci = rbind(ci, data.table(id = "split", type = "factor", levels = levels(split)))
setkeyv(ci)
saveRDS(ci, here::here("inst/col_info/tiny_imagenet.rds"))
Binary file added inst/col_info/melanoma.rds
Binary file not shown.
Loading
Loading