90-ccl.Rmd

# Conclusions {#sec-ccl}

In this course, we have seen the importance of structured data. Good
data structure starts with simple, tidy tabular data, whether it is
manually encoded in spreadsheet, or handled in R as dataframes or
tibbles. More complex data, that doesn't fit in tabular data, can be
modelled into dedicated objects that display specialised
behaviour. Structured data allows us to reason on that data, without
having to look it at. Reasoning on and generalisation of data in turn
allows to manipulate and visualise it, i.e. to explore, analyse and
understand it. The cherry on top of the data analysis cake is to be
able to reproduce an analysis, either oneself or share it in a way
that others can.

As mentioned in the preamble, the goal of this course is obviously not
for students that take it to qualify as bioinformaticians at the
end. However, what is important is to appreciate the importance of
data and their analysis, and to become fluent in exploring, discussing
and communicating around data. A shared appreciation of data and their
complexity will hopefully reduce the distinction between
bioinformaticians and experimental scientists. Indeed, at the end of
the day, it's useful to remember that

> We are all biologists, in that we study biology. Some use wet lab
> experiments, others dry lab techniques.

## Next steps

- Statistics and machine learning (see your statistics courses and the
  follow course
  [WSBIM1322](https://github.com/UCLouvain-CBIO/WSBIM1322)).
- Getting better at programming and data analysis. See
  [@r4ds:2017] and [@advancedR].
- Evolving scripts into tools/packages [@rpkgs:2015].
- Other tools: unix command line and git/GitHub [@Perez-Riverol:2016].
  See also this [short
  tutorial](https://lgatto.github.io/github-intro/).
- Omics data analysis (see upcoming
  [WSBIM2122](https://github.com/UCLouvain-CBIO/WSBIM2122) course).


## Additional exercises

To answer the following exercises, you'll need to resort to what you
have learnt in various chapters.

`r msmbstyle::question_begin()`

Make sure you have `rWSBIM1207` version >= 0.1.16 and load the 2022
Belgian road accidents statistics and the associated metadata,
describing the variables. The path to the former as an `rds` file is
available with `road_accidents_be_2022.rds()`. The
`road_accidents_be_meta.csv()` returns the path to the metadata csv
file.

The data provides the Number of killed, seriously injured, slightly
injured and uninjured victims of road accidents, by age group, type of
user, sex and various characteristics of the accident in Belgium in
20222.

- Using the appropriate functions, load both files into R and
  familiarise yourself with the data.

- Visualise the numbers for men and women over the hours of the day
  for all age classes. Ignore any unknown information. Do you see a
  difference between man and women?

- Visualise the number of victims in the different provinces. Do this
  comparison for the different type of victims. Ignore any unknown
  information. Use lines and points for this visualisation.

- Come up with additional visualisations that you could produce with
  these data. Use bar plots for this visualisation.

`r msmbstyle::question_end()`

```{r accidents, echo=FALSE, include=FALSE}
library(rWSBIM1207)
library(tidyverse)
x <- readRDS(road_accidents_be_2022.rds())

## Compare the numbers for man and women over the hours of the day for
## all age classes.
dplyr::count(x, TX_SEX_DESCR_FR, DT_HOUR, TX_AGE_CLS_DESCR_FR) |>
    filter(TX_SEX_DESCR_FR != "Inconnu",
           TX_AGE_CLS_DESCR_FR != "Inconnu") |>
    ggplot(aes(x = DT_HOUR, y = n, colour = TX_SEX_DESCR_FR)) +
    geom_point() +
    geom_line() +
    facet_wrap(~ TX_AGE_CLS_DESCR_FR)


## Compare the number of victims in the different provinces. Do this
## comparison for the different type of victims.
dplyr::count(x, TX_VICT_TYPE_DESCR_FR, TX_PROV_DESCR_FR) |>
    filter(TX_VICT_TYPE_DESCR_FR != "Inconnu",
           TX_PROV_DESCR_FR != " ") |>
    ggplot(aes(x = TX_VICT_TYPE_DESCR_FR, y = n)) +
    geom_bar(stat = "identity") +
    facet_wrap(~ TX_PROV_DESCR_FR) +
    theme(axis.text.x = element_text(angle = 90, hjust = 0.5, vjust = 0.5))


res <- filter(x, TX_VICT_TYPE_DESCR_FR != "Inconnu", TX_PROV_DESCR_FR != " ") |>
    group_by(TX_VICT_TYPE_DESCR_FR, TX_PROV_DESCR_FR) |>
    summarise(m = sum(MS_VICT), n = n())

## Dans mon code, j'ai réalisé un group_by et un summarize (sum())
## parce que si on regarde le nombre de victimes, on peut observer
## qu'il y a parfois 2 victimes pour le même "type de victime" pour la
## même date.  Je suppose qu'avec un summarize(n()), on risque de
## perdre ces informations? Ou alors, j'ai mal compris les données, ou
## la question?

pm <- res |>
    ggplot(aes(x = TX_PROV_DESCR_FR, y = m)) +
    geom_point() + geom_line(aes(group = TX_VICT_TYPE_DESCR_FR)) +
    facet_wrap(~ TX_VICT_TYPE_DESCR_FR, scale = "free_y") +
    ggtitle("sum()") +
    theme(axis.text.x = element_text(angle = 90))

pn <- res |>
    ggplot(aes(x = TX_PROV_DESCR_FR, y = n)) +
    geom_point() + geom_line(aes(group = TX_VICT_TYPE_DESCR_FR)) +
    facet_wrap(~ TX_VICT_TYPE_DESCR_FR, scale = "free_y") +
    ggtitle("sum()") +
    theme(axis.text.x = element_text(angle = 90))

pn + pm

```

`r msmbstyle::question_begin()`

For this exercise, make sure you have `rWSBIM1207` version >=
0.1.17. Using the `population_be.csv()` function, get the path to
`r length(rWSBIM1207::population_be.csv())` files with the population
numbers across multiple regions of Belgium up to 2023.

**Tip:** given that all files contain equivalent data (i.e. for the
same variables), you can use `read_csv()` to load all files as once
into a long table.

- How many regions have been survey over the years?

- What differences are there in terms of region?

Focusing on Belgium and the Brussels, Walloon and Flemish regions, and
from 1991 on, generate one figure (possibly with multiple facets) that
allows to answer the following questions:

- Has the population increased since 1991?

- Are there more women or men living in these regions? Have this
  changed since 1991?

- Have the changes in population been driven by men, women or both
  equally?

- What region has the biggest (lowest) population?

`r msmbstyle::question_end()`

```{r pop, echo=FALSE, include=FALSE}
library(tidyverse)
library(rWSBIM1207)

x <- read_csv(population_be.csv())


## How many regions have been survey over the years?
dplyr::count(x, annee)

## What differences are there in terms of region?
dplyr::count(x, annee) |>
   pull(n) |>
   unique()

x2023 <- filter(x, annee == 2023) |>
  select(lieu_de_residence, total) |>
  dplyr::rename("total2023" = "total")

x2018 <- filter(x, annee == 2018) |>
  select(lieu_de_residence, total) |>
  dplyr::rename("total2018" = "total")

## unique 2023
full_join(x2023, x2018) |> filter(is.na(total2018))

## unique 2018
full_join(x2023, x2018) |> filter(is.na(total2023))

## Focusing on Belgium and the Brussels, Walloon and Flemish regions, and
## from 1991 on, generate one figure (possibly with multiple facets) that
## allows to answer the following questions:

lieux <- x[[2]][1:4]

x |>
    pivot_longer(names_to = "variable",
                 values_to = "value",
                 3:5) |>
    filter(lieu_de_residence %in% lieux) |>
    filter(annee > 1990) |>
    ggplot(aes(x = annee, y = value,
               colour = variable)) +
    geom_line() +
    facet_grid(. ~ lieu_de_residence)

x |>
    pivot_longer(names_to = "variable",
                 values_to = "value",
                 3:5) |>
    filter(lieu_de_residence %in% lieux) |>
    filter(annee > 1990) |>
    ggplot(aes(x = annee, y = value,
               colour = variable)) +
    geom_line() +
    facet_wrap(~ lieu_de_residence, scale = "free_y")
```

`r msmbstyle::question_begin()`

Here, we are going to examining the evolution of bankruptcies in
different region in Belgium between 2005 and 2023. The [original
data](https://statbel.fgov.be/fr/open-data/evolution-mensuelle-des-faillites-par-nace)
are available on the Belgian office for statistics, statbel, webpage.

Two files are needed:

-  The fill dataset, `TF_BANKRUPTCIES.zip`, contains over 175000
   records and is relatively large. Instead, you can use
   `TF_BANKRUPTCIES_subset.txt` that contains just over 10000 records,
   and made available through the `rWSBIM1207` package (see below).

- The `Method_BANKRUPTCIES.xlsx` file describes the variables in the
   bankruptcies dataset. It can be either downloaded from the [statbel
   page](https://statbel.fgov.be/fr/open-data/evolution-mensuelle-des-faillites-par-nace),
   or read from the `rWSBIM1207` package (see below). Note that you
   are not asked to load the xlsx file in R (although you can with
   `readxl::read_xlsx()`); feel free to open it with your favourite
   spreadsheet programme.

You will see that there are some minor discrepancies between the
metadata file and the data variables, such as the `CD_WEEK` variable,
that is documented in the metadata file, but absent from the actual
data. The other, non-documented, time-related variables are
self-explanatory.

The (compressed) `TF_BANKRUPTCIES_subset.txt` and
`Method_BANKRUPTCIES.xlsx` are available in the `rWSBIM1207` package
(version >= 0.1.19). Their paths is available with the
`faillites_be()` function.

You are asked to:

- Visualise the number of bankruptcies over time in the different
  regions. Make sure to remove observations that don't have any
  region. Note that the dates are recorded in a different format as
  what we have seen so far, but the usual conversion functions still
  apply.

- The improve the readability of these busy plots, computer the number
  of bankruptcies per year and repeat the above visualisation.

- We expect that bankruptcies of large companies will lead to more
  (full-time) unemployment. Check this hypothesis by (1) compute and
  display a table of average job losses by company size, and (2)
  produce a figure that uses the distribution of job losses (as
  opposed as only the averages). To facilitate the interpretation,
  order the companies by their size, in the table and figure.


`r msmbstyle::question_end()`

```{r faillites, include=FALSE, echo=FALSE}
suppressPackageStartupMessages({
    library(tidyverse)
    library(rWSBIM1207)
})

x <- read_delim(faillites_be()[1], delim = "|") |>
    mutate(time = dmy(CD_WEEK_START))

## Q1 fig
filter(x, !is.na(TX_PROV_DESCR_FR)) |>
    ggplot(aes(x = time,
               y = MS_COUNTOF_BANKRUPTCIES)) +
    geom_point() +
    facet_wrap(~ TX_PROV_DESCR_FR)


## Q2 fig
filter(x, !is.na(TX_PROV_DESCR_FR)) |>
    group_by(CD_YEAR, TX_PROV_DESCR_FR) |>
    summarise(faillites = sum(MS_COUNTOF_BANKRUPTCIES)) |>
    ggplot(aes(x = CD_YEAR,
               y = faillites)) +
    geom_point() +
    facet_wrap(~ TX_PROV_DESCR_FR)

## Q3 table
group_by(x, TX_EMPLOYMENT_CLASS_DESCR_FR) |>
    summarise(n = mean(MS_COUNTOF_FULL_TIME_WORKERS)) |>
    arrange(n)

## Q3 fig
ggplot(x, aes(x = MS_COUNTOF_FULL_TIME_WORKERS,
              y = TX_EMPLOYMENT_CLASS_DESCR_FR)) +
    geom_boxplot()

ggplot(x, aes(x = MS_COUNTOF_FULL_TIME_WORKERS,
              y = factor(CD_EMPLOYMENT_CLASS))) +
    geom_boxplot()

k <- select(x, CD_EMPLOYMENT_CLASS,
       TX_EMPLOYMENT_CLASS_DESCR_FR) |>
    unique() |>
    arrange(CD_EMPLOYMENT_CLASS)
k


x[["TX_EMPLOYMENT_CLASS_DESCR_FR"]] <-
    factor(x[["TX_EMPLOYMENT_CLASS_DESCR_FR"]],
       levels = k[[2]])

ggplot(x, aes(y = TX_EMPLOYMENT_CLASS_DESCR_FR,
              x = MS_COUNTOF_FULL_TIME_WORKERS)) +
    geom_boxplot()
```

`r msmbstyle::question_begin()`

This question was contributed by Mr William Vanloo in August 2024.

We will explore the road accidents that occurred in Belgium in 2023,
as available from the
[Statbel
website](https://statbel.fgov.be/fr/open-data/accidents-de-la-circulation-2023): the `.txt` file contains the dataset and the `.xlsx` file
contains the metadata describing the different variables.

- Among this data set, visualise the number of accidents occurring in
  different light conditions by severity of injury in the province of
  Liège over time. Be sure to remove the 'not available' data.


- Visualise the number of accidents in the different provinces during
  the month of January according to the different light conditions and
  only for fatal accidents.

`r msmbstyle::question_end()`


```{r accidents2023, include=FALSE, echo=FALSE}
tdir <- tempdir()
url <- "https://statbel.fgov.be/sites/default/files/files/opendata/Verkeersongevallen/TF_ACCIDENTS_2023.zip"
dest <- file.path(tdir, "TF_ACCIDENTS_2023.zip")
download.file(url, dest)

acc <- read_delim(dest)

acc |>
  select(DT_DAY,TX_LIGHT_COND_DESCR_FR,TX_PROV_DESCR_FR,MS_ACCT_WITH_MORY_INJ,
                       MS_ACCT_WITH_SERLY_INJ,MS_ACCT_WITH_SLY_INJ) |>
  filter(TX_PROV_DESCR_FR== "Province de Liège",
         !TX_LIGHT_COND_DESCR_FR == "Non disponible" ) |>
  pivot_longer(names_to = "injury_type",
               values_to = "ncase",
               -c(1:3)) |>
  group_by(DT_DAY,TX_LIGHT_COND_DESCR_FR,injury_type) |>
  summarise(sumcase=sum(ncase)) |>
  arrange(sumcase) |>
  ggplot(aes(x = DT_DAY,
             y = sumcase,
             color= injury_type)) +
  geom_line() +
  facet_wrap(~TX_LIGHT_COND_DESCR_FR, scales = "free_y") +
  theme(axis.text.x = element_text(angle = 90,hjust = 0.5, vjust = 0.5))


acc |>
  select(DT_DAY,TX_LIGHT_COND_DESCR_FR,TX_PROV_DESCR_FR,
         MS_ACCT_WITH_DEAD) |>
  filter(!TX_LIGHT_COND_DESCR_FR == "Non disponible") |>
  group_by(DT_DAY,TX_LIGHT_COND_DESCR_FR,TX_PROV_DESCR_FR) |>
  summarise(sumcase=sum(MS_ACCT_WITH_DEAD)) |>
  arrange(sumcase) |>
  filter(grepl("2023-01",DT_DAY)) |>
  ggplot(aes( x= DT_DAY,
             y = sumcase,
             color = TX_LIGHT_COND_DESCR_FR)) +
  geom_line()+
  facet_wrap(~TX_PROV_DESCR_FR, scales = "free_y") +
    theme(axis.text.x = element_text(angle = 90,hjust = 0.5, vjust = 0.5))
```