library(targets)
library(myPackage)
diff --git a/Building-Reproducible-Analytical-Pipelines.epub b/Building-Reproducible-Analytical-Pipelines.epub
index 90bb242..33156db 100644
Binary files a/Building-Reproducible-Analytical-Pipelines.epub and b/Building-Reproducible-Analytical-Pipelines.epub differ
diff --git a/Building-Reproducible-Analytical-Pipelines.pdf b/Building-Reproducible-Analytical-Pipelines.pdf
index beb9934..cb30e2c 100644
Binary files a/Building-Reproducible-Analytical-Pipelines.pdf and b/Building-Reproducible-Analytical-Pipelines.pdf differ
diff --git a/search.json b/search.json
index 79bce2c..21296aa 100644
--- a/search.json
+++ b/search.json
@@ -494,7 +494,7 @@
"href": "08-products.html#interactive-web-applications-with-shiny",
"title": "7 Data products",
"section": "7.4 Interactive web applications with {shiny}",
- "text": "7.4 Interactive web applications with {shiny}\n{shiny} is a package developed by Posit to build interactive web applications. These apps can be quite “simple” (for example, an app that shows a graph but in which the user can choose the variable to plot), but can be arbitrarily complex. Some people even go as far as make games with {shiny}. A version for Python is also in alpha, and you can already experiment with it.\nIn this section, I will give a very, very short introduction to {shiny}. This is because {shiny} is so feature-rich, that I could spend 20 hours teaching you and even then we would not have seen everything. That being said, we can with only some cursory knowledge build some useful apps. These apps can run locally on your machine, but they’re really only useful if deploy them on a server, so that users can then use these web apps on their browsers.\n\n7.4.1 The basic structure of a Shiny app\nShiny apps are always made of at least 2 parts: a server and a ui. In general, each of these parts are in separate scripts called server.R and ui.R. It is possible to have another script, called global.R, where you can define variables that you want to be available for both the server and the ui, and to every user of your app.\nLet’s start by building a very basic app. This app will allow users to visualize unemployment data for Luxembourg. For now, let’s say that we want users only to be able to select communes, but not variables. The example code below is based on this official example (this is how I recommend you learn by the way. Take a look at the different example there are and adapt them to suit your needs! You can find the examples here). Create a folder called something like my_app and then create three scripts in it:\n\nglobal.R\nserver.R\nui.R\n\nLet’s start with global.R:\n\nlibrary(myPackage)\nlibrary(dplyr)\nlibrary(ggplot2)\n\ndata(\"unemp\")\n\nIn the global.R file, we load the required packages and data. This is now available everywhere. Let’s continue with the server.R script:\n\nserver <- function(session, input, output) {\n\n filtered_data <- reactive(\n unemp %>%\n filter(place_name %in% input$place_name_selected)\n )\n\n output$unemp_plot <- renderPlot({\n\n ggplot(data = filtered_data()) +\n theme_minimal() +\n geom_line(aes(year, unemployment_rate_in_percent, color = place_name)) +\n labs(title = paste(\"Unemployment in\", paste(input$place_name_selected, collapse = \", \")))\n\n\n })\n}\n\nSeveral things need to be commented here: first, the script contains a single function, called server(). This function take three arguments, session, input and output. I won’t go into details here, but you should know that you will never call the server() function yourself, and that these arguments are required so the function can… function. I will leave a reference at the end of this section with more details. The next important thing is that we defined an object called filtered_data. This is a reactive object. What this means is that this object should get recomputed every time the user interacts with it. But how does the user interact with it? By choosing the place_name he or she wants to see! The predicate inside filter() is place_name %in% input$place_name_selected. Where does that input$place_name_selected come from? This comes from the ui (that we have not written yet). But the idea is that the user will be able to chose place names from a list, and this list will be called place_name_selected and will contain the place names that the user wants to see.\nFinally, we define a new object called output$unemp_plot. The goal of the server() function is to compute things that will be part of the output list. This list, and the objects it contains, get then rendered in the ui. unemp_plot is a ggplot graph that uses the reactive data set we defined first. Notice the () after filtered_data inside the ggplot call. These are required; this is how we say that the reactive object must be recomputed. If the plot does not get rendered, the reactive data set does not get computed, since it never gets called.\nOk so now to the ui. Let’s take inspiration from the same example again:\n\nui <- function(request){\n fluidPage(\n\n titlePanel(\"Unemployment in Luxembourg\"),\n\n sidebarLayout(\n\n sidebarPanel(\n selectizeInput(\"place_name_selected\", \"Select place:\",\n choices=unique(unemp$place_name),\n multiple = TRUE,\n selected = c(\"Rumelange\", \"Dudelange\"),\n options = list(\n plugins = list(\"remove_button\"),\n create = TRUE,\n persist = FALSE # keep created choices in dropdown\n )\n ),\n hr(),\n helpText(\"Original data from STATEC\")\n ),\n\n mainPanel(\n plotOutput(\"unemp_plot\")\n )\n )\n )\n\n}\n\nI’ve added some useful things to the ui. First of all, I made it a function of an argument, request. This is useful for bookmarking the state of the variable. We’ll add a bookmark button later. The ui is divided into two parts, a sidebar panel, and a main panel. The sidebar panel is where you will typically add dropdown menus, checkboxes, radio buttons, etc, for the users to make various selections. In the main panel, you will show the result of their selections. In the sidebar panel I add a selectizeInput() to create a dynamic dropdown list using the selectize JS library, included with {shiny}. The available choices are all the unique place names contained in our data, I allow users to select multiple place names, by default two communes are selected and using the options argument I need little “remove” buttons in the selected commune names. Finally, in the main panel I use the plotOutput() function to render the plot. Notice that I use the name of the plot defined in the server, “unemp_plot”. Finally, to run this, add a new script, called app.R and add the following line in it:\n\nshiny::runApp(\".\")\n\nYou can now run this script in RStudio, or from any R console, and this should open a web browser with your app.\n\n\n\n\n\nBelieve it or not, but this app contains almost every ingredient you need to know to build shiny apps. But of course, there are many, many other widgets that you can use to give your users even more ways to interact with applications.\n\n\n7.4.2 Slightly more advanced shiny\nLet’s take a look at another, more complex example. Because this second example is much more complex, let’s first take a look at a video of the app in action:\n\n\n\n\n\nThe global file will be almost the same as before:\n\nlibrary(myPackage)\nlibrary(dplyr)\nlibrary(ggplot2)\nlibrary(g2r)\n\ndata(\"unemp\")\n\nenableBookmarking(store = \"url\")\n\nThe only difference is that I load the {g2r} package to create a nice interactive plot, and enable bookmarking of the state of the app using enableBookmarking(store = \"url\"). Let’s move on to the ui:\n\nui <- function(request){\n fluidPage(\n\n titlePanel(\"Unemployment in Luxembourg\"),\n\n sidebarLayout(\n\n sidebarPanel(\n selectizeInput(\"place_name_selected\", \"Select place:\",\n choices=unique(unemp$place_name),\n multiple = TRUE,\n selected = c(\"Rumelange\", \"Dudelange\"),\n options = list(\n plugins = list(\"remove_button\"),\n create = TRUE,\n persist = FALSE # keep created choices in dropdown\n )\n ),\n hr(),\n # To allow users to select the variable, we add a selectInput\n # (not selectizeInput, like above)\n # don’t forget to use input$variable_selected in the\n # server function later!\n selectInput(\"variable_selected\", \"Select variable to plot:\",\n choices = setdiff(unique(colnames(unemp)), c(\"year\", \"place_name\", \"level\")),\n multiple = FALSE,\n selected = \"unemployment_rate_in_percent\",\n ),\n hr(),\n # Just for illustration purposes, these radioButtons will not be bound\n # to the actionButton.\n radioButtons(inputId = \"legend_position\",\n label = \"Choose legend position\",\n choices = c(\"top\", \"bottom\", \"left\", \"right\"),\n selected = \"right\",\n inline = TRUE),\n hr(),\n actionButton(inputId = \"render_plot\",\n label = \"Click here to generate plot\"),\n hr(),\n helpText(\"Original data from STATEC\"),\n hr(),\n bookmarkButton()\n ),\n\n mainPanel(\n # We add a tabsetPanel with two tabs. The first tab show\n # the plot made using ggplot the second tab\n # shows the plot using g2r\n tabsetPanel(\n tabPanel(\"ggplot version\", plotOutput(\"unemp_plot\")),\n tabPanel(\"g2r version\", g2Output(\"unemp_plot_g2r\"))\n )\n )\n )\n )\n\n}\n\nThere are many new things. Everything is explained in the comments within the script itself so take a look at them. What’s important to notice, is that I now added two buttons, an action button, and a bookmark button. The action button will be used to draw the plots. This means that the user will choose the options for the plot, and then the plot will only appear once the user clicks on the button. This is quite useful in cases where computations take time to run, and you don’t want the every reactive object to get recomputed as soon as the user interacts with the app. This way, only once every selection has been made can the user give the green light to the app to compute everything.\nAt the bottom of the ui you’ll see that I’ve added a tabsetPanel() with some tabPanel()s. This is where the graphs “live”. Let’s move on to the server script:\n\nserver <- function(session, input, output) {\n\n # Because I want the plots to only render once the user clicks the \n # actionButton, I need to move every interactive, or reactive, element into\n # an eventReactive() function. eventReactive() waits for something to \"happen\"\n # in order to let the reactive variables run. If you don’t do that, then\n # when the user interacts with app, these reactive variables will run\n # which we do not want.\n\n # Data only gets filtered once the user clicks on the actionButton\n filtered_data <- eventReactive(input$render_plot, {\n unemp %>%\n filter(place_name %in% input$place_name_selected)\n })\n\n\n # The variable the user selects gets passed down to the plot only once the user\n # clicks on the actionButton.\n # If you don’t do this, what will happen is that the variable will then update the plot\n # even when the user does not click on the actionButton\n variable_selected <- eventReactive(input$render_plot, {\n input$variable_selected\n })\n\n # The plot title only gets generated once the user clicks on the actionButton\n # If you don’t do this, what will happen is that the title of the plot will get\n # updated even when the user does not click on the actionButton\n plot_title <- eventReactive(input$render_plot, {\n paste(variable_selected(), \"for\", paste(input$place_name_selected, collapse = \", \"))\n })\n\n output$unemp_plot <- renderPlot({\n\n ggplot(data = filtered_data()) +\n theme_minimal() +\n # Because the selected variable is a string, we need to convert it to a symbol\n # using rlang::sym and evaluate it using !!. This is because the aes() function\n # expects bare variable names, and not strings.\n # Because this is something that developers have to use often in shiny apps,\n # there is a version of aes() that works with strings, called aes_string()\n # You can use both approaches interchangeably.\n #geom_line(aes(year, !!rlang::sym(variable_selected()), color = place_name)) +\n geom_line(aes_string(\"year\", variable_selected(), color = \"place_name\")) +\n labs(title = plot_title()) +\n theme(legend.position = input$legend_position)\n\n\n })\n\n output$unemp_plot_g2r <- renderG2({\n\n g2(data = filtered_data()) %>%\n # g2r’s asp() requires bare variable names\n fig_line(asp(year, !!rlang::sym(variable_selected()), color = place_name)) %>%\n # For some reason, the title does not show...\n subject(plot_title()) %>%\n legend_color(position = input$legend_position)\n\n })\n}\n\nWhat’s new here, is that I now must redefine the reactive objects in such a way that they only get run once the user clicks the button. This is why every reactive object (but one, the position of the legend) is now wrapped by eventReactive(). eventReactive() waits for a trigger, in this case the clicking of the action button, to run the reactive object. eventReactive() takes the action button ID as an input. I’ve also defined the plot title as a reactive value, not only the dataset as before, because if I didn’t do it, then the title of the plot would get updated as the user would choose other communes, but the contents of the plot, that depend on the data, would not get updated. To avoid the title and the plot to get desynched, I need to also wrap it around eventReactive(). You can see this behaviour by changing the legend position. The legend position gets updated without the user needing to click the button. This is because I have not wrapped the legend position inside eventReactive().\nFinally, I keep the {ggplot2} graph, but also remake it using {g2r}, to illustrate how it works inside a Shiny app.\nTo conclude this section, we will take a look at one last app. This app will allow users to do data aggregation on relatively large dataset, so computations will take some time. The app will illustrate how to best deal with this.\n\n\n7.4.3 Basic optimization of Shiny apps\nThe app we will build now requires what is sometimes referred to medium size data. Medium size data is data that is far from being big data, but already big enough that handling it requires some thought, especially in this scenario. What we want to do is build an app that will allow users to do some aggregations on this data. Because the size of the data is not trivial, these computations will take some time to run. So we need to think about certain strategies to avoid frustrating our users. The file we will be using can be downloaded from here. We’re not going to use the exact same data set though, I have prepared a smaller version that will be more than enough for our purposes. But the strategies that we are going to implement here will also work for the original, much larger, dataset. You can get the smaller version here. Uncompressed it’ll be a 2.4GB file. Not big data in any sense, but big enough to be annoying to handle without the use of some optimization strategies.\nOne such strategy is only letting the computations run once the user gives the green light by clicking on an action button. This is what we have seen in the previous example. The next obvious strategy is to use packages that are optimized for speed. It turns out that the functions we have seen until now, from packages like {dplyr} and the like, are not the fastest. Their ease of use and expressiveness come at a speed cost. So we will need to switch to something faster. We will do the same to read in the data.\nThis faster solution is the {arrow} package, which is an interface to the Arrow software developed by Apache.\nThe final strategy is to enable caching in the app.\nSo first, install the {arrow} package by running install.packages(\"arrow\"). This will compile libarrow from source on Linux and might take some time, so perhaps go grab a coffee.\nBefore building the app, let me perform a very simple benchmark. The script below reads in the data, then performs some aggregations. This is done using standard {tidyverse} functions, but also using {arrow}:\n\nstart_tidy <- Sys.time()\n # {vroom} is able to read in larger files than {readr}\n # I could not get this file into R using readr::read_csv\n # my RAM would get maxed out\n air <- vroom::vroom(\"data/combined\")\n\n mean_dep_delay <- air |>\n dplyr::group_by(Year, Month, DayofMonth) |>\n dplyr::summarise(mean_delay = mean(DepDelay, na.rm = TRUE))\nend_tidy <- Sys.time()\n\ntime_tidy <- end_tidy - start_tidy\n\n\nstart_arrow <- Sys.time()\n air <- arrow::open_dataset(\"data/combined\", format = \"csv\")\n\n mean_dep_delay <- air |>\n dplyr::group_by(Year, Month, DayofMonth) |>\n dplyr::summarise(mean_delay = mean(DepDelay, na.rm = TRUE))\nend_arrow <- Sys.time()\n\nend_tidy - start_tidy\nend_arrow - start_arrow\n\nThe “tidy” approach took 17 seconds, while the arrow approach took 6 seconds. This is an impressive improvement, but put yourself in the shoes of a user who has to wait 6 seconds for each query. That would get very annoying, very quickly. So the other strategy that we will use is to provide some visual cue that computations are running, and then we will go one step further and use caching of results in the Shiny app.\nBut before we continue, you may be confused by the code above. After all, I told you before that functions from {dplyr} and the like were not the fastest, and yet, I am using them in the arrow approach as well, and they now run almost 3 times as fast. What’s going on? What’s happening here, is that the air object that we read using arrow::open_dataset is not a dataframe, but an arrow dataset. These are special, and work in a different way. But that’s not what’s important: what’s important is that the {dplyr} api can be used to work with these arrow datasets. This means that functions from {dplyr} change the way they work depending on the type of the object their dealing with. If it’s a good old regular data frame, some C++ code gets called to perform the computations. If it’s an arrow dataset, libarrow and its black magic get called instead to perform the computations. If you’re familiar with the concept of polymorphism this is it (think of + in Python: 1+1 returns 2, \"a\"+\"b\" returns \"a+b\". A different computation gets performed depending on the type of the function’s inputs).\nLet’s now build a basic version of the app, only using {arrow} functions for speed. This is the global file:\n\nlibrary(arrow)\nlibrary(dplyr)\nlibrary(rlang)\nlibrary(DT)\n\nair <- arrow::open_dataset(\"data/combined\", format = \"csv\")\n\nThe ui will be quite simple:\n\nui <- function(request){\n fluidPage(\n\n titlePanel(\"Air On Time data\"),\n\n sidebarLayout(\n\n sidebarPanel(\n selectizeInput(\"group_by_selected\", \"Variables to group by:\",\n choices = c(\"Year\", \"Month\", \"DayofMonth\", \"Origin\", \"Dest\"),\n multiple = TRUE,\n selected = c(\"Year\", \"Month\"),\n options = list(\n plugins = list(\"remove_button\"),\n create = TRUE,\n persist = FALSE # keep created choices in dropdown\n )\n ),\n hr(),\n selectizeInput(\"var_to_average\", \"Select variable to average by groups:\",\n choices = c(\"ArrDelay\", \"DepDelay\", \"Distance\"),\n multiple = FALSE,\n selected = \"DepDelay\",\n ),\n hr(),\n actionButton(inputId = \"run_aggregation\",\n label = \"Click here to run aggregation\"),\n hr(),\n bookmarkButton()\n ),\n\n mainPanel(\n DTOutput(\"result\")\n )\n )\n )\n\n}\n\nAnd finally the server:\n\nserver <- function(session, input, output) {\n\n # Numbers get crunched only when the user clicks on the action button\n grouped_data <- eventReactive(input$run_aggregation, {\n air %>%\n group_by(!!!syms(input$group_by_selected)) %>%\n summarise(result = mean(!!sym(input$var_to_average),\n na.rm = TRUE)) %>%\n as.data.frame()\n })\n\n output$result <- renderDT({\n grouped_data()\n })\n\n}\n\nBecause group_by() and mean() expect bare variable names, I convert them from strings to symbols using rlang::syms() and rlang::sym(). The difference between the two is that rlang::syms() is required when a list of strings gets passed down to the function (remember that the user must select several variables to group by), and this is also why !!! are needed (to unquote the list of symbols). Finally, the computed data must be converted back to a data frame using as.data.frame(). This is actually when the computations happen. {arrow} collects all the aggregations but does not perform anything until absolutely required. Let’s see the app in action:\n\n\n\n\n\nAs you can see, in terms of User Experience (UX) this is quite poor. When the user clicks on the button nothing seems to be going on for several seconds, until the table appears. Then, when the user changes some options and clicks again on the action button, it looks like the app is crashing.\nLet’s add some visual cues to indicate to the user that something is happening when the button gets clicked. For this, we are going to use the {shinycssloaders} package:\n\ninstall.packages(\"shinycssloaders\")\n\nand simply change the ui to this (and don’t forget to load {shinycssloaders} in the global script!):\n\nui <- function(request){\n fluidPage(\n\n titlePanel(\"Air On Time data\"),\n\n sidebarLayout(\n\n sidebarPanel(\n selectizeInput(\"group_by_selected\", \"Variables to group by:\",\n choices = c(\"Year\", \"Month\", \"DayofMonth\", \"Origin\", \"Dest\"),\n multiple = TRUE,\n selected = c(\"Year\", \"Month\"),\n options = list(\n plugins = list(\"remove_button\"),\n create = TRUE,\n persist = FALSE # keep created choices in dropdown\n )\n ),\n hr(),\n selectizeInput(\"var_to_average\", \"Select variable to average by groups:\",\n choices = c(\"ArrDelay\", \"DepDelay\", \"Distance\"),\n multiple = FALSE,\n selected = \"DepDelay\",\n ),\n hr(),\n actionButton(inputId = \"run_aggregation\",\n label = \"Click here to run aggregation\"),\n hr(),\n bookmarkButton()\n ),\n\n mainPanel(\n # We add a tabsetPanel with two tabs. The first tab show the plot made using ggplot\n # the second tab shows the plot using g2r\n DTOutput(\"result\") |>\n withSpinner()\n )\n )\n )\n\n}\n\nThe only difference with before is that now the DTOutput() right at the end gets passed down to withSpinner(). There are several spinners that you can choose, but let’s simply use the default one. This is how the app looks now:\n\n\n\n\n\nNow the user gets a visual cue that something is happening. This makes waiting more bearable, but even better than waiting with a spinner is no waiting at all. For this, we are going to enable caching of results. There are several ways that you can cache results inside your app. You can enable the cache on a per-user and per-session basis, or only on a per-user basis. But I think that in our case here, the ideal caching strategy is to keep the cache persistent, and available across sessions. This means that each computation done by any user will get cached and available to any other user. In order to achieve this, you simply have to install the {cachem} packages add the following lines to the global script:\n\nshinyOptions(cache = cachem::cache_disk(\"./app-cache\",\n max_age = Inf))\n\nBy setting the max_age argument to Inf, the cache will never get pruned. The maximum size of the cache, by default is 1GB. You can of course increase it.\nNow, you must also edit the server file like so:\n\nserver <- function(session, input, output) {\n\n # Numbers get crunched only when the user clicks on the action button\n grouped_data <- reactive({\n air %>%\n group_by(!!!syms(input$group_by_selected)) %>%\n summarise(result = mean(!!sym(input$var_to_average),\n na.rm = TRUE)) %>%\n as.data.frame()\n }) %>%\n bindCache(input$group_by_selected,\n input$var_to_average) %>%\n bindEvent(input$run_aggregation)\n\n output$result <- renderDT({\n grouped_data()\n })\n\n}\n\nWe’ve had to change eventReactive() to reactive(), just like in the app where we don’t use an action button to run computations. Then, we pass the reactive object to bindCache(). bindCache() also takes the inputs as arguments. These are used to generate cache keys to retrieve the correct objects from cache. Finally, we pass all this to bindEvent(). This function takes the input referencing the action button. This is how we can now bind the computations to the button once again. Let’s test our app now. You will notice that the first time we choose certain options, the computations will take time, as before. But if we perform the same computations again, then the results will be shown instantly:\n\n\n\n\n\nAs you can see, once I go back to a computation that was done in the past, the table appears instantly. At the end of the video I open a terminal and navigate to the directory of the app, and show you the cache. There are several .Rds objects, these are the final data frames that get computed by the app. If the user wants to rerun a previous computation, the correct data frame gets retrieved, making it look like the computation happened instantly, and with another added benefit: as discussed above, the cache is persistent between sessions, so even if the user closes the browser and comes back later, the cache is still there, and other users will also benefit from the cache.\n\n\n7.4.4 Deploying your shiny app\nThe easiest way is certainly to use shinyapps.io. I won’t go into details, but you can read more about it here. You could also get a Virtual Private Server on a website like Vultr or DigitalOcean. When signing up with these services you get some free credit to test things out. If you use my Vultr referral link you get 100USD to test the platform. This is more than enough to get a basic VPS with Ubuntu on it. You can then try to install everything needed to deploy Shiny apps from your VPS. You could follow this guide to deploy from DigitalOcean, which should generalize well to other services like Vultr. Doing this will teach you a lot, and I would highly recommend you do it.\n\n\n7.4.5 References\n\nThe server function\nUsing caching in Shiny to maximize performance\nEngineering Production-Grade Shiny Apps",
+ "text": "7.4 Interactive web applications with {shiny}\n{shiny} is a package developed by Posit to build interactive web applications. These apps can be quite “simple” (for example, an app that shows a graph but in which the user can choose the variable to plot), but can be arbitrarily complex. Some people even go as far as make games with {shiny}. A version for Python is also in alpha, and you can already experiment with it.\nIn this section, I will give a very, very short introduction to {shiny}. This is because {shiny} is so feature-rich, that I could spend 20 hours teaching you and even then we would not have seen everything. That being said, we can with only some cursory knowledge build some useful apps. These apps can run locally on your machine, but they’re really only useful if deploy them on a server, so that users can then use these web apps on their browsers.\n\n7.4.1 The basic structure of a Shiny app\nShiny apps are always made of at least 2 parts: a server and a ui. In general, each of these parts are in separate scripts called server.R and ui.R. It is possible to have another script, called global.R, where you can define variables that you want to be available for both the server and the ui, and to every user of your app.\nLet’s start by building a very basic app. This app will allow users to visualize unemployment data for Luxembourg. For now, let’s say that we want users only to be able to select communes, but not variables. The example code below is based on this official example (this is how I recommend you learn by the way. Take a look at the different example there are and adapt them to suit your needs! You can find the examples here). Create a folder called something like my_app and then create three scripts in it:\n\nglobal.R\nserver.R\nui.R\n\nLet’s start with global.R:\n\nlibrary(myPackage)\nlibrary(dplyr)\nlibrary(ggplot2)\n\ndata(\"unemp\")\n\nIn the global.R file, we load the required packages and data. This is now available everywhere. Let’s continue with the server.R script:\n\nserver <- function(session, input, output) {\n\n filtered_data <- reactive(\n unemp %>%\n filter(place_name %in% input$place_name_selected)\n )\n\n output$unemp_plot <- renderPlot({\n\n ggplot(data = filtered_data()) +\n theme_minimal() +\n geom_line(aes(year, unemployment_rate_in_percent, color = place_name)) +\n labs(title = paste(\"Unemployment in\", paste(input$place_name_selected, collapse = \", \")))\n\n\n })\n}\n\nSeveral things need to be commented here: first, the script contains a single function, called server(). This function take three arguments, session, input and output. I won’t go into details here, but you should know that you will never call the server() function yourself, and that these arguments are required so the function can… function. I will leave a reference at the end of this section with more details. The next important thing is that we defined an object called filtered_data. This is a reactive object. What this means is that this object should get recomputed every time the user interacts with it. But how does the user interact with it? By choosing the place_name he or she wants to see! The predicate inside filter() is place_name %in% input$place_name_selected. Where does that input$place_name_selected come from? This comes from the ui (that we have not written yet). But the idea is that the user will be able to chose place names from a list, and this list will be called place_name_selected and will contain the place names that the user wants to see.\nFinally, we define a new object called output$unemp_plot. The goal of the server() function is to compute things that will be part of the output list. This list, and the objects it contains, get then rendered in the ui. unemp_plot is a ggplot graph that uses the reactive data set we defined first. Notice the () after filtered_data inside the ggplot call. These are required; this is how we say that the reactive object must be recomputed. If the plot does not get rendered, the reactive data set does not get computed, since it never gets called.\nOk so now to the ui. Let’s take inspiration from the same example again:\n\nui <- function(request){\n fluidPage(\n\n titlePanel(\"Unemployment in Luxembourg\"),\n\n sidebarLayout(\n\n sidebarPanel(\n selectizeInput(\"place_name_selected\", \"Select place:\",\n choices=unique(unemp$place_name),\n multiple = TRUE,\n selected = c(\"Rumelange\", \"Dudelange\"),\n options = list(\n plugins = list(\"remove_button\"),\n create = TRUE,\n persist = FALSE # keep created choices in dropdown\n )\n ),\n hr(),\n helpText(\"Original data from STATEC\")\n ),\n\n mainPanel(\n plotOutput(\"unemp_plot\")\n )\n )\n )\n\n}\n\nI’ve added some useful things to the ui. First of all, I made it a function of an argument, request. This is useful for bookmarking the state of the variable. We’ll add a bookmark button later. The ui is divided into two parts, a sidebar panel, and a main panel. The sidebar panel is where you will typically add dropdown menus, checkboxes, radio buttons, etc, for the users to make various selections. In the main panel, you will show the result of their selections. In the sidebar panel I add a selectizeInput() to create a dynamic dropdown list using the selectize JS library, included with {shiny}. The available choices are all the unique place names contained in our data, I allow users to select multiple place names, by default two communes are selected and using the options argument I need little “remove” buttons in the selected commune names. Finally, in the main panel I use the plotOutput() function to render the plot. Notice that I use the name of the plot defined in the server, “unemp_plot”. Finally, to run this, add a new script, called app.R and add the following line in it:\n\nshiny::runApp(\".\")\n\nYou can now run this script in RStudio, or from any R console, and this should open a web browser with your app.\n\n\n\n\n\nBelieve it or not, but this app contains almost every ingredient you need to know to build shiny apps. But of course, there are many, many other widgets that you can use to give your users even more ways to interact with applications.\n\n\n7.4.2 Slightly more advanced shiny\nLet’s take a look at another, more complex example. Because this second example is much more complex, let’s first take a look at a video of the app in action:\n\n\n\n\n\nThe global file will be almost the same as before:\n\nlibrary(myPackage)\nlibrary(dplyr)\nlibrary(ggplot2)\nlibrary(g2r)\n\ndata(\"unemp\")\n\nenableBookmarking(store = \"url\")\n\nThe only difference is that I load the {g2r} package to create a nice interactive plot, and enable bookmarking of the state of the app using enableBookmarking(store = \"url\"). Let’s move on to the ui:\n\nui <- function(request){\n fluidPage(\n\n titlePanel(\"Unemployment in Luxembourg\"),\n\n sidebarLayout(\n\n sidebarPanel(\n selectizeInput(\"place_name_selected\", \"Select place:\",\n choices=unique(unemp$place_name),\n multiple = TRUE,\n selected = c(\"Rumelange\", \"Dudelange\"),\n options = list(\n plugins = list(\"remove_button\"),\n create = TRUE,\n persist = FALSE # keep created choices in dropdown\n )\n ),\n hr(),\n # To allow users to select the variable, we add a selectInput\n # (not selectizeInput, like above)\n # don’t forget to use input$variable_selected in the\n # server function later!\n selectInput(\"variable_selected\", \"Select variable to plot:\",\n choices = setdiff(unique(colnames(unemp)), c(\"year\", \"place_name\", \"level\")),\n multiple = FALSE,\n selected = \"unemployment_rate_in_percent\",\n ),\n hr(),\n # Just for illustration purposes, these radioButtons will not be bound\n # to the actionButton.\n radioButtons(inputId = \"legend_position\",\n label = \"Choose legend position\",\n choices = c(\"top\", \"bottom\", \"left\", \"right\"),\n selected = \"right\",\n inline = TRUE),\n hr(),\n actionButton(inputId = \"render_plot\",\n label = \"Click here to generate plot\"),\n hr(),\n helpText(\"Original data from STATEC\"),\n hr(),\n bookmarkButton()\n ),\n\n mainPanel(\n # We add a tabsetPanel with two tabs. The first tab show\n # the plot made using ggplot the second tab\n # shows the plot using g2r\n tabsetPanel(\n tabPanel(\"ggplot version\", plotOutput(\"unemp_plot\")),\n tabPanel(\"g2r version\", g2Output(\"unemp_plot_g2r\"))\n )\n )\n )\n )\n\n}\n\nThere are many new things. Everything is explained in the comments within the script itself so take a look at them. What’s important to notice, is that I now added two buttons, an action button, and a bookmark button. The action button will be used to draw the plots. This means that the user will choose the options for the plot, and then the plot will only appear once the user clicks on the button. This is quite useful in cases where computations take time to run, and you don’t want the every reactive object to get recomputed as soon as the user interacts with the app. This way, only once every selection has been made can the user give the green light to the app to compute everything.\nAt the bottom of the ui you’ll see that I’ve added a tabsetPanel() with some tabPanel()s. This is where the graphs “live”. Let’s move on to the server script:\n\nserver <- function(session, input, output) {\n\n # Because I want the plots to only render once the user clicks the \n # actionButton, I need to move every interactive, or reactive, element into\n # an eventReactive() function. eventReactive() waits for something to \"happen\"\n # in order to let the reactive variables run. If you don’t do that, then\n # when the user interacts with app, these reactive variables will run\n # which we do not want.\n\n # Data only gets filtered once the user clicks on the actionButton\n filtered_data <- eventReactive(input$render_plot, {\n unemp %>%\n filter(place_name %in% input$place_name_selected)\n })\n\n\n # The variable the user selects gets passed down to the plot only once the user\n # clicks on the actionButton.\n # If you don’t do this, what will happen is that the variable will then update the plot\n # even when the user does not click on the actionButton\n variable_selected <- eventReactive(input$render_plot, {\n input$variable_selected\n })\n\n # The plot title only gets generated once the user clicks on the actionButton\n # If you don’t do this, what will happen is that the title of the plot will get\n # updated even when the user does not click on the actionButton\n plot_title <- eventReactive(input$render_plot, {\n paste(variable_selected(), \"for\", paste(input$place_name_selected, collapse = \", \"))\n })\n\n output$unemp_plot <- renderPlot({\n\n ggplot(data = filtered_data()) +\n theme_minimal() +\n # Because the selected variable is a string, we need to convert it to a symbol\n # using rlang::sym and evaluate it using !!. This is because the aes() function\n # expects bare variable names, and not strings.\n # Because this is something that developers have to use often in shiny apps,\n # there is a version of aes() that works with strings, called aes_string()\n # You can use both approaches interchangeably.\n #geom_line(aes(year, !!rlang::sym(variable_selected()), color = place_name)) +\n geom_line(aes_string(\"year\", variable_selected(), color = \"place_name\")) +\n labs(title = plot_title()) +\n theme(legend.position = input$legend_position)\n\n\n })\n\n output$unemp_plot_g2r <- renderG2({\n\n g2(data = filtered_data()) %>%\n # g2r’s asp() requires bare variable names\n fig_line(asp(year, !!rlang::sym(variable_selected()), color = place_name)) %>%\n # For some reason, the title does not show...\n subject(plot_title()) %>%\n legend_color(position = input$legend_position)\n\n })\n}\n\nWhat’s new here, is that I now must redefine the reactive objects in such a way that they only get run once the user clicks the button. This is why every reactive object (but one, the position of the legend) is now wrapped by eventReactive(). eventReactive() waits for a trigger, in this case the clicking of the action button, to run the reactive object. eventReactive() takes the action button ID as an input. I’ve also defined the plot title as a reactive value, not only the dataset as before, because if I didn’t do it, then the title of the plot would get updated as the user would choose other communes, but the contents of the plot, that depend on the data, would not get updated. To avoid the title and the plot to get desynched, I need to also wrap it around eventReactive(). You can see this behaviour by changing the legend position. The legend position gets updated without the user needing to click the button. This is because I have not wrapped the legend position inside eventReactive().\nFinally, I keep the {ggplot2} graph, but also remake it using {g2r}, to illustrate how it works inside a Shiny app.\nTo conclude this section, we will take a look at one last app. This app will allow users to do data aggregation on relatively large dataset, so computations will take some time. The app will illustrate how to best deal with this.\n\n\n7.4.3 Basic optimization of Shiny apps\nThe app we will build now requires what is sometimes referred to medium size data. Medium size data is data that is far from being big data, but already big enough that handling it requires some thought, especially in this scenario. What we want to do is build an app that will allow users to do some aggregations on this data. Because the size of the data is not trivial, these computations will take some time to run. So we need to think about certain strategies to avoid frustrating our users. The file we will be using can be downloaded from here. We’re not going to use the exact same data set though, I have prepared a smaller version that will be more than enough for our purposes. But the strategies that we are going to implement here will also work for the original, much larger, dataset. You can get the smaller version here. Uncompressed it’ll be a 2.4GB file. Not big data in any sense, but big enough to be annoying to handle without the use of some optimization strategies.\nOne such strategy is only letting the computations run once the user gives the green light by clicking on an action button. This is what we have seen in the previous example. The next obvious strategy is to use packages that are optimized for speed. It turns out that the functions we have seen until now, from packages like {dplyr} and the like, are not the fastest. Their ease of use and expressiveness come at a speed cost. So we will need to switch to something faster. We will do the same to read in the data.\nThis faster solution is the {arrow} package, which is an interface to the Arrow software developed by Apache.\nThe final strategy is to enable caching in the app.\nSo first, install the {arrow} package by running install.packages(\"arrow\"). This will compile libarrow from source on Linux and might take some time, so perhaps go grab a coffee.\nBefore building the app, let me perform a very simple benchmark. The script below reads in the data, then performs some aggregations. This is done using standard {tidyverse} functions, but also using {arrow}:\n\nstart_tidy <- Sys.time()\n # {vroom} is able to read in larger files than {readr}\n # I could not get this file into R using readr::read_csv\n # my RAM would get maxed out\n air <- vroom::vroom(\"data/combined\")\n\n mean_dep_delay <- air |>\n dplyr::group_by(Year, Month, DayofMonth) |>\n dplyr::summarise(mean_delay = mean(DepDelay, na.rm = TRUE))\nend_tidy <- Sys.time()\n\ntime_tidy <- end_tidy - start_tidy\n\n\nstart_arrow <- Sys.time()\n air <- arrow::open_dataset(\"data/combined\", format = \"csv\")\n\n mean_dep_delay <- air |>\n dplyr::group_by(Year, Month, DayofMonth) |>\n dplyr::summarise(mean_delay = mean(DepDelay, na.rm = TRUE))\nend_arrow <- Sys.time()\n\nend_tidy - start_tidy\nend_arrow - start_arrow\n\nThe “tidy” approach took 17 seconds, while the arrow approach took 6 seconds. This is an impressive improvement, but put yourself in the shoes of a user who has to wait 6 seconds for each query. That would get very annoying, very quickly. So the other strategy that we will use is to provide some visual cue that computations are running, and then we will go one step further and use caching of results in the Shiny app.\nBut before we continue, you may be confused by the code above. After all, I told you before that functions from {dplyr} and the like were not the fastest, and yet, I am using them in the arrow approach as well, and they now run almost 3 times as fast. What’s going on? What’s happening here, is that the air object that we read using arrow::open_dataset is not a dataframe, but an arrow dataset. These are special, and work in a different way. But that’s not what’s important: what’s important is that the {dplyr} api can be used to work with these arrow datasets. This means that functions from {dplyr} change the way they work depending on the type of the object their dealing with. If it’s a good old regular data frame, some C++ code gets called to perform the computations. If it’s an arrow dataset, libarrow and its black magic get called instead to perform the computations. If you’re familiar with the concept of polymorphism this is it (think of + in Python: 1+1 returns 2, \"a\"+\"b\" returns \"a+b\". A different computation gets performed depending on the type of the function’s inputs).\nLet’s now build a basic version of the app, only using {arrow} functions for speed. This is the global file:\n\nlibrary(arrow)\nlibrary(dplyr)\nlibrary(rlang)\nlibrary(DT)\n\nair <- arrow::open_dataset(\"data/combined\", format = \"csv\")\n\nThe ui will be quite simple:\n\nui <- function(request){\n fluidPage(\n\n titlePanel(\"Air On Time data\"),\n\n sidebarLayout(\n\n sidebarPanel(\n selectizeInput(\"group_by_selected\", \"Variables to group by:\",\n choices = c(\"Year\", \"Month\", \"DayofMonth\", \"Origin\", \"Dest\"),\n multiple = TRUE,\n selected = c(\"Year\", \"Month\"),\n options = list(\n plugins = list(\"remove_button\"),\n create = TRUE,\n persist = FALSE # keep created choices in dropdown\n )\n ),\n hr(),\n selectizeInput(\"var_to_average\", \"Select variable to average by groups:\",\n choices = c(\"ArrDelay\", \"DepDelay\", \"Distance\"),\n multiple = FALSE,\n selected = \"DepDelay\",\n ),\n hr(),\n actionButton(inputId = \"run_aggregation\",\n label = \"Click here to run aggregation\"),\n hr(),\n bookmarkButton()\n ),\n\n mainPanel(\n DTOutput(\"result\")\n )\n )\n )\n\n}\n\nAnd finally the server:\n\nserver <- function(session, input, output) {\n\n # Numbers get crunched only when the user clicks on the action button\n grouped_data <- eventReactive(input$run_aggregation, {\n air %>%\n group_by(!!!syms(input$group_by_selected)) %>%\n summarise(result = mean(!!sym(input$var_to_average),\n na.rm = TRUE)) %>%\n as.data.frame()\n })\n\n output$result <- renderDT({\n grouped_data()\n })\n\n}\n\nBecause group_by() and mean() expect bare variable names, I convert them from strings to symbols using rlang::syms() and rlang::sym(). The difference between the two is that rlang::syms() is required when a list of strings gets passed down to the function (remember that the user must select several variables to group by), and this is also why !!! are needed (to unquote the list of symbols). Finally, the computed data must be converted back to a data frame using as.data.frame(). This is actually when the computations happen. {arrow} collects all the aggregations but does not perform anything until absolutely required. Let’s see the app in action:\n\n\n\n\n\nAs you can see, in terms of User Experience (UX) this is quite poor. When the user clicks on the button nothing seems to be going on for several seconds, until the table appears. Then, when the user changes some options and clicks again on the action button, it looks like the app is crashing.\nLet’s add some visual cues to indicate to the user that something is happening when the button gets clicked. For this, we are going to use the {shinycssloaders} package:\n\ninstall.packages(\"shinycssloaders\")\n\nand simply change the ui to this (and don’t forget to load {shinycssloaders} in the global script!):\n\nui <- function(request){\n fluidPage(\n\n titlePanel(\"Air On Time data\"),\n\n sidebarLayout(\n\n sidebarPanel(\n selectizeInput(\"group_by_selected\", \"Variables to group by:\",\n choices = c(\"Year\", \"Month\", \"DayofMonth\", \"Origin\", \"Dest\"),\n multiple = TRUE,\n selected = c(\"Year\", \"Month\"),\n options = list(\n plugins = list(\"remove_button\"),\n create = TRUE,\n persist = FALSE # keep created choices in dropdown\n )\n ),\n hr(),\n selectizeInput(\"var_to_average\", \"Select variable to average by groups:\",\n choices = c(\"ArrDelay\", \"DepDelay\", \"Distance\"),\n multiple = FALSE,\n selected = \"DepDelay\",\n ),\n hr(),\n actionButton(inputId = \"run_aggregation\",\n label = \"Click here to run aggregation\"),\n hr(),\n bookmarkButton()\n ),\n\n mainPanel(\n # We add a tabsetPanel with two tabs. The first tab show the plot made using ggplot\n # the second tab shows the plot using g2r\n DTOutput(\"result\") |>\n withSpinner()\n )\n )\n )\n\n}\n\nThe only difference with before is that now the DTOutput() right at the end gets passed down to withSpinner(). There are several spinners that you can choose, but let’s simply use the default one. This is how the app looks now:\n\n\n\n\n\nNow the user gets a visual cue that something is happening. This makes waiting more bearable, but even better than waiting with a spinner is no waiting at all. For this, we are going to enable caching of results. There are several ways that you can cache results inside your app. You can enable the cache on a per-user and per-session basis, or only on a per-user basis. But I think that in our case here, the ideal caching strategy is to keep the cache persistent, and available across sessions. This means that each computation done by any user will get cached and available to any other user. In order to achieve this, you simply have to install the {cachem} packages add the following lines to the global script:\n\nshinyOptions(cache = cachem::cache_disk(\"./app-cache\",\n max_age = Inf))\n\nBy setting the max_age argument to Inf, the cache will never get pruned. The maximum size of the cache, by default is 1GB. You can of course increase it.\nNow, you must also edit the server file like so:\n\nserver <- function(session, input, output) {\n\n # Numbers get crunched only when the user clicks on the action button\n grouped_data <- reactive({\n air %>%\n group_by(!!!syms(input$group_by_selected)) %>%\n summarise(result = mean(!!sym(input$var_to_average),\n na.rm = TRUE)) %>%\n as.data.frame()\n }) %>%\n bindCache(input$group_by_selected,\n input$var_to_average) %>%\n bindEvent(input$run_aggregation)\n\n output$result <- renderDT({\n grouped_data()\n })\n\n}\n\nWe’ve had to change eventReactive() to reactive(), just like in the app where we don’t use an action button to run computations. Then, we pass the reactive object to bindCache(). bindCache() also takes the inputs as arguments. These are used to generate cache keys to retrieve the correct objects from cache. Finally, we pass all this to bindEvent(). This function takes the input referencing the action button. This is how we can now bind the computations to the button once again. Let’s test our app now. You will notice that the first time we choose certain options, the computations will take time, as before. But if we perform the same computations again, then the results will be shown instantly:\n\n\n\n\n\nAs you can see, once I go back to a computation that was done in the past, the table appears instantly. At the end of the video I open a terminal and navigate to the directory of the app, and show you the cache. There are several .Rds objects, these are the final data frames that get computed by the app. If the user wants to rerun a previous computation, the correct data frame gets retrieved, making it look like the computation happened instantly, and with another added benefit: as discussed above, the cache is persistent between sessions, so even if the user closes the browser and comes back later, the cache is still there, and other users will also benefit from the cache.\n\n\n7.4.4 Deploying your shiny app\nThe easiest way is certainly to use shinyapps.io. I won’t go into details, but you can read more about it here. You could also get a Virtual Private Server on a website like Vultr or DigitalOcean. When signing up with these services you get some free credit to test things out. If you use my Digital Ocean referral link you get 200USD to test the platform. This is more than enough to get a basic VPS with Ubuntu on it. You can then try to install everything needed to deploy Shiny apps from your VPS. You could follow this guide to deploy from DigitalOcean, which should generalize well to other services like Vultr. Doing this will teach you a lot, and I would highly recommend you do it.\n\n\n7.4.5 References\n\nThe server function\nUsing caching in Shiny to maximize performance\nEngineering Production-Grade Shiny Apps",
"crumbs": [
"7 Data products"
]
@@ -504,7 +504,7 @@
"href": "08-products.html#how-to-build-data-products-using-targets",
"title": "7 Data products",
"section": "7.5 How to build data products using {targets}",
- "text": "7.5 How to build data products using {targets}\nWe will now put everything together and create a {targets} pipeline to build a data product from start to finish. Let’s go back to one of the pipelines we wrote in Chapter 7. If you’re using RStudio, start a new project and make it renv-enabled by checking the required checkbox. If you’re using another editor, start with an empty folder and run renv::init(). Now create a new script with the following code:\n\nlibrary(targets)\nlibrary(myPackage)\nlibrary(dplyr)\nlibrary(ggplot2)\nsource(\"functions.R\")\n\nlist(\n tar_target(\n unemp_data,\n get_data()\n ),\n\n tar_target(\n lux_data,\n clean_unemp(unemp_data,\n place_name_of_interest = \"Luxembourg\",\n level_of_interest = \"Country\",\n col_of_interest = active_population)\n ),\n\n tar_target(\n canton_data,\n clean_unemp(unemp_data,\n level_of_interest = \"Canton\",\n col_of_interest = active_population)\n ),\n\n tar_target(\n commune_data,\n clean_unemp(unemp_data,\n place_name_of_interest = c(\"Luxembourg\", \"Dippach\", \"Wiltz\", \"Esch/Alzette\", \"Mersch\"),\n col_of_interest = active_population)\n ),\n\n tar_target(\n lux_plot,\n make_plot(lux_data)\n ),\n\n tar_target(\n canton_plot,\n make_plot(canton_data)\n ),\n\n tar_target(\n commune_plot,\n make_plot(commune_data)\n )\n\n)\n\nThis pipeline reads in data, then filters data and produces some plots. In another version of this pipeline we wrote the plots to disk. Now we will add them to a Quarto document, using the tar_quarto() function that can be found in the {tarchetypes} packages (so install it if this is not the case yet). {tarchetypes} provides functions to define further types of targets, such as tar_quarto() which makes it possible to render Quarto documents from a {targets} pipeline. But before rendering a document, we need to write this document. This is what the document could look like:\n---\ntitle: \"Reading objects from a targets pipeline\"\nauthor: \"Bruno Rodrigues\"\ndate: today\n---\n\nThis document loads three plots that were made using a `{targets}` pipeline.\n\n```{r}\ntargets::tar_read(lux_plot)\n```\n\n```{r}\ntargets::tar_read(canton_plot)\n```\n\n```{r}\ntargets::tar_read(commune_plot)\n```\nHere is what the final pipeline would look like (notice that I’ve added library(quarto) to the list of packages getting called):\n\nlibrary(targets)\nlibrary(tarchetypes)\nlibrary(myPackage)\nlibrary(dplyr)\nlibrary(ggplot2)\nlibrary(quarto)\nsource(\"functions.R\")\n\nlist(\n\n tar_target(\n unemp_data,\n get_data()\n ),\n\n tar_target(\n lux_data,\n clean_unemp(unemp_data,\n place_name_of_interest = \"Luxembourg\",\n level_of_interest = \"Country\",\n col_of_interest = active_population)\n ),\n\n tar_target(\n canton_data,\n clean_unemp(unemp_data,\n level_of_interest = \"Canton\",\n col_of_interest = active_population)\n ),\n\n tar_target(\n commune_data,\n clean_unemp(unemp_data,\n place_name_of_interest = c(\"Luxembourg\", \"Dippach\", \"Wiltz\", \"Esch/Alzette\", \"Mersch\"),\n col_of_interest = active_population)\n ),\n\n tar_target(\n lux_plot,\n make_plot(lux_data)\n ),\n\n tar_target(\n canton_plot,\n make_plot(canton_data)\n ),\n\n tar_target(\n commune_plot,\n make_plot(commune_data)\n ),\n\n tar_quarto(\n my_doc,\n path = \"my_doc.qmd\"\n )\n\n)\n\nMake sure that this pipeline runs using tar_make(). If yes, and you’re done with it, don’t forget to run renv::snapshot() to save the projects dependencies in a lock file. Again, take a look at the lock file to make extra sure that your package is correctly being versioned. As a reminder, you should see something like this:\n\"myPackage\": {\n \"Package\": \"myPackage\",\n \"Version\": \"0.1.0\",\n \"Source\": \"GitHub\",\n \"RemoteType\": \"github\",\n \"RemoteHost\": \"api.github.com\",\n \"RemoteRepo\": \"myPackage\",\n \"RemoteUsername\": \"b-rodrigues\",\n \"RemoteRef\": \"e9d9129de3047c1ecce26d09dff429ec078d4dae\",\n \"RemoteSha\": \"e9d9129de3047c1ecce26d09dff429ec078d4dae\",\n \"Hash\": \"4740b43847e10e012bad2b8a1a533433\",\n \"Requirements\": [\n \"dplyr\",\n \"janitor\",\n \"rlang\"\n ]\n},\nWhat’s really important is that you find the “RemoteXXXX” fields. We are now ready to push this project to github.com. Don’t forget to first edit the .gitignore file and add the renv folder in it. This is the folder that contains the downloaded packages, and it can get quite big. It is better to not push it. We are now done with building an almost 100% reproducible pipeline! If your product is a Shiny app, you may want to put as much calculations as possible in the {targets} pipelines. You can then use tar_load() or tar_read() inside the global.R file.",
+ "text": "7.5 How to build data products using {targets}\nWe will now put everything together and create a {targets} pipeline to build a data product from start to finish. Let’s go back to one of the pipelines we wrote in Chapter 7. If you’re using RStudio, start a new project and make it renv-enabled by checking the required checkbox. If you’re using another editor, start with an empty folder and run renv::init(). Now create a new script with the following code (create the script functions.R and put the get_data() function in it, as described here):\n\nlibrary(targets)\nlibrary(myPackage)\nlibrary(dplyr)\nlibrary(ggplot2)\nsource(\"functions.R\")\n\nlist(\n tar_target(\n unemp_data,\n get_data()\n ),\n\n tar_target(\n lux_data,\n clean_unemp(unemp_data,\n place_name_of_interest = \"Luxembourg\",\n level_of_interest = \"Country\",\n col_of_interest = active_population)\n ),\n\n tar_target(\n canton_data,\n clean_unemp(unemp_data,\n level_of_interest = \"Canton\",\n col_of_interest = active_population)\n ),\n\n tar_target(\n commune_data,\n clean_unemp(unemp_data,\n place_name_of_interest = c(\"Luxembourg\", \"Dippach\", \"Wiltz\", \"Esch/Alzette\", \"Mersch\"),\n col_of_interest = active_population)\n ),\n\n tar_target(\n lux_plot,\n make_plot(lux_data)\n ),\n\n tar_target(\n canton_plot,\n make_plot(canton_data)\n ),\n\n tar_target(\n commune_plot,\n make_plot(commune_data)\n )\n\n)\n\nThis pipeline reads in data, then filters data and produces some plots. In another version of this pipeline we wrote the plots to disk. Now we will add them to a Quarto document, using the tar_quarto() function that can be found in the {tarchetypes} packages (so install it if this is not the case yet). {tarchetypes} provides functions to define further types of targets, such as tar_quarto() which makes it possible to render Quarto documents from a {targets} pipeline. But before rendering a document, we need to write this document. This is what the document could look like:\n---\ntitle: \"Reading objects from a targets pipeline\"\nauthor: \"Bruno Rodrigues\"\ndate: today\n---\n\nThis document loads three plots that were made using a `{targets}` pipeline.\n\n```{r}\ntargets::tar_read(lux_plot)\n```\n\n```{r}\ntargets::tar_read(canton_plot)\n```\n\n```{r}\ntargets::tar_read(commune_plot)\n```\nHere is what the final pipeline would look like (notice that I’ve added library(quarto) to the list of packages getting called):\n\nlibrary(targets)\nlibrary(tarchetypes)\nlibrary(myPackage)\nlibrary(dplyr)\nlibrary(ggplot2)\nlibrary(quarto)\nsource(\"functions.R\")\n\nlist(\n\n tar_target(\n unemp_data,\n get_data()\n ),\n\n tar_target(\n lux_data,\n clean_unemp(unemp_data,\n place_name_of_interest = \"Luxembourg\",\n level_of_interest = \"Country\",\n col_of_interest = active_population)\n ),\n\n tar_target(\n canton_data,\n clean_unemp(unemp_data,\n level_of_interest = \"Canton\",\n col_of_interest = active_population)\n ),\n\n tar_target(\n commune_data,\n clean_unemp(unemp_data,\n place_name_of_interest = c(\"Luxembourg\", \"Dippach\", \"Wiltz\", \"Esch/Alzette\", \"Mersch\"),\n col_of_interest = active_population)\n ),\n\n tar_target(\n lux_plot,\n make_plot(lux_data)\n ),\n\n tar_target(\n canton_plot,\n make_plot(canton_data)\n ),\n\n tar_target(\n commune_plot,\n make_plot(commune_data)\n ),\n\n tar_quarto(\n my_doc,\n path = \"my_doc.qmd\"\n )\n\n)\n\nMake sure that this pipeline runs using tar_make(). If yes, and you’re done with it, don’t forget to run renv::snapshot() to save the projects dependencies in a lock file. Again, take a look at the lock file to make extra sure that your package is correctly being versioned. As a reminder, you should see something like this:\n\"myPackage\": {\n \"Package\": \"myPackage\",\n \"Version\": \"0.1.0\",\n \"Source\": \"GitHub\",\n \"RemoteType\": \"github\",\n \"RemoteHost\": \"api.github.com\",\n \"RemoteRepo\": \"myPackage\",\n \"RemoteUsername\": \"b-rodrigues\",\n \"RemoteRef\": \"e9d9129de3047c1ecce26d09dff429ec078d4dae\",\n \"RemoteSha\": \"e9d9129de3047c1ecce26d09dff429ec078d4dae\",\n \"Hash\": \"4740b43847e10e012bad2b8a1a533433\",\n \"Requirements\": [\n \"dplyr\",\n \"janitor\",\n \"rlang\"\n ]\n},\nWhat’s really important is that you find the “RemoteXXXX” fields. We are now ready to push this project to github.com. Don’t forget to first edit the .gitignore file and add the renv folder in it. This is the folder that contains the downloaded packages, and it can get quite big. It is better to not push it. We are now done with building an almost 100% reproducible pipeline! If your product is a Shiny app, you may want to put as much calculations as possible in the {targets} pipelines. You can then use tar_load() or tar_read() inside the global.R file.",
"crumbs": [
"7 Data products"
]