Shiny by RStudio

Using dplyr and pool to query a database

INSTALLATION NOTE: Make sure you have the most current development version of both the pool and the shiny packages:

devtools::install_github("rstudio/pool")
devtools::install_github("rstudio/shiny")

Motivation

Many of you might be using dplyr to connect to your database, especially because it allows you to query it using R code, rather than the more cumbersome SQL equivalent. For this reason and for performance, we’d actually recommend users to use dplyr by default, unless they actually need more functionality and must turn to DBI instead (when it comes to databases, dplyr only allows you to do queries that can be transformed into a SELECT SQL statement). If you’re not familiar with dplyr applied to databases, make sure to read the section about this on the first article of this series.

Here’s why dplyr tends to perform better than DBI (from dplyr’s vignette about databases):

When working with databases, dplyr tries to be as lazy as possible. It’s lazy in two ways:

  • It never pulls data back to R unless you explicitly ask for it.
  • It delays doing any work until the last possible minute, collecting together everything you want to do then sending that to the database in one step.

But on top of this, it’s possible to get even better performance by using a pool. Most importantly, however, you should use a pool for its connection management benefits. To do so, you only need to do minor changes to your existing code.

Combining dplyr and pool

If you’re used to using dplyr to query databases, then you’re familiar with the src_* functions (e.g. src_mysql). In those functions, you typically have to specify the drv (database driver) and all the authorization arguments required to connect to the database. If you’re using pool, however, you’ve already did all of this when you created the Pool object with dbPool. Therefore, the pool package introduces a new src_pool function that allows you to simply pass in a Pool object and get back a dplyr-compatible connection. Then, you can use the resulting object just like in dplyr (see the examples below).

Comparison to dplyr code

First, let’s consider how you’d connect to and query a MySQL database using only dplyr:

my_db <- src_mysql(
  dbname = "shinydemo",
  host = "shiny-demo.csa7qlmguqrf.us-east-1.rds.amazonaws.com",
  user = "guest",
  password = "guest"
)
# get the first 5 rows:
my_db %>% tbl("City") %>% head(5)
##   ID           Name CountryCode      District Population
## 1  1          Kabul         AFG         Kabol    1780000
## 2  2       Qandahar         AFG      Qandahar     237500
## 3  3          Herat         AFG         Herat     186800
## 4  4 Mazar-e-Sharif         AFG         Balkh     127800
## 5  5      Amsterdam         NLD Noord-Holland     731200

Now, let’s do the same thing using a Pool object:

pool <- dbPool(
  drv = RMySQL::MySQL(),
  dbname = "shinydemo",
  host = "shiny-demo.csa7qlmguqrf.us-east-1.rds.amazonaws.com",
  username = "guest",
  password = "guest"
)
# get the first 5 rows:
src_pool(pool) %>% tbl("City") %>% head(5)
##   ID           Name CountryCode      District Population
## 1  1          Kabul         AFG         Kabol    1780000
## 2  2       Qandahar         AFG      Qandahar     237500
## 3  3          Herat         AFG         Herat     186800
## 4  4 Mazar-e-Sharif         AFG         Balkh     127800
## 5  5      Amsterdam         NLD Noord-Holland     731200

Shiny apps

What’s the advantage of using pool with dplyr, rather than just using dplyr to query a database? As usual with pool, the answer is performance and connection management. In order to reap these benefits within a Shiny app, however, you need to be careful about where you create your pool and where you use src_pool. The same idea that we used for combining DBI and pool, also applies here. First, you should create your pool at the top of server.R (or in global.R), but outside the actual server function. Then, for each query, you should use src_pool. In a Shiny app, this means that each reactive or function that queries the database has its own call to src_pool. For example, here’s the same app used in the pool basics article, but written with dplyr instead of DBI:

library(shiny)
library(DBI)
library(pool)

pool <- dbPool(
  drv = RMySQL::MySQL(),
  dbname = "shinydemo",
  host = "shiny-demo.csa7qlmguqrf.us-east-1.rds.amazonaws.com",
  username = "guest",
  password = "guest"
)

ui <- fluidPage(
  textInput("ID", "Enter your ID:", "5"),
  tableOutput("tbl"),
  numericInput("nrows", "How many cities to show?", 10),
  plotOutput("popPlot")
)

server <- function(input, output, session) {
  output$tbl <- renderTable({
    src_pool(pool) %>% tbl("City") %>%
      filter(ID == input$ID)
  })
  output$popPlot <- renderPlot({
    df <- src_pool(pool) %>% tbl("City") %>%
      head(as.integer(input$nrows)[1])
    pop <- df$Population
    names(pop) <- df$Name
    barplot(pop)
  })
}

shinyApp(ui, server)

You’ll note that there is no need to do our own input sanitizing for SQL injection prevention (i.e. no need to call a function like DBI’s sqlInterpolate). This is because dplyr acts a middleman between you and the actual SQL query sent to the database; as that middleman, it also sanitizes your inputs for you.



We love it when R users help each other, but RStudio does not monitor or answer the comments in this thread. If you'd like to get specific help, we recommend the Shiny Discussion Forum for in depth discussion of Shiny related questions and How to get help for a list of the best ways to get help with R code.

comments powered by Disqus