Database basics - dplyr and DBI

As Shiny apps grow and become more complex, a recurring issue has been that of integrating an external database into an app. While this is already possible, so far it’s been mostly up to the app authors to figure out the appropriate database driver for R and how to manage the database connections within the app itself. The goal of this series of articles is to demystify this process and to introduce some brand new features that should make it easier.

Author

Bárbara Borges Ribeiro

Published

June 14, 2017

Note (6/22/17): pool is now compatible with dplyr 0.7.0 and the brand new dbplyr. If you want to continue using pool with an older version of dplyr, please install pool@dplyr-pre-0.7.0-compat (devtools::install_github("rstudio/pool@dplyr-pre-0.7.0-compat).

Installation

There are four packages that you need throughout this series of articles. Here are the installation instructions, so your code runs smoothly:

# get shiny, DBI, dplyr and dbplyr from CRAN
install.packages("shiny")
install.packages("DBI")
install.packages("dplyr")
install.packages("dbplyr")

# get pool from GitHub, since it's not yet on CRAN
devtools::install_github("rstudio/pool")

Overview

In particular, we will cover:

how to use the dplyr package to read data from an external database;
how to use the DBI package to hook up to an external database;
how to prevent SQL injections;
how to manage connections, prevent leaks and ensure the best performance using the pool package;
how to integrate the pool package with dplyr.

Note that it isn’t always ideal to link up to an external database, as it may break and it is certainly more computationally expensive than dealing with local data. To quote Hadley on when to use dplyr with databases vs in-memory data:

As well as working with local in-memory data stored in data frames, dplyr also works with remote on-disk data stored in databases. This is particularly useful in two scenarios:

Your data is already in a database.

You have so much data that it does not all fit into memory simultaneously and you need to use some external storage engine.

(If your data fits in memory there is no advantage to putting it in a database: it will only be slower and more frustrating).

It is, however, outside of the scope of these articles to explain further when you do need to connect to a database; we will just assume that you’ve thought through your options and concluded that this is the best way to go.

Finally, this series will only cover relational databases – in particular, relational databases whose drivers are DBI-compliant (which means that the R interface must have followed these steps these steps): SQLite, MySQL and PostgreSQL.

`dplyr` package

The easiest way to hook up to an external database from within your Shiny app is to use dplyr. The dplyr package is a very popular data manipulation package that aims to provide a function for each basic verb of data manipulation:

filter() (and slice())
arrange()
select() (and rename())
distinct()
mutate() (and transmute())
summarise()
sample_n() (and sample_frac())

If you’re not familiar with the basics of dplyr (how to apply these verbs to data frames in local storage), you should take the time to do so through this introductory vignette). But in addition to in-memory data, dplyr also works with remote databases. And by translating your R code into the appropriate SQL, it allows you to work with both types of data using the same set of tools (which can be especially useful if you’re not familiar with SQL). Hadley has written a vignette that explains in detail how to use dplyr in relation to databases. You should really check that out if you want a more comprehensive overview of this subject (in here, we will just provide a brief example). Here’s an excerpt about the rationale to support databases in dplyr (from here):

The motivation for supporting databases in dplyr is that you never pull down the right subset or aggregate from the database the first time, and usually you have to iterate between R and SQL many times before you get the perfect dataset. Switching between languages is cognitively challenging (especially because R and SQL are so perilously similar), so dplyr allows you to write R code that is automatically translated to SQL. The goal of dplyr is not to replace every SQL function with an R function: that would be difficult and error prone. Instead, dplyr only generates SELECT statements, the SQL you write most often as an analyst.

Sample usage

Here’s how to read the first five rows of a table from a remote database:

library(pool)
library(dplyr)

my_db <- dbPool(
  RMySQL::MySQL(), 
  dbname = "shinydemo",
  host = "shiny-demo.csa7qlmguqrf.us-east-1.rds.amazonaws.com",
  username = "guest",
  password = "guest"
)

# get the first 5 rows:
my_db %>% tbl("City") %>% head(5)
## # Source:   lazy query [?? x 5]
## # Database: mysql 10.0.17-MariaDB [guest@shiny-demo.csa7qlmguqrf.us-east-1.rds.amazonaws.com:/shinydemo]
##      ID           Name CountryCode      District Population
##   <dbl>          <chr>       <chr>         <chr>      <dbl>
## 1     1          Kabul         AFG         Kabol    1780000
## 2     2       Qandahar         AFG      Qandahar     237500
## 3     3          Herat         AFG         Herat     186800
## 4     4 Mazar-e-Sharif         AFG         Balkh     127800
## 5     5      Amsterdam         NLD Noord-Holland     731200

As you can see, it’s pretty straightforward. With very few changes, you can make this more efficient, and behave a lot better within a Shiny app, if you combine it with the pool package (see this article on the subject). But to do so, you must understand the general idea behind pooling connections, so make sure you check this article about pool basics first.

Note

Throughout this series of articles, all the examples provided will use the “shinydemo” toy database above (a MariaDB – which is just a MySQL fork – database, hosted on Amazon Web Services). This was created especially for learning purposes and you should feel free to play with it using the guest credentials above (these will allow you to read whatever you want from the table, but you obviously won’t be able to write into it or modify it in any way). The “shinydemo” database is simply the well-known “world” database, available for download from here. Some general info:

The data set is available as a set of three tables:

Country: Information about countries of the world.

City: Information about some of the cities in those countries.

CountryLanguage: Languages spoken in each country.

Note: The Country table was modified to remove all rows that contained non-ASCII characters, so its total number of rows (3427) is smaller than the original (4079). This is to ensure that you can access the entire table without running into encoding problems, given that the default encoding is different depending on the platform you’re using (Unix or Windows), which can be an issue for non-ASCII characters. The other two tables were left unchanged.

`DBI` package

If you need to do anything more elaborate than fairly simple SELECT queries, dplyr won’t be able to help you. In that case, we heartily recommend that you use DBI to connect to your database if there is a suitable driver. Here’s a tidbit about DBI from its github page:

The DBI package defines a common interface between the R and database management systems (DBMS). The interface defines a small set of classes and methods similar in spirit to Perl’s DBI, Java’s JDBC, Python’s DB-API, and Microsoft’s ODBC. It defines a set of classes and methods defines what operations are possible and how they are performed:

connect/disconnect to the DBMS

create and execute statements in the DBMS

extract results/output from statements

error/exception handling

information (meta-data) from database objects

transaction management (optional)

Here’s a basic usage example that highlights some of the most common DBI capabilities – we establish a connection to a database, query it, fetch the result set, close the result set and disconnect when we’re done:

library(DBI)
conn <- dbConnect(
    drv = RMySQL::MySQL(),
    dbname = "shinydemo",
    host = "shiny-demo.csa7qlmguqrf.us-east-1.rds.amazonaws.com",
    username = "guest",
    password = "guest")
rs <- dbSendQuery(conn, "SELECT * FROM City LIMIT 5;")
dbFetch(rs)
##   ID           Name CountryCode      District Population
## 1  1          Kabul         AFG         Kabol    1780000
## 2  2       Qandahar         AFG      Qandahar     237500
## 3  3          Herat         AFG         Herat     186800
## 4  4 Mazar-e-Sharif         AFG         Balkh     127800
## 5  5      Amsterdam         NLD Noord-Holland     731200
dbClearResult(rs)
dbDisconnect(conn)

`dbGetQuery`

In the example above, we use DBI to query the database and retrieve the first 5 rows from the City table. As you can see, there is some boilerplate required by DBI: every connection you create using dbConnect() must at some point be destroyed using dbDisconnect() (or you’ll get a leaked connection, which will slow everything down unnecessarily). If you use dbSendQuery(), this will return a result set, here named rs, that stays open (so that you can gradually fetch data using dbFetch()) until you close it using dbClearResult(rs). In this particular case, since you’re fetching the data all at once (5 rows of data fit comfortably in R memory), you could actually achieve the same result in an easier way:

library(DBI)
conn <- dbConnect(
    drv = RMySQL::MySQL(),
    dbname = "shinydemo",
    host = "shiny-demo.csa7qlmguqrf.us-east-1.rds.amazonaws.com",
    username = "guest",
    password = "guest")
dbGetQuery(conn, "SELECT * FROM City LIMIT 5;")
##   ID           Name CountryCode      District Population
## 1  1          Kabul         AFG         Kabol    1780000
## 2  2       Qandahar         AFG      Qandahar     237500
## 3  3          Herat         AFG         Herat     186800
## 4  4 Mazar-e-Sharif         AFG         Balkh     127800
## 5  5      Amsterdam         NLD Noord-Holland     731200
dbDisconnect(conn)

The dbGetQuery() function basically calls dbSendQuery(), dbFetch() and dbClearResult() for you, so it makes your code a lot simpler. However, your query may be a lot bigger than 5 rows, so this is not always feasible. You may also have noticed that it can take a bit of time for the connection to be established (the second line of code). If you’re unfamiliar with DBI, it may be a good idea to explore some its additional functionality. That notwithstanding, just knowing the commands above (along with how to construct SQL queries) is enough to get you using DBI.

Here’s a very simple Shiny app that does exactly what is demoed above, but allows the user to specify how many rows they want to fetch:

library(shiny)
library(DBI)

ui <- fluidPage(
  numericInput("nrows", "Enter the number of rows to display:", 5),
  tableOutput("tbl")
)

server <- function(input, output, session) {
  output$tbl <- renderTable({
    conn <- dbConnect(
      drv = RMySQL::MySQL(),
      dbname = "shinydemo",
      host = "shiny-demo.csa7qlmguqrf.us-east-1.rds.amazonaws.com",
      username = "guest",
      password = "guest")
    on.exit(dbDisconnect(conn), add = TRUE)
    dbGetQuery(conn, paste0(
      "SELECT * FROM City LIMIT ", input$nrows, ";"))
  })
}

shinyApp(ui, server)

VERY IMPORTANT NOTE: The app above has a major security vulnerability because it allows app users to enter data that is directly used in the SQL query. For this particular case, this may not matter because guests have no write access to the database. But often this is not the case. And using the construct above, you’ve just opened yourself up to nasty vulnerability of SQL injections. Thankfully DBI just added support for SQL injection prevention, so read on to the next article!