Dplyr Join Cheat Sheet



8 min read

Use dplyr syntax and let dbplyr handle the rest. This is my default option. I do almost all of my. Dplyr provides a grammar for manipulating tables in R. This cheatsheet will guide you through the grammar, reminding you how to select, filter, arrange, mutate, summarise, group, and join data frames and tibbles. ( Previous version) Updated January 17. . dplyr verb. Direct Spark SQL (DBI). SDF function (Scala API). Export an R DataFrame. Read a file. Read existing Hive table Data Science in Spark with Sparklyr:: CHEAT SHEET Intro Using sparklyr.

2020/05/04

Motivation

I use R to extract data held in Microsoft SQL Server databases on a daily basis.

When I first started I was confused by all the different ways to accomplish this task. I was a bit overwhelmed trying to choose the, “best,” option given the specific job at hand.

I want to share what approaches I’ve landed on to help others who may want a simple list of options to get started with.

R Dplyr Cheat Sheet

Scope

This post is about reading data from a database, not writing to one.

I prefer to use packages in the tidyverse so I’ll focus on those packages.

While it’s possible to generalize many of the concepts I write about here to other DBMS systems I will focus exclusively on Microsoft SQL Server. I hope this will provide simple, prescriptive guidance for those working in a similar configuration.

The data for these examples is stored using Microsoft SQL Server Express. Free download available here.

One last thing - these are a few options I populated my toolbox with. They have served me well over the past two years as an analyst in an enterprise environment, but are definitely not the only options available.

Setup

Connect to the server

I use the keyring package to keep my credentials out of my R code. You can use the great documentation available from RStudio to learn how do the same.

Write some sample data

Dplyr Join Cheat Sheet Excel

Note that I set the temporary argument to TRUE so that the data is written to the tempdb on SQL server, which will result in it being deleted on disconnection.

This results in dplyr prefixing the table name with, “##.”

SOURCE: https://db.rstudio.com/dplyr/#connecting-to-the-database

Option 1: Use dplyr syntax and let dbplyr handle the rest

When I use this option

This is my default option.

I do almost all of my analysis in R and this avoids fragmenting my work and thoughts across different tools.

Examples

Example 1: filter rows, and retrieve selected columns

Example 2: join across tables and retrieve selected columns

Example 3: Summarize and count

Quite a few tailnum values in flights, are not present in planes, interesting!

Option 2: Write SQL syntax and have dplyr and dbplyr run the query

When I use this option

Join

I use this option when I am reusing a fairly short, existing SQL querywith minor modifications.

Example 1: Simple selection of records using SQL syntax

Example 2: Use dplyr syntax to enhance a raw SQL query

Option 3: Store the SQL query in a text file and have dplyr and dbplyr run the query

When I use this option

I use this approach under the following conditions:

  1. I’m reusing existing SQL code or when collaborating with someone who will be writing new code in SQL
  2. The SQL code is longer than a line or two

Dplyr Left Join

I prefer to, “modularize,” my R code. Having an extremely long SQL statementin my R code doesn’t abstract away the complexity of the SQL query. Putting thequery into it’s own file helps achieve my desired level of abstraction.

In conjunction with source control it makes tracking changes to the definition of adata set simple.

More importantly, it’s a really useful way to collaborate with others whoare comfortable with SQL but don’t use R. For example, I recently used thisapproach on a project involving aggregation of multiple data sets.Another team member focused on building out the data collection logic forsome of the data sets in SQL. Once he had them built and validated he handed offthe query to me and I pasted it into a text file.

Step 1: Put your SQL code into a text file

Here is some example SQL code that might be in a file

Let’s say that SQL code was stored in a text file called, flights.sql

R Dplyr Join Two Tables

Step 2: Use the SQL code in the file to retrieve data and execute the query.