Published on Data Blog

From frustration to JOYn: Introducing joyn for R, a tool for smarter (but not harder) data joins

April 25, 2024

This page in:

From frustration to JOYn: Introducing joyn for R, a tool for smarter (but not harder) data joins

Joining data is an inescapable and essential component of data work. Whether you are an economist or data analyst, a data engineer or data scientist, you regularly need to combine information from different data frames. However, more complex joins can result in computationally intensive operations where mistakes are difficult to detect. The {joyn} R package solves these problems by allowing efficient and flexible data joins as well as user-friendly checks and join validation features.

Stata users will find {joyn} particularly intuitive. Economists and data analysts accustomed to the robust functionalities of Stata's `merge` command often find the transition to R for joining operations frustrating. While R’s base `merge()` function offers basic joining capabilities, it lacks the intuitiveness and comprehensive features appreciated by Stata users in the merge command. Existing R packages like {data.table}, {dplyr}, and {collapse} provide powerful alternatives, some even surpassing Stata with unique functionalities. However, a crucial gap remains – data join validation.

{joyn} fills this gap by bridging the two worlds of R and Stata and joining the strengths of each. With the recent release of {joyn} version 0.2.0 for R, users, regardless of their proficiency or enthusiasm for R, can now access and benefit from its new features. By offering intuitive join handling tools, validation, and informative reports, {joyn} ensures precise and well-informed results that enhance data joining in R while also remaining intuitive for Stata users to navigate.

The real beauty of JOYn

Flexibility in Join Types:
{joyn} offers users the flexibility to select their preferred join type ("left", "right", “full”, or "inner"). By default, {joyn} performs a full join to ensure inclusion of all observations.
Easy Variable Handling:
{joyn} facilitates variable handling from both data frames, addressing issues such as duplicate variable names. Users can choose to update values automatically, retain both variables with unique suffixes, or selectively include specific variables from one data frame only.

Match Type Awareness:
Users can specify multiple keys as well as the match type – whether one-to-one, one-to-many, many-to-one, or many-to-many. Moreover, {joyn} checks whether the specified match type is appropriate on the given keys and returns information to inform match type specification. In contrast to other R packages, {joyn} performs a one-to-one join by default. This is the most restrictive match type that ensures the users don’t get unexpected results with the many-to-many match type that other R packages use by default.

Instant Feedback:
{joyn} improves the join process with instant feedback, providing a summary table detailing the merge, a reporting variable tracking individual row statuses, and various types of messages. For example, depending on user-selection, the report variable (see figure below) identifies each row's origin—whether it originated from the left or right input data frame—and highlights any updates made to the values of the columns from the left data frame by those from the right. Additionally, its messaging system is both preventive (e.g., flagging issues like unmatched observations or missing variables) and informative (e.g., time spent in execution).

Familiar Syntax:
{joyn} also acts as a wrapper: it includes functions that resemble the usability of base R, {data.table} and {dplyr} while also incorporating the additional features that characterize {joyn}.

``` r 
library(data.table) 
library(joyn) 
#> Attaching package: 'joyn' 
#> The following object is masked from 'package:base': 
#>  
#>     merge 

x = data.table(id = c(1, 4, 2, 3, NA), 
               t  = c(1L, 2L, 1L, 2L, NA_integer_), 
               x  = c(16, 12, NA, NA, 15)) 
 
y = data.table(id = c(1, 2, 5, 6, 3), 
               yd = c(1, 2, 5, 6, 3), 
               y  = c(11L, 15L, 20L, 13L, 10L), 
               x  = c(16:20)) 

joyn(x, y, by = "id", match_type = "m:m", verbose = TRUE) 
#>  
#> ── JOYn Report ── 
#>  
#>     .joyn     n percent 
#>        
#> 1:      x     2   28.6% 
#> 2:  x & y     3   42.9% 
#> 3:      y     2   28.6% 
#> 4:  total     7    100% 
#> ────────────────────────────────────────────────────────── End of JOYn report ── 

#> ℹ ❯ Joyn's report available in variable .joyn 
#> ℹ ❯ Removing key variables id from id, yd, y, and x 
#> ⚠ Warning: The keys supplied uniquely identify both x and y therefore a 1:1 
#> join is executed. 
#> Key:  
#>       id     t     x    yd     y  .joyn 
#>          
#> 1:    NA    NA    15    NA    NA      x 
#> 2:     1     1    16     1    11  x & y 
#> 3:     2     1    NA     2    15  x & y 
#> 4:     3     2    NA     3    10  x & y 
#> 5:     4     2    12    NA    NA      x 
#> 6:     5    NA    NA     5    20      y 
#> 7:     6    NA    NA     6    13      y 
joyn_msg("timing") 

#> ● Timing: The full joyn is executed in 0.006649 seconds 
#> ● Timing: The entire joyn function, including checks, is executed in 0.047987 
#> seconds

An important caveat

While {joyn} does strive for efficiency, it does not prioritize speed above all else. Its comprehensive join checks and detailed reporting slow-down performance slightly, but to offset this, {joyn} utilizes the fastest joining alternatives available in the R community - namely {data.table} and {collapse}. As a result, {joyn} sets itself apart as a tool that allows users to integrate data frames confidently and effectively: the benefits of error prevention and valuable insights make {joyn} a reliable choice for joining tasks.

To get started

Take the first step towards leveraging {joyn} version 0.2.0 by installing it directly from CRAN.

Use the command install.packages("joyn"), and then refer to its website [https://randrescastaneda.github.io/joyn/] for further information on its functionalities.

The authors gratefully acknowledge financial support from the UK Government through the Data and Evidence for Tackling Extreme Poverty (DEEP) Research Program.

Join the Conversation

The content of this field is kept private and will not be shown publicly

Remaining characters: 1000

I have read the Privacy Notice and consent to my personal data being processed, to the extent necessary, to submit my comment for moderation. I also consent to having my name published.

From frustration to JOYn: Introducing joyn for R, a tool for smarter (but not harder) data joins

The real beauty of JOYn

An important caveat

To get started

Get updates from Data Blog

Rossana Tatulli

R. Andres Castaneda Aguilar

Zander Prinsloo

Join the Conversation