Published on Data Blog

From frustration to JOYn: Introducing joyn for R, a tool for smarter (but not harder) data joins

This page in:
From frustration to JOYn: Introducing joyn for R, a tool for smarter (but not harder) data joins

Joining data is an inescapable and essential component of data work. Whether you are an economist or data analyst, a data engineer or data scientist, you regularly need to combine information from different data frames. However, more complex joins can result in computationally intensive operations where mistakes are difficult to detect. The {joyn} R package solves these problems by allowing efficient and flexible data joins as well as user-friendly checks and join validation features.

Stata users will find {joyn} particularly intuitive. Economists and data analysts accustomed to the robust functionalities of Stata's `merge` command often find the transition to R for joining operations frustrating. While R’s base `merge()` function offers basic joining capabilities, it lacks the intuitiveness and comprehensive features appreciated by Stata users in the merge command. Existing R packages like {data.table}, {dplyr}, and {collapse} provide powerful alternatives, some even surpassing Stata with unique functionalities. However, a crucial gap remains – data join validation. 

{joyn}
fills this gap by bridging the two worlds of R and Stata and joining the strengths of each. With the recent release of {joyn} version 0.2.0 for R, users, regardless of their proficiency or enthusiasm for R, can now access and benefit from its new features. By offering intuitive join handling tools, validation, and informative reports, {joyn} ensures precise and well-informed results that enhance data joining in R while also remaining intuitive for Stata users to navigate.

 

The real beauty of JOYn

  • Flexibility in Join Types:
    {joyn}
    offers users the flexibility to select their preferred join type ("left", "right", “full”, or "inner"). By default, {joyn} performs a full join to ensure inclusion of all observations. 

  • Easy Variable Handling:
    {joyn}
    facilitates variable handling from both data frames, addressing issues such as duplicate variable names. Users can choose to update values automatically, retain both variables with unique suffixes, or selectively include specific variables from one data frame only.
     

  • Match Type Awareness: 
    Users can specify multiple keys as well as the match type – whether one-to-one, one-to-many, many-to-one, or many-to-many. Moreover, {joyn} checks whether the specified match type is appropriate on the given keys and returns information to inform match type specification. In contrast to other R packages, {joyn} performs a one-to-one join by default. This is the most restrictive match type that ensures the users don’t get unexpected results with the many-to-many match type that other R packages use by default.  
     

  • Instant Feedback: 
    {joyn}
    improves the join process with instant feedback, providing a summary table detailing the merge, a reporting variable tracking individual row statuses, and various types of messages. For example, depending on user-selection, the report variable (see figure below) identifies each row's origin—whether it originated from the left or right input data frame—and highlights any updates made to the values of the columns from the left data frame by those from the right. Additionally, its messaging system is both preventive (e.g., flagging issues like unmatched observations or missing variables) and informative (e.g., time spent in execution).
     

  • Familiar Syntax: 
    {joyn}
    also acts as a wrapper: it includes functions that resemble the usability of base R, {data.table} and {dplyr} while also incorporating the additional features that characterize {joyn}.

``` r 
library(data.table) 
library(joyn) 
#> Attaching package: 'joyn' 
#> The following object is masked from 'package:base': 
#>  
#>     merge 

x = data.table(id = c(1, 4, 2, 3, NA), 
               t  = c(1L, 2L, 1L, 2L, NA_integer_), 
               x  = c(16, 12, NA, NA, 15)) 
 
y = data.table(id = c(1, 2, 5, 6, 3), 
               yd = c(1, 2, 5, 6, 3), 
               y  = c(11L, 15L, 20L, 13L, 10L), 
               x  = c(16:20)) 

joyn(x, y, by = "id", match_type = "m:m", verbose = TRUE) 
#>  
#> ── JOYn Report ── 
#>  
#>     .joyn     n percent 
#>      
#> 1:      x     2   28.6% 
#> 2:  x & y     3   42.9% 
#> 3:      y     2   28.6% 
#> 4:  total     7    100% 
#> ────────────────────────────────────────────────────────── End of JOYn report ── 

#> ℹ ❯ Joyn's report available in variable .joyn 
#> ℹ ❯ Removing key variables id from id, yd, y, and x 
#> ⚠ Warning: The keys supplied uniquely identify both x and y therefore a 1:1 
#> join is executed. 
#> Key:  
#>       id     t     x    yd     y  .joyn 
#>          
#> 1:    NA    NA    15    NA    NA      x 
#> 2:     1     1    16     1    11  x & y 
#> 3:     2     1    NA     2    15  x & y 
#> 4:     3     2    NA     3    10  x & y 
#> 5:     4     2    12    NA    NA      x 
#> 6:     5    NA    NA     5    20      y 
#> 7:     6    NA    NA     6    13      y 
joyn_msg("timing") 

#> ● Timing: The full joyn is executed in 0.006649 seconds 
#> ● Timing: The entire joyn function, including checks, is executed in 0.047987 
#> seconds  

An important caveat

While {joyn} does strive for efficiency, it does not prioritize speed above all else. Its comprehensive join checks and detailed reporting slow-down performance slightly, but to offset this, {joyn} utilizes the fastest joining alternatives available in the R community - namely {data.table} and {collapse}. As a result, {joyn} sets itself apart as a tool that allows users to integrate data frames confidently and effectively: the benefits of error prevention and valuable insights make {joyn} a reliable choice for joining tasks. 
 

To get started

Take the first step towards leveraging {joyn} version 0.2.0 by installing it directly from CRAN. 

Use the command install.packages("joyn"), and then refer to its website [https://randrescastaneda.github.io/joyn/] for further information on its functionalities. 



The authors gratefully acknowledge financial support from the UK Government through the Data and Evidence for Tackling Extreme Poverty (DEEP) Research Program.


Rossana Tatulli

Consultant, Development Data Group, World Bank

R. Andres Castaneda Aguilar

Economist, Development Data Group, World Bank

Zander Prinsloo

Consultant, Development Data Group, World Bank

Join the Conversation

The content of this field is kept private and will not be shown publicly
Remaining characters: 1000