Fast Table Joins
join.Rd
Join two data frame like objects x
and y
on
columns. Inspired by polars and by default uses a vectorized hash join algorithm (workhorse function fmatch
).
Usage
join(x, y,
on = NULL,
how = "left",
suffix = NULL,
validate = "m:m",
multiple = FALSE,
sort = FALSE,
keep.col.order = TRUE,
drop.dup.cols = FALSE,
verbose = .op[["verbose"]],
column = NULL,
attr = NULL,
...
)
Arguments
- x
a data frame-like object. The result will inherit the attributes of this object.
- y
a data frame-like object to join with
x
.- on
character. vector of columns to join on.
NULL
usesunion(names(x), names(y))
. Use a named vector to match columns named differently inx
andy
, e.g.c("x_id" = "y_id")
.- how
character. Join type:
"left"
,"right"
,"inner"
,"full"
,"semi"
or"anti"
. The first letter suffices.- suffix
character(1 or 2). Suffix to add to duplicate column names.
NULL
renames duplicatey
columns aspaste(col, y_name, sep = "_")
, wherey_name = as.character(substitute(y))
i.e. the name of the data frame as passed into the function. In general, passingsuffix
length 1 will only renamey
, whereas a length 2 suffix will rename bothx
andy
, respectively. Ifverbose > 0
a message will be printed.- validate
character. (Optional) check if join is of specified type. One of
"1:1"
,"1:m"
,"m:1"
or"m:m"
. The default"m:m"
does not perform any checks. Checks are done before the actual join step and failure results in an error. Note that this argument does not affect the result, it only triggers a check.- multiple
logical. Handling of rows in
x
with multiple matches iny
. The defaultFALSE
takes the first match iny
.TRUE
returns every match iny
, increasing the size of the joined table.- sort
logical.
TRUE
implements a sort-merge-join: a completely separate join algorithm that sorts both datasets on the join columns usingradixorder
and then matches the rows without hashing. Note that in this case the result will be sorted by the join columns, whereassort = FALSE
preserves the order of rows inx
.- keep.col.order
logical. Keep order of columns in
x
?FALSE
places theon
columns in front.- drop.dup.cols
instead of renaming duplicate columns in
x
andy
usingsuffix
, this option simply drops them:TRUE
or"y"
drops them fromy
,"x"
fromx
.- verbose
integer. Prints information about the join. One of 0 (off), 1 (default) or 2 (additionally prints the classes of the
on
columns). Note:verbose > 0
orvalidate != "m:m"
invoke thecount
argument tofmatch
, soverbose = 0
is slightly more efficient.- column
(optional) name for an extra column to generate in the output indicating which dataset a record came from.
TRUE
calls this column".join"
(inspired by STATA's '_merge' column). By default this column is generated as the last column, but, ifkeep.col.order = FALSE
, it is placed after the 'on' columns. The column is a factor variable with levels corresponding to the dataset names (inferred from the input) or"matched"
for matched records. Alternatively, it is possible to specify a list of 2, where the first element is the column name, and the second a length 3 (!) vector of levels e.g.column = list("joined", c("x", "y", "x_y"))
, where"x_y"
replaces"matched"
. The column has an additional attribute"on.cols"
giving the join columns corresponding to the factor levels. See Examples.- attr
(optional) name for attribute providing information about the join performed (including the output of
fmatch
) to the result.TRUE
calls this attribute"join.match"
.- ...
further arguments to
fmatch
(ifsort = FALSE
). Notably,overid
can bet set to 0 or 2 (default 1) to control the matching process if the join condition more than identifies the records.
Value
A data frame-like object of the same type and attributes as x
. "row.names"
of x
are only preserved in left-join operations.
Examples
df1 <- data.frame(
id1 = c(1, 1, 2, 3),
id2 = c("a", "b", "b", "c"),
name = c("John", "Jane", "Bob", "Carl"),
age = c(35, 28, 42, 50)
)
df2 <- data.frame(
id1 = c(1, 2, 3, 3),
id2 = c("a", "b", "c", "e"),
salary = c(60000, 55000, 70000, 80000),
dept = c("IT", "Marketing", "Sales", "IT")
)
# Different types of joins
for(i in c("l","i","r","f","s","a"))
join(df1, df2, how = i) |> print()
#> left join: df1[id1, id2] 3/4 (75%) <m:m> df2[id1, id2] 3/4 (75%)
#> id1 id2 name age salary dept
#> 1 1 a John 35 60000 IT
#> 2 1 b Jane 28 NA <NA>
#> 3 2 b Bob 42 55000 Marketing
#> 4 3 c Carl 50 70000 Sales
#> inner join: df1[id1, id2] 3/4 (75%) <m:m> df2[id1, id2] 3/4 (75%)
#> id1 id2 name age salary dept
#> 1 1 a John 35 60000 IT
#> 2 2 b Bob 42 55000 Marketing
#> 3 3 c Carl 50 70000 Sales
#> right join: df1[id1, id2] 3/4 (75%) <m:m> df2[id1, id2] 3/4 (75%)
#> id1 id2 name age salary dept
#> 1 1 a John 35 60000 IT
#> 2 2 b Bob 42 55000 Marketing
#> 3 3 c Carl 50 70000 Sales
#> 4 3 e <NA> NA 80000 IT
#> full join: df1[id1, id2] 3/4 (75%) <m:m> df2[id1, id2] 3/4 (75%)
#> id1 id2 name age salary dept
#> 1 1 a John 35 60000 IT
#> 2 1 b Jane 28 NA <NA>
#> 3 2 b Bob 42 55000 Marketing
#> 4 3 c Carl 50 70000 Sales
#> 5 3 e <NA> NA 80000 IT
#> semi join: df1[id1, id2] 3/4 (75%) <m:m> df2[id1, id2] 3/4 (75%)
#> id1 id2 name age
#> 1 1 a John 35
#> 2 2 b Bob 42
#> 3 3 c Carl 50
#> anti join: df1[id1, id2] 3/4 (75%) <m:m> df2[id1, id2] 3/4 (75%)
#> id1 id2 name age
#> 1 1 b Jane 28
# Adding join column: useful esp. for full join
join(df1, df2, how = "f", column = TRUE)
#> full join: df1[id1, id2] 3/4 (75%) <m:m> df2[id1, id2] 3/4 (75%)
#> id1 id2 name age salary dept .join
#> 1 1 a John 35 60000 IT matched
#> 2 1 b Jane 28 NA <NA> df1
#> 3 2 b Bob 42 55000 Marketing matched
#> 4 3 c Carl 50 70000 Sales matched
#> 5 3 e <NA> NA 80000 IT df2
# Custom column + rearranging
join(df1, df2, how = "f", column = list("join", c("x", "y", "x_y")), keep = FALSE)
#> full join: df1[id1, id2] 3/4 (75%) <m:m> df2[id1, id2] 3/4 (75%)
#> id1 id2 join name age salary dept
#> 1 1 a x_y John 35 60000 IT
#> 2 1 b x Jane 28 NA <NA>
#> 3 2 b x_y Bob 42 55000 Marketing
#> 4 3 c x_y Carl 50 70000 Sales
#> 5 3 e y <NA> NA 80000 IT
# Attaching match attribute
str(join(df1, df2, attr = TRUE))
#> left join: df1[id1, id2] 3/4 (75%) <m:m> df2[id1, id2] 3/4 (75%)
#> 'data.frame': 4 obs. of 6 variables:
#> $ id1 : num 1 1 2 3
#> $ id2 : chr "a" "b" "b" "c"
#> $ name : chr "John" "Jane" "Bob" "Carl"
#> $ age : num 35 28 42 50
#> $ salary: num 60000 NA 55000 70000
#> $ dept : chr "IT" NA "Marketing" "Sales"
#> - attr(*, "join.match")=List of 3
#> ..$ call : language join(x = df1, y = df2, attr = TRUE)
#> ..$ on.cols:List of 2
#> .. ..$ x: chr [1:2] "id1" "id2"
#> .. ..$ y: chr [1:2] "id1" "id2"
#> ..$ match : 'qG' int [1:4] 1 NA 2 3
#> .. ..- attr(*, "N.nomatch")= int 1
#> .. ..- attr(*, "N.groups")= int 4
#> .. ..- attr(*, "N.distinct")= int 3