Fast Hash-Based Grouping
group.Rd
group()
scans the rows of a data frame (or atomic vector / list of atomic vectors), assigning to each unique row an integer id - starting with 1 and proceeding in first-appearance order of the rows. The function is written in C and optimized for R's data structures. It is the workhorse behind functions like GRP
/ fgroup_by
, collap
, qF
, qG
, finteraction
and funique
, when called with argument sort = FALSE
.
Arguments
- x
an atomic vector or data frame / list of equal-length atomic vectors.
- starts
logical. If
TRUE
, an additional attribute"starts"
is attached giving a vector of group starts (= index of first-occurrence of unique rows).- group.sizes
logical. If
TRUE
, an additional attribute"group.sizes"
is attached giving the size of each group.
Details
A data frame is grouped on a column-by-column basis, starting from the leftmost column. For each new column the grouping vector obtained after the previous column is also fed back into the hash function so that unique values are determined on a running basis. The algorithm terminates as soon as the number of unique rows reaches the size of the data frame. Missing values are also grouped just like any other values. Invoking arguments starts
and/or group.sizes
requires an additional pass through the final grouping vector.
Value
An object is of class 'qG' see qG
.
Author
The Hash Function and inspiration was taken from the excellent kit package by Morgan Jacob, the algorithm was developed by Sebastian Krantz.
Examples
# Let's replicate what funique does
g <- group(wlddev, starts = TRUE)
if(attr(g, "N.groups") == fnrow(wlddev)) wlddev else
ss(wlddev, attr(g, "starts"))
#> country iso3c date year decade region income OECD PCGDP
#> 1 Afghanistan AFG 1961-01-01 1960 1960 South Asia Low income FALSE NA
#> 2 Afghanistan AFG 1962-01-01 1961 1960 South Asia Low income FALSE NA
#> 3 Afghanistan AFG 1963-01-01 1962 1960 South Asia Low income FALSE NA
#> 4 Afghanistan AFG 1964-01-01 1963 1960 South Asia Low income FALSE NA
#> 5 Afghanistan AFG 1965-01-01 1964 1960 South Asia Low income FALSE NA
#> LIFEEX GINI ODA POP
#> 1 32.446 NA 116769997 8996973
#> 2 32.962 NA 232080002 9169410
#> 3 33.471 NA 112839996 9351441
#> 4 33.971 NA 237720001 9543205
#> 5 34.463 NA 295920013 9744781
#> [ reached 'max' / getOption("max.print") -- omitted 13171 rows ]