fscale is a generic function to efficiently standardize (scale and center) data. STD is a wrapper around fscale representing the 'standardization operator', with more options than fscale when applied to matrices and data frames. Standardization can be simple or groupwise, ordinary or weighted. Arbitrary target means and standard deviations can be set, with special options for grouped scaling and centering. It is also possible to scale data without centering i.e. perform mean-preserving scaling.

Note: For centering without scaling see fwithin/W. For simple not mean-preserving scaling use fsd(..., TRA = "/"). To sweep pre-computed means and scale-factors out of data see TRA.

fscale(x, ...)
   STD(x, ...)

# S3 method for default
fscale(x, g = NULL, w = NULL, na.rm = TRUE, mean = 0, sd = 1, ...)
# S3 method for default
STD(x, g = NULL, w = NULL, na.rm = TRUE, mean = 0, sd = 1, ...)

# S3 method for matrix
fscale(x, g = NULL, w = NULL, na.rm = TRUE, mean = 0, sd = 1, ...)
# S3 method for matrix
STD(x, g = NULL, w = NULL, na.rm = TRUE, mean = 0, sd = 1,
    stub = "STD.", ...)

# S3 method for data.frame
fscale(x, g = NULL, w = NULL, na.rm = TRUE, mean = 0, sd = 1, ...)
# S3 method for data.frame
STD(x, by = NULL, w = NULL, cols = is.numeric, na.rm = TRUE,
    mean = 0, sd = 1, stub = "STD.", keep.by = TRUE, keep.w = TRUE, ...)

# Methods for compatibility with plm:

# S3 method for pseries
fscale(x, effect = 1L, w = NULL, na.rm = TRUE, mean = 0, sd = 1, ...)
# S3 method for pseries
STD(x, effect = 1L, w = NULL, na.rm = TRUE, mean = 0, sd = 1, ...)

# S3 method for pdata.frame
fscale(x, effect = 1L, w = NULL, na.rm = TRUE, mean = 0, sd = 1, ...)
# S3 method for pdata.frame
STD(x, effect = 1L, w = NULL, cols = is.numeric, na.rm = TRUE,
    mean = 0, sd = 1, stub = "STD.", keep.ids = TRUE, keep.w = TRUE, ...)

# Methods for grouped data frame / compatibility with dplyr:

# S3 method for grouped_df
fscale(x, w = NULL, na.rm = TRUE, mean = 0, sd = 1,
       keep.group_vars = TRUE, keep.w = TRUE, ...)
# S3 method for grouped_df
STD(x, w = NULL, na.rm = TRUE, mean = 0, sd = 1,
    stub = "STD.", keep.group_vars = TRUE, keep.w = TRUE, ...)

Arguments

x

a numeric vector, matrix, data frame, panel series (plm::pseries), panel data frame (plm::pdata.frame) or grouped data frame (class 'grouped_df').

g

a factor, GRP object, atomic vector (internally converted to factor) or a list of vectors / factors (internally converted to a GRP object) used to group x.

by

STD data.frame method: Same as g, but also allows one- or two-sided formulas i.e. ~ group1 or var1 + var2 ~ group1 + group2. See Examples.

cols

data.frame method: Select columns to scale using a function, column names, indices or a logical vector. Default: All numeric variables. Note: cols is ignored if a two-sided formula is passed to by.

w

a numeric vector of (non-negative) weights. STD data frame and pdata.frame methods also allow a one-sided formula i.e. ~ weightcol. The grouped_df (dplyr) method supports lazy-evaluation. See Examples.

na.rm

logical. Skip missing values in x or w when computing means and sd's.

effect

plm methods: Select which panel identifier should be used as group-id. 1L takes the first variable in the plm::index, 2L the second etc.. Index variables can also be called by name using a character string. More than one variable can be supplied.

stub

a prefix or stub to rename all transformed columns. FALSE will not rename columns.

mean

the mean to center on (default is 0). If mean = FALSE, no centering will be performed. In that case the scaling is mean-preserving. A numeric value different from 0 (i.e. mean = 5) will be added to the data after subtracting out the mean(s), such that the data will have a mean of 5. A special option when performing grouped scaling and centering is mean = "overall.mean". In that case the overall mean of the data will be added after subtracting out group means.

sd

the standard deviation to scale the data to (default is 1). A numeric value different from 0 (i.e. sd = 3) will scale the data to have a standard deviation of 3. A special option when performing grouped scaling is sd = "within.sd". In that case the within standard deviation (= the standard deviation of the group-centered series) will be calculated and applied to each group. The results is that the variance of the data within each group is harmonized without forcing a certain variance (such as 1).

keep.by, keep.ids, keep.group_vars

data.frame, pdata.frame and grouped_df methods: Logical. Retain grouping / panel-identifier columns in the output. For STD.data.frame this only works if grouping variables were passed in a formula.

keep.w

data.frame, pdata.frame and grouped_df methods: Logical. Retain column containing the weights in the output. Only works if w is passed as formula / lazy-expression.

...

arguments to be passed to or from other methods.

Details

If g = NULL, fscale by default (column-wise) subtracts the mean or weighted mean (if w is supplied) from all data points in x, and then divides this difference by the standard deviation or frequency-weighted standard deviation (if w is supplied). The result is that all columns in x will have mean 0 and standard deviation 1. Alternatively, data can be scaled to have a mean of mean and a standard deviation of sd. If mean = FALSE the data is only scaled (not centered) such that the mean of the data is preserved.

Means and standard deviations are computed using Welford's numerically stable online algorithm.

With groups supplied to g, this standardizing becomes groupwise, so that in each group (in each column) the data points will have mean mean and standard deviation sd. Naturally if mean = FALSE then each group is just scaled and the mean is preserved. For centering without scaling see fwithin.

If na.rm = FALSE and a NA or NaN is encountered, the mean and sd for that group will be NA, and all data points belonging to that group will also be NA in the output.

If na.rm = TRUE, means and sd's are computed (column-wise) on the available data points, and also the weight vector can have missing values. In that case, the weighted mean an sd are computed on (column-wise) complete.cases(x, w), and x is scaled using these statistics. Note that fscale will not insert a missing value in x if the weight for that value is missing, rather, that value will be scaled using a weighted mean and standard-deviated computed without itself! (The intention here is that a few (randomly) missing weights shouldn't break the computation when na.rm = TRUE, but it is not meant for weight vectors with many missing values. If you don't like this behavior, you should prepare your data using x[is.na(w), ] <- NA, or impute your weight vector for non-missing x).

Special options for grouped scaling are mean = "overall.mean" and sd = "within.sd". The former group-centers vectors on the overall mean of the data (see fwithin for more details) and the latter scales the data in each group to have the within-group standard deviation (= the standard deviation of the group-centered data). Thus scaling a grouped vector with options mean = "overall.mean" and sd = "within.sd" amounts to removing all differences in the mean and standard deviations between these groups. In weighted computations, mean = "overall.mean" will subtract weighted group-means from the data and add the overall weighted mean of the data, whereas sd = "within.sd" will compute the weighted within- standard deviation and apply it to each group.

Value

x standardized (mean = mean, standard deviation = sd), grouped by g/by, weighted with w. See Details.

See also

Examples

## Simple Scaling & Centering / Standardizing head(fscale(mtcars)) # Doesn't rename columns
#> mpg cyl disp hp drat #> Mazda RX4 0.1508848 -0.1049878 -0.57061982 -0.5350928 0.5675137 #> Mazda RX4 Wag 0.1508848 -0.1049878 -0.57061982 -0.5350928 0.5675137 #> Datsun 710 0.4495434 -1.2248578 -0.99018209 -0.7830405 0.4739996 #> Hornet 4 Drive 0.2172534 -0.1049878 0.22009369 -0.5350928 -0.9661175 #> Hornet Sportabout -0.2307345 1.0148821 1.04308123 0.4129422 -0.8351978 #> Valiant -0.3302874 -0.1049878 -0.04616698 -0.6080186 -1.5646078 #> wt qsec vs am gear #> Mazda RX4 -0.610399567 -0.7771651 -0.8680278 1.1899014 0.4235542 #> Mazda RX4 Wag -0.349785269 -0.4637808 -0.8680278 1.1899014 0.4235542 #> Datsun 710 -0.917004624 0.4260068 1.1160357 1.1899014 0.4235542 #> Hornet 4 Drive -0.002299538 0.8904872 1.1160357 -0.8141431 -0.9318192 #> Hornet Sportabout 0.227654255 -0.4637808 -0.8680278 -0.8141431 -0.9318192 #> Valiant 0.248094592 1.3269868 1.1160357 -0.8141431 -0.9318192 #> carb #> Mazda RX4 0.7352031 #> Mazda RX4 Wag 0.7352031 #> Datsun 710 -1.1221521 #> Hornet 4 Drive -1.1221521 #> Hornet Sportabout -0.5030337 #> Valiant -1.1221521
head(STD(mtcars)) # By default adds a prefix
#> STD.mpg STD.cyl STD.disp STD.hp STD.drat #> Mazda RX4 0.1508848 -0.1049878 -0.57061982 -0.5350928 0.5675137 #> Mazda RX4 Wag 0.1508848 -0.1049878 -0.57061982 -0.5350928 0.5675137 #> Datsun 710 0.4495434 -1.2248578 -0.99018209 -0.7830405 0.4739996 #> Hornet 4 Drive 0.2172534 -0.1049878 0.22009369 -0.5350928 -0.9661175 #> Hornet Sportabout -0.2307345 1.0148821 1.04308123 0.4129422 -0.8351978 #> Valiant -0.3302874 -0.1049878 -0.04616698 -0.6080186 -1.5646078 #> STD.wt STD.qsec STD.vs STD.am STD.gear #> Mazda RX4 -0.610399567 -0.7771651 -0.8680278 1.1899014 0.4235542 #> Mazda RX4 Wag -0.349785269 -0.4637808 -0.8680278 1.1899014 0.4235542 #> Datsun 710 -0.917004624 0.4260068 1.1160357 1.1899014 0.4235542 #> Hornet 4 Drive -0.002299538 0.8904872 1.1160357 -0.8141431 -0.9318192 #> Hornet Sportabout 0.227654255 -0.4637808 -0.8680278 -0.8141431 -0.9318192 #> Valiant 0.248094592 1.3269868 1.1160357 -0.8141431 -0.9318192 #> STD.carb #> Mazda RX4 0.7352031 #> Mazda RX4 Wag 0.7352031 #> Datsun 710 -1.1221521 #> Hornet 4 Drive -1.1221521 #> Hornet Sportabout -0.5030337 #> Valiant -1.1221521
qsu(STD(mtcars)) # See that is works
#> N Mean SD Min Max #> STD.mpg 32 -0 1 -1.6079 2.2913 #> STD.cyl 32 0 1 -1.2249 1.0149 #> STD.disp 32 -0 1 -1.2879 1.9468 #> STD.hp 32 0 1 -1.381 2.7466 #> STD.drat 32 -0 1 -1.5646 2.4939 #> STD.wt 32 -0 1 -1.7418 2.2553 #> STD.qsec 32 0 1 -1.874 2.8268 #> STD.vs 32 0 1 -0.868 1.116 #> STD.am 32 -0 1 -0.8141 1.1899 #> STD.gear 32 0 1 -0.9318 1.7789 #> STD.carb 32 -0 1 -1.1222 3.2117
qsu(STD(mtcars, mean = 5, sd = 3)) # Assigning a mean of 5 and a standard deviation of 3
#> N Mean SD Min Max #> STD.mpg 32 5 3 0.1764 11.8738 #> STD.cyl 32 5 3 1.3254 8.0446 #> STD.disp 32 5 3 1.1363 10.8403 #> STD.hp 32 5 3 0.8569 13.2397 #> STD.drat 32 5 3 0.3062 12.4817 #> STD.wt 32 5 3 -0.2253 11.766 #> STD.qsec 32 5 3 -0.622 13.4803 #> STD.vs 32 5 3 2.3959 8.3481 #> STD.am 32 5 3 2.5576 8.5697 #> STD.gear 32 5 3 2.2045 10.3368 #> STD.carb 32 5 3 1.6335 14.635
qsu(STD(mtcars, mean = FALSE)) # No centering: Scaling is mean-preserving
#> N Mean SD Min Max #> STD.mpg 32 20.0906 1 18.4827 22.3819 #> STD.cyl 32 6.1875 1 4.9626 7.2024 #> STD.disp 32 230.7219 1 229.434 232.6686 #> STD.hp 32 146.6875 1 145.3065 149.4341 #> STD.drat 32 3.5966 1 2.032 6.0905 #> STD.wt 32 3.2172 1 1.4755 5.4726 #> STD.qsec 32 17.8487 1 15.9747 20.6755 #> STD.vs 32 0.4375 1 -0.4305 1.5535 #> STD.am 32 0.4062 1 -0.4079 1.5962 #> STD.gear 32 3.6875 1 2.7557 5.4664 #> STD.carb 32 2.8125 1 1.6903 6.0242
## Panel Data head(fscale(get_vars(wlddev,9:12), wlddev$iso3c)) # Standardizing 4 series within each country
#> PCGDP LIFEEX GINI ODA #> 1 NA -1.551093 NA -0.6132859 #> 2 NA -1.506322 NA -0.5543146 #> 3 NA -1.462247 NA -0.6130676 #> 4 NA -1.418570 NA -0.5527772 #> 5 NA -1.375191 NA -0.5200308 #> 6 NA -1.331912 NA -0.4864215
head(STD(wlddev, ~iso3c, cols = 9:12)) # Same thing using STD, id's added
#> iso3c STD.PCGDP STD.LIFEEX STD.GINI STD.ODA #> 1 AFG NA -1.551093 NA -0.6132859 #> 2 AFG NA -1.506322 NA -0.5543146 #> 3 AFG NA -1.462247 NA -0.6130676 #> 4 AFG NA -1.418570 NA -0.5527772 #> 5 AFG NA -1.375191 NA -0.5200308 #> 6 AFG NA -1.331912 NA -0.4864215
pwcor(fscale(get_vars(wlddev,9:12), wlddev$iso3c)) # Correlaing panel series after standardizing
#> PCGDP LIFEEX GINI ODA #> PCGDP 1 .60 -.21 .05 #> LIFEEX .60 1 -.09 .30 #> GINI -.21 -.09 1 -.03 #> ODA .05 .30 -.03 1
fmean(get_vars(wlddev, 9:12)) # This calculates the overall means
#> PCGDP LIFEEX GINI ODA #> 1.156365e+04 6.384109e+01 3.939757e+01 4.287465e+08
fsd(fwithin(get_vars(wlddev, 9:12), wlddev$iso3c)) # This calculates the within standard deviations
#> PCGDP LIFEEX GINI ODA #> 6.334952e+03 5.829248e+00 3.040647e+00 6.070240e+08
head(qsu(fscale(get_vars(wlddev, 9:12), # This group-centers on the overall mean and wlddev$iso3c, # group-scales to the within standard deviation mean = "overall.mean", sd = "within.sd"), # -> data harmonized in the first 2 moments by = wlddev$iso3c))
#> , , PCGDP #> #> N Mean SD Min Max #> ABW 32 11563.6529 6334.9523 -11684.3802 19256.8244 #> AFG 16 11563.6529 6334.9523 3001.7971 18493.53 #> AGO 38 11563.6529 6334.9523 58.6091 22218.6163 #> ALB 38 11563.6529 6334.9523 3046.7453 23022.1251 #> AND 48 11563.6529 6334.9523 803.0491 25506.1511 #> ARE 43 11563.6529 6334.9523 3061.0074 25309.0701 #> #> , , LIFEEX #> #> N Mean SD Min Max #> ABW 57 63.8411 5.8292 50.106 71.2163 #> AFG 57 63.8411 5.8292 54.7994 72.7759 #> AGO 57 63.8411 5.8292 56.0748 75.7701 #> ALB 57 63.8411 5.8292 50.9166 73.5876 #> AND 0 0 - 0 0 #> ARE 57 63.8411 5.8292 50.0791 70.6117 #> #> , , GINI #> #> N Mean SD Min Max #> ABW 0 0 - 0 0 #> AFG 0 0 - 0 0 #> AGO 2 39.3976 3.0406 41.5476 41.5476 #> ALB 5 39.3976 3.0406 34.8534 42.8826 #> AND 0 0 - 0 0 #> ARE 0 0 - 0 0 #> #> , , ODA #> #> N Mean SD Min Max #> ABW 20 428,746468 607,024040 -557,600788 1.82207583e+09 #> AFG 58 428,746468 607,024040 23,280279.7 1.95895363e+09 #> AGO 56 428,746468 607,024040 -264,229614 3.10365038e+09 #> ALB 30 428,746468 607,024040 -815,513556 1.94166639e+09 #> AND 0 0 - 0 0 #> ARE 45 428,746468 607,024040 -156,627600 3.56976530e+09 #>
## Using plm pwlddev <- plm::pdata.frame(wlddev, index = c("iso3c","year")) head(STD(pwlddev)) # Standardizing all numeric variables by country
#> iso3c year STD.decade STD.PCGDP STD.LIFEEX STD.GINI STD.ODA #> ABW-1960 ABW 1960 -1.629936 NA -2.356240 NA NA #> ABW-1961 ABW 1961 -1.629936 NA -2.207971 NA NA #> ABW-1962 ABW 1962 -1.629936 NA -2.074817 NA NA #> ABW-1963 ABW 1963 -1.629936 NA -1.951379 NA NA #> ABW-1964 ABW 1964 -1.629936 NA -1.834059 NA NA #> ABW-1965 ABW 1965 -1.629936 NA -1.718179 NA NA
head(STD(pwlddev, effect = 2L)) # Standardizing all numeric variables by year
#> iso3c year STD.decade STD.PCGDP STD.LIFEEX STD.GINI STD.ODA #> ABW-1960 ABW 1960 NaN NA 0.9653371 NA NA #> ABW-1961 ABW 1961 NaN NA 0.9521446 NA NA #> ABW-1962 ABW 1962 NaN NA 0.9613612 NA NA #> ABW-1963 ABW 1963 NaN NA 0.9690544 NA NA #> ABW-1964 ABW 1964 NaN NA 0.9592609 NA NA #> ABW-1965 ABW 1965 NaN NA 0.9563056 NA NA
## Weighted Standardizing weights = abs(rnorm(nrow(wlddev))) head(fscale(get_vars(wlddev,9:12), wlddev$iso3c, weights))
#> PCGDP LIFEEX GINI ODA #> 1 NA -1.569523 NA -0.6868124 #> 2 NA -1.526868 NA -0.6333788 #> 3 NA -1.484877 NA -0.6866147 #> 4 NA -1.443264 NA -0.6319858 #> 5 NA -1.401936 NA -0.6023144 #> 6 NA -1.360703 NA -0.5718611
head(STD(wlddev, ~iso3c, weights, 9:12))
#> iso3c STD.PCGDP STD.LIFEEX STD.GINI STD.ODA #> 1 AFG NA -1.569523 NA -0.6868124 #> 2 AFG NA -1.526868 NA -0.6333788 #> 3 AFG NA -1.484877 NA -0.6866147 #> 4 AFG NA -1.443264 NA -0.6319858 #> 5 AFG NA -1.401936 NA -0.6023144 #> 6 AFG NA -1.360703 NA -0.5718611
# Using dplyr library(dplyr) wlddev %>% group_by(iso3c) %>% select(PCGDP,LIFEEX) %>% STD
#> Adding missing grouping variables: `iso3c`
#> # A tibble: 12,744 x 3 #> # Groups: iso3c [216] #> iso3c STD.PCGDP STD.LIFEEX #> * <fct> <dbl> <dbl> #> 1 AFG NA -1.55 #> 2 AFG NA -1.51 #> 3 AFG NA -1.46 #> 4 AFG NA -1.42 #> 5 AFG NA -1.38 #> 6 AFG NA -1.33 #> 7 AFG NA -1.29 #> 8 AFG NA -1.25 #> 9 AFG NA -1.20 #> 10 AFG NA -1.16 #> # ... with 12,734 more rows
wlddev %>% group_by(iso3c) %>% select(PCGDP,LIFEEX) %>% STD(weights) # weighted standardizing
#> Adding missing grouping variables: `iso3c`
#> # A tibble: 12,744 x 3 #> # Groups: iso3c [216] #> iso3c STD.PCGDP STD.LIFEEX #> * <fct> <dbl> <dbl> #> 1 AFG NA -1.57 #> 2 AFG NA -1.53 #> 3 AFG NA -1.48 #> 4 AFG NA -1.44 #> 5 AFG NA -1.40 #> 6 AFG NA -1.36 #> 7 AFG NA -1.32 #> 8 AFG NA -1.28 #> 9 AFG NA -1.24 #> 10 AFG NA -1.20 #> # ... with 12,734 more rows
wlddev %>% group_by(iso3c) %>% select(PCGDP,LIFEEX,ODA) %>% STD(ODA) # weighting by ODA ->
#> Adding missing grouping variables: `iso3c`
#> # A tibble: 12,744 x 4 #> # Groups: iso3c [216] #> iso3c ODA STD.PCGDP STD.LIFEEX #> * <fct> <dbl> <dbl> <dbl> #> 1 AFG 114440000 NA -4.15 #> 2 AFG 233350000 NA -4.08 #> 3 AFG 114880000 NA -4.02 #> 4 AFG 236450000 NA -3.95 #> 5 AFG 302480000 NA -3.88 #> 6 AFG 370250000 NA -3.81 #> 7 AFG 338270000 NA -3.74 #> 8 AFG 259230000 NA -3.68 #> 9 AFG 182000000 NA -3.61 #> 10 AFG 162680000 NA -3.54 #> # ... with 12,734 more rows
# ..keeps the weight column unless keep.w = FALSE