The package “compositions” was released in 2005 to enable R users the analysis of compositional data in R. The main goals were twofold: to provide functions for the logratio analysis of compositions, and the facilitation of the comparison of results and interpretation among several analysis paradigms, in a transparent and consistent way. These paradigms correspond to different statistical scales for the data, thus for different geometries of the underlying sample space (the simplex). Each geometry was encoded in an S3 class. The following table summarizes these classes. Details on how to choose a class, and how to work with them can be found in UsingCompositions.pdf. Readers new to “compositions” are recommended to start there, as this vignette rather focuses on what has changed and how to do certain advanced analysis with the new functionality.
counts? | total? | relative? | geometry | class |
---|---|---|---|---|
no | no | no | real compositional | rcomp |
no | no | yes | relative compositional | acomp |
no | yes | no | real amounts | rplus |
no | yes | yes | relative amounts | aplus |
yes | yes | yes | count multinomial | ccomp |
no | – | no | real multivariate | rmult |
In this table, the first three columns correspond to three binary questions that guide you in choosing the right geometry for your data
Once you have chosen your geometry (or a couple of them that you want to compare), you can load “compositions” with
library(compositions)
#> Welcome to compositions, a package for compositional data analysis.
#> Find an intro with "? compositions"
#>
#> Attaching package: 'compositions'
#> The following objects are masked from 'package:stats':
#>
#> anova, cor, cov, dist, var
#> The following object is masked from 'package:graphics':
#>
#> segments
#> The following objects are masked from 'package:base':
#>
#> %*%, norm, scale, scale.default
and load your data with any data reading function
(read.csv()
, read_delim()
,
read_excel()
, etc). Here we use for illustration purposes
the data set “Hydrochem”
data("Hydrochem")
dim(Hydrochem)
#> [1] 485 19
colnames(Hydrochem)
#> [1] "Code" "Site" "Location" "River" "Date" "H"
#> [7] "Na" "K" "Mg" "Ca" "Sr" "Ba"
#> [13] "NH4" "Cl" "NO3" "PO4" "SO4" "HCO3"
#> [19] "TOC"
having 5 descriptive variables and 14 chemical components (from
H
to TOC
). These last form a composition,
which we preliminarily consider of an aplus geometry
xp = aplus(Hydrochem[, c("Ca","K", "Na", "Mg")])
summary(xp)
#> Ca K Na Mg
#> Min. 40.39 0.0251 2.233 4.545
#> 1st Qu. 87.13 3.2230 30.360 20.430
#> Median 104.70 13.3700 135.100 34.260
#> Mean 119.30 10.1600 90.300 32.110
#> 3rd Qu. 170.10 32.6200 227.000 58.110
#> Max. 441.50 217.6000 1121.000 190.700
#> attr(,"class")
#> [1] "summary.aplus" "matrix" "array"
The object is of course an “aplus” object
is.aplus(xp) # check with S3 paradigm
#> [1] TRUE
is(xp, "aplus") # check with S4 paradigm
#> [1] TRUE
but it is also an amounts object
The “amounts” class is an abstract superclass, that cannot be created directly. Also objects of class “rplus” and “ccomp” are amounts
If we consider the total sum of each row to be irrelevant, the appropriate class is an “acomp”
xc = acomp(Hydrochem[, c("Ca","K", "Na", "Mg")])
summary(xc)
#> $mean
#> Ca K Na Mg
#> "0.47366650" "0.04034558" "0.35849821" "0.12748971"
#> attr(,"class")
#> [1] "acomp"
#>
#> $mean.ratio
#> Ca K Na Mg
#> Ca 1.00000000 11.740233 1.3212521 3.7153311
#> K 0.08517718 1.000000 0.1125405 0.3164614
#> Na 0.75685786 8.885688 1.0000000 2.8119776
#> Mg 0.26915501 3.159943 0.3556216 1.0000000
#>
#> $variation
#> Ca K Na Mg
#> Ca 0.0000000 1.697259 1.1117720 0.2195639
#> K 1.6972588 0.000000 0.4538740 1.1885007
#> Na 1.1117720 0.453874 0.0000000 0.7442923
#> Mg 0.2195639 1.188501 0.7442923 0.0000000
#>
#> $expsd
#> Ca K Na Mg
#> Ca 1.000000 3.679544 2.870270 1.597718
#> K 3.679544 1.000000 1.961485 2.974821
#> Na 2.870270 1.961485 1.000000 2.369606
#> Mg 1.597718 2.974821 2.369606 1.000000
#>
#> $invexpsd
#> Ca K Na Mg
#> Ca 1.0000000 0.2717728 0.3483993 0.6258926
#> K 0.2717728 1.0000000 0.5098179 0.3361547
#> Na 0.3483993 0.5098179 1.0000000 0.4220110
#> Mg 0.6258926 0.3361547 0.4220110 1.0000000
#>
#> $min
#> Ca K Na Mg
#> Ca 1.0000000000 0.4980473 0.146331281 0.853172522
#> K 0.0004626728 1.0000000 0.008240315 0.005522552
#> Na 0.0381578947 1.7109562 1.000000000 0.259131129
#> Mg 0.0806127186 0.4801286 0.053971402 1.000000000
#>
#> $q1
#> Ca K Na Mg
#> Ca 1.00000000 3.535891 0.59303483 2.8447810
#> K 0.02604998 1.000000 0.06717914 0.1342102
#> Na 0.30390625 5.067818 1.00000000 1.5538624
#> Mg 0.22329706 1.109722 0.18490000 1.0000000
#>
#> $med
#> Ca K Na Mg
#> Ca 1.00000000 12.281003 0.9759519 3.499224
#> K 0.08142658 1.000000 0.1214027 0.272508
#> Na 1.02464066 8.237046 1.0000000 3.581553
#> Mg 0.28577763 3.669617 0.2792085 1.000000
#>
#> $q3
#> Ca K Na Mg
#> Ca 1.0000000 38.387745 3.2904884 4.4783394
#> K 0.2828142 1.000000 0.1973236 0.9011264
#> Na 1.6862416 14.885572 1.0000000 5.4083288
#> Mg 0.3515209 7.451001 0.6435576 1.0000000
#>
#> $max
#> Ca K Na Mg
#> Ca 1.000000 2161.3546 26.2068966 12.404990
#> K 2.007841 1.0000 0.5844685 2.082775
#> Na 6.833809 121.3546 1.0000000 18.528331
#> Mg 1.172096 181.0757 3.8590501 1.000000
#>
#> $missingness
#> missingType
#> variable NMV BDL MAR MNAR SZ Err
#> Ca 485 0 0 0 0 0
#> K 485 0 0 0 0 0
#> Na 485 0 0 0 0 0
#> Mg 485 0 0 0 0 0
#>
#> attr(,"class")
#> [1] "summary.acomp"
This can be thus checked
is.acomp(xc) # check with S3 paradigm
#> [1] TRUE
is(xc, "acomp") # check with S4 paradigm
#> [1] TRUE
There is also an abstract superclass for compositional objects, i.e. those where the total is irrelevant or an artifact
that contains objects of class “acomp”, “rcomp” and “ccomp”
Additionally, all comopostional classes have natural ways to be automatically converted to “data.frame” and to “structure” classes. This works like this
and it allows S4 classes with slots expecting a “data.frame” or one of “vector”, “matrix” or “array” to accept a compositional object. This will be discussed in the last section of this vignette.
Compositional data sets react now to the dollar notation. Since compositions v2 you can extract one particular variable with
As discussed in UsingCompositions.pdf, the key idea of compositional analysis is to transform the data prior to the analysis. All transformations are now accessible using the dollar notation
summary(xp$clr)
#> Ca K Na Mg
#> Min. -0.5765232 -4.4190306 -0.7626504 -1.14624527
#> 1st Qu. 0.4801820 -1.9501308 0.3878456 -0.69086350
#> Median 0.8936557 -1.5989396 0.8498314 -0.36473684
#> Mean 1.0135173 -1.4495044 0.7349375 -0.29895048
#> 3rd Qu. 1.5489812 -0.7802982 1.0572174 0.03228459
#> Max. 3.2594599 0.1205369 1.8932905 1.36923306
#> attr(,"class")
#> [1] "summary.rmult" "matrix" "array"
This includes each of alr
, ilr
,
clr
, log
, ilt
, iit
,
apt
, ipt
, cpt
, their generic
versions cdt
and idt
, as well as all their
inverses (e.g. cptInv
or alrInv
). Also
raw
, clo
, unclass
and
pwlr
resp. pwlrInv
(pairwise logratio and its
inverse) work with this trick
summary(xp)
#> Ca K Na Mg
#> Min. 40.39 0.0251 2.233 4.545
#> 1st Qu. 87.13 3.2230 30.360 20.430
#> Median 104.70 13.3700 135.100 34.260
#> Mean 119.30 10.1600 90.300 32.110
#> 3rd Qu. 170.10 32.6200 227.000 58.110
#> Max. 441.50 217.6000 1121.000 190.700
#> attr(,"class")
#> [1] "summary.aplus" "matrix" "array"
summary(cdt(xp))
#> Ca K Na Mg
#> Min. 3.698582 -3.684887 0.803346 1.514028
#> 1st Qu. 4.467401 1.170313 3.413126 3.017004
#> Median 4.651099 2.593013 4.906015 3.533978
#> Mean 4.781715 2.318693 4.503135 3.469247
#> 3rd Qu. 5.136386 3.484926 5.424950 4.062338
#> Max. 6.090178 5.382888 7.022422 5.250702
#> attr(,"class")
#> [1] "summary.rmult" "matrix" "array"
summary(xp$cdt)
#> Ca K Na Mg
#> Min. 3.698582 -3.684887 0.803346 1.514028
#> 1st Qu. 4.467401 1.170313 3.413126 3.017004
#> Median 4.651099 2.593013 4.906015 3.533978
#> Mean 4.781715 2.318693 4.503135 3.469247
#> 3rd Qu. 5.136386 3.484926 5.424950 4.062338
#> Max. 6.090178 5.382888 7.022422 5.250702
#> attr(,"class")
#> [1] "summary.rmult" "matrix" "array"
summary(xp$cdt$cdtInv)
#> Ca K Na Mg
#> Min. 40.39 0.0251 2.233 4.545
#> 1st Qu. 87.13 3.2230 30.360 20.430
#> Median 104.70 13.3700 135.100 34.260
#> Mean 119.30 10.1600 90.300 32.110
#> 3rd Qu. 170.10 32.6200 227.000 58.110
#> Max. 441.50 217.6000 1121.000 190.700
#> attr(,"class")
#> [1] "summary.aplus" "matrix" "array"
Another (potentially conflictive) novelty is the
[
-subsetting, in particular for closed classes (“acomp” and
“rcomp”). In compositions-v1, subseting invariably destroyed the class,
returning the selected rows or columns in a matrix, or even in a vector
if only one row or one column was selected. From compositions-v2 on, the
nature of the subsetting depends on class of the parent object and the
number of selected elements:
xp0 = xp[1:5,]
xp0
#> Ca K Na Mg
#> 1 "279.1" "6.318" "159.0" "92.41"
#> 2 "300.3" "6.449" "179.2" "96.20"
#> 3 "305.1" "5.978" "186.0" "96.83"
#> 4 "343.4" "5.171" "181.1" "88.58"
#> 5 "369.0" "7.371" "151.0" "84.90"
#> attr(,"class")
#> [1] "aplus"
xc0 = xc[1:5,]
xc0
#> Ca K Na Mg
#> 1 "0.5199058" "0.011769133" "0.2961843" "0.1721408"
#> 2 "0.5158473" "0.011077920" "0.3078250" "0.1652498"
#> 3 "0.5137159" "0.010065532" "0.3131798" "0.1630387"
#> 4 "0.5554378" "0.008363917" "0.2929231" "0.1432751"
#> 5 "0.6026743" "0.012038787" "0.2466228" "0.1386641"
#> attr(,"class")
#> [1] "acomp"
xc[10,]
#> Ca K Na Mg
#> "0.52155495" "0.01018437" "0.30386080" "0.16439989"
#> attr(,"class")
#> [1] "acomp"
xp0[,"Ca"]
#> 1 2 3 4 5
#> 279.1 300.3 305.1 343.4 369.0
xp0[,c("Ca","K")]
#> Ca K
#> 1 "279.1" "6.318"
#> 2 "300.3" "6.449"
#> 3 "305.1" "5.978"
#> 4 "343.4" "5.171"
#> 5 "369.0" "7.371"
#> attr(,"class")
#> [1] "aplus"
xc0[,c("Ca","K")]
#> Ca K
#> 1 "0.9778640" "0.02213595"
#> 2 "0.9789763" "0.02102370"
#> 3 "0.9807830" "0.01921705"
#> 4 "0.9851651" "0.01483485"
#> 5 "0.9804156" "0.01958440"
#> attr(,"class")
#> [1] "acomp"
xc0$Ca
#> 1 2 3 4 5
#> 0.5199058 0.5158473 0.5137159 0.5554378 0.6026743
xc0[,"Ca"]
#> 1 2 3 4 5
#> 0.5199058 0.5158473 0.5137159 0.5554378 0.6026743
xp0[,"Ca"]
#> 1 2 3 4 5
#> 279.1 300.3 305.1 343.4 369.0
drop=TRUE
in the subsetting, or globally by calling
setStickyClassOption(FALSE)
xc0[,c("Ca","K"), drop=TRUE]
#> Ca K
#> 1 0.5199058 0.011769133
#> 2 0.5158473 0.011077920
#> 3 0.5137159 0.010065532
#> 4 0.5554378 0.008363917
#> 5 0.6026743 0.012038787
xp0[,"Ca", drop=TRUE]
#> 1 2 3 4 5
#> 279.1 300.3 305.1 343.4 369.0
You can check which is the current status of sticky classes with
getStickyClassOption()
.
Matrix-like compositions can be embedded in other data containers, typically “data.frame”s or tibbles. For this, we just need to define an extra column
Hydrochem$comp <- xc
head(Hydrochem)
#> Code Site Location River Date H Na K Mg Ca
#> 1 9535612 95 Anoia Anoia 01.07.1997 7.943282e-09 159.0 6.318 92.41 279.1
#> 2 9535643 95 Anoia Anoia 01.08.1997 5.011872e-09 179.2 6.449 96.20 300.3
#> 3 9535674 95 Anoia Anoia 01.09.1997 7.943282e-09 186.0 5.978 96.83 305.1
#> 4 9535704 95 Anoia Anoia 01.10.1997 6.165950e-09 181.1 5.171 88.58 343.4
#> 5 9535735 95 Anoia Anoia 01.11.1997 1.202264e-08 151.0 7.371 84.90 369.0
#> 6 9535765 95 Anoia Anoia 01.12.1997 1.028016e-08 170.1 6.475 89.82 337.6
#> Sr Ba NH4 Cl NO3 PO4 SO4 HCO3
#> 1 8.58366 0.049900 0.22341340 345.9561 16.965779 0.4014300 730.6014 349.5047
#> 2 7.29700 0.054983 0.02694842 280.2735 8.647828 0.5218775 773.2336 314.6050
#> 3 7.34100 0.057367 0.07923994 300.2572 11.394282 0.5084960 884.5392 342.4435
#> 4 7.63000 0.059473 0.02231826 297.9269 11.293516 0.4014442 757.7849 335.2200
#> 5 8.24600 0.053410 0.46398830 229.2619 9.547616 0.4415887 616.2988 308.6500
#> 6 7.57200 0.050168 0.09921348 267.8800 13.253693 0.6958367 881.1794 334.1700
#> TOC comp.1 comp.2 comp.3 comp.4
#> 1 3.022 0.519905817 0.011769133 0.296184253 0.172140797
#> 2 3.039 0.515847317 0.011077920 0.307824973 0.165249790
#> 3 3.196 0.513715929 0.010065532 0.313179819 0.163038720
#> 4 3.871 0.555437840 0.008363917 0.292923101 0.143275142
#> 5 3.400 0.602674306 0.012038787 0.246622819 0.138664088
#> 6 3.625 0.558945024 0.010720287 0.281624848 0.148709840
dim(Hydrochem)
#> [1] 485 20
This composition can be then recovered with the dollar notation, and also used in further calls to statistical methods, e.g.
head(Hydrochem$comp)
#> [,1] [,2] [,3] [,4]
#> 1 "0.5199058" "0.011769133" "0.2961843" "0.1721408"
#> 2 "0.5158473" "0.011077920" "0.3078250" "0.1652498"
#> 3 "0.5137159" "0.010065532" "0.3131798" "0.1630387"
#> 4 "0.5554378" "0.008363917" "0.2929231" "0.1432751"
#> 5 "0.6026743" "0.012038787" "0.2466228" "0.1386641"
#> 6 "0.5589450" "0.010720287" "0.2816248" "0.1487098"
#> attr(,"class")
#> [1] "acomp"
codalm = lm(alr(comp)~River, data=Hydrochem)
summary(codalm)
#> Response v1 :
#>
#> Call:
#> lm(formula = v1 ~ River, data = Hydrochem)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -1.90144 -0.19728 0.06689 0.26947 0.82324
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 1.13957 0.03242 35.152 <2e-16 ***
#> RiverCardener 0.09677 0.05131 1.886 0.0599 .
#> RiverLowerLLobregat 0.01237 0.04652 0.266 0.7904
#> RiverUpperLLobregat 0.65173 0.04892 13.324 <2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.3877 on 481 degrees of freedom
#> Multiple R-squared: 0.3198, Adjusted R-squared: 0.3155
#> F-statistic: 75.38 on 3 and 481 DF, p-value: < 2.2e-16
#>
#>
#> Response v2 :
#>
#> Call:
#> lm(formula = v2 ~ River, data = Hydrochem)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -3.3186 -0.3621 0.1977 0.4878 1.6234
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -1.68613 0.07202 -23.413 <2e-16 ***
#> RiverCardener 0.99021 0.11399 8.687 <2e-16 ***
#> RiverLowerLLobregat 1.38843 0.10334 13.435 <2e-16 ***
#> RiverUpperLLobregat -0.19421 0.10867 -1.787 0.0745 .
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.8612 on 481 degrees of freedom
#> Multiple R-squared: 0.3798, Adjusted R-squared: 0.376
#> F-statistic: 98.2 on 3 and 481 DF, p-value: < 2.2e-16
#>
#>
#> Response v3 :
#>
#> Call:
#> lm(formula = v3 ~ River, data = Hydrochem)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -2.1345 -0.4578 0.2273 0.5083 1.7514
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 0.95089 0.06784 14.017 < 2e-16 ***
#> RiverCardener 0.21806 0.10738 2.031 0.042821 *
#> RiverLowerLLobregat 0.45005 0.09735 4.623 4.86e-06 ***
#> RiverUpperLLobregat -0.36803 0.10236 -3.595 0.000357 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.8112 on 481 degrees of freedom
#> Multiple R-squared: 0.1213, Adjusted R-squared: 0.1158
#> F-statistic: 22.13 on 3 and 481 DF, p-value: 1.94e-13
The transformation can even be called with the dollar notation, and transformed data keep the information necessary to construct their back-transformation.
codalm = lm(comp$idt~River, data=Hydrochem)
head(predict(codalm)$idtInv)
#> [,1] [,2] [,3] [,4]
#> 1 "0.4530465" "0.02685076" "0.3751471" "0.1449556"
#> 2 "0.4530465" "0.02685076" "0.3751471" "0.1449556"
#> 3 "0.4530465" "0.02685076" "0.3751471" "0.1449556"
#> 4 "0.4530465" "0.02685076" "0.3751471" "0.1449556"
#> 5 "0.4530465" "0.02685076" "0.3751471" "0.1449556"
#> 6 "0.4530465" "0.02685076" "0.3751471" "0.1449556"
#> attr(,"class")
#> [1] "acomp"
To recover the column names in the backtransformation you might need
to make use of the orig
argument of the explicit
transformation (that is, without the dollar notation)
head(idtInv(predict(codalm), orig=xc))
#> Ca K Na Mg
#> 1 "0.4530465" "0.02685076" "0.3751471" "0.1449556"
#> 2 "0.4530465" "0.02685076" "0.3751471" "0.1449556"
#> 3 "0.4530465" "0.02685076" "0.3751471" "0.1449556"
#> 4 "0.4530465" "0.02685076" "0.3751471" "0.1449556"
#> 5 "0.4530465" "0.02685076" "0.3751471" "0.1449556"
#> 6 "0.4530465" "0.02685076" "0.3751471" "0.1449556"
#> attr(,"class")
#> [1] "acomp"
Note as well the existence of the new function
backtransform
, which will take an “rmult” object and return
its expression in the appropriate original components. This can even be
allowed to be controlled by a third object through an extra argument
as
head(backtransform(idt(xp)))
#> Ca K Na Mg
#> 1 "279.1" "6.318" "159.0" "92.41"
#> 2 "300.3" "6.449" "179.2" "96.20"
#> 3 "305.1" "5.978" "186.0" "96.83"
#> 4 "343.4" "5.171" "181.1" "88.58"
#> 5 "369.0" "7.371" "151.0" "84.90"
#> 6 "337.6" "6.475" "170.1" "89.82"
#> attr(,"class")
#> [1] "aplus"
head(backtransform(predict(codalm), as=codalm$residuals))
#> [,1] [,2] [,3] [,4]
#> 1 "0.4530465" "0.02685076" "0.3751471" "0.1449556"
#> 2 "0.4530465" "0.02685076" "0.3751471" "0.1449556"
#> 3 "0.4530465" "0.02685076" "0.3751471" "0.1449556"
#> 4 "0.4530465" "0.02685076" "0.3751471" "0.1449556"
#> 5 "0.4530465" "0.02685076" "0.3751471" "0.1449556"
#> 6 "0.4530465" "0.02685076" "0.3751471" "0.1449556"
#> attr(,"class")
#> [1] "acomp"
In the case of tibbles, the column names of the embedded object are even preserved, so that
would return the same as xc$Ca
. This is not included
here to redude the dependencies of this package. Note that this trick
does not work with “data.frame” objects, because [<-
takes care of removing the column names of the composition before
attaching it.
This property of self-back-transformability is the single most important addition to “compositions” v2, and will experiment improvements in the future. In particular, we strive to make the recovery of column names automatic in the near future.
The final addition to compositions v2 is the ability of compositional data to fake to be “data.frame”s or “structure”s themselves. This allows them to be embedded in S3 elements and S4 classes in slots expecting one of these objects, keeping their additional attributes (hence their compositional nature, backtransformability and so on). This is an example of how this could work, for instance in a spatial data frame of package “sp”. As before, this is not actually included here to avoid including one extra package. Readers are invited to check how it works. Packages “gstat” and “sp” are necessary
data("jura", package="gstat")
coords = jura.pred[, 1:2]
compo = jura.pred[, 7:13]
xc = cdt(acomp(compo))
spcompo = SpatialPointsDataFrame(coords=coords, data=xc)
summary(spcompo@data)
class(spcompo@data)
For an S4 class to admit compositional classes in their slots for
“data.frame”, “vector”, “matrix” or “array” objects, though, there is a
condition: validity checks must be made after
as(x,"data.frame")
, as(x,"matrix")
or
equivalent is applied.