Title: | A Basic Set of Functions for Compositional Data Analysis |
---|---|
Description: | A minimum set of functions to perform compositional data analysis using the log-ratio approach introduced by John Aitchison (1982). Main functions have been implemented in c++ for better performance. |
Authors: | Marc Comas-Cufí [aut, cre] |
Maintainer: | Marc Comas-Cufí <[email protected]> |
License: | GPL |
Version: | 0.5.5 |
Built: | 2024-12-24 06:56:46 UTC |
Source: | CRAN |
The alimentation data set contains the percentages of consumption of several types of food in 25 European countries during the 80s. The categories are: * RM: red meat (pork, veal, beef), * WM: white meat (chicken), * E: eggs, * M: milk, * F: fish, * C: cereals, * S: starch (potatoes), * N: nuts, and * FV: fruits and vegetables.
alimentation
alimentation
An object of class data.frame
with 25 rows and 13 columns.
Moreover, the dataset contains a categorical variable that shows if the country is from the North or a Southern Mediterranean country. In addition, the countries are classified as Eastern European or as Western European.
Compute the transformation matrix to express a composition using the oblique additive log-ratio coordinates.
alr_basis(dim, denominator = dim, numerator = which(denominator != 1:dim))
alr_basis(dim, denominator = dim, numerator = which(denominator != 1:dim))
dim |
number of parts |
denominator |
part used as denominator (default behaviour is to use last part) |
numerator |
parts to be used as numerator. By default all except the denominator parts are chosen following original order. |
matrix
Aitchison, J. (1986) The Statistical Analysis of Compositional Data. Monographs on Statistics and Applied Probability. Chapman & Hall Ltd., London (UK). 416p.
alr_basis(5) # Third part is used as denominator alr_basis(5, 3) # Third part is used as denominator, and # other parts are rearranged alr_basis(5, 3, c(1,5,2,4))
alr_basis(5) # Third part is used as denominator alr_basis(5, 3) # Third part is used as denominator, and # other parts are rearranged alr_basis(5, 3, c(1,5,2,4))
The arctic lake data set records the [sand, silt, clay] compositions of 39 sediment
arctic_lake
arctic_lake
An object of class data.frame
with 39 rows and 5 columns.
Obtain coordinates basis
basis(H)
basis(H)
H |
coordinates for which basis should be shown |
basis used to create coordinates H
In humans the main blood group systems are the ABO system, the Rh system and the MN system. The MN blood system is a system of blood antigens also related to proteins of the red blood cell plasma membrane. The inheritance pattern of the MN blood system is autosomal with codominance, a type of lack of dominance in which the heterozygous manifests a phenotype totally distinct from the homozygous. The possible phenotypical forms are three blood types: type M blood, type N blood and type MN blood. The frequencies of M, N and MN blood types vary widely depending on the ethnic population. However, the Hardy-Weinberg principle states that allele and genotype frequencies in a population will remain constant from generation to generation in the absence of other evolutionary influences. This implies that, in the long run, it holds that
where xM M and xN N are the genotype relative frequencies of MM and NN homozygotes, respectively, and xM N is the genotype relative frequency of MN heterozygotes. This principle was named after G.H. Hardy and W. Weinberg demonstrated it mathematically.
blood_mn
blood_mn
An object of class data.frame
with 49 rows and 5 columns.
The 'bmi_activity' data set records the proportion of daily time spent to sleep (sleep), sedentary behaviour (sedent), light physical activity (Lpa), moderate physical activity (Mpa) and vigorous physical activity (Vpa) measured on a small population of 393 children. Moreover the standardized body mass index (zBMI) of each child was also registered.
This data set was used in the example of the article (Dumuid et al. 2019) to examine the expected differences in zBMI for reallocations of daily time between sleep, physical activity and sedentary behaviour. Because the original data is confidential, the data set BMIPhisActi includes simulated data that mimics the main features of the original data.
bmi_activity
bmi_activity
An object of class data.frame
with 393 rows and 8 columns.
D. Dumuid, Z. Pedisic, T.E. Stanford, J.A. Martín-Fernández, K. Hron, C. Maher, L.K. Lewis and T.S. Olds, The Compositional Isotemporal Sub- stitution Model: a Method for Estimating Changes in a Health Outcome for Reallocation of Time between Sleep, Sedentary Behaviour, and Physical Activity. Statistical Methods in Medical Research 28(3) (2019), 846–857
Balance generated from the first canonical correlation component
cbalance_approx(Y, X)
cbalance_approx(Y, X)
Y |
compositional dataset |
X |
explanatory dataset |
matrix
Isometric log-ratio basis based on canonical correlations
cc_basis(Y, X)
cc_basis(Y, X)
Y |
compositional dataset |
X |
explanatory dataset |
matrix
The function return default balances used in CoDaPack software.
cdp_basis(dim)
cdp_basis(dim)
dim |
dimension to build the ILR basis based on balanced balances |
matrix
Compute the default binary partition used in CoDaPack's software
cdp_partition(ncomp)
cdp_partition(ncomp)
ncomp |
number of parts |
matrix
cdp_partition(4)
cdp_partition(4)
Generic function to calculate the center of a compositional dataset
center(X, zero.rm = FALSE, na.rm = FALSE)
center(X, zero.rm = FALSE, na.rm = FALSE)
X |
compositional dataset |
zero.rm |
a logical value indicating whether zero values should be stripped before the computation proceeds. |
na.rm |
a logical value indicating whether NA values should be stripped before the computation proceeds. |
X = matrix(exp(rnorm(5*100)), nrow=100, ncol=5) g = rep(c('a','b','c','d'), 25) center(X) (by_g <- by(X, g, center)) center(t(simplify2array(by_g)))
X = matrix(exp(rnorm(5*100)), nrow=100, ncol=5) g = rep(c('a','b','c','d'), 25) center(X) (by_g <- by(X, g, center)) center(t(simplify2array(by_g)))
Compute the transformation matrix to express a composition using the linearly dependant centered log-ratio coordinates.
clr_basis(dim)
clr_basis(dim)
dim |
number of parts |
matrix
Aitchison, J. (1986) The Statistical Analysis of Compositional Data. Monographs on Statistics and Applied Probability. Chapman & Hall Ltd., London (UK). 416p.
(B <- clr_basis(5)) # CLR coordinates are linearly dependant coordinates. (clr_coordinates <- coordinates(c(1,2,3,4,5), B)) # The sum of all coordinates equal to zero sum(clr_coordinates) < 1e-15
(B <- clr_basis(5)) # CLR coordinates are linearly dependant coordinates. (clr_coordinates <- coordinates(c(1,2,3,4,5), B)) # The sum of all coordinates equal to zero sum(clr_coordinates) < 1e-15
A minimum set of functions to perform compositional data analysis using the log-ratio approach introduced by John Aitchison (1982) <https://www.jstor.org/stable/2345821>. Main functions have been implemented in c++ for better performance.
Marc Comas-Cufí
Useful links:
Calculate a composition from coordinates with respect a given basis
composition(H, basis = NULL) comp(H, basis = NULL)
composition(H, basis = NULL) comp(H, basis = NULL)
H |
coordinates of a composition. Either a matrix, a data.frame or a vector |
basis |
basis used to calculate the coordinates |
coordinates with respect the given basis
See functions ilr_basis
, alr_basis
,
clr_basis
, sbp_basis
to define different compositional basis.
See function coordinates
to obtain details on how to calculate
coordinates of a given composition.
Calculate the coordinates of a composition with respect a given basis
coordinates(X, basis = "ilr", basis_return = TRUE) coord(..., basis = "ilr") alr_c(X) clr_c(X) ilr_c(X) olr_c(X)
coordinates(X, basis = "ilr", basis_return = TRUE) coord(..., basis = "ilr") alr_c(X) clr_c(X) ilr_c(X) olr_c(X)
X |
compositional dataset. Either a matrix, a data.frame or a vector |
basis |
basis used to calculate the coordinates. |
basis_return |
Should the basis be returned as attribute? (default: |
... |
components of the compositional data |
coordinates
function calculates the coordinates of a compositiona w.r.t. a given basis. 'basis' parameter is
used to set the basis, it can be either a matrix defining the log-contrasts in columns or a string defining some well-known
log-contrast: 'alr' 'clr', 'ilr', 'pw', 'pc', 'pb' and 'cdp', for the additive log-ratio, centered log-ratio, isometric log-ratio,
pairwise log-ratio, clr principal components, clr principal balances or default's CoDaPack balances respectively.
Coordinates of composition X
with respect the given basis
.
See functions ilr_basis
, alr_basis
,
clr_basis
, sbp_basis
to define different compositional basis.
See function composition
to obtain details on how to calculate
a compositions from given coordinates.
coordinates(c(1,2,3,4,5)) h = coordinates(c(1,2,3,4,5)) basis(h) # basis is shown if 'coda.base.basis' option is set to TRUE options('coda.base.basis' = TRUE) coordinates(c(1,2,3,4,5)) # Default transformation can improve performance. N = 100 K = 1000 X = matrix(exp(rnorm(N*K)), nrow=N, ncol=K) system.time(coordinates(X, alr_basis(K))) system.time(coordinates(X, 'alr'))
coordinates(c(1,2,3,4,5)) h = coordinates(c(1,2,3,4,5)) basis(h) # basis is shown if 'coda.base.basis' option is set to TRUE options('coda.base.basis' = TRUE) coordinates(c(1,2,3,4,5)) # Default transformation can improve performance. N = 100 K = 1000 X = matrix(exp(rnorm(N*K)), nrow=N, ncol=K) system.time(coordinates(X, alr_basis(K))) system.time(coordinates(X, 'alr'))
This function overwrites dist
function to contain Aitchison distance between
compositions.
dist(x, method = "euclidean", ...)
dist(x, method = "euclidean", ...)
x |
compositions method |
method |
the distance measure to be used. This must be one of "aitchison", "euclidean", "maximum", "manhattan", "canberra", "binary" or "minkowski". Any unambiguous substring can be given. |
... |
arguments passed to |
dist
returns an object of class "dist".
See functions dist
.
X = exp(matrix(rnorm(10*50), ncol=50, nrow=10)) (d <- dist(X, method = 'aitchison')) plot(hclust(d)) # In contrast to Euclidean distance dist(rbind(c(1,1,1), c(100, 100, 100)), method = 'euc') # method = 'euclidean' # using Aitchison distance, only relative information is of importance dist(rbind(c(1,1,1), c(100, 100, 100)), method = 'ait') # method = 'aitchison'
X = exp(matrix(rnorm(10*50), ncol=50, nrow=10)) (d <- dist(X, method = 'aitchison')) plot(hclust(d)) # In contrast to Euclidean distance dist(rbind(c(1,1,1), c(100, 100, 100)), method = 'euc') # method = 'euclidean' # using Aitchison distance, only relative information is of importance dist(rbind(c(1,1,1), c(100, 100, 100)), method = 'ait') # method = 'aitchison'
According to the three–sector theory, as a country’s economy develops, employment shifts from the primary sector (raw material extraction: farming, hunting, fishing, mining) to the secondary sector (industry, energy and construction) and finally to the tertiary sector (services). Thus, a country’s employment distribution can be used as a predictor of economic wealth.
The 'eurostat_employment' data set contains EUROSTAT data on employment aggregated for both sexes, and all ages distributed by economic activity (classification 1983-2008, NACE Rev. 1.1) in 2008 for the 29 EUROSTAT member countries, thus reflecting reality just before the 2008 financial crisis. Country codes in alphabetical order according to the country name in its own language are: Belgium (BE), Cyprus (CY), Czechia (CZ), Denmark (DK), Deutchland–Germany (DE), Eesti–Estonia (EE), Eire–Ireland (IE), España–Spain (ES), France (FR), Hellas-Greece (GR), Hrvatska–Croatia (HR), Iceland (IS), Italy (IT), Latvia (LV), Lithuania (LT), Luxembourg (LU), Macedonia (MK), Magyarország-Hungary (HU), Malta (MT), Netherlands (NL), Norway (NO), Österreich–Austria (AT), Portugal (PT), Romania (RO), Slovakia (SK), Suomi–Finland (FI), Switzerland (CH), Turkey (TR), United Kingdom (GB).
A key related variable is the logarithm of gross domestic product per person in EUR at current prices (“logGDP”). For the purposes of exploratory data analyses it has also been categorised as a binary variable indicating values higher or lower than the median (“Binary GDP”). The employment composition (D = 11) is:
* Primary sector (agriculture, hunting, forestry, fishing, mining, quarrying) * Manufacturing * Energy (electricity, gas and water supply) * Construction * Trade repair transport (wholesale and retail trade, repair, transport, storage, communications) * Hotels restaurants * Financial intermediation * Real estate (real estate, renting and business activities) * Educ admin defense soc sec (education, public administration, defence, social security) * Health social work * Other services (other community, social and personal service activities)
eurostat_employment
eurostat_employment
An object of class data.frame
with 29 rows and 17 columns.
The foraminiferal data set (Aitchison, 1986) is a typical example of paleocological data. It contains compositions of 4 different fossils (Neogloboquadrina atlantica, Neogloboquadrina pachyderma, Globorotalia obesa, and Globigerinoides triloba) at 30 different depths. Due to the rounded zeros present in the data set we will apply some zero replacement techniques to impute these values in advance. After data preprocessing, the analysis that should be undertaken is the association between the composition and the depth.
foraminiferals
foraminiferals
An object of class data.frame
with 30 rows and 5 columns.
Generic function for the (trimmed) geometric mean.
gmean(x, zero.rm = FALSE, trim = 0, na.rm = FALSE)
gmean(x, zero.rm = FALSE, trim = 0, na.rm = FALSE)
x |
A nonnegative vector. |
zero.rm |
a logical value indicating whether zero values should be stripped before the computation proceeds. |
trim |
the fraction (0 to 0.5) of observations to be trimmed from each end of x before the mean is computed. Values of trim outside that range are taken as the nearest endpoint. |
na.rm |
a logical value indicating whether NA values should be stripped before the computation proceeds. |
From Eurostat (the European Union’s statistical information service) the houseexpend data set records the composition on proportions of mean consumption expenditure of households expenditures on 12 domestic year costs in 27 states of the European Union. Some values in the data set are rounded zeros. In addition the data set contains the gross domestic product (GDP05) and (GDP14) in years 2005 and 2014, respectively. An interesting analysis is the potential association between expenditures compositions and GDP. Once a linear regression model is established, predictions can be provided.
house_expend
house_expend
An object of class data.frame
with 27 rows and 15 columns.
In a sample survey of single persons living alone in rented accommodation, twenty men and twenty women were randomly selected and asked to record over a period of one month their expenditures on the following four mutually exclusive and exhaustive commodity groups: * Hous: Housing, including fuel and light. * Food: Foodstuffs, including alcohol and tobacco. * Serv: Services, including transport and vehicles. * Other: Other goods, including clothing, footwear and durable goods.
household_budget
household_budget
An object of class data.frame
with 40 rows and 6 columns.
By default the basis of the clr-given by Egozcue et al., 2013 Build an isometric log-ratio basis for a composition with k+1 parts
for .
ilr_basis(dim, type = "default") olr_basis(dim, type = "default")
ilr_basis(dim, type = "default") olr_basis(dim, type = "default")
dim |
number of components |
type |
if different than 'pivot' (pivot balances) or 'cdp' (codapack balances) default balances are returned, which computes a triangular Helmert matrix as defined by Egozcue et al., 2013. |
Modifying parameter type (pivot or cdp) other ilr/olr basis can be generated
matrix
Egozcue, J.J., Pawlowsky-Glahn, V., Mateu-Figueras, G. and Barceló-Vidal C. (2003). Isometric logratio transformations for compositional data analysis. Mathematical Geology, 35(3) 279-300
ilr_basis(5)
ilr_basis(5)
The mammalsmilk data set contains the percentages of five constituents (W: water, P: protein, F: fat, L: lactose, and A: ash) of the milk of 24 mammals. The data are taken from [Har75].
mammals_milk
mammals_milk
An object of class data.frame
with 24 rows and 6 columns.
In an attempt to improve the quality of cow milk, milk from each of thirty cows was assessed by dietary composition before and after a strictly controlled dietary and hormonal regime over a period of eight weeks. Although seasonal variations in milk quality might have been regarded as negligible over this period, it was decided to have a control group of thirty cows kept under the same conditions but on a regular established regime. The sixty cows were of course allocated to control and treatment groups at random. The 'milk_cows' data set provides the complete set of before and after milk compositions for the sixty cows, showing the protein (pr), milk fat (mf), carbohydrate (ch), calcium (Ca), sodium (Na) and potassium (K) proportions by weight of total dietary content.
milk_cows
milk_cows
An object of class tbl_df
(inherits from tbl
, data.frame
) with 116 rows and 10 columns.
The montana data set consists of 229 samples of the concentration (in ppm) of minor elements [Cr, Cu, Hg, U, V] in carbon ashes from the Fort Union formation (Montana, USA), side of the Powder River Basin. The formation is mostly Palaeocene in age, and the coal is the result of deposition in conditions ranging from fluvial to lacustrine. All samples were taken from the same seam at different sites over an area of 430 km by 300 km, which implies that on average, the sampling spacing is 24 km. Using the spatial coordinates of the data, a semivariogram analysis was conducted for each chemical element in order to check for a potential spatial dependence structure in the data (not shown here). No spatial dependence patterns were observed for any component, which allowed us to assume an independence of the chemical samples at different locations.
The aforementioned chemical components actually represent a fully observed subcomposition of a much larger chemical composition. The five elements are not closed to a constant sum. Note that, as the samples are expressed in parts per million and all concentrations were originally measured, a residual element could be defined to fill up the gap to 10^6.
montana
montana
An object of class data.frame
with 229 rows and 6 columns.
The function returns all combinations of pairs of log-ratios.
pairwise_basis(dim)
pairwise_basis(dim)
dim |
dimension to build the pairwise log-ratio generator system |
matrix
Results of catalan parliament elections in 2017 by regions.
parliament2017
parliament2017
A data frame with 42 rows and 9 variables:
Region
Votes to Ciutadans party
Votes to Junts per Catalunya party
Votes to Esquerra republicana de Catalunya party
Votes to Partit socialista de Catalunya party
Votes to Catalunya si que es pot party
Votes to Candidatura d'unitat popular party
Votes to Partit popular party
Votes to other parties
https://www.idescat.cat/tema/elecc
Exact method to calculate the principal balances of a compositional dataset. Different methods to approximate the principal balances of a compositional dataset are also included.
pb_basis( X, method, constrained.complete_up = FALSE, cluster.method = "ward.D2", ordering = TRUE, ... )
pb_basis( X, method, constrained.complete_up = FALSE, cluster.method = "ward.D2", ordering = TRUE, ... )
X |
compositional dataset |
method |
method to be used with Principal Balances. Methods available are: 'exact', 'constrained' or 'cluster'. |
constrained.complete_up |
When searching up, should the algorithm try to find possible siblings for the current balance (TRUE) or build a parent directly forcing current balance to be part of the next balance (default: FALSE). While the first is more exhaustive and given better results the second is faster and can be used with highe dimensional datasets. |
cluster.method |
Method to be used with the hclust function (default: 'ward.D2') or any other method available in hclust function |
ordering |
should the principal balances found be returned ordered? (first column, first principal balance and so on) |
... |
parameters passed to hclust function |
matrix
Martín-Fernández, J.A., Pawlowsky-Glahn, V., Egozcue, J.J., Tolosana-Delgado R. (2018). Advances in Principal Balances for Compositional Data. Mathematical Geosciencies, 50, 273-298.
set.seed(1) X = matrix(exp(rnorm(5*100)), nrow=100, ncol=5) # Optimal variance obtained with Principal components (v1 <- apply(coordinates(X, 'pc'), 2, var)) # Optimal variance obtained with Principal balances (v2 <- apply(coordinates(X,pb_basis(X, method='exact')), 2, var)) # Solution obtained using constrained method (v3 <- apply(coordinates(X,pb_basis(X, method='constrained')), 2, var)) # Solution obtained using Ward method (v4 <- apply(coordinates(X,pb_basis(X, method='cluster')), 2, var)) # Plotting the variances barplot(rbind(v1,v2,v3,v4), beside = TRUE, ylim = c(0,2), legend = c('Principal Components','PB (Exact method)', 'PB (Constrained)','PB (Ward approximation)'), names = paste0('Comp.', 1:4), args.legend = list(cex = 0.8), ylab = 'Variance')
set.seed(1) X = matrix(exp(rnorm(5*100)), nrow=100, ncol=5) # Optimal variance obtained with Principal components (v1 <- apply(coordinates(X, 'pc'), 2, var)) # Optimal variance obtained with Principal balances (v2 <- apply(coordinates(X,pb_basis(X, method='exact')), 2, var)) # Solution obtained using constrained method (v3 <- apply(coordinates(X,pb_basis(X, method='constrained')), 2, var)) # Solution obtained using Ward method (v4 <- apply(coordinates(X,pb_basis(X, method='cluster')), 2, var)) # Plotting the variances barplot(rbind(v1,v2,v3,v4), beside = TRUE, ylim = c(0,2), legend = c('Principal Components','PB (Exact method)', 'PB (Constrained)','PB (Ward approximation)'), names = paste0('Comp.', 1:4), args.legend = list(cex = 0.8), ylab = 'Variance')
Different approximations to approximate the principal balances of a compositional dataset.
pc_basis(X)
pc_basis(X)
X |
compositional dataset |
matrix
This petrafm data set is formed by 100 classified volcanic rock samples from Ontario (Canada). The three parts are:
Rocks from the calc-alkaline magma series (25) can be well distinguished from samples from the tholeiitic magma series (75) on an AFM diagram.
petrafm
petrafm
An object of class data.frame
with 100 rows and 4 columns.
Plot a balance
plot_balance(B, data = NULL, main = "Balance dendrogram", ...)
plot_balance(B, data = NULL, main = "Balance dendrogram", ...)
B |
Balance to plot |
data |
(Optional) Data used to calculate the statistics associated to a balance |
main |
Plot title |
... |
further arguments passed to plot |
Balance plot
The pollen data set is formed by 30 fossil pollen samples from three different locations (recorded in variable group) . The samples were analysed and the 3-part composition [pinus, abies, quercus] was measured.
pollen
pollen
An object of class data.frame
with 30 rows and 4 columns.
The pottery data set consists of data pertaining to the chemical composition of 45 specimens of Romano-British pottery. The method used to generate these data is atomic absorption spectophotometry, and readings for nine oxides (Al2O3, Fe2O3, MgO, CaO, Na2O, K2O, TiO2 , MnO, BaO) are provided. These samples come from five different kiln sites.
pottery
pottery
An object of class data.frame
with 45 rows and 11 columns.
The function hides the basis attribute. An option is included to show such basis.
## S3 method for class 'coda' print(x, ..., basis = getOption("coda.base.basis"))
## S3 method for class 'coda' print(x, ..., basis = getOption("coda.base.basis"))
x |
coordinates |
... |
parameters passed to print function |
basis |
boolean to show or not the basis with the output |
Import data from a codapack workspace
read_cdp(fname)
read_cdp(fname)
fname |
cdp file name |
Build an ilr_basis
using a sequential binary partition or
a generic coordinate system based on balances.
sbp_basis(sbp, data = NULL, fill = FALSE, silent = FALSE)
sbp_basis(sbp, data = NULL, fill = FALSE, silent = FALSE)
sbp |
parts to consider in the numerator and the denominator. Can be defined either using a list of formulas setting parts (see examples) or using a matrix where each column define a balance. Positive values are parts in the numerator, negative values are parts in the denominator, zeros are parts not used to build the balance. |
data |
composition from where name parts are extracted |
fill |
should the balances be completed to become an orthonormal basis? if the given balances are not orthonormal, the function will complete the balance to become a basis. |
silent |
inform about orthogonality |
matrix
X = data.frame(a=1:2, b=2:3, c=4:5, d=5:6, e=10:11, f=100:101, g=1:2) sbp_basis(list(b1 = a~b+c+d+e+f+g, b2 = b~c+d+e+f+g, b3 = c~d+e+f+g, b4 = d~e+f+g, b5 = e~f+g, b6 = f~g), data = X) sbp_basis(list(b1 = a~b, b2 = b1~c, b3 = b2~d, b4 = b3~e, b5 = b4~f, b6 = b5~g), data = X) # A non-orthogonal basis can also be calculated. sbp_basis(list(b1 = a+b+c~e+f+g, b2 = d~a+b+c, b3 = d~e+g, b4 = a~e+b, b5 = b~f, b6 = c~g), data = X)
X = data.frame(a=1:2, b=2:3, c=4:5, d=5:6, e=10:11, f=100:101, g=1:2) sbp_basis(list(b1 = a~b+c+d+e+f+g, b2 = b~c+d+e+f+g, b3 = c~d+e+f+g, b4 = d~e+f+g, b5 = e~f+g, b6 = f~g), data = X) sbp_basis(list(b1 = a~b, b2 = b1~c, b3 = b2~d, b4 = b3~e, b5 = b4~f, b6 = b5~g), data = X) # A non-orthogonal basis can also be calculated. sbp_basis(list(b1 = a+b+c~e+f+g, b2 = d~a+b+c, b3 = d~e+g, b4 = a~e+b, b5 = b~f, b6 = c~g), data = X)
The 'serprot' data set records the percentages of the four serum proteins from the blood samples of 30 patients. Fourteen patients have one disease (1) and sixteen are known to have another different disease (2). The 4-compositions are formed by the proteins [albumin, pre-albumin, globulin A, globulin B].
serprot
serprot
An object of class data.frame
with 36 rows and 7 columns.
Time budgets –how a day or a period of work is divided up into different activities have become a popular source of data in psychology and sociology. To illustrate such problems we consider six daily activities undertaken by an academic statistician: teaching (T); consultation (C); administration (A); research (R); other wakeful activities (O); and sleep (S).
The 'statistician_time' data set records the daily time (in hours) devoted to each activity, recorded on each of 20 days, selected randomly from working days in alternate weeks so as to avoid possible carry-over effects such as a short-sleep day being compensated by make-up sleep on the succeeding day. The six activities may be divided into two categories: 'work' comprising activities T, C, A, and R, and 'leisure', comprising activities O and S. Our analysis may then be directed towards the work pattern consisting of the relative times spent in the four work activities, the leisure pattern, and the division of the day into work time and leisure time. Two obvious questions are as follows. To what extent, if any, do the patterns of work and of leisure depend on the times allocated to these major divisions of the day? Is the ratio of sleep to other wakeful activities dependent on the times spent in the various work activities?
statistitian_time
statistitian_time
An object of class data.frame
with 20 rows and 7 columns.
Variation array is returned.
variation_array(X, include_means = FALSE)
variation_array(X, include_means = FALSE)
X |
Compositional dataset |
include_means |
if TRUE logratio means are included in the lower-left triangle |
variation array matrix
set.seed(1) X = matrix(exp(rnorm(5*100)), nrow=100, ncol=5) variation_array(X) variation_array(X, include_means = TRUE)
set.seed(1) X = matrix(exp(rnorm(5*100)), nrow=100, ncol=5) variation_array(X) variation_array(X, include_means = TRUE)
The actual population residing in a municipality of Catalonia is composed by the census count and the so-called floating population (tourists, seasonal visitors, hostel students, short-time employees, and the like). Since actual population combines long and short term residents it is convenient to express it as equivalent full-time residents. Floating population may be positive if the + municipality is receiving more short term residents than it is sending elsewhere, or negative if the opposite holds (expressed as a percentage above –if positive– or below –if negative– the census count). The waste data set includes this information in the variable floating population. Floating population has a large impact on solid waste generation and thus waste can be used to predict floating population which is a hard to estimate demographic variable. This case study was presented in
waste
waste
An object of class data.frame
with 215 rows and 10 columns.
Tourists and census population do not generate the same volume of waste and have different consumption and recycling patterns (waste composition). The Catalan Statistical Institute (IDESCAT) publishes official floating population data for all municipalities in Catalonia (Spain) above 5000 census habitants. The composition of urban solid waste is classified into D = 5 parts: * x1 : non recyclable (grey waste container in Catalonia), * x2 : glass (bottles and jars of any colour: green waste container), * x3 : light containers (plastic packaging, cans and tetra packs: yellow container), * x4 : paper and cardboard (blue container), and * x5 : biodegradable waste (brown container).
G. Coenders, J.A.Martín-Fernández and B. Ferrer-Rosell, When relative and absolute information matter: compositional predictor with a total in generalized linear models. Statistical Modelling 17(6) (2017), 494–512.
The 'weibo_hotels' data set aims at comparing the use of Weibo (Facebook equivalent in China) in hospitality e-marketing between small and medium accommodation establishments (private hostels, small hotels) and big and well-established business (such as international hotel chains or large hotels) in China. The 50 latest posts of the Weibo pages of each hotel (n = 10) are content-analyzed and coded regarding the count of posts featuring information on a 4-part composition [facilities, food, events, promotions]. Hotels were coded as large “L” or small “S” in the hotel size categorical variable.
weibo_hotels
weibo_hotels
An object of class data.frame
with 10 rows and 5 columns.