Title: | Auxiliary Functions for Phenological Data Analysis |
---|---|
Description: | Provides some easy-to-use functions for time series analyses of (plant-) phenological data sets. These functions mainly deal with the estimation of combined phenological time series and are usually wrappers for functions that are already implemented in other R packages adapted to the special structure of phenological data and the needs of phenologists. Some date conversion functions to handle Julian dates are also provided. |
Authors: | Joerg Schaber |
Maintainer: | Maximilian Lange <[email protected]> |
License: | GPL (>= 2) |
Version: | 1.7-0 |
Built: | 2024-10-31 19:55:58 UTC |
Source: | CRAN |
Finds connected data sets, i.e. connected rows and columns of a numeric matrix M.
connectedSets(M)
connectedSets(M)
M |
Numeric matrix with missing values assumed to be NA or 0. |
In a two-way classification of linear models sometimes independent sets of normal equations are obtained due to missing data in the experiments design, i.e. the complete design matrix is not of full rank and thus no solution can be found. However, solutions of the independent sets of normal equations can still exist. This phenomenon is called 'connectedness' of the data. Especially in phenological analysis experimental designs are almost always unbalanced because of missing data. Thus, when combined time series are to be estimated, it is worth checking for and finding connected data sets for which combined time series can then be estimated. Example (also see example data(Simple) and example in 'maxConnectedSet'): In the following matrix dots represent missing values, X represent observations and the lines join the connected sets:
: X\_\_\_X . .
:
: X\_\_\_X . .
:
: . . X\_\_\_X
Thus, in this matrix observations in rows 1 and 2 or colums 1 and 2 form one connected set. Likewise row 3 (or columns 3 and 4) form also one connected set.
rowclasses |
Vector of set numbers of rows of M (>=1). A value of |
colclasses |
Vector of set numbers of columns of M(>=1). A value of |
Joerg Schaber
Searle (1997) 'Linear Models'. Wiley. page 318.
maxConnectedSet
getConnectedSets
data(Simple) connectedSets(Simple)
data(Simple) connectedSets(Simple)
Converts a string date "DD.MM.YYYY" into a Julian day of year (DOY).
date2jul1(d)
date2jul1(d)
d |
Date as charater string 'DD.MM.YYYY'. |
doy |
Day of year as integer. |
year |
Year as integer. |
Joerg Schaber
date2jul1('31.05.1970')
date2jul1('31.05.1970')
Converts an integer date (day,month,year) into a Julian day of year (DOY). If y is missing, 2000 is assumed.
date2jul2(d,m,y)
date2jul2(d,m,y)
d |
Day of month, numeric coecerd into an integer. |
m |
Month of year, numeric coerced into an integer. |
y |
Year, numeric coerced into an integer, default 2000. |
doy |
Day of year as integer. |
year |
Year as integer. |
Joerg Schaber
date2jul2(31,5,1970)
date2jul2(31,5,1970)
Calculates daylength [h] and declination angle delta [radians] on day i [julian day of year] for latitude l [degrees].
daylength(i,l)
daylength(i,l)
i |
Integer as julian day of year (1-365) |
l |
Float as latitude [degress] |
dl |
daylength [h] |
delta |
declination angle [degrees] |
Joerg Schaber
daylength(as.integer(120),63)
daylength(as.integer(120),63)
Number of days between date1 and date2.
daysbetween(d1,d2)
daysbetween(d1,d2)
d1 |
Date as a character string 'DD.MM.YYYY'. |
d2 |
Date as s character string 'DD.MM.YYYY'. |
ndays |
Number of days between d1 and d2. |
Joerg Schaber
daysbetween('31.05.1970','10.03.2004')
daysbetween('31.05.1970','10.03.2004')
Phenological observations of nine stations from 1951 to 1998. Data from the German Weather Service.
data(DWD)
data(DWD)
Data frame containing three columns (day of year of observations,year,station-id)
German Weather Service
Schaber J, Badeck F-W (2002) 'Evaluation of methods for the combination of phenological time series and outlier detection'. Tree Physiology 22:973-982
Finds a list of connected data sets in a matrix or data frame and returns them accordingly.
getConnectedSets(M)
getConnectedSets(M)
M |
Numeric matrix with missing values considered as 0, or a data frame. The data frame is internally converted to a matrix and should have three columns (x, factor 1, factor 2) where x are considered the entries of the matrix, rows correspond to levels of factor 2 and columns correspond to levels of factor 1. |
getConnctedSets returns a list of connected data sets as numeric data frames D with three columns (x, factor 1, factor 2) or a n*m matrix M, where the n rows correspond to n levels of factor 2 and m columns correspond to m levels of factor the respective factors. Output as data frame or matrix, depending on input.
cs_i |
List of connected sets as matrix or data frame, corresponding to the input. named as cs_i with i being the number of the connected sets. |
Joerg Schaber
Searle (1997) 'Linear Models'. Wiley. page 318.
data(Searle) getConnectedSets(Searle)
data(Searle) getConnectedSets(Searle)
Converts Julian day of year (DOY) into a string date 'DD.MM.YYYY'. If y is missing a non-leap year is assumed.
jul2date1(d,y)
jul2date1(d,y)
d |
DOY, numeric coerced into an integer. |
y |
Year, numeric coerced into an integer, default 2000. |
date |
Date, as character string 'DD.MM.YYYY' |
Joerg Schaber
jul2date1(151,1970)
jul2date1(151,1970)
Converts Julian day of year (DOY) into an integer date (day,month,year). If y is missing a non-leap year is assumed.
jul2date2(d,y)
jul2date2(d,y)
d |
DOY, numeric coerced into an integer. |
y |
Year, numeric coerced into an integer, default 2000. |
day |
Day of month as integer. |
month |
Month of year as integer. |
year |
Year as integer. |
Joerg Schaber
jul2date2(151,1970)
jul2date2(151,1970)
Tests whether a given year is a leap year or not.
leapyear(y)
leapyear(y)
y |
Year, numeric coerced into integer. |
TRUE |
leap year |
FALSE |
non leap year |
Joerg Schaber
leapyear(2000) leapyear(2004)
leapyear(2000) leapyear(2004)
Converts a numeric matrix M into a dataframe D with three columns (x, factor 1, factor 2) where rows of M are ranks of factor 1 levels and columns of M are ranks of factor 2 levels, missing values are assumed to be 0 or NA. The resulting dataframe D has no missing values.
matrix2raw(M,l1,l2)
matrix2raw(M,l1,l2)
M |
Numeric matrix, with missing values assumed to be NA or 0. |
l1 |
Optional numeric vector of level names of column 2 (factor 1)
of returned data frame. If missing it is assigned row numbers of |
l2 |
Optional numeric vector of level names of column 3 (factor 2)
of returned data frame. If missing it is assigned column numbers of |
D |
Data frame with three columns: (y,f1,f1). |
Joerg Schaber
data(DWD) M <- raw2matrix(DWD) # conversion to matrix D1 <- matrix2raw(M) # back conversion, but with different level names D2 <- matrix2raw(M,c(1951:1998),c(1:9)) # with original level names
data(DWD) M <- raw2matrix(DWD) # conversion to matrix D1 <- matrix2raw(M) # back conversion, but with different level names D2 <- matrix2raw(M,c(1951:1998),c(1:9)) # with original level names
Finds connected data set, i.e. connected rows and columns of a numeric matrix M, that has the largest number of data entries.
maxConnectedSet(M)
maxConnectedSet(M)
M |
Numeric matrix with missing values considered as 0, or a data frame. The data frame is internally converted to a matrix and should have three columns (x, factor 1, factor 2) where x are considered the entries of the matrix, rows correspond to levels of factor 2 and columns correspond to levels of factor 1. |
In a two-way classification of linear models sometimes independent sets of normal equations are obtained due to missing data in the experiments design, i.e. the complete design matrix is not of full rank and thus no solution can be found. However, solutions of the independent sets of normal equations can still exist. This phenomenon is called 'connectedness' of the data. Especially in phenological analysis experimental designs are almost always unbalanced because of missing data. Thus, when combined time series are to be estimated, it is worth checking for and finding connected data sets for which combined time series can then be estimated. This can also be interpreted in the way that a prerequisite to obtain a combined time series is to have overlapping time series. Example (also see example data(Searle) from Searle (1997), page 324 and example in 'connectedSets'): In the following matrix dots represent missing values, X represent observations and the lines join the connected sets:
: X\_\_\_.\_\_\_.\_\_\_.\_\_\_X . . .
:
: . . X\_\_\_.\_\_\_!\_\_\_.\_\_\_.\_\_\_X
:
: . X\_\_\_.\_\_\_.\_\_\_!\_\_\_X\_\_\_X !
:
: . X\_\_\_.\_\_\_.\_\_\_!\_\_\_X\_\_\_X !
:
: . . . . X . . !
:
: . . X\_\_\_.\_\_\_!\_\_\_.\_\_\_.\_\_\_X
:
: . . . X\_\_\_X . . .
Thus, in this matrix observations of rows 1, 5 and 7 or colums 1, 4 and 5 form one connected set. Likewise observations of rows 2 and 6 (or columns 3 and 8) and rows 3 and 4 (or columns 2, 6 and 7) form also connected sets, respectively.
ms |
maximal connected set as matrix or data frame, corresponding to the input. |
maxl |
Number of observations in the maximal connected data set. |
nsets |
Number of connected data sets. |
lsets |
Vector with number of observations in each connected data sets, i.e. lsets[i] is the number of observations in connected data set i. |
Joerg Schaber
Searle (1997) 'Linear Models'. Wiley. page 318.
data(Searle) maxConnectedSet(Searle)
data(Searle) maxConnectedSet(Searle)
Calculates maximal daylength maxdl [h] at a certain latitude l [degrees].
maxdaylength(l)
maxdaylength(l)
l |
Latitude in degrees. |
maxdl |
Maximal daylength [h] at a certain latitude l [degrees] |
Joerg Schaber
maxdaylength(60)
maxdaylength(60)
Creation of dense two-way classification design matrix. The sum of the second factor is constrained to be zero. No general mean.
pheno.ddm(D,na.omit=TRUE)
pheno.ddm(D,na.omit=TRUE)
D |
Data frame with three columns: (observations, factor 1, factor 2). |
na.omit |
Determined whether missing values should be omitted or not. Default is TRUE. |
In phenological applications observations should be the julian day
of observation of a certain phase, factor 1 should be the observation year
and factor 2 should be a station-id.
Usually this is much easier created by:
y <- factor(f1),
s <- factor(f2),
ddm <- as.matrix.csr(model.matrix(~ y + s -1, contrasts=list(s=("contr.sum"))))
.
However, this procedure can be quite memory demanding and might exceed storage
capacity for large problems.
This procedure here is much less memory comsuming.
Moreover, in order to get direct estimates for all coefficients,
an additional row is appended to the matrix, where the columns for the second factor are set to 1.
Therefore, dimensions of ddm
are (nlevels(factor1)+1)x(nlevels(factor2)).
ddm |
Dense roworder matrix, matrix.csr format (see matrix.csr in package SparseM) |
D |
Data frame D sorted first by f2 then by f1 and with rows containing NA's removed. |
na.rows |
Rows in D that were omitted due to missing values. |
Joerg Schaber
data(DWD) ddm1 <- pheno.ddm(DWD) attach(DWD) y <- factor(DWD[[2]]) s <- factor(DWD[[3]]) ddm2 <- as.matrix.csr(model.matrix(~ y + s -1, contrasts=list(s=("contr.sum")))) identical(ddm1$ddm,ddm2)
data(DWD) ddm1 <- pheno.ddm(DWD) attach(DWD) y <- factor(DWD[[2]]) s <- factor(DWD[[3]]) ddm2 <- as.matrix.csr(model.matrix(~ y + s -1, contrasts=list(s=("contr.sum")))) identical(ddm1$ddm,ddm2)
Fits a two-way linear fixed model. The model assumes the first factor f1 the second factor f2 to be fixed. Errors are assumed to be i.i.d. No general mean and sum of f2 is constrained to be zero.
pheno.flm.fit(D,limit=1000)
pheno.flm.fit(D,limit=1000)
D |
Data frame with three columns (x, f1, f2) or a matrix where rows are ranks of factor f1 levels and columns are ranks of factor f2 levels and missing values are assumed to be NA or 0. |
limit |
Integer that determines which algorithm to use (see Details). |
This function is basically a wrapper for the slm.fit()
function
form the SparseM
package, adapted for the estimation of combined phenological time series.
In phenological application, x should be the julian day
of observation of a certain phase, factor f1 should be the observation year
and factor f2 should be a station-id.
For large problems length(x)>limit
, the linear model is calculated
for treatment contrasts for efficiency reasons, and the constraint that the sum of f2 is zero,
is adjusted afterwards. This results in a slight over-estimation of
standard errors.
Note that the input data is sorted before fitting, such that subsequent
analyses using the input data should be done using the sorted output data frame.
f1 |
Estimated fixed effects f1, in phenology this is precisely the combined time series. |
f1.se |
f1 estimated standard error. |
f1.lev |
Levels of f1. Should be the same order as f1. |
f2 |
Estimated fixed effects f2, in phenology these are the station effects. |
f2.se |
f2 estimated standard error. |
f2.lev |
Levels of f2. Should be the same order as f2. |
resid |
Residuals |
lclf1 |
Lower 95 percent confidence limit of factor f1. |
uclf1 |
Upper 95 percent confidence limit of factor f1. |
lclf2 |
Lower 95 percent confidence limit of factor f2. |
uclf2 |
Upper 95 percent confidence limit of factor f2. |
D |
The input as ordered data frame, ordered first by f2 then by f1 |
fit |
The fitted lm model object. |
Joerg Schaber
Searle (1997) 'Linear Models'. Wiley. Schaber J, Badeck F-W (2002) 'Evaluation of methods for the combination of phenological time series and outlier detection'. Tree Physiology 22:973-982
data(DWD) R <- pheno.flm.fit(DWD) # parameter estimation
data(DWD) R <- pheno.flm.fit(DWD) # parameter estimation
Fits a robust two-way linear model. The model assumes both factors (f1 and f2) to be fixed. Errors are assumed to be i.i.d. No general mean and sum of f2 is constrained to be zero.
pheno.lad.fit(D,limit=1000)
pheno.lad.fit(D,limit=1000)
D |
Data frame with three columns (x, f1, f2) or a matrix where rows are ranks of factor f1 levels and columns are ranks of factor f2 levels and missing values are assumed to be NA or 0. |
limit |
Integer that determines which algorithm to use (see Details). |
The function minimizes the least absolute deviations (LAD or L1 norm)
of the residuals of a two-way linear model.
This function is basically a wrapper for the rq.fit()
or rq.fit.sfn()
functions of the quantreg
package, respectively,
adapted for the estimation of combined phenological time series.
Depending on the size of the problem length(x)<=limit
either the rq.fit()
function using the Barrodale-Roberts algorithm is used or
(length(x)>1000) the corresponding dense matrix implementation with
rq.fit.sfn()
using the Interior-Point method.
In phenological applications, x should be the julian day
of observation of a certain phase, factor f1 should be the observation year
and factor f2 should be a station-id.
For efficiency reasons, the linear model is calcualted for treatment contrasts
and the constraint that the sum of f2 is zero, is adjusted afterwards.
Note that the input data is sorted before fitting, such that subsequent
analyses using the input data should be done using the sorted output data frame.
f1 |
Estimated parameters of factor f1, in phenology this is precisely the combined time series. |
f1.lev |
Levels of f1. Should be the same order as f1. |
f2 |
Estimated parameters of factor f2, in phenology these are precisely the station effects. |
f2.lev |
Levels of f2. Should be the same order as f2. |
resid |
Residuals |
ierr |
For length(x) > 1000 this is the return error code of |
D |
The input as ordered data frame, ordered first by f2 then by f1 |
fit |
The fitted rq.fit model object. |
Joerg Schaber
Rousseeuw PJ, Leroy AM (1987) 'Robust estimation and outlier detection'. Wiley. Schaber J, Badeck F-W (2002) 'Evaluation of methods for the combination of phenological time series and outlier detection'. Tree Physiology 22:973-982
data(DWD) R <- pheno.lad.fit(DWD) # robust parameter estimation plot(levels(factor(R$D[[2]])),R$p1,type="l") # plot combined time series R$D[R$resid >= 30,] # observation whose residuals # are > 30 days (outliers)
data(DWD) R <- pheno.lad.fit(DWD) # robust parameter estimation plot(levels(factor(R$D[[2]])),R$p1,type="l") # plot combined time series R$D[R$resid >= 30,] # observation whose residuals # are > 30 days (outliers)
Fits a two-way linear mixed model. The model assumes the first factor f1 to be fixed and the second factor f2 to be random. Errors are assumed to be i.i.d. No general mean and sum of f2 is constrained to be zero.
pheno.mlm.fit(D)
pheno.mlm.fit(D)
D |
Data frame with three columns (x, f1, f2) or a matrix where rows are ranks of factor f1 levels and columns are ranks of factor f2 levels and missing values are set to 0. |
This function is basically a wrapper for the lme()
function of
the nlme
package, adapted for the estimation of combined
phenological time series. Estimation method: restricted maximum likelihood (REML)
In phenological application, x should be the julian day
of observation of a certain phase, factor f1 should be the observation year
and factor f2 should be a station-id.
Note that the input data is sorted before fitting, such that subsequent
analyses using the input data should be done using the sorted output data frame.
fixed |
Estimated fixed effects, in phenology this is precisely the combined time series. |
fixed.lev |
Levels of fixed effects. Should be the same order as fixed effects. |
random |
Estimated random effects, in phenology these are the station effects. |
random.lev |
Levels of random effects. Should be the same order as random effects. |
SEf1 |
Standard error group f1, i.e. square root of variance component fixed effect. |
SEf2 |
Standard error group f2, i.e. square root of variance component random effect. |
lclf |
Lower 95 percent confidence limit of fixed effects. |
uclf |
Upper 95 percent confidence limit of fixed effects. |
D |
The input as ordered data frame, ordered first by f2 then by f1 |
fit |
The fitted lme model object. |
Joerg Schaber
Searle (1997) 'Linear Models'. Wiley. Schaber J, Badeck F-W (2002) 'Evaluation of methods for the combination of phenological time series and outlier detection'. Tree Physiology 22:973-982
data(DWD) R <- pheno.mlm.fit(DWD) # parameter estimation plot(levels(factor(DWD[[2]])),R$fixed,type="l") # plot combined time series tr <- lm(R$fixed~rank(levels(factor(DWD[[2]]))))# trend estimation summary(tr)$coef[2] # slope of trend summary(tr)$coef[4] # standard error of trend
data(DWD) R <- pheno.mlm.fit(DWD) # parameter estimation plot(levels(factor(DWD[[2]])),R$fixed,type="l") # plot combined time series tr <- lm(R$fixed~rank(levels(factor(DWD[[2]]))))# trend estimation summary(tr)$coef[2] # slope of trend summary(tr)$coef[4] # standard error of trend
Converts a numeric data frame D with three columns (x, factor 1, factor 2) to a numeric matrix M where rows are ranks of levels of factor 1 and columns are ranks of levels of factor 2, missing values are set to NA.
raw2matrix(D)
raw2matrix(D)
D |
Data frame with three columns (x, factor 1, factor 2) |
M |
Numeric matrix where rows are ranks of levels of factor 1 and columns are ranks of levels of factor 2, missing values are set to NA. |
Joerg Schaber
data(DWD) raw2matrix(DWD)
data(DWD) raw2matrix(DWD)
Example of a two-way classification table where lacking data creates three distinct connected sets.
data(Searle)
data(Searle)
R source file
Searle (1997) 'Linear Models'. Wiley. 324p.
The sequential Mann-Kendall test on time series x detects approximate potential trend turning points in time series.
seqMK(x)
seqMK(x)
x |
Numeric vector x. |
Implicitly assumes a equidistant time series x. Calculates a progressive and a retrograde series of Kendall normalized tau's. Points where the two lines cross are considered as approximate potential trend turning points. When either the progressive or retrograde row exceed certain confidence limits before and after the crossing points, this trend turning point is considered significant at the corresponding level, i.e. 1.96 for 95
prog |
Progressive row of Kendall's normalized tau's |
retr |
Retrograde row of Kendall's normalized tau's |
tp |
Boolean vector indicating at what indices of the original timeseries the prog and retr cross, i.e. TRUE at potential trend turning points. |
Joerg Schaber
Kendall M, Gibbons JD (1990) 'Rank correlation methods'. Arnold. Sneyers R (1990) 'On statistical analysis of series of observations. Technical Note No 143. Geneva. Switzerland. World Meteorological Society. Schaber J (2003) 'Phenology in German in the 20th Century: Methods, analyses and models. Ph.D. Thesis. University of Potsdam. Germany. https://nbn-resolving.org/urn:nbn:de:kobv:517-0000532
Simple example of a two-way classification table where missing data creates two distinct connected sets.
data(Simple)
data(Simple)
R source file
Kendall's normalized tau for time series x
tau(x)
tau(x)
x |
Numeric vector x. |
Implicitly assumes a equidistant time series x.
t |
Kendall's normalized tau. |
Joerg Schaber
Kendall M, Gibbons JD (1990) 'Rank correlation methods'. Arnold. Sneyers R (1990) 'On statistical analysis of series of observations. Technical Note No 143. Geneva. Switzerland. World Meteorological Society. Schaber J (2003) 'Phenology in German in the 20th Century: Methods, analyses and models. Ph.D. Thesis. University of Potsdam. Germany. https://nbn-resolving.org/urn:nbn:de:kobv:517-0000532