Title: | K-Means for Longitudinal Data |
---|---|
Description: | An implementation of k-means specifically design to cluster longitudinal data. It provides facilities to deal with missing value, compute several quality criterion (Calinski and Harabatz, Ray and Turie, Davies and Bouldin, BIC, ...) and propose a graphical interface for choosing the 'best' number of clusters. |
Authors: | Christophe Genolini [cre, aut], Bruno Falissard [ctb], Patrice Kiener [ctb] |
Maintainer: | Christophe Genolini <[email protected]> |
License: | GPL (>= 2) |
Version: | 2.5.0 |
Built: | 2024-11-23 06:20:24 UTC |
Source: | CRAN |
This package is a implementation of k-means for longitudinal data (or trajectories).
Here is an overview of the package. For the description of the
algorithm, see kml
.
Package: | kml |
Type: | Package |
Version: | 2.4.1 |
Date: | 2016-02-02 |
License: | GPL (>= 2) |
LazyData: | yes |
Depends: | methods,clv,longitudinalData(>= 2.1.2) |
URL: | http://www.r-project.org |
URL: | http://christophe.genolini.free.fr/kml |
To cluster data, KmL
go through three steps, each of which
is associated to some functions:
Data preparation
Building "optimal" partition
Exporting results
KmL
works on object of class ClusterLongData
.
Data preparation therefore simply consists in transforming data into an object ClusterLongData
.
This can be done via function
clusterLongData
(cld
in short).
It converts a data.frame
or a matrix
into a ClusterLongData
.
Instead of working on real data, one can also work on artificial
data. Such data can be created with
generateArtificialLongData
(gald
in
short).
Once an object of class ClusterLongData
has been created, the algorithm
kml
can be run.
Starting with a ClusterLongData
, kml
built a
Partition
, a class in package longitudinalData.
An object of class Partition
is a partition of trajectories
into subgroups. It also contains some information like the
percentage of trajectories contained in each group or some quality critetion.
kml
is a "hill-climbing" algorithm. The specificity of this
kind of algorithm is that it always converges towards a maximum, but
one cannot know whether it is a local or a global maximum. It offers
no guarantee of optimality.
To maximize one's chances of getting a quality Partition
, it is better to run the hill climbing algorithm several times,
then to choose the best solution. By default, kml
executes the hill climbing algorithm 20 times
and chooses the Partition
maximizing the determinant of the matrix between.
Likewise, it is not possible to know beforehand the optimum number of clusters.
On the other hand, afterwards, it is possible to calculate
clues that will enable us to choose.
In the end, kml
tests by default 2, 3, 4, 5 et 6 clusters, 20 times each.
When kml
has constructed some
Partition
, the user can examine them one by one and choose
to export some. This can be done via function
choice
. choice
opens a graphic windows showing
various information including the trajectories clutered by a specific
Partition
.
When some Partition
has been selected (the user can select
more than 1), it is possible to
save them. The clusters are therefore exported towards the file
name-cluster.csv
. Criteria are exported towards
name-criteres.csv
. The graphs are exported according to their
extension.
It is also possible to extract a partition from the object
ClusterLongData
using the function getClusters
.
Classes : ClusterLongData
,
Partition
in package longitudinalData
Methods : clusterLongData
, kml
, choice
Plot : plot(ClusterLongData)
### Move to tempdir wd <- getwd() setwd(tempdir()); getwd() ### 1. Data Preparation data(epipageShort) names(epipageShort) cldSDQ <- cld(epipageShort,timeInData=3:6,time=c(3,4,5,8)) ### 2. Building "optimal" clusteration (with only 3 redrawings) kml(cldSDQ,nbRedrawing=3,toPlot="both") ### 3. Exporting results ### To check the best's cluster numbers plotAllCriterion(cldSDQ) # To see the best partition try(choice(cldSDQ)) ### 4. Further analysis epipageShort$clust <- getClusters(cldSDQ,4) summary(glm(gender~clust,data=epipageShort,family="binomial")) ### Go back to current dir setwd(wd)
### Move to tempdir wd <- getwd() setwd(tempdir()); getwd() ### 1. Data Preparation data(epipageShort) names(epipageShort) cldSDQ <- cld(epipageShort,timeInData=3:6,time=c(3,4,5,8)) ### 2. Building "optimal" clusteration (with only 3 redrawings) kml(cldSDQ,nbRedrawing=3,toPlot="both") ### 3. Exporting results ### To check the best's cluster numbers plotAllCriterion(cldSDQ) # To see the best partition try(choice(cldSDQ)) ### 4. Further analysis epipageShort$clust <- getClusters(cldSDQ,4) summary(glm(gender~clust,data=epipageShort,family="binomial")) ### Go back to current dir setwd(wd)
Given some longitudinal data (trajectories) and k cluster's centers, affectFuzzyIndiv
compute the matrix of individual membership (according to the algorithm
fuzzy k-means).
affectFuzzyIndiv(traj, clustersCenter, fuzzyfier=1.25)
affectFuzzyIndiv(traj, clustersCenter, fuzzyfier=1.25)
traj |
|
clustersCenter |
|
fuzzyfier |
|
Given a matrix of clusters center clustersCenter
(each line is
a cluster center), the function affectFuzzyIndiv
compute for
each individual and each cluster a "membership".
affectFuzzyIndiv
used with calculTrajFuzzyMean
simulates one fuzzy k-means step.
Matrix of the membership. Each line is an individual, column are for clusters.
####################### ### affectFuzzyIndiv ### Some LongitudinalData traj <- gald()["traj"] ### 4 clusters centers center <- traj[runif(4,1,nrow(traj)),] ### Affectation of each individual affectFuzzyIndiv(traj,center)
####################### ### affectFuzzyIndiv ### Some LongitudinalData traj <- gald()["traj"] ### 4 clusters centers center <- traj[runif(4,1,nrow(traj)),] ### Affectation of each individual affectFuzzyIndiv(traj,center)
Given some longitudinal data (trajectories) and k clusters' centers,
affectIndiv
and affectIndivC
affect each individual to the cluster whose centre is the closest.
affectIndiv(traj, clustersCenter, distance = function(x,y){dist(rbind(x, y))}) affectIndivC(traj, clustersCenter)
affectIndiv(traj, clustersCenter, distance = function(x,y){dist(rbind(x, y))}) affectIndivC(traj, clustersCenter)
traj |
|
clustersCenter |
|
distance |
|
Given a matrix of clusters center clustersCenter
(each line is
a cluster center), the function affectIndiv
affect each
individual of the matrix traj
to the closest clusters
(according to distance
). affectIndivC
does the same but
assume that the distance is the Euclidean
distance. affectIndivC
is writen in C (and is therefor much faster).
affectIndiv
used with calculTrajMean
simulates one k-means step.
Object of classPartition
.
####################### ### affectIndiv ### Some trajectories traj <- gald()["traj"] ### 4 clusters centers center <- traj[runif(4,1,nrow(traj)),] ### Affectation of each individual system.time(part <- affectIndiv(traj,center)) system.time(part <- affectIndivC(traj,center))
####################### ### affectIndiv ### Some trajectories traj <- gald()["traj"] ### 4 clusters centers center <- traj[runif(4,1,nrow(traj)),] ### Affectation of each individual system.time(part <- affectIndiv(traj,center)) system.time(part <- affectIndivC(traj,center))
Given some longitudinal data and a group's membership,
calculFuzzyMean
computes the mean trajectories of each cluster.
calculTrajFuzzyMean(traj, fuzzyClust)
calculTrajFuzzyMean(traj, fuzzyClust)
traj |
|
fuzzyClust |
|
Given a matrix of individual membership, the function
calculTrajFuzzyMean
compute the mean trajectory of each
clusters.
affectFuzzyIndiv
used with calculTrajFuzzyMean
simulates one fuzzy k-means step.
A matrix with k line and t column containing k clusters centers. Each line is a center, each column is a time measurement.
####################### ### calculTrajFuzzyMean ### Some LongitudinalData traj <- gald()["traj"] ### 4 clusters centers center <- traj[runif(4,1,nrow(traj)),] ### Affectation of each individual membership <- affectFuzzyIndiv(traj,center) ### Computation of the mean's trajectories calculTrajFuzzyMean(traj,membership)
####################### ### calculTrajFuzzyMean ### Some LongitudinalData traj <- gald()["traj"] ### 4 clusters centers center <- traj[runif(4,1,nrow(traj)),] ### Affectation of each individual membership <- affectFuzzyIndiv(traj,center) ### Computation of the mean's trajectories calculTrajFuzzyMean(traj,membership)
Given some longitudinal data and a cluster affectation,
calculTrajMean
and calculTrajMeanC
compute the mean trajectories of each cluster.
calculTrajMean(traj, clust, centerMethod = function(x){mean(x, na.rm =TRUE)}) calculTrajMeanC(traj, clust)
calculTrajMean(traj, clust, centerMethod = function(x){mean(x, na.rm =TRUE)}) calculTrajMeanC(traj, clust)
traj |
|
clust |
|
centerMethod |
|
Given a vector of affectation to a cluster, the function
calculTrajMean
compute the "central" trajectory of each
clusters. The "center" can be define using the argument centerMethod
.
calculTrajMeanC
does the same but
assume that the center definition is the classic "mean".
calculTrajMeanC
is writen in C (and is therefor much faster).
affectIndiv
used with calculTrajMean
simulates one k-means step.
A matrix with k line and t column containing k clusters centers. Each line is a center, each column is a time measurement.
####################### ### calculMean ### Some trajectories traj <- gald()["traj"] ### A cluster affectation clust <- initializePartition(3,200,"randomAll") ### Computation of the cluster's centers system.time(centers <- calculTrajMean(traj,clust)) system.time(centers <- calculTrajMeanC(traj,clust))
####################### ### calculMean ### Some trajectories traj <- gald()["traj"] ### A cluster affectation clust <- initializePartition(3,200,"randomAll") ### Computation of the cluster's centers system.time(centers <- calculTrajMean(traj,clust)) system.time(centers <- calculTrajMeanC(traj,clust))
choice
lets the user choose some Partition
he wants to export.
choice(object, typeGraph = "bmp")
choice(object, typeGraph = "bmp")
object |
|
typeGraph |
|
choice
is a function that lets the user see the
Partition
found by kml
.
At first, choice
opens a graphics window (for Linux user, the windows should be explicitly
open using x11(type = "Xlib")
). On the left side, all
the Partition
contain in Object
are ploted by a
number (the number of cluster of the Partition). The level of the
number is proportionnal to a quality criteria (like Calinski &
Harabatz). One Partition
is 'active', it is the one marked by a
black dot.
On the right side, the trajectories of Object are drawn, according to the active Partition
.
From there, choice
offers numerous options :
Change the active Partition
.
Select/unselect a Partition
(the selected
Partition
are surrounded by a circle).
Export all the selected Partition
, then
quit the function choice
.
Change the display (Trajectories alone / quality criterion alone / both)
Change actif criterion.
Sort the Partition according to the actif criterion.
Change the trajectories' style.
Change the means trajectories's style.
Change the symbol size.
Change the number of symbols.
When 'return' is pressed (or 'm' using Linux), the selected Partition
are
exported. Exporting is done in a specific named
objectName-Cx-y
where x is the number of cluster and y is the
order in the sublist. Four files are created
:
Table with two columns. The first is the identifier of each trajectory (idAll); the second holds the cluster's affectation of the trajectory.
Table containing information about the clusteration (percentage of individual in each cluster, various qualities criterion, algorithm used to find the partition and convergence time.)
Graph representing the trajectories. All the parameters set during the visualization (color of the trajectories, symbols used, mean color) are used for the export. Note that the 'typeGraph' argument can be used to export the graph in a format different than 'bmp'.
Graph representing the means trajectories of each clusterss. All the parameters set during the visualization (color of the trajectories, symbols used, mean color) are used for the export.
This four file are created for each selected Partition. In addition, two 'global' graphes are created :
Graph presenting the values of the criterionActifall for all the Partition.
For each cluster's number, the first Partition is considered. This graph presents on a single display the values of all the criterion for each first Partition. It is helpfull to compare the various qualities criterion.
For each selected Partition
, four files are saved, plus two global files.
Overview: kml-package
Classes : ClusterLongData
,
Partition
in package longitudinalData
Methods : kml
Plot : plot
### Move to tempdir wd <- getwd() setwd(tempdir()); getwd() ### Creation of artificial data cld1 <- gald(25) ### Clusterisation kml(cld1,3:5,nbRedrawing=2,toPlot='both') ### Selection of the clustering we want # (note that "try" is for compatibility with CRAN only, # you probably can use "choice(cld1)") try(choice(cld1)) ### Go back to current dir setwd(wd)
### Move to tempdir wd <- getwd() setwd(tempdir()); getwd() ### Creation of artificial data cld1 <- gald(25) ### Clusterisation kml(cld1,3:5,nbRedrawing=2,toPlot='both') ### Selection of the clustering we want # (note that "try" is for compatibility with CRAN only, # you probably can use "choice(cld1)") try(choice(cld1)) ### Go back to current dir setwd(wd)
clusterLongData
(or cld
in short) is the constructor
for ClusterLongData
object.
clusterLongData(traj, idAll, time, timeInData, varNames, maxNA) cld(traj, idAll, time, timeInData, varNames, maxNA)
clusterLongData(traj, idAll, time, timeInData, varNames, maxNA) cld(traj, idAll, time, timeInData, varNames, maxNA)
traj |
|
idAll |
|
time |
|
timeInData |
|
varNames |
|
maxNA |
|
clusterLongData
construct a object of class ClusterLongData
.
Two cases can be distinguised:
traj
is an array
:lines are individual. Column are time of measurment.
If idAll
is missing, the individuals are labelled i1
,
i2
, i3
,...
If timeInData
is missing, all the column
are used (timeInData=1:ncol(traj)
).
traj
is a data.frame
:lines are individual. Column are time of measurement.
If idAll
is missing, then the first column of the
data.frame
is used for idAll
If timeInData
is missing and idAll
is missing, then
all the columns but the first are used for timeInData
(the
first is omited since it is already used for idAll
): idAll=traj[,1],timeInData=2:ncol(traj)
.
If timeInData
is missing but idAll
is not missing,
then all the column including the first are used for timeInData
: timeInData=1:ncol(traj)
.
An object of class ClusterLongData
.
Christophe Genolini
1. UMR U1027, INSERM, Université Paul Sabatier / Toulouse III / France
2. CeRSME, EA 2931, UFR STAPS, Université de Paris Ouest-Nanterre-La Défense / Nanterre / France
[1] C. Genolini and B. Falissard
"KmL: k-means for longitudinal data"
Computational Statistics, vol 25(2), pp 317-328, 2010
[2] C. Genolini and B. Falissard
"KmL: A package to cluster longitudinal data"
Computer Methods and Programs in Biomedicine, 104, pp e112-121, 2011
Overview: kml-package
Classes : ClusterLongData
Methods : choice
, kml
Plot : plot(ClusterLongData)
##################### ### From matrix ### Small data mat <- matrix(c(1,NA,3,2,3,6,1,8,10),3,3,dimnames=list(c(101,102,104),c("T2","T4","T8"))) clusterLongData(mat) (ld1 <- clusterLongData(traj=mat,idAll=as.character(c(101,102,104)),time=c(2,4,8),varNames="V")) plot(ld1) ### Big data mat <- matrix(runif(1051*325),1051,325) (ld2 <- clusterLongData(traj=mat,idAll=paste("I-",1:1051,sep=""),time=(1:325)+0.5,varNames="R")) #################### ### From data.frame dn <- data.frame(id=1:3,v1=c(NA,2,1),v2=c(NA,1,0),v3=c(3,2,2),v4=c(4,2,NA)) ### Basic clusterLongData(dn) ### Selecting some times (ld3 <- clusterLongData(dn,timeInData=c(1,2,4),varNames=c("Hyp"))) ### Excluding trajectories with more than 1 NA (ld3 <- clusterLongData(dn,maxNA=1))
##################### ### From matrix ### Small data mat <- matrix(c(1,NA,3,2,3,6,1,8,10),3,3,dimnames=list(c(101,102,104),c("T2","T4","T8"))) clusterLongData(mat) (ld1 <- clusterLongData(traj=mat,idAll=as.character(c(101,102,104)),time=c(2,4,8),varNames="V")) plot(ld1) ### Big data mat <- matrix(runif(1051*325),1051,325) (ld2 <- clusterLongData(traj=mat,idAll=paste("I-",1:1051,sep=""),time=(1:325)+0.5,varNames="R")) #################### ### From data.frame dn <- data.frame(id=1:3,v1=c(NA,2,1),v2=c(NA,1,0),v3=c(3,2,2),v4=c(4,2,NA)) ### Basic clusterLongData(dn) ### Selecting some times (ld3 <- clusterLongData(dn,timeInData=c(1,2,4),varNames=c("Hyp"))) ### Excluding trajectories with more than 1 NA (ld3 <- clusterLongData(dn,maxNA=1))
ClusterLongData
is an object containing trajectories and associated
Partition
(from package LongitudinalData).
kml
is an algorithm that builds a set of
Partition
from longitudinal data. ClusterLongData
is the object containing the original longitudinal data
and all the Partition
that kml
finds.
When created, an ClusterLongData
object simply contains initial
data (the trajectories). After the execution of kml
, it
contains
the original data and the Partition
which has just been calculated by kml
.
Note that if kml
is executed several times, every new Partition
is added to the original ones, no pre-existing Partition
is erased.
idAll
[vector(character)]
: Single identifier
for each of the trajectory (each individual). Usefull for exporting clusters.
idFewNA
[vector(character)]
: Restriction of
idAll
to the trajectories that does not have 'too many' missing
value. See maxNA
for details.
time
[numeric]
: Time at which measures are made.
varNames
[character]
: Name of the variable measured.
traj
[matrix(numeric)]
: Contains
the longitudianl data. Each lines is the trajectories of an
individual. Each column is the time at which measures
are made.
dimTraj
[vector2(numeric)]
: size of the matrix
traj
(ie dimTraj=c(length(idFewNA),length(time))
).
maxNA
[numeric]
or [vector(numeric)]
:
Individual whose trajectories contain 'too many' missing value
are exclude from traj
and will no be use in
the analysis. Their identifier is preserved in idAll
but
not in idFewNA
. 'too many' is define by maxNA
: a
trajectory with more missing than maxNA
is exclude.
reverse
[matrix(numeric)]
: if the trajectories
are scale using the function scale
, the 'scaling
parameters' (probably mean and standard deviation) are saved in
reverse
. This is usefull to restaure the original data after a
scaling operation.
criterionActif
[character]: Store the criterion name that will be used by functions that need a single criterion (like plotCriterion or ordered).
initializationMethod
[vector(chararcter)]: list all
the initialization method that has already been used to find some
Partition
(usefull to not run several time a deterministic method).
sorted
[logical]
: are the Partition
curently hold in the object sorted in decreasing order ?
c1
[list(Partition)]: list of
Partition
with 1 clusters.
c2
[list(Partition)]: list of
Partition
with 2 clusters.
c3
[list(Partition)]: list of
Partition
with 3 clusters.
...
c26
[list(Partition)]: list of
Partition
with 26 clusters.
Class LongData
, directly.
Class ListPartition
, directly.
Class ClusterizLongData
objects can be constructed via function
clusterLongData
that turn a data.frame
or a matrix
into a ClusterLongData
. Note that some artificial data can be
generated using gald
.
object['xxx']
Get the value of the field
xxx
. Inherit from LongData
and ListPartition
.
object['xxx']<-value
Set the field xxx
to value
.
xxx
. Inherit from class ListPartition
.
plot
Display the
ClusterLongData
according to a class Partition
.
Special thanks to Boris Hejblum for debugging the '[' and '[<-' operators (the previous version was not compatible with the matrix package, which is used by lme4).
Overview: kml-package
Classes :
classes Partition, LongData, ListPartition
Methods : clusterLongData
, kml
, choice
Plot : plot(ClusterLongData)
,
plotCriterion
### Move to tempdir wd <- getwd() setwd(tempdir()); getwd() ################ ### Creation of some trajectories traj <- matrix(c(1,2,3,1,4, 3,6,1,8,10, 1,2,1,3,2, 4,2,5,6,3, 4,3,4,4,4, 7,6,5,5,4),6) myCld <- clusterLongData( traj=traj, idAll=as.character(c(100,102,103,109,115,123)), time=c(1,2,4,8,15), varNames="P", maxNA=3 ) ################ ### get and set myCld["idAll"] myCld["varNames"] myCld["traj"] ################ ### Creation of a Partition part2 <- partition(clusters=rep(1:2,3),myCld) part3 <- partition(clusters=rep(1:3,2),myCld) ################ ### Adding a clusterization to a clusterizLongData myCld["add"] <- part2 myCld["add"] <- part3 myCld ### Go back to current dir setwd(wd)
### Move to tempdir wd <- getwd() setwd(tempdir()); getwd() ################ ### Creation of some trajectories traj <- matrix(c(1,2,3,1,4, 3,6,1,8,10, 1,2,1,3,2, 4,2,5,6,3, 4,3,4,4,4, 7,6,5,5,4),6) myCld <- clusterLongData( traj=traj, idAll=as.character(c(100,102,103,109,115,123)), time=c(1,2,4,8,15), varNames="P", maxNA=3 ) ################ ### get and set myCld["idAll"] myCld["varNames"] myCld["traj"] ################ ### Creation of a Partition part2 <- partition(clusters=rep(1:2,3),myCld) part3 <- partition(clusters=rep(1:3,2),myCld) ################ ### Adding a clusterization to a clusterizLongData myCld["add"] <- part2 myCld["add"] <- part3 myCld ### Go back to current dir setwd(wd)
A subset of the longitudinal study EPIPAGE.
data(epipageShort)
data(epipageShort)
id
unique idenfier for each patient.
gender
Male or Female.
sdq3
score of the Strengths and Difficulties Questionnaire at 3 years old.
sdq4
score of the Strengths and Difficulties Questionnaire at 4 years old.
sdq5
score of the Strengths and Difficulties Questionnaire at 5 years old.
sdq8
score of the Strengths and Difficulties Questionnaire at 8 years old.
The EPIPAGE cohort, funded by INSERM and the French general health authority, is a multi-regional French follow-up survey of severely premature children. It included more than 4000 children born at less than 33 weeks gestational age, and two control samples of children, respectively born at 33-34 weeks of gestational age and born full term. The general objectives were to study short and long term motor, cognitive and behavioural outcomes in these children, and to determine the impact of medical practice, care provision and organization of perinatal care, environment, family circle and living conditions on child health and development. About 2600 children born severely premature and 400 and 600 controls respectively were followed up to the age of 5 years and then to the age of 8.
The SDQ is a behavioral questionnaire for children and adolescents ages 4 through 16 years old. It measures the severity of the disability (higher score indicate higher disability).
The database belongs to the INSERM unit U953 (P.Y. Ancel). which has agreed to include the variable SDQ in the library.
Larroque B, Ancel P, Marret S, Marchand L, André M, Arnaud C, Pierrat V, Rozé J, Messer J, Thiriez G, et al. (2008). "Neurodevelopmental disabilities and special care of 5-year-old children born before 33 weeks of gestation (the EPIPAGE study): a longitudinal cohort study." The Lancet, 371(9615), 813-820.
Laurent C, Kouanfack C, Laborde-Balen G, Aghokeng A, Mbougua J, Boyer S, Carrieri M, Mben J, Dontsop M, Kazé S, et al. (2011). "Monitoring of HIV viral loads, CD4 cell counts, and clinical assessments versus clinical monitoring alone for antiretroviral therapy in rural district hospitals in Cameroon (Stratall ANRS 12110/ESTHER): a randomised non-inferiority trial." The Lancet Infectious Diseases, 11(11), 825-833.
data(epipageShort) str(epipageShort)
data(epipageShort) str(epipageShort)
fuzzyKmlSlow
is a new implementation of fuzzy k-means for longitudinal data (or trajectories).
fuzzyKmlSlow(traj, clusterAffectation, toPlot = "traj", fuzzyfier = 1.25, parAlgo = parALGO())
fuzzyKmlSlow(traj, clusterAffectation, toPlot = "traj", fuzzyfier = 1.25, parAlgo = parALGO())
traj |
|
clusterAffectation |
|
toPlot |
|
fuzzyfier |
|
parAlgo |
|
fuzzyKmlSlow
is a new implementation of fuzzy k-means for
longitudinal data (or trajectories). To date, it is writen in R (and
not in C, this explain the "slow")
The matrix of the individual membership.
### Data generation traj <- gald(25)["traj"] partInit <- initializePartition(3,100,"kmeans--",traj) ### fuzzy Kml partResult <- fuzzyKmlSlow(traj,partInit)
### Data generation traj <- gald(25)["traj"] partInit <- initializePartition(3,100,"kmeans--",traj) ### fuzzy Kml partResult <- fuzzyKmlSlow(traj,partInit)
This function builp up an artificial longitudinal data set (single
variable-trajectory) an turn it
into an object of class ClusterLongData
.
gald(nbEachClusters=50,time=0:10,varNames="V", meanTrajectories=list(function(t){0},function(t){t}, function(t){10-t},function(t){-0.4*t^2+4*t}), personalVariation=function(t){rnorm(1,0,2)}, residualVariation=function(t){rnorm(1,0,2)}, decimal=2,percentOfMissing=0) generateArtificialLongData(nbEachClusters=50,time=0:10,varNames="V", meanTrajectories=list(function(t){0},function(t){t}, function(t){10-t},function(t){-0.4*t^2+4*t}), personalVariation=function(t){rnorm(1,0,2)}, residualVariation=function(t){rnorm(1,0,2)}, decimal=2,percentOfMissing=0)
gald(nbEachClusters=50,time=0:10,varNames="V", meanTrajectories=list(function(t){0},function(t){t}, function(t){10-t},function(t){-0.4*t^2+4*t}), personalVariation=function(t){rnorm(1,0,2)}, residualVariation=function(t){rnorm(1,0,2)}, decimal=2,percentOfMissing=0) generateArtificialLongData(nbEachClusters=50,time=0:10,varNames="V", meanTrajectories=list(function(t){0},function(t){t}, function(t){10-t},function(t){-0.4*t^2+4*t}), personalVariation=function(t){rnorm(1,0,2)}, residualVariation=function(t){rnorm(1,0,2)}, decimal=2,percentOfMissing=0)
nbEachClusters |
[numeric] or [vector(numeric)]: number of trajectories that each cluster must contain. If a single number is given, it is duplicated for all groups. |
time |
[vector(numeric)]: time at which measures are made. |
varNames |
[character]: name of the variable. |
meanTrajectories |
[list(function)]: lists the functions define the average trajectories of each cluster. |
personalVariation |
[function] or [list(function)]: lists the functions defining the personnal variation between an individual and the mean trajectories of its cluster. Note that these function should be constant function (the personal variation can not evolve with time). If a single function is given, it is duplicated for all groups (see detail). |
residualVariation |
[function] or [list(function)]: lists the functions generating the noise of each trajectory within its own cluster. If a single function is given, it is duplicated for all groups (see detail). |
decimal |
[numeric]: number of decimals used to round up values. |
percentOfMissing |
[numeric]: percentage (between 0 and 1) of missing data generated in each cluster. If a single value is given, it is duplicated for all groups. The missing values are Missing Completly At Random (MCAR). |
generateArtificialLongData
(gald
in short) is a
function that contruct a set of artificial longitudinal data.
Each individual is considered as belonging to a group. This group
follows a theoretical trajectory, function of time. These functions (one per group) are given via the argument meanTrajectories
.
Within a group, the individual undergoes individal variations. Individual variations are given via the argument residualVariation
.
The number of individuals in each group is given by nbEachClusters
.
Finally, it is possible to add missing values randomly (MCAR) striking the
data thanks to percentOfMissing
.
An object of class ClusterLongData
.
Christophe Genolini
1. UMR U1027, INSERM, Université Paul Sabatier / Toulouse III / France
2. CeRSME, EA 2931, UFR STAPS, Université de Paris Ouest-Nanterre-La Défense / Nanterre / France
[1] C. Genolini and B. Falissard
"KmL: k-means for longitudinal data"
Computational Statistics, vol 25(2), pp 317-328, 2010
[2] C. Genolini and B. Falissard
"KmL: A package to cluster longitudinal data"
Computer Methods and Programs in Biomedicine, 104, pp e112-121, 2011
ClusterLongData
, clusterLongData
par(ask=TRUE) ##################### ### Default example (ex1 <- generateArtificialLongData()) plot(ex1) plot(ex1,parTraj=parTRAJ(col=rep(2:5,each=50))) ##################### ### Three diverging lines ex2 <- generateArtificialLongData(meanTrajectories=list(function(t)0,function(t)-t,function(t)t)) plot(ex2,parTraj=parTRAJ(col=rep(2:4,each=50))) ##################### ### Three diverging lines with high variance, unbalance groups and missing value ex3 <- generateArtificialLongData( meanTrajectories=list(function(t)0,function(t)-t,function(t)t), nbEachClusters=c(100,30,10), residualVariation=function(t){rnorm(1,0,3)}, percentOfMissing=c(0.25,0.5,0.25) ) part3 <- partition(rep(1:3,c(100,30,10))) plot(ex3,parTraj=parTRAJ(col=rep(2:4,c(100,30,10)))) ##################### ### Four strange functions ex4 <- generateArtificialLongData( nbEachClusters=c(300,200,100,100), meanTrajectories=list(function(t){-10+2*t},function(t){-0.6*t^2+6*t-7.5}, function(t){10*sin(t)},function(t){30*dnorm(t,2,1.5)}), residualVariation=function(t){rnorm(1,0,3)}, time=0:10,decimal=2,percentOfMissing=0.3) plot(ex4,parTraj=parTRAJ(col=rep(2:5,c(300,200,100,100)))) ##################### ### To get only longData (if you want some artificial longData ### to deal with another algorithm), use the getteur ["traj"] ex5 <- gald(nbEachCluster=3,time=1:3) ex5["traj"] par(ask=FALSE)
par(ask=TRUE) ##################### ### Default example (ex1 <- generateArtificialLongData()) plot(ex1) plot(ex1,parTraj=parTRAJ(col=rep(2:5,each=50))) ##################### ### Three diverging lines ex2 <- generateArtificialLongData(meanTrajectories=list(function(t)0,function(t)-t,function(t)t)) plot(ex2,parTraj=parTRAJ(col=rep(2:4,each=50))) ##################### ### Three diverging lines with high variance, unbalance groups and missing value ex3 <- generateArtificialLongData( meanTrajectories=list(function(t)0,function(t)-t,function(t)t), nbEachClusters=c(100,30,10), residualVariation=function(t){rnorm(1,0,3)}, percentOfMissing=c(0.25,0.5,0.25) ) part3 <- partition(rep(1:3,c(100,30,10))) plot(ex3,parTraj=parTRAJ(col=rep(2:4,c(100,30,10)))) ##################### ### Four strange functions ex4 <- generateArtificialLongData( nbEachClusters=c(300,200,100,100), meanTrajectories=list(function(t){-10+2*t},function(t){-0.6*t^2+6*t-7.5}, function(t){10*sin(t)},function(t){30*dnorm(t,2,1.5)}), residualVariation=function(t){rnorm(1,0,3)}, time=0:10,decimal=2,percentOfMissing=0.3) plot(ex4,parTraj=parTRAJ(col=rep(2:5,c(300,200,100,100)))) ##################### ### To get only longData (if you want some artificial longData ### to deal with another algorithm), use the getteur ["traj"] ex5 <- gald(nbEachCluster=3,time=1:3) ex5["traj"] par(ask=FALSE)
Given a ClusterLongData
object that hold a
Partition
, this function extract the best
posterior probability of each individual.
getBestPostProba(xCld, nbCluster, clusterRank = 1)
getBestPostProba(xCld, nbCluster, clusterRank = 1)
xCld |
|
nbCluster |
|
clusterRank |
|
Given a ClusterLongData
object that hold a
Partition
, this function extract the best
posterior probability of each individual.
A vector of numeric.
### Move to tempdir wd <- getwd() setwd(tempdir()); getwd() ### Creation of an object ClusterLongData myCld <- gald(20) ### Computation of some partition kml(myCld,2:4,3) ### Extraction the best posterior probabilities ### form the list of partition with 3 clusters of the second clustering getBestPostProba(myCld,3,2) ### Go back to current dir setwd(wd)
### Move to tempdir wd <- getwd() setwd(tempdir()); getwd() ### Creation of an object ClusterLongData myCld <- gald(20) ### Computation of some partition kml(myCld,2:4,3) ### Extraction the best posterior probabilities ### form the list of partition with 3 clusters of the second clustering getBestPostProba(myCld,3,2) ### Go back to current dir setwd(wd)
This function extract a cluster affectation from an
ClusterLongData
object.
getClusters(xCld, nbCluster, clusterRank = 1, asInteger = FALSE)
getClusters(xCld, nbCluster, clusterRank = 1, asInteger = FALSE)
xCld |
|
nbCluster |
|
clusterRank |
|
asInteger |
|
This function extract a clusters from an object
ClusterLongData
.
It is almost the same as
xCld[paste("c",nbCluster,sep="")][[clusterRank]]
except that
the individual with too many missing value (and thus excludes from the
analysis) will be noted by some NA values.
A vector of numeric or a LETTER, according to the value of asInteger
.
### Move to tempdir wd <- getwd() setwd(tempdir()); getwd() ### Creation of an object ClusterLongData myCld <- gald(20) ### Computation of some partition kml(myCld,2:4,3) ### Extraction form the list of partition with 3 clusters ### of the second clustering getClusters(myCld,3,2) ### Go back to current dir setwd(wd)
### Move to tempdir wd <- getwd() setwd(tempdir()); getwd() ### Creation of an object ClusterLongData myCld <- gald(20) ### Computation of some partition kml(myCld,2:4,3) ### Extraction form the list of partition with 3 clusters ### of the second clustering getClusters(myCld,3,2) ### Go back to current dir setwd(wd)
kml
is a implementation of k-means for longitudinal data (or trajectories). This algorithm is able to deal with missing value and
provides an easy way to re roll the algorithm several times, varying the starting conditions and/or the number of clusters looked for.
Here is the description of the algorithm. For an overview of the package, see kml-package.
kml(object,nbClusters=2:6,nbRedrawing=20,toPlot="none",parAlgo=parALGO())
kml(object,nbClusters=2:6,nbRedrawing=20,toPlot="none",parAlgo=parALGO())
object |
[ClusterLongData]: contains trajectories to cluster as
well as previous |
nbClusters |
[vector(numeric)]: Vector containing the number of clusters
with which |
nbRedrawing |
[numeric]: Sets the number of time that k-means must be re-run (with different starting conditions) for each number of clusters. |
toPlot |
|
parAlgo |
|
kml
works on object of class ClusterLongData
.
For each number included in nbClusters
, kml
computes a
Partition
then stores it in the field
cX
of the object ClusterLongData
according to the number
of clusters 'X'. The algorithm starts over as many times as it is told in nbRedrawing
. By default, it is executed for 2,
3, 4, 5 and 6 clusters 20 times each, namely 100 times.
When a Partition
has been found, it is added to the
corresponding slot c1,
c2, c3, ... or c26. The sublist cX stores the all Partition
with
X clusters. Inside a sublist, the
Partition
can be sorted from the biggest quality criterion to
the smallest (the best are stored first, using
ordered,ListPartition
), or not.
Note that Partition
are saved throughout the algorithm. If the user
interrupts the execution of kml
, the result is not lost. If the
user run kml
on an object, then runnig kml
again on the same object
will add some new Partition
to the one already found.
The possible starting conditions are defined in initializePartition
.
A ClusterLongData
object, after having added
some Partition
to it.
Behind kml, there are two different procedures :
Fast: when the parameter distance
is set to "euclidean"
and toPlot
is set to 'none' or
'criterion', kml
call a C
compiled (optimized) procedure.
Slow: when the user defines its own distance or if he wants
to see the construction of the clusters by setting toPlot
to
'traj' or 'both', kml
uses a R non compiled
programmes.
The C prodecure is 25 times faster than the R one.
So we advice to use the R procedure 1/ for trying some new method
(like using a new distance) or 2/ to "see" the very first clusters
construction, in order to check that every thing goes right. Then it
is better to
switch to the C procedure (like we do in Example
section).
If for a specific use, you need a different distance, feel free to contact the author.
Overview: kml-package
Classes : ClusterLongData
,
Partition
in package longitudinalData
Methods : clusterLongData
, choice
### Move to tempdir wd <- getwd() setwd(tempdir()); getwd() ### Generation of some data cld1 <- generateArtificialLongData(25) ### We suspect 3, 4 or 6 clusters, we want 3 redrawing. ### We want to "see" what happen (so printCal and printTraj are TRUE) kml(cld1,c(3,4,6),3,toPlot='both') ### 4 seems to be the best. We want 7 more redrawing. ### We don't want to see again, we want to get the result as fast as possible. kml(cld1,4,10) ### Go back to current dir setwd(wd)
### Move to tempdir wd <- getwd() setwd(tempdir()); getwd() ### Generation of some data cld1 <- generateArtificialLongData(25) ### We suspect 3, 4 or 6 clusters, we want 3 redrawing. ### We want to "see" what happen (so printCal and printTraj are TRUE) kml(cld1,c(3,4,6),3,toPlot='both') ### 4 seems to be the best. We want 7 more redrawing. ### We don't want to see again, we want to get the result as fast as possible. kml(cld1,4,10) ### Go back to current dir setwd(wd)
parKml
and parALGO
are constructor for the object ParKml
.
parKml(saveFreq,maxIt,imputationMethod,distanceName,power,distance, centerMethod,startingCond,nbCriterion,scale) parALGO(saveFreq=100,maxIt=200,imputationMethod="copyMean", distanceName="euclidean",power=2,distance=function(){}, centerMethod=meanNA,startingCond="nearlyAll",nbCriterion=1000,scale=TRUE)
parKml(saveFreq,maxIt,imputationMethod,distanceName,power,distance, centerMethod,startingCond,nbCriterion,scale) parALGO(saveFreq=100,maxIt=200,imputationMethod="copyMean", distanceName="euclidean",power=2,distance=function(){}, centerMethod=meanNA,startingCond="nearlyAll",nbCriterion=1000,scale=TRUE)
saveFreq |
|
maxIt |
|
imputationMethod |
|
distanceName |
|
power |
|
distance |
|
centerMethod |
|
startingCond |
|
nbCriterion |
|
scale |
|
parKml
is the constructor of object ParKml
.
An object ParKml
.
### Move to tempdir wd <- getwd() setwd(tempdir()); getwd() ### Generation of some data cld1 <- generateArtificialLongData() ### Setting two different set of option : (option1 <- parALGO()) (option2 <- parALGO(distanceName="maximum",centerMethod=function(x)median(x,na.rm=TRUE))) ### Running kml We suspect 3, 4 or 5 clusters, we want 3 redrawing. kml(cld1,3:5,3,toPlot="both",parAlgo=option1) kml(cld1,3:5,3,toPlot="both",parAlgo=option2) ### Go back to current dir setwd(wd)
### Move to tempdir wd <- getwd() setwd(tempdir()); getwd() ### Generation of some data cld1 <- generateArtificialLongData() ### Setting two different set of option : (option1 <- parALGO()) (option2 <- parALGO(distanceName="maximum",centerMethod=function(x)median(x,na.rm=TRUE))) ### Running kml We suspect 3, 4 or 5 clusters, we want 3 redrawing. kml(cld1,3:5,3,toPlot="both",parAlgo=option1) kml(cld1,3:5,3,toPlot="both",parAlgo=option2) ### Go back to current dir setwd(wd)
ParKml
is an object containing some parameter used by kml
.
saveFreq
[numeric]
: Long computations can take several
days. So it is possible to save the object ClusterLongData
on which works kml
once in a while. saveFreq
defines the frequency of the saving
process. The ClusterLongData
is saved every saveFreq
clustering calculations. The object is saved in the file
objectName.Rdata
in the curent folder. If saveFreq
is
set to Inf
, the object is never saved.
maxIt
:[numeric]
: Set a limit to the number of iteration if
convergence is not reached.
imputationMethod
:[character]
: the calculation of quality
criterion can not be done if some value are
missing. imputationMethod
define the method use to impute the
missing value.
See imputation
for detail.
distanceName
:[character]
: name of the
distance
used by k-means. If the distanceName
is one of
"manhattan", "euclidean", "minkowski", "maximum", "canberra" or
"binary", a compiled optimized version specificaly design for
trajectories version is used. Otherwise, the function define in
the slot distance
is used.
power
:[numeric]
: If distanceName="minkowski"
, this define
the power that will be used.
distance
:[numeric <- function(trajA,trajB)]
: function that computes the
distance between two trajectories. This field is used only if
'distanceName' is not one of the classical function.
centerMethod
:[numeric <-
function(vector(numeric))]
: k-means algorithm computes the centers of
each cluster. It is possible to personalize the definition of
"center" by defining a function "centerMethod". This function should
take a vector of numeric as argument and return a single numeric -the
center of the vector-.
startingCond
:[character]
: specifies the starting
condition. Should be one of "randomAll", "randomK", "maxDist",
"kmeans++", "kmeans+", "kmeans-" or "kmeans–" (see
initializePartition
for details). It
also could take two specifics values: "all" stands for
c("maxDist","kmeans-") then an alternance of "kmeans–" and
"randomK" while "nearlyAll" stands for
"kmeans-" then an alternance of "kmeans–" and "randomK".
nbCriterion
[numeric]
: set the maximum number of
quality criterion that are display on the graph (since displaying
a high criterion number an slow down the overall process). The
default value is 100.
[logical]
: if TRUE, then the data will be
automaticaly scaled (using the function scale
with
default values) before the execution of k-means on joint
trajectories. Then the data
will be restore (using the function restoreRealData
)
just before the end of the function kml3d
. This option
has no effect on kml
.
object['xxx']
Get the value of the field xxx
.
### Move to tempdir wd <- getwd() setwd(tempdir()); getwd() ### Building data myCld <- gald() ### Standard kml kml(myCld,,3,toPlot="both") ### Using median instead of mean parWithMedian <- parALGO(centerMethod=function(x){median(x,na.rm=TRUE)}) kml(myCld,,3,toPlot="both",parAlgo=parWithMedian) ### Using distance max parWithMax <- parALGO(distanceName="maximum") kml(myCld,,3,toPlot="both",parAlgo=parWithMax) ### Go back to current dir setwd(wd)
### Move to tempdir wd <- getwd() setwd(tempdir()); getwd() ### Building data myCld <- gald() ### Standard kml kml(myCld,,3,toPlot="both") ### Using median instead of mean parWithMedian <- parALGO(centerMethod=function(x){median(x,na.rm=TRUE)}) kml(myCld,,3,toPlot="both",parAlgo=parWithMedian) ### Using distance max parWithMax <- parALGO(distanceName="maximum") kml(myCld,,3,toPlot="both",parAlgo=parWithMax) ### Go back to current dir setwd(wd)
plot
the trajectories of an object
ClusterLongData
relatively to a Partition
.
## S4 method for signature 'ClusterLongData,ANY' plot(x,y=NA,parTraj=parTRAJ(),parMean=parMEAN(), addLegend=TRUE, adjustLegend=-0.12,toPlot="both",criterion=x["criterionActif"], nbCriterion=1000, ...)
## S4 method for signature 'ClusterLongData,ANY' plot(x,y=NA,parTraj=parTRAJ(),parMean=parMEAN(), addLegend=TRUE, adjustLegend=-0.12,toPlot="both",criterion=x["criterionActif"], nbCriterion=1000, ...)
x |
|
y |
|
parTraj |
|
parMean |
|
toPlot |
|
criterion |
|
nbCriterion |
|
addLegend |
|
adjustLegend |
|
... |
Some other parameters can be passed to the method (like "xlab" or "ylab". |
plot
the trajectories of an object ClusterLongData
relativly
to the 'best' Partition
, or to the
Partition
define by y
.
Graphical option concerning the individual trajectory (col, type, pch
and xlab) can be change using parTraj
.
Graphical option concerning the cluster mean trajectory (col, type, pch,
pchPeriod and cex) can be change using parMean
. For more
detail on parTraj
and parMean
, see object of
class ParLongData
in package longitudinalData
.
Overview: kml-package
Classes : ClusterLongData
Plot : plot: overview
, plotCriterion
### Move to tempdir wd <- getwd() setwd(tempdir()); getwd() ################## ### Construction of the data ld <- gald() ### Basic plotting plot(ld) ################## ### Changing graphical parameters 'par' kml(ld,3:4,1) ### No letters on the mean trajectories plot(ld,3,parMean=parMEAN(type="l")) ### Only one letter on the mean trajectories plot(ld,4,parMean=parMEAN(pchPeriod=Inf)) ### Color individual according to its clusters (col="clusters") plot(ld,3,parTraj=parTRAJ(col="clusters")) ### Mean without individual plot(ld,4,parTraj=parTRAJ(type="n")) ### No mean trajectories (type="n") ### Color individual according to its clusters (col="clusters") plot(ld,3,parTraj=parTRAJ(col="clusters"),parMean=parMEAN(type="n")) ### Only few trajectories plot(ld,4,nbSample=10,parTraj=parTRAJ(col='clusters'),parMean=parMEAN(type="n")) ### Go back to current dir setwd(wd)
### Move to tempdir wd <- getwd() setwd(tempdir()); getwd() ################## ### Construction of the data ld <- gald() ### Basic plotting plot(ld) ################## ### Changing graphical parameters 'par' kml(ld,3:4,1) ### No letters on the mean trajectories plot(ld,3,parMean=parMEAN(type="l")) ### Only one letter on the mean trajectories plot(ld,4,parMean=parMEAN(pchPeriod=Inf)) ### Color individual according to its clusters (col="clusters") plot(ld,3,parTraj=parTRAJ(col="clusters")) ### Mean without individual plot(ld,4,parTraj=parTRAJ(type="n")) ### No mean trajectories (type="n") ### Color individual according to its clusters (col="clusters") plot(ld,3,parTraj=parTRAJ(col="clusters"),parMean=parMEAN(type="n")) ### Only few trajectories plot(ld,4,nbSample=10,parTraj=parTRAJ(col='clusters'),parMean=parMEAN(type="n")) ### Go back to current dir setwd(wd)
plotMeans
plots the means' trajectories of an object
ClusterLongData
relatively to a Partition
.
## S4 method for signature 'ClusterLongData,ANY' plotMeans(x,y,parMean=parMEAN(), parWin=windowsCut(x['nbVar'],addLegend=TRUE),...)
## S4 method for signature 'ClusterLongData,ANY' plotMeans(x,y,parMean=parMEAN(), parWin=windowsCut(x['nbVar'],addLegend=TRUE),...)
x |
|
y |
|
parMean |
|
parWin |
|
... |
Some other parameters can be passed to the method. |
plotMeans
plots the means' trajectories of an object ClusterLongData
relativly
to the 'best' Partition
, or to the
Partition
define by y
.
Graphical option (col, type, pch,
pchPeriod and cex) can be change using parMean
. For more
detail on parTraj
and parMean
, see object of
class ParLongData
in package longitudinalData
.
Overview: kml-package
Classes : ClusterLongData
PlotMeans : plotMeans: overview
, plotCriterion
### Move to tempdir wd <- getwd() setwd(tempdir()); getwd() ################## ### Construction of the data ld <- gald(10) kml(ld,3:4,2) ### Basic plotMeansting plotMeans(ld,3) ### Go back to current dir setwd(wd)
### Move to tempdir wd <- getwd() setwd(tempdir()); getwd() ################## ### Construction of the data ld <- gald(10) kml(ld,3:4,2) ### Basic plotMeansting plotMeans(ld,3) ### Go back to current dir setwd(wd)
plotTraj
plot the trajectories of an object
ClusterLongData
relatively to a Partition
.
## S4 method for signature 'ClusterLongData,ANY' plotTraj(x,y,parTraj=parTRAJ(col="clusters"), parWin=windowsCut(x['nbVar'],addLegend=TRUE),nbSample=1000,...)
## S4 method for signature 'ClusterLongData,ANY' plotTraj(x,y,parTraj=parTRAJ(col="clusters"), parWin=windowsCut(x['nbVar'],addLegend=TRUE),nbSample=1000,...)
x |
|
y |
|
parTraj |
|
parWin |
|
nbSample |
|
... |
Some other parameters can be passed to the method. |
plotTraj
the trajectories of an object ClusterLongData
relativly
to the 'best' Partition
, or to the
Partition
define by y
.
Graphical option (col, type, pch
and xlab) can be change using parTraj
.
For more
detail on parTraj
, see object of
class ParLongData
in package longitudinalData
.
Overview: kml-package
Classes : ClusterLongData
PlotTraj : plotTraj: overview
, plotCriterion
### Move to tempdir wd <- getwd() setwd(tempdir()); getwd() ################## ### Construction of the data ld <- gald() kml(ld,3:4,1) ### Basic plotTrajting plotTraj(ld,3) ### Go back to current dir setwd(wd)
### Move to tempdir wd <- getwd() setwd(tempdir()); getwd() ################## ### Construction of the data ld <- gald() kml(ld,3:4,1) ### Basic plotTrajting plotTraj(ld,3) ### Go back to current dir setwd(wd)