Title: | Optimal Partitioning of Similarity Relations |
---|---|
Description: | Contains a set of algorithms for creating partitions and coverings of objects largely based on operations on (dis)similarity relations (or matrices). There are several iterative re-assignment algorithms optimizing different goodness-of-clustering criteria. In addition, there are covering algorithms 'clique' which derives maximal cliques, and 'maxpact' which creates a covering of maximally compact sets. Graphical analyses and conversion routines are also included. |
Authors: | David W. Roberts <[email protected]> |
Maintainer: | David W. Roberts <[email protected]> |
License: | GPL (>= 2) |
Version: | 3.0-3 |
Built: | 2024-10-31 21:25:17 UTC |
Source: | CRAN |
Archipelago analysis finds connected clusters in a dissimilarity matrix. Samples in the same cluster are at most alpha dissimilar to at least one other sample in the cluster, and are more than alpha dissimilar to all samples in all other clusters. The solution is equivalent to slicing a nearest neighbor cluster analysis at alpha, but does not require (or produce) a hierarchical structure.
archi(dist,alpha)
archi(dist,alpha)
dist |
|
alpha |
the dissimilarity threshold to establish the relationship |
Archipelago analysis is a topological, as opposed to metric space, cluster routine that returns connected clusters. Every sample in a cluster is connected by a path with step lengths of at most alpha dissimilarity to every other sample in the cluster, and is more than alpha dissimilar to all other samples in all other clusters.
produces an object of class ‘clustering’, a list with a vector ‘clustering’ of cluster memberships
David W. Roberts [email protected]
data(shoshveg) # produces a vegetation dataframe dis.bc <- dsvdis(shoshveg,'bray/curtis') # produces a Bray/Curtis dissimilarity matrix arc.50 <- archi(dis.bc,0.5) # clusters at 0.5 dissimilarity table(arc.50$clustering)
data(shoshveg) # produces a vegetation dataframe dis.bc <- dsvdis(shoshveg,'bray/curtis') # produces a Bray/Curtis dissimilarity matrix arc.50 <- archi(dis.bc,0.5) # clusters at 0.5 dissimilarity table(arc.50$clustering)
Sorts the members of clusters by maximum similarity goodness-of-fit
bestfit(x,cluster)
bestfit(x,cluster)
x |
an object of class ‘partana’ or ‘silhouette’ |
cluster |
a specific cluster number |
Simply finds all members of a specific cluster and lists them in order of (1) mean similarity to their cluster (if x is an object of class ‘partana’) or silhouette width (if x is an object of class ‘silhouette’ as produced by functions in package ‘cluster’)
returns a data.frame with cluster members in column ‘ID’ and goodness-of-fit in column ‘fit’
David W. Roberts [email protected]
http://ecology.msu.montana.edu/labdsv/
data(shoshveg) # returns vegetation matrix dis.bc <- dsvdis(shoshveg,'bray') # Bray/Curtis dissimilarity matrix opt.5 <- optpart(5,dis.bc) # 5 cluster partition print(class(opt.5)) fit <- bestfit(opt.5,1) # goodness-of-fit for cluster 1 sil.5 <- silhouette(opt.5,dis.bc) # calculate silhouette widths fit2 <- bestfit(sil.5,1) # goodness-of-fit for cluster 1
data(shoshveg) # returns vegetation matrix dis.bc <- dsvdis(shoshveg,'bray') # Bray/Curtis dissimilarity matrix opt.5 <- optpart(5,dis.bc) # 5 cluster partition print(class(opt.5)) fit <- bestfit(opt.5,1) # goodness-of-fit for cluster 1 sil.5 <- silhouette(opt.5,dis.bc) # calculate silhouette widths fit2 <- bestfit(sil.5,1) # goodness-of-fit for cluster 1
Produces a specified number of optpart
solutions from random starts, keeping the best result of the set
bestopt(dist,numclu,numrep,maxitr=100)
bestopt(dist,numclu,numrep,maxitr=100)
dist |
an object of class ‘dist’ from |
numclu |
the number of clusters desired |
numrep |
the number of random starts requested |
maxitr |
the maximum number of iterations per replicate |
calls function optpart
with an random initial
assignment of items to clusters ‘numitr’ times, keeping the best result
(highest within/among ratio observed). See optpart
for more details.
an object of class partana, with components:
ptc |
the mean similarity of each item to each cluster |
ctc |
the mean similarity of each cluster to other clusters |
musubx |
the membership of each item in each cluster |
clustering |
the best ‘crisp’ partition from musubx |
ratio |
the within-cluster/among-cluster similarity ratio achieved at each iteration of the selected result. |
This is a simple wrapper function to automate independent random starts of function optpart.
David W. Roberts [email protected]
data(shoshveg) dis.bc <- dsvdis(shoshveg,'bray/curtis') x <- bestopt(dis.bc,5,10) summary(x) ## Not run: plot(x)
data(shoshveg) dis.bc <- dsvdis(shoshveg,'bray/curtis') x <- bestopt(dis.bc,5,10) summary(x) ## Not run: plot(x)
Compares two classifications by cross-tabulating the assignment of objects to classes, and (optionally) produces a new classification to reflect the congruences and differences
classmatch(x,y,type='full')
classmatch(x,y,type='full')
x |
an object of class ‘clustering’, ‘partana’, ‘partition’ or a vector identifying membership of objects in classes |
y |
an object of class ‘clustering’, ‘partana’, ‘partition’ or a vector identifying membership of objects in classes |
type |
a switch, either ‘full’ or ‘direct’, to control the parameters of the algorithm |
classmatch first calculates a cross-tabulation of the two classifications. Then, if ‘type=="full"’, the default, it finds all cases of agreement in order of number of objects. Objects are assigned to new clusters to reflect that order. It's important to note that a single class may be partitioned into several new classes, and the the number of new classes produced may be higher than either of the classifications considered.
If ‘type=="direct"’ classmatch assumes a one-to-one relation between the two classifications compared. Classmatch finds the largest case of agreement, and assigns that match to class 1. It then zeros out the rows and columns corresponding to those classes, and iterates.
A list with components:
tab |
the cross-tabulation analyzed |
pairs |
the x and y values considered matched in order of solution |
partial |
a cumulative fraction of agreement as a function of number of clusters |
ord |
a table showing the order of new clusters |
combo |
a new vector of assignment of objects to clusters, only produced if ‘type=="full"’ |
David W. Roberts [email protected]
http://ecology.msu.montana.edu/labdsv/R
data(shoshveg) dis.bc <- dsvdis(shoshveg,'bray/curtis') opt.5 <- optpart(5,dis.bc) pam.5 <- pam(dis.bc,5) classmatch(opt.5,pam.5)
data(shoshveg) dis.bc <- dsvdis(shoshveg,'bray/curtis') opt.5 <- optpart(5,dis.bc) pam.5 <- pam(dis.bc,5) classmatch(opt.5,pam.5)
Maximal clique analysis produces the set of maximal cliques of a dissimilarity or distance matrix. Maximal cliques are sets where every member of the set is <= alpha-dissimilar to every other member.
clique(dist,alphac,minsize=1,mult=100) ## S3 method for class 'clique' summary(object, ...) ## S3 method for class 'clique' plot(x, panel = 'all', ...)
clique(dist,alphac,minsize=1,mult=100) ## S3 method for class 'clique' summary(object, ...) ## S3 method for class 'clique' plot(x, panel = 'all', ...)
dist |
|
alphac |
the dissimilarity threshold to establish the relationship |
minsize |
the minimum size clique to list in the results |
mult |
scratch space multiplier to control stack size (see below) |
object |
an object of class ‘clique’ |
... |
ancillary arguments to |
x |
an object of class ‘clique’ |
panel |
an integer switch to indicate which panel to plot |
Maximal clique analysis produces a covering, as opposed to a partition, i.e. objects can belong to more than one clique, and every object belongs to at least one clique. The maximal clique solution is solved for by symbolic computation, as opposed to numerical computation, and produces a unique solution. The number of cliques produced cannot be known beforehand, and can significantly exceed the number of objects. The ‘mult’ argument controls the size of the stack to hold intermediate terms in the equation as the solution proceeds. At each iteration, the algorithm simplifies the equation to the extent possible, and recovers space used to hold terms that have been eliminated. Nonetheless, it is possible for the equation to grow quite large at intermediate steps. The initial value of ‘mult=100’ sets the stack to 100 times the number of objects in the dissimilarity/distance matrix. If the memory allocated is exceeded, the output is set to NULL, and a message is printed to increase the ‘mult’ argument to a higher value.
produces a list with elements:
alphac |
the threshold value used to establish the cliques |
musubx |
a matrix of object membership in each of the maximal cliques |
member |
a list of members of each clique |
WARNING. The run time of maximal clique analysis is approximately
for
objects. The number of cliques generated, and the run time, is sensitive
to ‘alpha’, as values of ‘alpha’ close to the mean dissimilarity
of the matrix are likely to produce the most cliques and longest run time. A
solution for 1200 objects once took approximately 20 CPU days on a SparcStation.
The example shown below (100 objects) runs in a few seconds on a modern computer.
David W. Roberts [email protected] http://ecology.msu.montana.edu/labdsv/R
data(shoshveg) # produces a vegetation dataframe dis.bc <- dsvdis(shoshveg,'bray/curtis') # produces a Bray/Curtis dissimilarity matrix cli.50 <- clique(dis.bc,0.5) # clusters at 0.5 dissimilarity, likely # to run for a few seconds in most PCs summary(cli.50)
data(shoshveg) # produces a vegetation dataframe dis.bc <- dsvdis(shoshveg,'bray/curtis') # produces a Bray/Curtis dissimilarity matrix cli.50 <- clique(dis.bc,0.5) # clusters at 0.5 dissimilarity, likely # to run for a few seconds in most PCs summary(cli.50)
The ‘clique.test’ function analyzes within-clique variability in attributes of objects other than those used to calculate the similarity relation. If the cliques exhibit a narrower range of values than expected at random it may be that the variable analyzed has an underlying role in determining the attributes on which the similarity is calculated.
clique.test(cliq,env,minsize=2,plotit=FALSE)
clique.test(cliq,env,minsize=2,plotit=FALSE)
cliq |
an object of class ‘clique’ |
env |
a continuous environmental variable to test |
minsize |
the minimum size clique to test for range |
plotit |
a switch to control plotting each clique individually |
Produces a vector of probabilities, one for each clique that expresses the probability of obtaining a range of ‘env’ as small as observed. Also produces a plot of the sorted probabilities on the current device.
The ‘clique.test’ function actually calls the
envrtest
function once for each clique and stores the
associated probability as determined by envrtest
David W. Roberts [email protected]
data(shoshveg) data(shoshsite) dis.bc <- dsvdis(shoshveg,'bray') ## Not run: cli.60 <- clique(dis.bc,0.60) # will run for several ## Not run: print(clique.test(cli.60,shoshsite$swb)) # minutes
data(shoshveg) data(shoshsite) dis.bc <- dsvdis(shoshveg,'bray') ## Not run: cli.60 <- clique(dis.bc,0.60) # will run for several ## Not run: print(clique.test(cli.60,shoshsite$swb)) # minutes
A clustering object is a list with a component called ‘clustering’ which is an integer vector of length n where n is the number of elements in a classification.
The clustering object is defined simply to allow a cleaner interface to functions in package ‘cluster’.
David W. Roberts [email protected]
http://ecology.msu.montana.edu/labdsv/R
Extracts specified columns from a constancy table
(see const
) and identifies species which
occur in one of the two clusters (potential diagnostic species)
or in both.
compare(const,left,right,thresh=0.2)
compare(const,left,right,thresh=0.2)
const |
a constancy table produced by function |
left |
a numeric cluster |
right |
a cluster number |
thresh |
a minimum differential abundance to list in the table |
compare extracts two columns (left and right) from a constancy table produced by
const
, and calculates the pairwise differences.
Differences greater than the specified threshold appear in the set ‘left’;
negative differences less then minus one times the threshold appear in the set ‘right’,
and species which occur in both columns but with an absolute value of difference less
than the threshold appear in set ‘both’.
a list with elements
left |
a data.frame of species diagnostic of set ‘left’ |
right |
a data.frame of species diagnostic of set ‘right’ |
both |
species occuring in both sets and diagnostic of neither |
David W. Roberts [email protected]
http://ecology.msu.montana.edu/labdsv/R
data(shoshveg) # returns vegetation data set data(shoshsite) # returns site data elev.clust <- as.numeric(factor(cut(shoshsite$elevation,5))) # 5 elevation bands elev.const <- const(shoshveg,elev.clust) compare(elev.const,1,2) # identify diagnostic species
data(shoshveg) # returns vegetation data set data(shoshsite) # returns site data elev.clust <- as.numeric(factor(cut(shoshsite$elevation,5))) # 5 elevation bands elev.const <- const(shoshveg,elev.clust) compare(elev.const,1,2) # identify diagnostic species
A confusion matrix is a cross-tabulation of actual class membership with memberships predicted by a discriminant function, classification tree, or other predictive model. A fuzzy confusion matrix is a confusion matrix that corrects for ‘near misses’ in prediction by comparing the similarity of the predicted type to the actual type and giving credit for the similarity.
confus(clustering,model,diss=NULL)
confus(clustering,model,diss=NULL)
clustering |
an object of class ‘clustering’ or a vector of (integer or factor) class membership values |
model |
a predictive model of class ‘tree’ or ‘randomForest’ |
diss |
optionally, a dissimilarity object of class ‘dist’ from ‘dist’, ‘dsvdis’, or ‘vegdist’ |
Cross-classifies each sample by actual class membership and predicted membership, computing overall accuracy, and the Kappa statistic of agreement. If a dissimilarity matrix is passed, calculates a fuzzy confusion matrix. In this case, correct predictions are assigned values of 1.0, and other predictions are given the value of the similarity of the two types an placed on the diagonal. The dissimilarity of the two types is added off the diagonal as fuzzy error.
produces a list with elements
matrix |
the (fuzzy) cross-tabulation matrix as a data.frame |
correct |
the fraction of (fuzzily) correctly predicted samples |
kappa |
the value of the Kappa statistic |
legend |
the text legend for the cross-tabulation matrix |
Confusion matrices are commonly computed in remote sensing applications, but are equally suited to the evaluation of any predictive methods of class membership or factors.
David W. Roberts [email protected] http://ecology.msu.montana.edu/labdsv/R
http://ecology.msu.montana.edu/labdsv/R
data(shoshveg) # returns a data frame of vegetation data data(shoshsite) # returns a data frame of site data dis.bc <- dsvdis(shoshveg,'bray') opt.5 <- optpart(5,dis.bc) library(tree) mod <- tree(factor(opt.5$clustering)~ elevation+slope+av, data=shoshsite) confus(opt.5,mod) confus(opt.5,mod,dis.bc)
data(shoshveg) # returns a data frame of vegetation data data(shoshsite) # returns a data frame of site data dis.bc <- dsvdis(shoshveg,'bray') opt.5 <- optpart(5,dis.bc) library(tree) mod <- tree(factor(opt.5$clustering)~ elevation+slope+av, data=shoshsite) confus(opt.5,mod) confus(opt.5,mod,dis.bc)
Presents an ordered list of possible cluster combinations to consider for merging to simplify a classification.
consider(part)
consider(part)
part |
an object of class ‘partana’ from functions
|
Simply sorts the cluster-to-cluster mean similarity matrix of a ‘partana’ object into a list sorted by mean similarity.
a data.frame with three elements:
row |
the current cluster number |
col |
the cluster to which it is most similar |
vals |
the mean similarity of the two clusters |
The listed combinations are not suggested to be optimal by any specific
criterion. In fact, if the ‘partana’ object was generated by optpart
or bestopt
it is known that the suggested combinations are sub-optimal.
Nevertheless, sometimes it is desirable to simplify a classification for other reasons.
David W. Roberts [email protected]
http://ecology.msu.montana.edu/labdsv/R
optpart
, bestopt
data(shoshveg) # returns a vegetation data.frame dis.bc <- dsvdis(shoshveg,'bray') # calculates a Bray/Curtis # dissimilarity matrix opt.5 <- optpart(5,dis.bc) # generates a 5 cluster partition consider(opt.5) # recommends possible clusters to merge
data(shoshveg) # returns a vegetation data.frame dis.bc <- dsvdis(shoshveg,'bray') # calculates a Bray/Curtis # dissimilarity matrix opt.5 <- optpart(5,dis.bc) # generates a 5 cluster partition consider(opt.5) # recommends possible clusters to merge
Calculates the diameter (maximum within-cluster dissimilarity) of all clusters in a partition, as well as the average diameter across all clusters.
disdiam(x,dist,digits) ## S3 method for class 'stride' disdiam(x,dist,digits=3) ## S3 method for class 'disdiam' print(x, ...)
disdiam(x,dist,digits) ## S3 method for class 'stride' disdiam(x,dist,digits=3) ## S3 method for class 'disdiam' print(x, ...)
x |
a vector of integers or an object of class ‘clustering’, ‘partition’, ‘partana’, or ‘stride’ |
dist |
|
digits |
the number of significant digits reported in the output |
... |
ancillary arguments to the print function |
disdiam is a cluster validation routine, and calculates the diameter (maximum within-cluster dissimilarity) of each cluster, as well as the average diameter of across all clusters of size greater than one.
A list with components:
diameters |
a data.frame with clusters as rows, and cluster ID, cluster size, and diameter as cols |
mean |
the weighted mean diameter of clusters of size greater than one. The mean is weighted for cluster size. |
David W. Roberts [email protected]
data(shoshveg) dis.bc <- dsvdis(shoshveg,'bray') opt.5 <- optpart(5,dis.bc) disdiam(opt.5,dis.bc)
data(shoshveg) dis.bc <- dsvdis(shoshveg,'bray') opt.5 <- optpart(5,dis.bc) disdiam(opt.5,dis.bc)
Extracts a specified cluster solution from an object of class ‘stride’. The desired solution is specified by the number of clusters.
## S3 method for class 'stride' extract(stride,k)
## S3 method for class 'stride' extract(stride,k)
stride |
an object of class ‘stride’ from function
|
k |
the number of clusters desired |
A stride object consists of a list with a data.frame of cluster solutions for varying numbers of clusters. Extract simply selects one column of this data.frame (specified by number of clusters desired, not column number) and returns that solution as an object of class ‘clustering’.
an object of class ‘clustering’.
David W. Roberts [email protected]
data(shoshveg) # get vegetation data dis.bc <- dsvdis(shoshveg,'bray') # calculate dissimilarity # matrix avg.hcl <- hclust(dis.bc,'average') # average linkage cluster # analysis avg.2.10 <- stride(2:10,avg.hcl) # compute stride res <- extract(avg.2.10,8) # extract 8-cluster solution
data(shoshveg) # get vegetation data dis.bc <- dsvdis(shoshveg,'bray') # calculate dissimilarity # matrix avg.hcl <- hclust(dis.bc,'average') # average linkage cluster # analysis avg.2.10 <- stride(2:10,avg.hcl) # compute stride res <- extract(avg.2.10,8) # extract 8-cluster solution
Calculates Lance and Williams flexible-beta dendrogram with simplified argument
flexbeta(dis,beta=-0.25,alpha=(1-beta)/2,gamma=0)
flexbeta(dis,beta=-0.25,alpha=(1-beta)/2,gamma=0)
dis |
a distance or dissimilarity object of class ‘dist’ |
beta |
the Beta coefficient |
alpha |
the Alpha coefficients (assumed equal) |
gamma |
the gamma coefficient |
Calculates a flexible-beta dendrogram from a dissimilarity matrix specifying
minimum parameters. The routine is simply a wrapper for the
agnes
function from package cluster
with suitable arguments specified to
achieve desired results. Ecologist in particular (but many others) find beta = -0.25
a good default.
An object of class ‘hclust’ for plotting and analysis like other
hclust objects, as compared to objects of class ‘agnes’ as generated by the
agnes
function in package cluster.
for the agnes function, Peter Rousseeuw for the original Fortran, Martin Maechler for the R code
for this function, David W. Roberts [email protected]
Lance, G.N., and W.T. Williams (1966). A General Theory of Classifactory Sorting Strategies, I. Hierarchical Systems. Computer J. *9*, 373-380.
data(shoshveg) dis.bc <- dsvdis(shoshveg,'bray') flexb <- flexbeta(dis.bc)
data(shoshveg) dis.bc <- dsvdis(shoshveg,'bray') flexb <- flexbeta(dis.bc)
Calculates mean cluster silhouette widths using a generalized mean.
gensilwidth(clust, dist, p=1)
gensilwidth(clust, dist, p=1)
clust |
an integer vector of cluster memberships or a classification object of class ‘clustering’ |
dist |
an object of class ‘dist’ |
p |
the scaling parameter of the analysis |
gensilwidth calculates mean cluster silhouette widths using a generalized
mean. The scaling parameter can be set between where values
less than one emphasize connectivity, and values greater than one emphasize
compactedness. Individual sample unit silhouette widths are still calculated as
where
is the mean dissimilarity of a
sample unit to the cluster to which it is assigned, and
is the mean
dissimilarity to the nearest neighbor cluster. Given
for all members of a cluster,
the generalized mean is calculated as
Exceptions exist for specific values:
for p=0
for p=
for p=
= harmonic mean,
= geometric mean, and
= arithmetic mean.
an object of class ‘silhouette’, a list with components
cluster |
the assigned cluster for each sample unit |
neighbor |
the identity of the nearest neighbor cluster for each sample unit |
sil_width |
the silhouette width for each sample unit |
Attila Lengyel and Zoltan Botta-Dukat wrote the algorithm
David W. Roberts [email protected] http://ecology.msu.montana.edu/labdsv/R
Lengyel, A. and Z. Botta-Dukat. 2019. Silhouette width using generalized mean: A flexible method for assessing clustering efficiency. Ecology and Evolution https://doi.org/10.1002/ece3.5774
data(shoshveg) dis.bc <- dsvdis(shoshveg,'bray') opt.5 <- optpart(5,dis.bc) gensilwidth(opt.5,dis.bc)
data(shoshveg) dis.bc <- dsvdis(shoshveg,'bray') opt.5 <- optpart(5,dis.bc) gensilwidth(opt.5,dis.bc)
Compares two classifications by calculating the Goodman-Kruskal Index of association
lambda(x,y,digits=5)
lambda(x,y,digits=5)
x |
an object of class ‘clustering’, ‘partana’, ‘partition’ or a vector identifying membership of objects in classes with names attribute |
y |
an object of class ‘clustering’, ‘partana’, ‘partition’ or a vector identifying membership of objects in classes with names attribute |
digits |
the number of digits of the statsitic to report |
lambda calculates the Goodman-Kruskal index of association:
Prints a cross-tabulated table and the lambda statistic, and (invisibly) returns the lambda statistic
David W. Roberts [email protected]
http://ecology.msu.montana.edu/labdsv/R
partition
, optpart
,
slice
,classmatch
data(shoshveg) dis.bc <- dsvdis(shoshveg,'bray/curtis') opt.5 <- optpart(5,dis.bc) pam.5 <- pam(dis.bc,5) lambda(opt.5,pam.5)
data(shoshveg) dis.bc <- dsvdis(shoshveg,'bray/curtis') opt.5 <- optpart(5,dis.bc) pam.5 <- pam(dis.bc,5) lambda(opt.5,pam.5)
Maximally similar sets is an approach to deriving relatively homogeneous subsets of objects as determined by similarity of the composition of the objects. Maximally similar sets are a covering, as opposed to a partition, of objects. The sets so derived can be tested against random sets of the same size to determine whether a vector of independent data exhibits an improbably restricted distribution within the sets.
maxsimset(dist,size=NULL,alphac=NULL,mean=FALSE) mss.test(mss, env, panel = 'all', main = deparse(substitute(env)), ...) ## S3 method for class 'mss' plot(x, ...) ## S3 method for class 'mss' getsets(mss)
maxsimset(dist,size=NULL,alphac=NULL,mean=FALSE) mss.test(mss, env, panel = 'all', main = deparse(substitute(env)), ...) ## S3 method for class 'mss' plot(x, ...) ## S3 method for class 'mss' getsets(mss)
dist |
a dist object from |
size |
the size of desired sets |
alphac |
the alpha-cut to specify maximum dissimilarity for inclusion in a set |
mean |
if mean is FALSE (the default), the algorithm uses a furthest neighbor criterion; if mean is TRUE, it uses a mean similarity criterion |
mss |
an object of class ‘mss’ |
env |
a quantitative environmental variable for analysis |
main |
a title for the plot of mss.test |
panel |
an integer switch to indicate which panel to draw |
x |
an object of class ‘mss’ from maxsimset |
... |
ancillary arguments for ‘plot’ |
maxsimset starts with each sample as a seed, and adds the most similar plot to the set. Plots are added in turn to the set (up to the size specified, or to the maximum dissimilarity specified) in order of maximum similarity. If mean is FALSE, the sample most similar to set is the sample with the max-min similarity, that is, the sample whose minimum similarity to the set if highest, equivalent to furthest-neighbor or complete-linkage in cluster analysis. If mean is TRUE, the sample most similar to a set is the sample with highest mean similarity to the set. Once the sets are determined for each seed, the list is examined for duplicate sets, which are deleted, to return the list of unique sets.
If ‘alphac’ is specified, sets are grown to maximum size, or to maximum dissimilarity as specified by alphac, whichever is smaller.
The ‘mss.test’ function analyzes within-set variability in attributes of the objects other than those used to calculate the similarity relation. If maximally similar sets exhibit a narrower range of values than expected at random it may be that the variable analyzed has an underlying role in determining the attributes on which the similarity is calculated. The function ‘plot’ plots the sorted within-set range of values in red, and the sorted range of values of random sets of the same size in black. This followed by a boxplot of within-set values for the random replicates versus the observed sets, and calculates a Wilcoxon rank sum test of the difference.
‘getsets’ expands and pulls out the maximally similar sets as a list of logical membership vectors for use in other analyses.
an object of class ‘mss’, a list with elements:
musubx |
a matrix of sample membership in the sets where membership is given by the similarity with which a sample joined the set |
member |
a list of set members in the order they were added to the set |
numset |
the number of unique sets derived |
size |
the number of members in each set |
distname |
the name of the dissimilarity/distance object employed |
David W. Roberts [email protected]
data(shoshveg) data(shoshsite) dis.bc <- dsvdis(shoshveg,'bray/curtis') mss.10 <- maxsimset(dis.bc,10) ## Not run: mss.test(mss.10,shoshsite$elevation) # plots graph and produces summary
data(shoshveg) data(shoshsite) dis.bc <- dsvdis(shoshveg,'bray/curtis') mss.10 <- maxsimset(dis.bc,10) ## Not run: mss.test(mss.10,shoshsite$elevation) # plots graph and produces summary
Re-assigns members of one cluster to another specified cluster, reducing the number of clusters by one.
mergeclust(clustering,from,to)
mergeclust(clustering,from,to)
clustering |
a vector of (integer) cluster memberships, or an object of class‘partition’, ‘partana’, or ‘clustering’ |
from |
the cluster number to be vacated |
to |
the cluster to which members will be re-assigned |
The function simply renumbers members of one cluster with the number of another, but greatly simplifies managing the list objects class‘partition’, ‘partana’, or ‘clustering’ and simplifes the syntax.
A list object of class ‘clustering’ specifying cluster membership for every object.
David W. Roberts [email protected]
http://ecology.msu.montana.edu/labdsv/R
partition
, partana
, and
clustering
data(shoshveg) # returns a vegetation data.frame dis.bc <- dsvdis(shoshveg,'bray/curtis') # returns a Bray/Curtis # dissimilarity matrix opt.5 <- optpart(5,dis.bc) # five cluster partition opt.5a <- mergeclust(opt.5,5,4) # reassigns member from cluster # 5 to 4
data(shoshveg) # returns a vegetation data.frame dis.bc <- dsvdis(shoshveg,'bray/curtis') # returns a Bray/Curtis # dissimilarity matrix opt.5 <- optpart(5,dis.bc) # five cluster partition opt.5a <- mergeclust(opt.5,5,4) # reassigns member from cluster # 5 to 4
Calculates the indicator value of species in a single cluster or environment type using the Murdoch Preference Function
murdoch(comm,type,minval=0,minplt=10) ## S3 method for class 'murdoch' summary(object,pval=0.05,digits=3,...) ## S3 method for class 'murdoch' plot(x,axtype=1,pval=0.05,...) ## S3 method for class 'murdoch' print(x,digits = 5, ...)
murdoch(comm,type,minval=0,minplt=10) ## S3 method for class 'murdoch' summary(object,pval=0.05,digits=3,...) ## S3 method for class 'murdoch' plot(x,axtype=1,pval=0.05,...) ## S3 method for class 'murdoch' print(x,digits = 5, ...)
comm |
a matrix or data.frame of samples with species as columns and samples as rows |
type |
a logical vector with values of TRUE for samples in a specific cluster or type |
minval |
a threshold minimum abundance value to count as a presence |
minplt |
the minimum number of presences to include a species in the calculation |
object |
and object of class ‘murdoch’ |
pval |
the maximum probability to include a species in the summary table |
digits |
the number of digits to report |
... |
ancillary arguments to maintain compatibility with the generic summary function |
x |
an object of class ‘murdoch’ |
axtype |
a switch to control scaling of the x axis in the plot. 1=number of plots in the data set, other = number of presences in the type |
Calculates the indicator value of species for a specific type using the modified Murdoch statistic:
where:
= number of samples where species is present,
= number of samples where species is absent,
= total number of samples (p+a),
= number of samples in type i where species is present,
= number of samples in type i.
Probabilities are based on the hypergeometric distribution calculation of having as many or more presences in a type as observed.
a list object of class ‘murdoch’ with components:
minplt |
the minimum number of presences to be included |
nplots |
the number of plots a species occurs in |
type |
the plot membership vector for the type |
pres |
the number of presences for species in the type |
abs |
the number of absences of species in the type |
murdoch |
the Murdoch value for species in the type |
pval |
the probability of getting such a high murdoch value |
Indicator value analysis is a set of techniques designed to identify
species of special interest in clusters or types. The most widely used indicator species
analysis was proposed by Dufrene and Legendre (1997), and is included in package
‘labdsv’ as indval
. murdoch differs significantly from
indval in assumption and objective, seeking to identify species that have improbable occurrences
in types, regardless of their relative frequency in the type
David W. Roberts with help from Ken Aho [email protected] http://ecology.msu.montana.edu/labdsv/R
data(shoshveg) # returns a vegetation dataframe dis.bc <- dsvdis(shoshveg,'bray/curtis') # returns a dissimilarity # matrix opt.5 <- optpart(5,dis.bc) plot(murdoch(shoshveg,opt.5$clustering==1))
data(shoshveg) # returns a vegetation dataframe dis.bc <- dsvdis(shoshveg,'bray/curtis') # returns a dissimilarity # matrix opt.5 <- optpart(5,dis.bc) plot(murdoch(shoshveg,opt.5$clustering==1))
Calculates the nearest neighbor (least dissimilar cluster) for each item in partition to identify the topology of the partition.
neighbor(x,all=FALSE)
neighbor(x,all=FALSE)
x |
an object of class ‘pam’ or class ‘partana’ |
all |
a logical switch to control which items are included in the calculation |
Each item in a partition has membership in a cluster. The nearest
neighbor of an item is the cluster to which the item is least dissimilar, other
than the one to which it it belongs. If ‘all’ is TRUE, then every item
is included in the analysis. If ‘all’ is FALSE, only ‘misfits’
are included in the calculation. If the first argument is an object of class
‘pam’, then a misfit is an item with a negative silhouette width (see
silhouette
). If the first argument is an object of class
‘partana’, a misfit is an item with lower mean dissimilarity to another
cluster than to the one to which it belongs.
A table with clusters as rows, and neighbors as columns.
David W. Roberts [email protected]
data(shoshveg) dis.bc <- dsvdis(shoshveg,'bray') opt.5 <- optpart(5,dis.bc) neighbor(opt.5,all=TRUE)
data(shoshveg) dis.bc <- dsvdis(shoshveg,'bray') opt.5 <- optpart(5,dis.bc) neighbor(opt.5,all=TRUE)
Calculates the number of indicator species/cluster across a range of partitions
optimclass(comm, stride, pval = 0.01, counts = 2)
optimclass(comm, stride, pval = 0.01, counts = 2)
comm |
a community matrix with sample units as rows and species as columns |
stride |
an object of class ‘stride’from function
|
pval |
the minimum probability for inclusion in the list of indicators |
counts |
the minimum number of clusters for inclusion in the list |
Calculates the number of indicator species/cluster and the number of
clusters with at least ‘counts’ indicators, using the index to
identify indicators with probabilities less than or equal to ‘pval’.
Arguably the optimal partition is the one with the most indicator species and
the most clusters with adequate indicators.
A data.frame of
clusters |
number of clusters |
sig.spc |
the number of species with significant indicator value |
sig.clust |
the number of clusters with at least ‘counts’ indicator species |
The concept and first implementation were by Tichy in software package ‘Juice’, and this is a simple port of the algorithm to R.
Lubomir Tichy wrote the original algorithm
David W. Roberts [email protected]
Tichy, L., M. Chytry, M. Hajek, S. Talbot, and Z. Botta-Dukat. 2010. OptimClass: Using species-to-cluster fidelity to determine the optimal partition in classification of ecological communities. J. Veg. Sci. 21:287-299.
data(shoshveg) dis.bc <- dsvdis(shoshveg,'bray') opt.2.10 <- stride(2:20,dis.bc) ## Not run: optimclass(shoshveg,opt.2.10)
data(shoshveg) dis.bc <- dsvdis(shoshveg,'bray') opt.2.10 <- stride(2:20,dis.bc) ## Not run: optimclass(shoshveg,opt.2.10)
optindval is a iterative re-assignment classification algorithm that assigns samples to clusters to maximize the sum of indicator values.
optindval(comm,clustering,maxitr=100,minsiz=5)
optindval(comm,clustering,maxitr=100,minsiz=5)
comm |
a vegetation or other taxon table with samples as rows and taxa as columns |
clustering |
an index of cluster membership for each sample. May be either a numeric vector of length equal to the number of samples, or an object that inherits from class ‘cluster’ |
maxitr |
the maximum number of iterations to attempt |
minsiz |
the minimum size cluster to consider reassigning a sample out of |
Iterative re-allocation algorithms temporarily re-assign each sample to each of the other possible clusters and calculate a goodness-of-clustering statistic for each re-assignment. The best of all possible re-assignments is then executed and the algorithm iterates until there are no more good re-assignments or the maximum number of iterations is reached. In optindval, the goodness-of-clustering statistic is the sum of Dufrene and Legendre indicator values
a list of class "optindval","clustering" with components:
numitr |
the number of iterations performed |
sums |
a vector of indicator value probability sums |
clustering |
the vector of cluster memberships (as integers) for each sample |
Like many iterative re-assignment algorithms, optindval is likely to be VERY slow from a random start or poor initial condition. optindval is maybe better used to polish existing classifications
David W. Roberts [email protected]
http://ecology.msu.montana.edu/labdsv/R
optpart
, opttdev
, optsil
data(shoshveg) # returns a data.frame of vegetation data called shoshveg dis.bc <- dsvdis(shoshveg,'bray') # generate Bray/Curtis dissimilarity # matrix opt.5 <- optpart(5,dis.bc) # generate 5-cluster optpart ## Not run: res <- optindval(shoshveg,opt.5) # polish the optpart result ## Not run: classmatch(opt.5,res) # see the plot re-assignments
data(shoshveg) # returns a data.frame of vegetation data called shoshveg dis.bc <- dsvdis(shoshveg,'bray') # generate Bray/Curtis dissimilarity # matrix opt.5 <- optpart(5,dis.bc) # generate 5-cluster optpart ## Not run: res <- optindval(shoshveg,opt.5) # polish the optpart result ## Not run: classmatch(opt.5,res) # see the plot re-assignments
Optimal partitioning is an iterative re-allocation algorithm to maximize the ratio of within-cluster similarity/among-cluster similarity for a given number of clusters. Optpart can operate as either a crisp (classical) partitioning, or a fuzzy partitioning algorithm.
optpart(x, dist, maxitr = 100, mininc = 0.001, maxdmu = 1)
optpart(x, dist, maxitr = 100, mininc = 0.001, maxdmu = 1)
x |
an integer, integer vector, factor vector, or objects of class ‘clustering’, ‘partana’, ‘partition’ or ‘stride’ |
dist |
|
maxitr |
the maximum number of iterations to perform |
mininc |
the minimum increment in the within/among similarity ratio to continue iterating |
maxdmu |
the ‘maximum delta mu’. If 1, a crisp (non-fuzzy) partition results. If (0,1) a fuzzy partition results. |
optpart produces a partition, or clustering, of items into clusters by iterative reallocation of items to clusters so as to maximize the within cluster/ among cluster similarity ratio. At each iteration optpart ranks all possible re-allocations of a sample from one cluster to another. The re-allocation that maximizes the change in the within-cluster/among-cluster ratio is performed. The next best reallocation is considered, and if it does not include any clusters already modified, it is also performed, as re-allocations of independent clusters are independent and additive in effect. When no further re-allocations can be performed in that iteration, the algorithm recalculates all possible re-allocations and iterates again. When no re-allocations exist that improve the within/among ratio greater than ‘mininc’, or the maximum number of iterations is reached, the algorithm stops.
optpart is designed to run from a random start or the levels of a factor, or
preferably from existing initial partitions. Specifying a single integer gives the number of clusters
desired using a random start. Specifying an integer vector gives the initial assignments of
items to clusters. Initial assignments can also be extracted from a number of objects.
Specific
methods exist for objects of class ‘clustering’ from functions
slice
or archi
, class ‘partana’ from function
partana
, class ‘stride’ from stride
, or
class ‘partition’ from functions
pam
or diana
. optpart is deterministic from a
given initial condition. To get good results from a random start, multiple
random starts should be attempted, using function bestopt
.
Optpart is an unweighted algorithm, i.e. each of the pairwise distances or
dissimilarities is included in the calculation of the ratio exactly once. Optpart
somewhat penalizes small clusters, as small clusters contribute only
values
to the numerator; the extreme case is that a cluster with a single member does not
contribute anything to the numerator.
It is an interesting characteristic of optpart that no minimum cluster size is enforced, and it is common for partitions of a large number of clusters to contain null clusters, i.e. clusters with no members. This is not a bug or error, but rather an indication that a partition with a fewer number of clusters achieves a better within/among similarity ratio than does a larger number. It is also somewhat common that for solutions with a small or intermediate number of clusters, optpart places outliers in a small ‘trash’ cluster.
When optpart is run as a fuzzy partitioning algorithm, it often achieves a surprisingly low entropy, with many items assigned completely to a single cluster.
an object of class partana
, a list with elements:
ptc |
a matrix of item mean similarity to each cluster |
ctc |
a matrix of mean cluster-to-cluster similarity |
musubx |
a matrix of membership of each item to each cluster. If |
clustering |
a vector giving the cluster each item is assigned to. If optpart is run as a fuzzy partitioning, this is determined by the maximum membership observed. |
ratio |
the vector of within/among similarities achieved at each iteration. The final non-zero value is the final ratio achieved. |
numitr |
the number of iterations performed |
names |
the names of the items clustered |
David W. Roberts [email protected] http://ecology.msu.montana.edu/labdsv/R
partana
data(shoshveg) dis.bc <- dsvdis(shoshveg,'bray/curtis') opt.5 <- optpart(5,dis.bc) summary(opt.5)
data(shoshveg) dis.bc <- dsvdis(shoshveg,'bray/curtis') opt.5 <- optpart(5,dis.bc) summary(opt.5)
Silhouette width is a measurement of the mean similarity of each
object to the other objects in its cluster, compared to its mean similarity to
the most similar cluster (see silhouette
). Optsil is an
iterative re-allocation algorithm to maximize the mean silhouette width of a
clustering for a given number of clusters.
optsil(x,dist,maxitr)
optsil(x,dist,maxitr)
x |
an integer, a vector of integers, an object of class ‘clustering’, ‘partana’, ‘partition’, or ‘stride’ |
dist |
|
maxitr |
the maximum number of iterations to perform |
optsil produces a partition, or clustering, of items into clusters by iterative reallocation of items to clusters so as to maximize the mean silhouette width of the classification. At each iteration optsil ranks all possible re-allocations of a item from one cluster to another. The reallocation that maximizes the change in the mean silhouette width is performed. Because silhouette widths are not independent of clusters that are not modified, only a single reallocation can be preformed in a single iteration. When no further re-allocations result in an improvement, or the maximum number of iterations is achieved, the algorithm stops.
Optsil is an unweighted algorithm, i.e. each of the objects is included in the calculation exactly once.
Optsil can be extremely slow to converge, and is best used to ‘polish’ an
existing partition or clusterings resulting from slicing an hclust
or
from functions optpart
, pam
,
diana
or other initial clusterings. It is possible
to run optsil from a random start, but is EXTREMELY SLOW to converge, and should be
done only with caution.
a list with elements:
clustering |
a vector of integers giving the cluster assignment for each object |
sils |
a vector of the silhouette widths achieved at each iteration |
numitr |
the number of iterations performed |
David W. Roberts [email protected]
data(shoshveg) dis.bc <- dsvdis(shoshveg,'bray/curtis') opt.5 <- optpart(5,dis.bc) sil.5 <- optsil(opt.5,dis.bc,100) # make take a few minutes summary(silhouette(sil.5,dis.bc)) ## Not run: plot(silhouette(sil.5,dis.bc))
data(shoshveg) dis.bc <- dsvdis(shoshveg,'bray/curtis') opt.5 <- optpart(5,dis.bc) sil.5 <- optsil(opt.5,dis.bc,100) # make take a few minutes summary(silhouette(sil.5,dis.bc)) ## Not run: plot(silhouette(sil.5,dis.bc))
opttdev is a iterative re-assignment classification algorithm that assigns samples to clusters to minimize the total deviance of a table with respect to the row-wise relative abundance of the elements
opttdev(comm,clustering,maxitr=100,minsiz=5)
opttdev(comm,clustering,maxitr=100,minsiz=5)
comm |
a vegetation or other taxon table with samples as rows and taxa as columns |
clustering |
an index of cluster membership for each sample. May be either a numeric vector of length equal to the number of samples, or an object that inherits from class ‘cluster’ |
maxitr |
the maximum number of iterations to attempt |
minsiz |
the minimum size cluster to consider reassigning a sample out of |
Iterative re-allocation algorithms temporarily re-assign each sample to each
of the other possible clusters and calculate a goodness-of-clustering statistic for each
re-assignment. The best of all possible re-assignments is then executed and the algorithm
iterates until there are no more good re-assignments or the maximum number of iterations
is reached. In opttdev, the goodness-of-clustering statistic is total table deviance
as calculated by tabdev
. See the help file for
tabdev
for more detail.
a list which inherits from class ‘opttdev’, ‘clustering’ with components:
numitr |
the number of iterations performed |
dev |
a vector of total table deviance at each iteration of length ‘numitr’ |
clustering |
the vector of cluster memberships (as integers) for each sample |
Like many iterative re-assignment algorithms, opttdev is likely to be VERY slow from a random start or poor initial condition. opttdev is maybe better used to polish existing classifications
David W. Roberts [email protected]
http://ecology.msu.montana.edu/labdsv/R
## Not run: data(shoshveg) # returns a data.frame of vegetation ## Not run: data(shoshsite) ## Not run: res <- opttdev(shoshveg, as.numeric(cut(shoshsite$elevation,5))) ## End(Not run) ## Not run: # likely to be VERY slow
## Not run: data(shoshveg) # returns a data.frame of vegetation ## Not run: data(shoshsite) ## Not run: res <- opttdev(shoshveg, as.numeric(cut(shoshsite$elevation,5))) ## End(Not run) ## Not run: # likely to be VERY slow
Partition analysis evaluates the within-cluster to among-cluster similarity of classifications as a measure of cluster validity
partana(c,dist) ## S3 method for class 'partana' summary(object, ...) ## S3 method for class 'partana' plot(x,panel='all',zlim=range(x$ptc),col=heat.colors(12),...)
partana(c,dist) ## S3 method for class 'partana' summary(object, ...) ## S3 method for class 'partana' plot(x,panel='all',zlim=range(x$ptc),col=heat.colors(12),...)
c |
an integer or factor vector, or an object of class ‘clustering’, ‘partana’, ‘partition’, or ‘stride’ |
dist |
an object of class ‘dist’ from functions |
.
object |
an object of class ‘partana’ |
x |
an object of class ‘partana’ |
panel |
an integer switch to indicate which panel to draw |
zlim |
the min and max values for the color map |
col |
a color map name (heat.colors(12) is the default) |
... |
ancillary arguments to pass to summary or plot |
Calculates mean object-to-cluster similarity, mean cluster-to-cluster
similarity, and mean within-cluster to among-cluster similarity. partana operates
on partitions or clusterings produced by a wide range of algorithms, including specific
methods for the products of functions optpart
, slice
,
pam
and diana
.
summary
produces a matrix of the mean cluster-to-cluster similarities,
and the overall within-cluster/among-cluster similarity ratio.
plot
plots two panels in sequence in the current device. The first shows
the mean similarity of every object to each cluster, sorted by mean similarity
to the other members of its own cluster, with objects as columns and clusters
as rows. The second panel shows the mean similarity of every cluster to every
other cluster and mean within-cluster similarity, ignoring cluster size. These
plots are known as ‘Mondriaan’ plots, where the similarities are given by lines
colored from min to max. If the ‘partana’ object was produced by optpart, a
third panel is plotted showing the trace of the optimization.
an object of class ‘partana’ with components:
ptc |
matrix of mean object-to-cluster similarity |
ctc |
matrix of mean cluster-to-cluster similarity |
clustering |
vector of numeric cluster assignments |
ratio |
within-cluster to among-cluster similarity ratio |
David W. Roberts [email protected]
partition
, optpart
,
plot.partana
data(shoshveg) data(shoshsite) dis.bc <- dsvdis(shoshveg,'bray/curtis') demo.part <- partana(cut(shoshsite$elev,5),dis.bc) summary(demo.part)
data(shoshveg) data(shoshsite) dis.bc <- dsvdis(shoshveg,'bray/curtis') demo.part <- partana(cut(shoshsite$elev,5),dis.bc) summary(demo.part)
Convert an object of class ‘partana’ or class ‘clustering’ to an object of class ‘partition’.
partition(x, dist, ...)
partition(x, dist, ...)
x |
an object which inherits from class ‘clustering’ |
dist |
an object of class ‘dist’ |
... |
ancillary arguments to pass to ‘partition’ |
A ‘partition’ object is the output of several functions in package ‘cluster’. This utility function converts objects from package ‘optpart’ to ‘partitions’ so that functions in that library are available.
an object of class ‘partition’ with components (and possibly others):
clustering |
vector of numeric cluster assignments |
silinfo |
a list with all silhouette information, only available when
the number of clusters is non-trivial, i.e., |
David W. Roberts [email protected] http://ecology.msu.montana.edu/labdsv/R
http://ecology.msu.montana.edu/labdsv/R
silhouette
, partition
,
optpart
data(shoshveg) dis.bc <- dsvdis(shoshveg,'bray/curtis') part <- partana(sample(1:5,nrow(shoshveg),replace=TRUE),dis.bc) result <- partition(part,dis.bc)
data(shoshveg) dis.bc <- dsvdis(shoshveg,'bray/curtis') part <- partana(sample(1:5,nrow(shoshveg),replace=TRUE),dis.bc) result <- partition(part,dis.bc)
Calculates the statistic on a classified
table of taxa
phi(comm,clustering,minplt=10,p.adjust=FALSE)
phi(comm,clustering,minplt=10,p.adjust=FALSE)
comm |
a data.frame with samples as rows and attributes as columns |
clustering |
a vector of integers or an object of class ‘clustering’, ‘partition’, or ‘partana’ |
minplt |
the minimum number of samples a species must occur in to be included in the calculation |
p.adjust |
switch to control adjusting probabilities for simultaneous inference by Hochberg correction |
is a statistic of agreement between two vectors. In this case
the function calculates the distribution of each species within clusters of a partition,
calculates the
statistic for each species in each cluster.
where:
a | sample is in specified type and species is present |
b | sample is not in group and species is present |
c | sample is in type but species is not present |
d | sample is not in type and species is not present |
A data.frame of values with species as rows and clusters as columns
David W. Roberts [email protected]
Tichy, L. and M. Chytry. 2006. Statistical determination of diagnostic species for site groups of unequal size. Journal of Vegetation Science 17:809-818.
data(shoshveg) dis.bc <- dsvdis(shoshveg,'bray') opt.5 <- optpart(5,dis.bc) phi(shoshveg,opt.5)
data(shoshveg) dis.bc <- dsvdis(shoshveg,'bray') opt.5 <- optpart(5,dis.bc) phi(shoshveg,opt.5)
Refine allows you to re-assign specific elements of a classification from one class or cluster to another. In the default case, you simply interactively enter sample IDs and give a new cluster assignment. For PCO and NMDS ordinations, you do the assignments with a mouse.
## Default S3 method: refine(comm,clustering,...) ## S3 method for class 'dsvord' refine(x,clustering,ax=1,ay=2,...)
## Default S3 method: refine(comm,clustering,...) ## S3 method for class 'dsvord' refine(x,clustering,ax=1,ay=2,...)
comm |
a community data.frame |
x |
an ordination of class ‘dsvord’ |
clustering |
a clustering identity or membership vector |
ax |
the X axis of the ordination |
ay |
the Y axis of the ordination |
... |
ancillary arguments to allow differing numbers of arguments |
a list object of class ‘clustering’ with one component.
clustering |
a numeric vector giving the cluster assignment for each sample |
There are many, many ways to produce classifications in R, including several in package ‘optpart’. refine is designed to take one of these classifications and polish it by making relatively few transfers. The ordination-based routines allow visual assessment of cluster validity, although in reduced dimensionality, which can be misleading.
David W. Roberts [email protected]
http://ecology.msu.montana.edu/labdsv/R
## Not run: data(shoshveg) ## Not run: dis.bc <- dsvdis(shoshveg,'bray') ## Not run: opt.5 <- optpart(5,dis.bc) ## Not run: nmds.bc <- nmds(dis.bc) ## Not run: plot(nmds.bc) ## Not run: res <- refine(nmds.bc,opt.5)
## Not run: data(shoshveg) ## Not run: dis.bc <- dsvdis(shoshveg,'bray') ## Not run: opt.5 <- optpart(5,dis.bc) ## Not run: nmds.bc <- nmds(dis.bc) ## Not run: plot(nmds.bc) ## Not run: res <- refine(nmds.bc,opt.5)
In it's simplest form simply reassigns cluster numbers in an existing classification to re-order tables and graphs. Can also be used to combine clusters into a fewer number of clusters.
reordclust(clustering,from,to)
reordclust(clustering,from,to)
clustering |
a vector of (interger) cluster mmeberships, or an object of class ‘clustering’, ‘partana’, or ‘partition’ |
from |
an integer vector equal in length to the number of clusters that specifies the current clusters |
to |
an integer vector equal in length to the number of clusters that specifies the clusters the current clusters map to |
The function simply maps cluster numbers in the ‘from’ vector to the respective cluster number in the ‘to’ vector.
an object of class ‘clustering’
As demonstrated in the examples below, reordclass can also combine existing clusters into fewer clusters while reordering if more than one cluster in the ‘from’ vector maps to the same cluster in the ‘to’ cluster.
David W. Roberts [email protected]
http://ecology.msu.montana.edu/labdsv/R
mergeclust
data(shoshveg) dis.bc <- dsvdis(shoshveg,'bray') opt.10 <- optpart(10,dis.bc) new <- reordclust(opt.10,1:10,c(1,3,5,7,9,2,4,6,8,10)) # simply re-ordered new2 <- reordclust(opt.10,1:10,c(1,1,2,2,3,3,4,4,5,5)) # merge 1 and 2 into 1, 3 and 4 into 2, etc.
data(shoshveg) dis.bc <- dsvdis(shoshveg,'bray') opt.10 <- optpart(10,dis.bc) new <- reordclust(opt.10,1:10,c(1,3,5,7,9,2,4,6,8,10)) # simply re-ordered new2 <- reordclust(opt.10,1:10,c(1,1,2,2,3,3,4,4,5,5)) # merge 1 and 2 into 1, 3 and 4 into 2, etc.
The coniferous forests of the Shoshone National Forest range from lower elevation woodlands dominated by Pinus flexilis, through forests of Pseudotsuga menziesii, Pinus contorta, Picea engelmannii, Abies lasiocarpa and Pinus albicaulis with increasing elevation (Steele et al. 1983). One hundred and fifty sample plots were chosen at random from a larger set for this data set; the larger set was stratified by elevation, exposure, surficial geology, and geographic distribution.
data(shoshsite)
data(shoshsite)
A data.frame with sample plots as rows, and site variable as columns. Variables comprise:
elevation | elevation above sea level in meters |
aspect | compass orientation of the site in degrees |
slope | slope steepness in percent |
av | aspect value: (cosd(aspect-30)+1)/2 |
swb | site water balance: a tipping bucket model of water-year soil water |
sprppt | spring precipitation in cm |
sumppt | summer precipitation in cm |
autppt | autumn precipitation in cm |
winppt | winter precipitation in cm |
sprtmp | spring mean temperature degrees C |
sumtmp | summer mean temperature degrees C |
auttmp | autumn mean temperature degrees C |
winppt | winter mean temperature degrees C |
sprpet | spring potential evapotranspiration in cm |
sumpet | summer potential evapotranspiration in cm |
autpet | autumn potential evapotranspiration in cm |
winpet | winter potential evapotranspiration in cm |
sprrad | spring direct and diffuse solar radiation (correcting for topographic shading) |
sumrad | summer direct and diffuse solar radiation (correcting for topographic shading) |
autrad | autumn direct and diffuse solar radiation (correcting for topographic shading) |
win | winter direct and diffuse solar radiation (correcting for topographic shading) |
ffd | frost free days |
dday | degree days heat sum |
tcol | mean monthly temperature of the coldest month |
The data were derived from a multi-year effort by numerous scientists and field technicians. The project was directed by Kent Houston, Soil Scientist and Ecologist, Shoshone National Forest. The site data were calculated primarily by Dr. Niklaus Zimmermann, WSL, Birmensdorf, Switzerland http://www.wsl.ch/staff/niklaus.zimmermann/biophys.html
Roberts, D.W. 2008. Statistical Analysis of Multidimensional Fuzzy Set Ordinations. Ecology 89:1246-1260
Percent cover (in codes) for 368 vascular plants on 150 sample plots
stratified across the Shoshone National Forest, Wyoming, USA. Plots were chosen at
random from a larger dataset.
data(shoshveg)
data(shoshveg)
A data.frame with sample plots as rows and species as columns. Sample plots match the ‘shoshsite’ dataset.
The cover of all vascular plant species was recorded according to the following scale: present but < 1% = 0.1, 1-5% = 0.5, 5-15% = 1.0, 15-25% = 2.0, 25-35% = 3.0, 35-45% = 4.0, 45-55% = 5.0, 55-65% = 6.0, 65=75% = 7.0, 75-85% = 8.0. No species exhibited greater than 80% cover in the data set.
The data were derived from a multi-year effort by numerous scientists and field technicians. The project was directed by Kent Houston, Soil Scientist and Ecologist, Shoshone National Forest
Roberts, D.W. 2008. Statistical Analysis of Multidimensional Fuzzy Set Ordinations. Ecology 89:1246-1260.
Extracts components from a partana
, clustering
, or stride
object,
and passes the values to the silhouette
function to
produce an object of class silhouette.
## S3 method for class 'partana' silhouette(x, dist, ...) ## S3 method for class 'clustering' silhouette(x, dist, ...) ## S3 method for class 'stride' silhouette(x, dist, ...) testsil(sil)
## S3 method for class 'partana' silhouette(x, dist, ...) ## S3 method for class 'clustering' silhouette(x, dist, ...) ## S3 method for class 'stride' silhouette(x, dist, ...) testsil(sil)
x |
an object of class ‘partana’, ‘clustering’, or ‘stride’ |
dist |
an object of class dist |
... |
miscellaneous arguments to pass to function silhouette |
sil |
an object of class ‘silhouette’ |
For ‘partana’ and ‘clustering’ objects the advantage over
calling silhouette
directly is that the row.names of the resulting
object are added to the results, as opposed to consecutive integers.
For objects of class ‘stride’ the function extracts the component
‘clustering’ for each level of a stride object, and calls function
silhouette
in library ‘cluster’
returniung the mean silhouette width for each case.
testsil
identifies ‘misfits’ in a partition, defined as plots with
a negative silhouette width, and prints them out in a sorted list.
An object of class silhouette
This is a a simple conversion routine to allow plotting a silhouette plot for an object of class partana.
David W. Roberts [email protected]
http://ecology.msu.montana.edu/labdsv/R
data(shoshveg) # produces a data frame of vegetation data, #samples as rows, attributes as columns dis.bc <- dsvdis(shoshveg,'bray/curtis') # produces a Bray/Curtis # dissimilarity matrix opt.5 <- optpart(5,dis.bc) # produces an optimal partitioning into # 5 clusters silhouette(opt.5,dis.bc) # calculates the silhouette values ## Not run: plot(silhouette(opt.5,dis.bc)) # produce silhouette # plot on current device
data(shoshveg) # produces a data frame of vegetation data, #samples as rows, attributes as columns dis.bc <- dsvdis(shoshveg,'bray/curtis') # produces a Bray/Curtis # dissimilarity matrix opt.5 <- optpart(5,dis.bc) # produces an optimal partitioning into # 5 clusters silhouette(opt.5,dis.bc) # calculates the silhouette values ## Not run: plot(silhouette(opt.5,dis.bc)) # produce silhouette # plot on current device
Allows a simple classification of objects by slicing a dendrogram of a hierarchical cluster analysis graphically with a mouse, or by simply giving a number.
slice(clust, k=NULL)
slice(clust, k=NULL)
clust |
an object of class ‘hclust’ produced by |
k |
a desired number of clusters. If null, the function waits on a mouse click |
an object of class ‘clustering’, a list with a vector of cluster memberships
This function is a simple wrapper for cutree
that allows users
to click their mouse at the height they desire to slice the dendrogram, and to establish
the result with a class of ‘clustering’ for ease of use in other functions. If you
want to use the mouse, the dendrogram must have been previously plotted in the current
graphic device.
David W. Roberts [email protected] http://ecology.msu.montana.edu/labdsv/R
data(shoshveg) dis.bc <- dsvdis(shoshveg,'bray/curtis') demo <- hclust(dis.bc,'ave') ave.5 <- slice(demo,5) ## Not run: plot(demo) ## Not run: ave.clust <- slice(demo)
data(shoshveg) dis.bc <- dsvdis(shoshveg,'bray/curtis') demo <- hclust(dis.bc,'ave') ave.5 <- slice(demo,5) ## Not run: plot(demo) ## Not run: ave.clust <- slice(demo)
stride proceeds along a specified sequence creating clusterings or partitions of a dissimilarity matrix for each value of the sequence.
stride(seq,arg2,type='pam',numrep=10,maxitr=100) ## S3 method for class 'stride' plot(x, dist, col2=4, ...)
stride(seq,arg2,type='pam',numrep=10,maxitr=100) ## S3 method for class 'stride' plot(x, dist, col2=4, ...)
seq |
a sequence, in either a:b or seq(a,b,c) form |
arg2 |
an object of class ‘dist’ from
|
type |
if arg2 is an object of class ‘dist’, type specifies the algorithm to produce clusters, and can be either ‘pam’ or ‘optpart’. |
numrep |
if arg2 is an object of class ‘dist’ and type =
‘optpart’, numrep specifies the number of replicates to run in function
|
maxitr |
if arg2 is an object of class ‘dist’ and type =
‘optpart’ numrep specifies the maximum number of iterations per replicate
in function
|
x |
an object of class ‘stride’ |
dist |
|
col2 |
the color code for the second line in the graph |
... |
ancillary arguments to the plot function |
The specific action of function stride depends on the class of the second
argument. If arg2 is of class ‘dist’, then clusters are generated by
a fixed cluster algorithm. In this case, if type is ‘pam’, the
function pam
is called to
produce the clusters. If type is ‘optpart’ the function
optpart
is called to produce the clusters.
If arg2 is of class ‘hclust’, then the hlcust object is successively
‘sliced’ at levels specified by the sequence. The object of class
‘hclust’ can result from the function hclust
using any of the
methods provided, or by casting an object of class ‘partition’ to class
‘hlcust’ with the as.hclust
function.
The default plot method for a stride plots the partana ratios
(see partana
) of each partition of the sequence on the left Y axis,
and the silhouette widths of the same partitions on the right Y axis.
an object of class ‘stride’, which is a list with components:
clustering |
a data.frame with items as rows, and cluster IDs as columns, with one column for each value of the sequence |
seq |
a copy of the sequence employed |
David W. Roberts [email protected]
data(shoshveg) dis.bc <- dsvdis(shoshveg,'bray') results <- stride(2:5,dis.bc)
data(shoshveg) dis.bc <- dsvdis(shoshveg,'bray') results <- stride(2:5,dis.bc)
Table deviance is a method to assess the quality of classifications by calculating the clarity of the classification with respect to the original data, as opposed to a dissimilarity or distance matrix representation
## Default S3 method: tabdev(x,clustering,nitr=999,...) ## S3 method for class 'stride' tabdev(x,taxa,...) ## S3 method for class 'tabdev' summary(object,p=0.05,...)
## Default S3 method: tabdev(x,clustering,nitr=999,...) ## S3 method for class 'stride' tabdev(x,taxa,...) ## S3 method for class 'tabdev' summary(object,p=0.05,...)
x |
a matrix or data.frame of multivariate observations, with objects as rows, and attributes as columns |
clustering |
a vector of integer cluster assignments, or an object of class ‘clustering’ or ‘partana’ |
nitr |
number of iterations to perform in calculating the probability of obtaining as effective a classification as observed |
taxa |
a data.frame with samples as rows and species as columns |
object |
and object of class ‘tabdev’ |
p |
the maximum probability threshold to list species in the summary table |
... |
ancillary arguments to maintain compatibility with generic summary function |
Tabdev calculates the concentration of values within clusters. For each column, tabdev calculates the sum of values within classes and the sum within classes divided by the sum of that column to get fractional sums by class. These values are used to calculate the deviance of each row. Attributes that are widely dispersed among classes exhibit high deviance; attributes that are concentrated within a single class contribute zero deviance. An effective classification should exhibit low deviance.
Tabdev then permutes the values within columns and calculates the probability of observing as low a deviance as observed as $$ (m+1)/(niter + 1)$$ where $m$ is the number of cases with as low or lower deviance as observed.
a list with components:
spcdev |
a data.frame with species, deviance, and probability as columns |
totdev |
the total deviance of the entire table |
David W. Roberts [email protected] http://ecology.msu.montana.edu/labdsv/R
data(shoshveg) dis.bc <- dsvdis(shoshveg,'bray/curtis') opt.5 <- optpart(5,dis.bc) tabdev(shoshveg,opt.5)
data(shoshveg) dis.bc <- dsvdis(shoshveg,'bray/curtis') opt.5 <- optpart(5,dis.bc) tabdev(shoshveg,opt.5)
testopt analyzes the mean similarity of each sample to the cluster to which it is assigned to all other clusters, and lists those samples which have similarity higher to another cluster than to the one to which they are assigned.
testpart(part,ord=TRUE)
testpart(part,ord=TRUE)
part |
a object of class ‘partana’ from |
ord |
a switch to control whether the output is ordered |
Simply examines each sample plot, comparing the mean similarity of that sample to all other samples in the cluster to which it is assigned as compared to its mean similarity to all other clusters. Samples which are more similar to other clusters than to the one to which they are assigned are listed in a table which gives their current cluster assignment, the cluster to which they are more similar, and the mean similarities of that sample to all clusters.
If ‘ord=TRUE’ then the output is ordered to reflect target clusters.
a table of values
David W. Roberts [email protected]
data(shoshveg) dis.bc <- dsvdis(shoshveg,'bray/curtis') opt.5 <- optpart(5,dis.bc) testpart(opt.5)
data(shoshveg) dis.bc <- dsvdis(shoshveg,'bray/curtis') opt.5 <- optpart(5,dis.bc) testpart(opt.5)
Identifies samples that typify clusters in a partition based on dissimilarity.
typal(clustering,dist,k=1)
typal(clustering,dist,k=1)
clustering |
a vector of integers or an object of class ‘clustering’, ‘partition’, or ‘partana’ |
dist |
|
k |
number of typal species/cluster to identify |
typal calculates two versions of typal species based on silhouette analysis
(see silhouette
) and partana analysis
(see partana
). With respect to silhouette analysis, the
function returns k species with the largest positive silhouette width for each
cluster. With respect to the partana analysis the function returns the k species
with the highest mean similarity to the cluster.
Returns a list with two data.frames. The first, partana, gives the clusters as rows and typal samples as columns from the perspective of the partana ratio. The second, silhouette, also gives the clusters as rows and typal samples as columns but from the perspective of silhouette widths.
David W. Roberts [email protected]
data(shoshveg) dis.bc <- dsvdis(shoshveg,'bray') opt.5 <- bestopt(dis.bc,5,20) typal(opt.5,dis.bc,3)
data(shoshveg) dis.bc <- dsvdis(shoshveg,'bray') opt.5 <- bestopt(dis.bc,5,20) typal(opt.5,dis.bc,3)