Title: | Clustering Graphics |
---|---|
Description: | Orders panels in scatterplot matrices and parallel coordinate displays by some merit index. Package contains various indices of merit, ordering functions, and enhanced versions of pairs and parcoord which color panels according to their merit level. |
Authors: | Catherine Hurley |
Maintainer: | Catherine Hurley <[email protected]> |
License: | GPL (>= 2) |
Version: | 1.3.2 |
Built: | 2024-11-06 06:16:29 UTC |
Source: | CRAN |
Computes clustering coefficients from cluster
,
where x
and y
give the object coordinates.
ac(x, y, ...) sil(x, y, groups, ...)
ac(x, y, ...) sil(x, y, groups, ...)
x |
is a numeric vector. |
y |
is a numeric vector. |
groups |
is a vector of group memberships, used by |
... |
are passed to |
ac
- Computes clustering coefficient from agnes{cluster}
.
sil
- Computes the silhouette coefficient from from package
cluster
.
The clustering coefficient is returned.
Catherine B. Hurley
Kaufman, L. and Rousseeuw, P.J. (1990). Finding Groups in Data: An Introduction to Cluster Analysis . Wiley, New York.
agnes
, silhouette
, dist
.
x <- runif(20) y <- runif(20) g <- rep(c("a","b"),10) ac(x,y) sil(x,y,g)
x <- runif(20) y <- runif(20) g <- rep(c("a","b"),10) ac(x,y) sil(x,y,g)
Data from "Multivariate Statistics A practical approach", by Bernhard Flury and Hans Riedwyl, Chapman and Hall, 1988, Tables 1.1 and 1.2 pp. 5-8. Six measurements made on 100 genuine Swiss banknotes and 100 counterfeit ones.
data(bank)
data(bank)
This data frame contains the following columns:
0 = genuine, 1 = counterfeit
Length of bill, mm
Width of left edge, mm
Width of right edge, mm
Bottom margin width, mm
Top margin width, mm
Length of image diagonal, mm
Flury, B. and Riedwyl, H. (1988), Multivariate Statistics A Practical Approach, London: Chapman and Hall.
This dataset contains 21 body dimension measurements as well as age, weight, height, and gender on 507 individuals. The 247 men and 260 women were primarily individuals in their twenties and thirties, with a scattering of older men and women, all exercising several hours a week.
Measurements were initially taken by Grete Heinz and Louis J. Peterson - at San Jose State University and at the U.S. Naval Postgraduate School in Monterey, California. Later, measurements were taken at dozens of California health and fitness clubs by technicians under the supervision of one of these authors.
data(body)
data(body)
This data frame contains the following columns:
Biacromial diameter (cm)
Biiliac diameter, or "pelvic breadth" (cm)
Bitrochanteric diameter (cm)
Chest depth between spine and sternum at nipple level, mid-expiration (cm)
Chest diameter at nipple level, mid-expiration (cm)
Elbow diameter, sum of two elbows (cm)
Wrist diameter, sum of two wrists (cm)
Knee diameter, sum of two knees (cm)
Ankle diameter, sum of two ankles (cm)
Shoulder girth over deltoid muscles (cm)
Chest girth, nipple line in males and just above breast tissue in females, mid-expiration (cm)
Waist girth, narrowest part of torso below the rib cage, average of contracted and relaxed position (cm)
Navel (or "Abdominal") girth at umbilicus and iliac crest, iliac crest as a landmark (cm)
Hip girth at level of bitrochanteric diameter (cm)
Thigh girth below gluteal fold, average of right and left girths (cm)
Bicep girth, flexed, average of right and left girths (cm)
Forearm girth, extended, palm up, average of right and left girths (cm)
Knee girth over patella, slightly flexed position, average of right and left girths (cm)
Calf maximum girth, average of right and left girths (cm)
Ankle minimum girth, average of right and left girths (cm)
Wrist minimum girth, average of right and left girths (cm)
in years
in kg
in cm
1 - male, 0 - female
Heinz, G., Peterson, L.J., Johnson, R.W. and Kerk, C.J. (2003), “Exploring Relationships in Body Dimensions”, Journal of Statistics Education , 11.
The data file is taken from http://jse.amstat.org/datasets/body.dat.txt This information file is based on http://jse.amstat.org/datasets/body.txt
Given an nxp matrix m
and a function f
,
returns the pxp matrix got by applying f
to all pairs of columns of m
.
colpairs(m, f, diag = 0, na.omit = FALSE, ...)
colpairs(m, f, diag = 0, na.omit = FALSE, ...)
m |
a matrix |
f |
a function of two vectors, which returns a single result. |
diag |
if supplied, this value is placed on the diagonal of the result. |
na.omit |
If |
... |
argments are passed to |
a matrix matrix got by applying f
to all pairs of columns of m
.
Catherine B. Hurley
gave
, partition.crit
,
order.single
,order.endlink
data(state) state.m <- colpairs(state.x77, function(x,y) cor.test(x,y,"two.sided","kendall")$estimate, diag=1) state.col <- dmat.color(state.m) # This is equivalent to state.m <- cor(state.x77,method="kendall") layout(matrix(1:2,nrow=1,ncol=2)) cparcoord(state.x77, panel.color= state.col) # Get rid of the panels with lots of line crossings (yellow) by reorderings cparcoord(state.x77, order.endlink(state.m), state.col) layout(matrix(1,1)) # m is a homogeneity measure of each pairwise variable plot m <- -colpairs(scale(state.x77), gave) o<- order.single(m) pcols = dmat.color(m) # Color panels by level of m and reorder variables so that # pairs with high m are near the diagonal. cpairs(state.x77,order=o, panel.colors=pcols) # In this case panels showing either of Area or Population # exhibit the most clumpiness because these variables # are skewed.
data(state) state.m <- colpairs(state.x77, function(x,y) cor.test(x,y,"two.sided","kendall")$estimate, diag=1) state.col <- dmat.color(state.m) # This is equivalent to state.m <- cor(state.x77,method="kendall") layout(matrix(1:2,nrow=1,ncol=2)) cparcoord(state.x77, panel.color= state.col) # Get rid of the panels with lots of line crossings (yellow) by reorderings cparcoord(state.x77, order.endlink(state.m), state.col) layout(matrix(1,1)) # m is a homogeneity measure of each pairwise variable plot m <- -colpairs(scale(state.x77), gave) o<- order.single(m) pcols = dmat.color(m) # Color panels by level of m and reorder variables so that # pairs with high m are near the diagonal. cpairs(state.x77,order=o, panel.colors=pcols) # In this case panels showing either of Area or Population # exhibit the most clumpiness because these variables # are skewed.
This function draws a scatterplot matrix of data. Variables may be reordered and panels colored in the display.
cpairs(data, order = NULL, panel.colors = NULL, border.color = "grey70", show.points = TRUE, ...)
cpairs(data, order = NULL, panel.colors = NULL, border.color = "grey70", show.points = TRUE, ...)
data |
a numeric matrix |
order |
the order of variables. Default is the order in data. |
panel.colors |
a matrix of panel colors. If supplied, dimensions should match those of the pairs plot. Diagonal entries are ignored. |
border.color |
used for panel border. |
show.points |
If FALSE, no points are drawn. |
... |
graphical parameters passed to |
Catherine B. Hurley
Hurley, Catherine B. “Clustering Visualisations of Multidimensional Data”, to appear in JCGS.
pairs
, cparcoord
,
dmat.color
,colpairs
, order.single
.
data(USJudgeRatings) judge.cor <- cor(USJudgeRatings) judge.color <- dmat.color(judge.cor) # Colors variables by their correlation. cpairs(USJudgeRatings,panel.colors=judge.color,pch=".",gap=.5) judge.o <- order.single(judge.cor) # Reorder variables so that those with highest correlation # are close to the diagonal. cpairs(USJudgeRatings,judge.o,judge.color,pch=".",gap=.5) # Specify your own color scheme judge.color <- dmat.color(judge.cor, breaks=c(-1,0,.5,.9,1), colors = cm.colors(4)) data(bank) # m is a homogeneity measure of each pairwise variable plot m <- -colpairs(scale(bank[,-1]), partition.crit,gfun=gave,groups=bank[,1]) # Color panels by level of m and reorder variables so that # pairs with high m are near the diagonal. Panels shown # in pink have the highest amount of group homogeneity, as measured by # gave. cpairs(bank[,-1],order=order.single(m), panel.colors=dmat.color(m), gap=.3,col=c("purple","black")[bank[,"Status"]+1], pch=c(5,3)[bank[,"Status"]+1])
data(USJudgeRatings) judge.cor <- cor(USJudgeRatings) judge.color <- dmat.color(judge.cor) # Colors variables by their correlation. cpairs(USJudgeRatings,panel.colors=judge.color,pch=".",gap=.5) judge.o <- order.single(judge.cor) # Reorder variables so that those with highest correlation # are close to the diagonal. cpairs(USJudgeRatings,judge.o,judge.color,pch=".",gap=.5) # Specify your own color scheme judge.color <- dmat.color(judge.cor, breaks=c(-1,0,.5,.9,1), colors = cm.colors(4)) data(bank) # m is a homogeneity measure of each pairwise variable plot m <- -colpairs(scale(bank[,-1]), partition.crit,gfun=gave,groups=bank[,1]) # Color panels by level of m and reorder variables so that # pairs with high m are near the diagonal. Panels shown # in pink have the highest amount of group homogeneity, as measured by # gave. cpairs(bank[,-1],order=order.single(m), panel.colors=dmat.color(m), gap=.3,col=c("purple","black")[bank[,"Status"]+1], pch=c(5,3)[bank[,"Status"]+1])
This function draws a parallel coordinate plot of data. Variables
may be reordered and panels colored in the display. It is a modified
version of parcoord {MASS}
.
cparcoord(data, order = NULL, panel.colors = NULL, col = 1, lty = 1, horizontal = FALSE, mar = NULL, ...)
cparcoord(data, order = NULL, panel.colors = NULL, col = 1, lty = 1, horizontal = FALSE, mar = NULL, ...)
data |
a numeric matrix |
order |
the order of variables. Default is the order in data. |
panel.colors |
either a vector or a matrix of panel colors. If a vector is supplied, the ith color is used for the ith panel. If a matrix, dimensions should match those of the variables. Diagonal entries are ignored. |
col |
a vector of colours, recycled as necessary for each observation. |
lty |
a vector of line types, recycled as necessary for each observation. |
horizontal |
If TRUE, orientation is horizontal. |
mar |
margin parameters, passed to |
... |
graphics parameters which are passed to matplot. |
If panel.colors
is a matrix and order
is supplied, panel.colors
is
reordered.
Catherine B. Hurley
Hurley, Catherine B. “Clustering Visualisations of Multidimensional Data”, Journal of Computational and Graphical Statistics, vol. 13, (4), pp 788-806, 2004.
cpairs
, parcoord
,
dmat.color
, colpairs
, order.endlink
.
data(state) state.m <- colpairs(state.x77, function(x,y) cor.test(x,y,"two.sided","kendall")$estimate, diag=1) # OR, Works only in R1.8, state.m <-cor(state.x77,method="kendall") state.col <- dmat.color(state.m) cparcoord(state.x77, panel.color= state.col) # Get rid of the panels with lots of line crossings (yellow) by reordering: cparcoord(state.x77, order.endlink(state.m), state.col) # To get rid of the panels with lots of long line segments: # use a different panel merit measure- pclen: mins <- apply(state.x77,2,min) ranges <- apply(state.x77,2,max) - mins state.m <- -colpairs(scale(state.x77,mins,ranges), pclen) cparcoord(state.x77, order.endlink(state.m), dmat.color(state.m))
data(state) state.m <- colpairs(state.x77, function(x,y) cor.test(x,y,"two.sided","kendall")$estimate, diag=1) # OR, Works only in R1.8, state.m <-cor(state.x77,method="kendall") state.col <- dmat.color(state.m) cparcoord(state.x77, panel.color= state.col) # Get rid of the panels with lots of line crossings (yellow) by reordering: cparcoord(state.x77, order.endlink(state.m), state.col) # To get rid of the panels with lots of long line segments: # use a different panel merit measure- pclen: mins <- apply(state.x77,2,min) ranges <- apply(state.x77,2,max) - mins state.m <- -colpairs(scale(state.x77,mins,ranges), pclen) cparcoord(state.x77, order.endlink(state.m), dmat.color(state.m))
Computes measures of cluster heterogeneity of 2-d data,
where x
and y
give the object coordinates.
diameter(x, y, ...) star(x, y, ...) km2(x,y) gtot(x,y, ...) gave(x,y, ...)
diameter(x, y, ...) star(x, y, ...) km2(x,y) gtot(x,y, ...) gave(x,y, ...)
x |
is a numeric vector. |
y |
is a numeric vector. |
... |
are passed to |
diameter
computes the cluster diameter- the maximum distance
between objects.
star
computes the cluster star distance- the smallest
total distance from one object to another.
km2
computes the kmeans distance.
gtot
computes the sum of all inter-object distances.
gave
computes the per-object average of all
inter-object distances.
The cluster measure is returned.
Catherine B. Hurley
See Gordon, A. D. (1999).“Classification”. Second Edition. London: Chapman and Hall / CRC
colpairs
, cpairs
, order.single
x <- runif(20) y <- runif(20) diameter(x,y)
x <- runif(20) y <- runif(20) diameter(x,y)
Accepts a dissimilarity matrix or dist
m
, and
returns a matrix of colors.
Values in m
are cut
into categories using breaks
(ranked distances if
byrank
is TRUE
) and categories are assigned the values
in colors
.
dmat.color(m, colors = default.dmat.color, byrank = NULL, breaks = length(colors))
dmat.color(m, colors = default.dmat.color, byrank = NULL, breaks = length(colors))
m |
a dissimilarity matrix or the result of |
colors |
a vector of colors. The default is
|
byrank |
boolean, default |
breaks |
the number of break points. |
breaks
are passed to the functioncut
.
If byrank
is TRUE
, values in m
are
ranked before they are categorized.
If byrank
is TRUE
and breaks
is an integer, then
there are breaks
equal-sized categories.
Returns a matrix of colors. The matrix is symmetric, with NAs on the diagonal.
Catherine B. Hurley
data(longley) longley.cor <- cor(longley) # A matrix with equal (or nearly equal) number of entries of each color. longley.color <- dmat.color(longley.cor) # Plot the colors plotcolors(longley.color,dlabels=rownames(longley.color)) # Try different color schemes # A matrix where each color represents an equal-length interval. longley.color <- dmat.color(longley.cor, byrank=FALSE) # Specify colors and breaks longley.color <- dmat.color(longley.cor, breaks=c(-1,0,.5,.8,1), cm.colors(4)) # Could also reorder variables prior to plotting: longley.o <- order.single(longley.cor) longley.color <- longley.color[longley.o,longley.o] # The colors can be used in a scatterplot matrix or parallel # coordinate display: cpairs(longley, panel.color= longley.color) cparcoord(longley, panel.color= longley.color)
data(longley) longley.cor <- cor(longley) # A matrix with equal (or nearly equal) number of entries of each color. longley.color <- dmat.color(longley.cor) # Plot the colors plotcolors(longley.color,dlabels=rownames(longley.color)) # Try different color schemes # A matrix where each color represents an equal-length interval. longley.color <- dmat.color(longley.cor, byrank=FALSE) # Specify colors and breaks longley.color <- dmat.color(longley.cor, breaks=c(-1,0,.5,.8,1), cm.colors(4)) # Could also reorder variables prior to plotting: longley.o <- order.single(longley.cor) longley.color <- longley.color[longley.o,longley.o] # The colors can be used in a scatterplot matrix or parallel # coordinate display: cpairs(longley, panel.color= longley.color) cparcoord(longley, panel.color= longley.color)
Reorders objects so that similar (or high-merit) object pairs are adjacent. The clusters argument specifies (possibly ordered) groups, and objects within a group are kept together.
order.clusters(merit,clusters,within.order = order.single, between.order= order.single,...)
order.clusters(merit,clusters,within.order = order.single, between.order= order.single,...)
merit |
is either a symmetric matrix of merit or similarity score,
or a |
clusters |
specifies a partial grouping. It should either be a list whose ith element contains the indices of the objects in the ith cluster, or a vector of integers whose ith element gives the cluster membership of the ith object. Either representation may be used to specify grouping, the first is preferrable to specify adjacencies. |
within.order |
is a function used to order the objects within each cluster. |
between.order |
is a function used to order the clusters. |
... |
arguments are passed to |
within.order
may be NULL, in which case objects within a
cluster are assumed to be in order. Otherwise, within.order
should be one of the ordering functions
order.single
,order.endlink
or order.hclust
.
between.order
may be NULL, in which case cluster order
is preserved.
Otherwise, betweem.order
should be one of the ordering functions that uses a partial ordering,
order.single
or order.endlink
.
A permutation of the objects represented by merit
is returned.
Catherine B. Hurley
order.single,order.endlink,order.hclust.
data(state) state.d <- dist(state.x77) # Order the states, keeping states in a division together. state.o <- order.clusters(-state.d, as.numeric(state.division)) cmat <- dmat.color(as.matrix(state.d), rev(cm.colors(5))) op <- par(mar=c(1,6,1,1)) rlabels <- state.name[state.o] plotcolors(cmat[state.o,state.o], rlabels=rlabels) par(op) # Alternatively, use kmeans to place the states into 6 clusters state.km <- kmeans(state.d,6)$cluster # An ordering obtained from the kmeans clustering... state.o <- unlist(memship2clus(state.km)) layout(matrix(1:2,nrow=1,ncol=2),widths=c(0.1,1)) op <- par(mar=c(1,1,1,.2)) state.colors <- cbind(state.km,state.km) plotcolors(state.colors[state.o,]) par(mar=c(1,6,1,1)) rlabels <- state.name[state.o] plotcolors(cmat[state.o,state.o], rlabels=rlabels) par(op) layout(matrix(1,1)) # In the ordering above, the ordering of clusters and the # ordering of objects within the clusters is arbitrary. # order.clusters gives an improved order but preserves the kmeans clusters. state.o <- order.clusters(-state.d, state.km) # and replot layout(matrix(1:2,nrow=1,ncol=2),widths=c(0.1,1)) op <- par(mar=c(1,1,1,.2)) state.colors <- cbind(state.km,state.km) plotcolors(state.colors[state.o,]) par(mar=c(1,6,1,1)) rlabels <- state.name[state.o] plotcolors(cmat[state.o,state.o], rlabels=rlabels) par(op) layout(matrix(1,1))
data(state) state.d <- dist(state.x77) # Order the states, keeping states in a division together. state.o <- order.clusters(-state.d, as.numeric(state.division)) cmat <- dmat.color(as.matrix(state.d), rev(cm.colors(5))) op <- par(mar=c(1,6,1,1)) rlabels <- state.name[state.o] plotcolors(cmat[state.o,state.o], rlabels=rlabels) par(op) # Alternatively, use kmeans to place the states into 6 clusters state.km <- kmeans(state.d,6)$cluster # An ordering obtained from the kmeans clustering... state.o <- unlist(memship2clus(state.km)) layout(matrix(1:2,nrow=1,ncol=2),widths=c(0.1,1)) op <- par(mar=c(1,1,1,.2)) state.colors <- cbind(state.km,state.km) plotcolors(state.colors[state.o,]) par(mar=c(1,6,1,1)) rlabels <- state.name[state.o] plotcolors(cmat[state.o,state.o], rlabels=rlabels) par(op) layout(matrix(1,1)) # In the ordering above, the ordering of clusters and the # ordering of objects within the clusters is arbitrary. # order.clusters gives an improved order but preserves the kmeans clusters. state.o <- order.clusters(-state.d, state.km) # and replot layout(matrix(1:2,nrow=1,ncol=2),widths=c(0.1,1)) op <- par(mar=c(1,1,1,.2)) state.colors <- cbind(state.km,state.km) plotcolors(state.colors[state.o,]) par(mar=c(1,6,1,1)) rlabels <- state.name[state.o] plotcolors(cmat[state.o,state.o], rlabels=rlabels) par(op) layout(matrix(1,1))
Reorders objects so that similar (or high-merit) object pairs are adjacent. A permutation vector is returned.
order.single(merit,clusters=NULL) order.endlink(merit,clusters=NULL) order.hclust(merit, reorder=TRUE,...)
order.single(merit,clusters=NULL) order.endlink(merit,clusters=NULL) order.hclust(merit, reorder=TRUE,...)
merit |
is either a symmetric matrix of merit or similarity score,
or a |
clusters |
if non-null, specifies a partial ordering. It should be a list whose ith element contains the indices the objects in the ith ordered cluster. |
reorder |
if TRUE, reorders the default ordering from |
... |
arguments are passed to |
order.single
performs a variation on single-link cluster analysis,
devised by Gruvaeus and Wainer (1972).
When two ordered clusters are merged, the new cluster is formed by placing the
most similar endpoints of the joining clusters adjacent to each other.
When applied to variables, the resulting order is useful for scatterplot
matrices.
order.endlink
is another variation on single-link cluster analysis,
where the similarity between two ordered clusters is defined as the minimum distance
between their endpoints. When two ordered clusters are merged, the new cluster is formed by placing the
most similar endpoints of the joining clusters adjacent to each other.
When applied to variables, the resulting order is useful for parallel
coordinate displays.
order.hclust
returns the order of objects from hclust
if
reorder
is FALSE
. Otherwise, it reorders the objects using
hclust.reorder
so that
when two ordered clusters are merged, the new cluster is formed by placing the
most similar endpoints of the joining clusters adjacent to each other.
order.hclust(m,method="single")
is equivalent to
order.single
when clusters
is NULL
.
The default method of hclust
is "complete", see hclust
for other
possibilities.
A permutation of the objects represented by merit
is returned.
Catherine B. Hurley
Hurley, Catherine B. “Clustering Visualisations of Multidimensional Data”, Journal of Computational and Graphical Statistics, vol. 13, (4), pp 788-806, 2004.
Gruvaeus, G. and Wainer, H. (1972), “Two Additions to Hierarchical Cluster Analysis”, British Journal of Mathematical and Statistical Psychology, 25, 200-206.
cpairs
,
cparcoord
,plotcolors
,
reorder.hclust
,order.clusters
, hclust
.
data(state) state.cor <- cor(state.x77) order.single(state.cor) order.endlink(state.cor) order.hclust(state.cor,method="average") # Use for plotting... cpairs(state.x77, panel.colors=dmat.color(state.cor), order.single(state.cor),pch=".",gap=.4) cparcoord(state.x77, order.endlink(state.cor),panel.colors=dmat.color(state.cor)) # Order the states instead of the variables... state.d <- dist(state.x77) state.o <- order.single(-state.d) op <- par(mar=c(1,6,1,1)) cmat <- dmat.color(as.matrix(state.d), rev(cm.colors(5))) plotcolors(cmat[state.o,state.o], rlabels=state.name[state.o]) par(op)
data(state) state.cor <- cor(state.x77) order.single(state.cor) order.endlink(state.cor) order.hclust(state.cor,method="average") # Use for plotting... cpairs(state.x77, panel.colors=dmat.color(state.cor), order.single(state.cor),pch=".",gap=.4) cparcoord(state.x77, order.endlink(state.cor),panel.colors=dmat.color(state.cor)) # Order the states instead of the variables... state.d <- dist(state.x77) state.o <- order.single(-state.d) op <- par(mar=c(1,6,1,1)) cmat <- dmat.color(as.matrix(state.d), rev(cm.colors(5))) plotcolors(cmat[state.o,state.o], rlabels=state.name[state.o]) par(op)
This is the Ozone data discussed in Breiman and Friedman (JASA, 1985, p. 580). These data are for 330 days in 1976. All measurements are in the area of Upland, CA, east of Los Angeles.
data(ozone)
data(ozone)
This data frame contains the following columns:
Ozone conc., ppm, at Sandbug AFB.
Temperature F. (max?).
Inversion base height, feet
Daggett pressure gradient (mm Hg)
Visibility (miles)
Vandenburg 500 millibar height (m)
Humidity, percent
Inversion base temperature, degrees F.
Wind speed, mph
Breiman, L and Friedman, J. (1985), “Estimating Optimal Transformations for Multiple Regression and Correlation”, Journal of the American Statistical Association, 80, 580-598.
Applies the function gfun
to each group of x and y values
and combines the results using the function cfun
partition.crit(x, y, groups, gfun = gave, cfun = sum, ...)
partition.crit(x, y, groups, gfun = gave, cfun = sum, ...)
x |
is a numeric vector. |
y |
is a numeric vector. |
groups |
is a vector of group memberships. |
gfun |
is applied to the |
cfun |
combines the values returned by |
... |
arguements are passed to |
The function gfun
is applied to each group of x
and y
values. The function cfun
is applied to the vector or matrix of
gfun
results.
The result of applying cfun
.
Catherine B. Hurley
See Gordon, A. D. (1999). Classification. Second Edition. London: Chapman and Hall / CRC
x <- runif(20) y <- runif(20) g <- rep(c("a","b"),10) partition.crit(x,y,g) data(bank) # m is a homogeneity measure of each pairwise variable plot m <- -colpairs(scale(bank[,-1]), partition.crit,gfun=gave,groups=bank[,1]) # Color panels by level of m and reorder variables so that # pairs with high m are near the diagonal. Panels shown # in pink have the highest amount of group homogeneity, as measured by # gave. cpairs(bank[,-1],order=order.single(m), panel.colors=dmat.color(m), gap=.3,col=c("purple","black")[bank[,"Status"]+1], pch=c(5,3)[bank[,"Status"]+1]) # Try a different measure m <- -colpairs(scale(bank[,-1]), partition.crit,gfun=diameter,groups=bank[,1]) cpairs(bank[,-1],order=order.single(m), panel.colors=dmat.color(m), gap=.3,col=c("purple","black")[bank[,"Status"]+1], pch=c(5,3)[bank[,"Status"]+1]) # Result is the same, in this case.
x <- runif(20) y <- runif(20) g <- rep(c("a","b"),10) partition.crit(x,y,g) data(bank) # m is a homogeneity measure of each pairwise variable plot m <- -colpairs(scale(bank[,-1]), partition.crit,gfun=gave,groups=bank[,1]) # Color panels by level of m and reorder variables so that # pairs with high m are near the diagonal. Panels shown # in pink have the highest amount of group homogeneity, as measured by # gave. cpairs(bank[,-1],order=order.single(m), panel.colors=dmat.color(m), gap=.3,col=c("purple","black")[bank[,"Status"]+1], pch=c(5,3)[bank[,"Status"]+1]) # Try a different measure m <- -colpairs(scale(bank[,-1]), partition.crit,gfun=diameter,groups=bank[,1]) cpairs(bank[,-1],order=order.single(m), panel.colors=dmat.color(m), gap=.3,col=c("purple","black")[bank[,"Status"]+1], pch=c(5,3)[bank[,"Status"]+1]) # Result is the same, in this case.
Computes measures of profile smoothness of 2-d data,
where x
and y
give the object coordinates.
pclen(x, y) pcglen(x, y)
pclen(x, y) pcglen(x, y)
x |
is a numeric vector. |
y |
is a numeric vector. |
pclen
computes the total line length in a parallel coordinate plot
of x and y.
pcglen
computes the average (per object) line length in a parallel coordinate plot
where all pairs of objects are connected.
Usually, the data is standardized prior to using these functions.
The panel measure is returned.
Catherine B. Hurley
Hurley, Catherine B. “Clustering Visualisations of Multidimensional Data”, Journal of Computational and Graphical Statistics, vol. 13, (4), pp 788-806, 2004.
cparcoord
,
colpairs
, order.endlink
.
x <- runif(20) y <- runif(20) pclen(x,y) data(state) mins <- apply(state.x77,2,min) ranges <- apply(state.x77,2,max) - mins state.m <- -colpairs(scale(state.x77,mins,ranges), pclen) state.col <- dmat.color(state.m) cparcoord(state.x77, panel.color= state.col) # Get rid of the panels with long line segments (yellow) by reordering: cparcoord(state.x77, order.endlink(state.m), state.col)
x <- runif(20) y <- runif(20) pclen(x,y) data(state) mins <- apply(state.x77,2,min) ranges <- apply(state.x77,2,max) - mins state.m <- -colpairs(scale(state.x77,mins,ranges), pclen) state.col <- dmat.color(state.m) cparcoord(state.x77, panel.color= state.col) # Get rid of the panels with long line segments (yellow) by reordering: cparcoord(state.x77, order.endlink(state.m), state.col)
plotcolors
plots a matrix of colors
as an image or as points.
imageinfo
is a utility that given a matrix of colors,
returns a structure useful for the image
function.
plotcolors(cmat, na.color = "white", dlabels = NULL, rlabels = FALSE, clabels = FALSE, ptype = "image", border.color = "grey70", pch = 15, cex = 3, label.cex = 0.6, ...) imageinfo(cmat)
plotcolors(cmat, na.color = "white", dlabels = NULL, rlabels = FALSE, clabels = FALSE, ptype = "image", border.color = "grey70", pch = 15, cex = 3, label.cex = 0.6, ...) imageinfo(cmat)
cmat |
a matrix of numbers, nas are allowed. |
na.color |
used for NAs in |
dlabels |
vector of labels for the diagonals. |
rlabels |
vector of labels for the rows. |
clabels |
vector of labels for the columns. |
ptype |
should be "image" or "points" |
border.color |
color of border drawn around the plot. |
pch |
point type used when ptype="points". |
cex |
point cex used when ptype="points". |
label.cex |
cex parameter used for labels. |
... |
graphical parameters |
imageinfo
returns a list with components:
x |
a vector of x coordinates. |
y |
a vector of y coordinates. |
z |
a matrix containing values to be plotted. |
col |
the colors to be used. |
Catherine B. Hurley
plotcolors(matrix(1:20,nrow=4,ncol=5)) plotcolors(matrix(1:20,nrow=4,ncol=5),ptype="points",cex=6) plotcolors(matrix(1:20,nrow=4,ncol=5),rlabels = c("a","b","c","d")) data(longley) longley.cor <- cor(longley) # A matrix with equal (or nearly equal) number of entries of each color. longley.color <- dmat.color(longley.cor) plotcolors(longley.color, dlabels=rownames(longley.color)) # Could also reorder variables prior to plotting: longley.o <- order.single(longley.cor) longley.color <- longley.color[longley.o,longley.o] op <- par(mar=c(1,6,6,1)) plotcolors(longley.color,rlabels=rownames(longley.color),clabels=rownames(longley.color) ) par(op)
plotcolors(matrix(1:20,nrow=4,ncol=5)) plotcolors(matrix(1:20,nrow=4,ncol=5),ptype="points",cex=6) plotcolors(matrix(1:20,nrow=4,ncol=5),rlabels = c("a","b","c","d")) data(longley) longley.cor <- cor(longley) # A matrix with equal (or nearly equal) number of entries of each color. longley.color <- dmat.color(longley.cor) plotcolors(longley.color, dlabels=rownames(longley.color)) # Could also reorder variables prior to plotting: longley.o <- order.single(longley.cor) longley.color <- longley.color[longley.o,longley.o] op <- par(mar=c(1,6,6,1)) plotcolors(longley.color,rlabels=rownames(longley.color),clabels=rownames(longley.color) ) par(op)
Reorders objects so that nearby object pairs are adjacent.
## S3 method for class 'hclust' reorder(x,dis,...)
## S3 method for class 'hclust' reorder(x,dis,...)
x |
is the result of |
dis |
is a distance matrix or |
... |
additional arguments. |
In hierarchical cluster displays, a decision is needed at each merge to specify which subtree should go on the left and which on the right. This algorithm uses the order suggested by Gruvaeus and Wainer (1972). At a merge of clusters A and B, the new cluster is one of (A,B), (A',B), (A,B'),(A',B'), where A' denotes A in reverse order. The new cluster is chosen to minimize the distance between the object in A placed adjacent to an object from B.
A permutation of the objects represented by dis
is returned.
Catherine B. Hurley
Hurley, Catherine B. “Clustering Visualisations of Multidimensional Data”, Journal of Computational and Graphical Statistics, vol. 13, (4), pp 788-806, 2004.
Gruvaeus, G. and Wainer, H. (1972), “Two Additions to Hierarchical Cluster Analysis”, British Journal of Mathematical and Statistical Psychology, 25, 200-206.
data(eurodist) dis <- as.dist(eurodist) hc <- hclust(dis, "ave") layout(matrix(1:2,nrow=2,ncol=1)) op <- par(mar=c(1,1,1,1)) plot(hc) hc1 <- reorder.hclust(hc, dis) plot(hc1) par(op) layout(matrix(1,1)) # Both dedrograms correspond to the same tree structure, # but the second one shows that # Paris is closer to Cherbourg than Munich, and # Rome is closer to Gibralter than to Barcelona. # We can also compare both orderings with an # image plot of the colors. # The second ordering seems to place nearby cities # closer to each other. layout(matrix(1:2,nrow=2,ncol=1)) op <- par(mar=c(1,6,1,1)) cmat <- dmat.color(eurodist, rev(cm.colors(5))) plotcolors(cmat[hc$order,hc$order], rlabels=labels(eurodist)[hc$order]) plotcolors(cmat[hc1$order,hc1$order], rlabels=labels(eurodist)[hc1$order]) layout(matrix(1,1)) par(op)
data(eurodist) dis <- as.dist(eurodist) hc <- hclust(dis, "ave") layout(matrix(1:2,nrow=2,ncol=1)) op <- par(mar=c(1,1,1,1)) plot(hc) hc1 <- reorder.hclust(hc, dis) plot(hc1) par(op) layout(matrix(1,1)) # Both dedrograms correspond to the same tree structure, # but the second one shows that # Paris is closer to Cherbourg than Munich, and # Rome is closer to Gibralter than to Barcelona. # We can also compare both orderings with an # image plot of the colors. # The second ordering seems to place nearby cities # closer to each other. layout(matrix(1:2,nrow=2,ncol=1)) op <- par(mar=c(1,6,1,1)) cmat <- dmat.color(eurodist, rev(cm.colors(5))) plotcolors(cmat[hc$order,hc$order], rlabels=labels(eurodist)[hc$order]) plotcolors(cmat[hc1$order,hc1$order], rlabels=labels(eurodist)[hc1$order]) layout(matrix(1,1)) par(op)
vec2distm
converts a vector to a distance matrix.
vec2dist
converts a vector to a dist
structure.
lower2upper.tri.inds
is the same as lower.to.upper.tri.inds
from package cluster. It computes an index vector for extracting or reordering a lower
triangular matrix that is stored as a contiguous vectors.
diag.off
returns a vector of off-diagonal elements of a matrix.
off
specifies the distance above the main (0) diagonal.
clus2memship
converts a list whose ith element contains the indices
of objects in the ith cluster into a vector whose ith
element gives the cluster number of the ith object.
memship2clus
converts a vector whose ith
element gives the cluster number of the ith object into a list
whose ith element contains the indices
of objects in the ith cluster.
vec2distm(vec) vec2dist(vec) lower2upper.tri.inds(n) diag.off(m,off=1) clus2memship(clusters) memship2clus(memship)
vec2distm(vec) vec2dist(vec) lower2upper.tri.inds(n) diag.off(m,off=1) clus2memship(clusters) memship2clus(memship)
vec |
is a vector. |
n |
is an integer > 1. |
m |
is a matrix. |
clusters |
is a list whose ith element contains the indices of the objects belonging to the ith cluster. |
off |
is an integer specifying the distance above the main (0) diagonal. |
memship |
is a vector whose ith element gives the cluster number of the ith object. |
Catherine B. Hurley
vec <- 1:15 vec2distm(vec) vec2dist(vec) diag.off(vec2distm(vec)) lower2upper.tri.inds(5) clus2memship(list(c(1,3,5),c(2,6),4)) memship2clus(c(1,3,4,2,1,4,2,3,2,3))
vec <- 1:15 vec2distm(vec) vec2dist(vec) diag.off(vec2distm(vec)) lower2upper.tri.inds(5) clus2memship(list(c(1,3,5),c(2,6),4)) memship2clus(c(1,3,4,2,1,4,2,3,2,3))
Data from the machine learning repository. A chemical analysis of 178 Italian wines from three different cultivars yielded 13 measurements. This dataset is often used to test and compare the performance of various classification algorithms.
data(wine)
data(wine)
This data frame contains the following columns:
There are 3 classes
Alcohol
Malic acid
Ash
Alcalinity of ash
Magnesium
Total phenols
Flavanoids
Nonflavanoid phenols
Proanthocyanins
Color intensity
Hue
OD280/OD315 of diluted wines
Proline
Forina, M. et al, PARVUS - An Extendible Package for Data Exploration, Classification and Correlation. Institute of Pharmaceutical and Food Analysis and Technologies, Via Brigata Salerno, 16147 Genoa, Italy.
Blake, C.L. and Merz, C.J. (1998), UCI Repository of machine learning databases, \ http://www.ics.uci.edu/~mlearn/MLRepository.html. Irvine, CA: University of California, Department of Information and Computer Science.
The database does not list the variable names. These were located at http://www.radwin.org/michael/projects/learning/about-wine.html.