The greatest value of a picture is when it forces us to notice what we never expected to see.
– John Tukey
Often the most effective way to describe, explore, and summarize a set of numbers – even a large set – is to look at a picture of those numbers.
– Edward Tufte
A central goal of most data analyses is to determine how one or more response variables relate to a set of covariates. In many such studies, the covariates take on only a few distinct values. In others, “shingling” continuous covarates (Cleveland 1993, in which it is referred to as slicing) yields a similar effect. Factorial designed experiments in which the design factors may occur at only 2 or 3 levels apiece in order to keep the experimental size manageable, are a common example of this kind of setup.
Effective visualization is an essential part of the data analysis process. Searching for and interpreting both expected and anomalous patterns, checking for unusual data that may signal possible problems, and clearly communicating the results are all facilitated by good graphics. Conditional plots (Cleveland 1993), coplots, an example of what Tufte called “small multiples” (Tufte 1983), are widely used in such efforts. Trellis Graphics was developed as an implementation of these ideas by William Cleveland, Richard Becker, and colleagues at AT&T Bell Labs and became part the S language system. Subsequently, Deepayan Sarkar developed the lattice package (Sarkar 2008) for R (R Core Team 2015), as an open source implementation. Other R packages, e.g. ggplot2, provide similar capabilities that differ somewhat in the details of how the plots are arranged and annotated. Stripless adds some additional features to lattice, but the same features could very likely be added to others.
I illustrate the concepts with the famous barley data that Cleveland extensively studied in his book and which is available as one of R’s stock data sets in the datasets package. The barley data frame has 4 columns, named “yield”,“variety”, “year”, and “site” giving barley yields from 10 varieties of barley grown at 6 sites in Minnesota in 1931 and 32 – 120 yields (bushels/acre) in all. The variety, site, and year variables were coded as factors. However, rather than using the default alphanumeric order for the factor levels, they were reordered by their median yields. Thus, for example, since the median of all sites and varieties in 1931 exceeded that of 1932, 1932 is the first level of “year” and 1931 is the second. The following R code does this reordering:
As will be clear in a moment for readers not already familiar with Cleveland’s book, arranging plots in such informative orderings can reveal important relationships that would otherwise be practically impossible to see. Using this ordering scheme, here is lattice’s dotplot of the results (but using “as.table = TRUE” so that median yields increase from top to bottom; Cleveland’s display used the default bottom to top ordering).
dotplot(variety~yield|year*site,data=barley,
col = "darkblue",
scales = list(alter = 1),
as.table = TRUE)
Cleveland’s concern was “the Morris anomaly,” which reverses the patterns of 1931 yields being higher at all locations and interrupts the parallel increasing trends of location effects for each year. Cleveland makes a convincing case through graphics and further analysis that the two years’ results were mistakenly reversed at Morris.
However, my point here is to focus on the architecture of the
display, not its content. In particular, replicated strips for location
and year labels take up a lot of room, resulting in a squashed data
area. While this could be alleviated by fiddling with various parameters
of the display like font size, strip height, and so forth – or just
enlarging the overall display – the fact is that the labels really
aren’t necessary: the regularity of the layout permits a separate legend
to convey the panel identities without the labeling excess. This is what
the strucplot()
function of stripless does. Here is the
strucplot
version:
strucplot(variety~ yield|year*site,data=barley, horizontal=TRUE,
panel=panel.dotplot,
col = "darkblue",
scales = list(alternat = 1),
spacings = list(x=0, y=.5))
##
## PLOT STRUCTURE
##
## _Horizontal_
## year: 1932 1931
##
## _Vertical_
## site: Grand Rapids Duluth University Farm Morris Crookston Waseca
Note that a legend giving the plot structure is printed on the console.
Additionally, there is a print.stuctured()
method that can
be explicitly invoked to add it to a side of the plot or on a separate
plot page in addition to the console printout. The “Horizontal” list
gives the horizontal factors (only one here) and their levels in order
from left to right. For more than one factor, the first varies fastest,
the second next fastest, and so forth, as will be described more fully
in what follows. The “Vertical” list does the same for the vertical
factors (again, only one here) from top to bottom. I think most would
agree that strip identifying labels aren’t really necessary with this
legend.
I think sometimes a bar chart version of the dotplot makes it easier
to discern patterns, even though Tufte (1983) considers the bars
“chartjunk” because only the ends of the bars convey the information.
The stripless panel function panel.bars()
, which is just a
thin wrapper for the lattice panel.barchart()
function with
a few added defaults, provides a quick way to do this.
strucplot(variety~yield|year*site,data=barley, horizontal = TRUE,
panel=panel.bars,
scales = list(alter = 1),
spacings = list(x=0, y = .5))
##
## PLOT STRUCTURE
##
## _Horizontal_
## year: 1932 1931
##
## _Vertical_
## site: Grand Rapids Duluth University Farm Morris Crookston Waseca
By removing the strips and putting their information in the legend, the
displays are more compact and, I think, somewhat clearer. Of course,
this can only be done when the displays form visual table, i.e. a
regular array where the marginals of the array comprise the levels of
each factor. But when this is the case, the identifying information in
the strips becomes redundant and somewhat ‘chart-junky,’ especially when
there are more than just one or two factors.
With only three conditioning factors, one of which provides the
categories for the dotplot panels, the simple 6 x 2 layout works with or
without the panel strips. Here is an example illustrating the
difficulties with more factors using the hsb data in the faraway
package, which you should first install from your favorite CRAN package
repository (there are many interesting data sets in this package). It
consists of observations of classification variables (gender, race,
public or private school, etc.) and scores on a battery of 5
standardized pre-high school tests for 200 high school students. There
was interest in determining which factors influenced the high school
program – academic, vocational, or general – that students chose. Rather
then do a fine-grained analysis, here, I merely sum the 5 scores for
each student and use that total instead of the separate scores. As
before, reordering the levels of the categorical factors in some
sensible way rather than using the default lexicographic ordering is
advisable. I use the median total score for all the factors except
ses
, socioeconomic-economic status, which has a natural
ordering of “low”<“middle”<“high”.
hsb$score <- rowSums(hsb[,7:11])
hsb[,c(2,3,5,6)] <-lapply(hsb[,c(2,3,5,6)],reorder,hsb$score,median)
hsb$ses <- factor(hsb$ses,levels = c("low","middle","high"))
I want to display how the total scores vary with chosen program using the other 4 factors, gender, race, ses, and schooltype as conditioning variables. This means that there will be 2x4x3x2 = 48 panels in all. I use an 8 row x 6 column layout, with gender and ses varying horizontally and schtyp and race varying vertically:
hsbprog <- strucplot(prog~score|gender*race*ses*schtyp,data=hsb,
main = "High School Program Choice",
panel= panel.dotplot,
horizontal = TRUE, cex=.5,
col = "darkblue",
scales = list(alternat = 1),
xyLay= list(x=c(1,3), y = c(4,2)))
plot(hsbprog)
##
## PLOT STRUCTURE
##
## _Horizontal_
## gender: female male
## ses: low middle high
##
## _Vertical_
## schtyp: public private
## race: hispanic african-amer asian white
Here’s how the the legend and spacings of the panels identifies the
factors and their levels. The first horizontal factor is gender, so the
first column of juxtaposed paired panels (= 0 spaces between) is for
“female” and the second is for “male”, with overall
median “male” scores greater than overall median
“female” scores (which might of course be reversed within any particular
pair or row of pairs). The second horizontal factor is ses, so the first
pair of columns of juxtaposed panels is for “low” ses, the second is for
“middle”, and the third is for “high”. The same is true vertically: the
juxtaposed pairs of rows are for “public” above and “private” below; the
four pairs or rows are the four different levels of race, from
“hispanic” on top to “white” on bottom. Median overall
scores for each factor increase down the array.
It is thus immediately clear from this plot that:
Essentially all data are from “public” schools, so including schtype as a covariate is unlikely to be useful
Almost all the data are for race = “white”, so including race as a covariate is unlikely to be useful
There appears to be a clear association between choice of program and total score on the exams in the way one would expect (highest scores tend to choose “academic”; lowest choose “vocation”; middle choose “general”)
It appears that those in the “high” ses group tend to choose “academic”, again as one might expect
One should certainly now consider how to improve this display, perhaps using statistical methods to help refine or modify it. However, I think it stands on its own as a reasonably informative first look at the data.
So the question is: how well would this work with strip labeling included? Here’s a default version with the same overall size.
dotplot(prog~score|gender*ses*schtyp*race,data=hsb,
main = "High School Program Choice",
layout = c(6,8),
col = "darkblue",
scales = list(alter = 1),
as.table = TRUE)
Clearly, this is useless. Of course it could be made to work by, for
example, reducing the font size and vertical width in the labels,
enlarging the display size, perhaps encoding gender by symbol to reduce
the number of factors, and so forth. But having to work so hard to get
something useful makes the point: the strip labels have become chartjunk
that hide rather than reveal what the data have to say. Removing and
replacing them by a simple legend, which is made possible by the
regularity of the layout, is a better solution.
#Features and Syntax
strucplot()
is a S3 generic that is a wrapper for the
lattice package’s xyplot()
function. The default formula
method in the previous examples has basically just 2 new arguments,
spacings
and xyLayout
(there is also a 3rd,
center
used for 2 level experimental designs that will be
discussed later). A few xyplot
arguments, such as ‘layout’,
‘between’, ‘skip’,‘strip’, ‘perm.cond’, ‘index.cond’, and
‘drop.unused.levels’, are set by strucplot
and are
therefore ignored in a strucplot
call; but most other
xyplot
arguments, especially the ‘panel’ function,
‘scales’, ‘groups’, and those like ‘cex’, ‘pch’, and ‘col’ that control
the plot characteristics, are passed to xyplot
and so work
as expected.
The formula
argument is as in xyplot()
(whose Help page should be consulted as needed), except for an added
convenience shortcut: a “.” on the righthand side after the “|”
conditioning symbol means “all other variables in the data.frame
data
argument in their column order.” Consequently, it is
an error to omit the data
argument when “.” is used. When
the conditioning variables are explicitly given, they will be looked up
in the usual way.
spacings
Argumentspacings
controls the spacings in the display. It must
be a list with “x” and “y” components, each of which must be a
nondecreasing sequence of positive numbers. These are used in order to
specify the spacing (in character heights) between the levels of the
factors in each direction. For example, in the high school students
plot, the default spacing of 0:9 means that in the x direction, the
levels of gender (“male”, “female”) are 0 spaces apart (i.e. juxtaposed)
and the levels of ses (“low”,“medium”,“high”), the pairs of plots in
each row, are 1 space apart. Similarly for schtype and race vertically.
The default is 0:9 in both directions and usually does not need to be
changed. Note that depending on the device, fractional values can also
be used.
xyLayout
ArgumentxyLayout
controls the layout of the plot – which factors
are used for conditioning and in what order – in the x and y directions.
In its most general form, it is again a list with “x” and “y”
components, each of which must be a vector of indices into the sequence
of conditioning variables, i.e. the variables to the right of the “|” in
the formula
argument. The union of these vectors must be
the sequence 1,2, … ,n, where ‘n’ is the number of conditioning
variables; and they must not have any values in common. So in the
example, n = 4, and
list(x = 1, y = 2:4), list(x = c(1,3), y = c(4,2))
are both
acceptable; while list(x = 1, y = 3:4)
would not be.
There are also several different ways of abbreviating this
specification. The help file for strucplot
should be
consulted for details, but the idea is that anything that allows the
full list specification to be “figured out” is acceptable. So, for
example, xyLayout = list(x = 1)
or even
xyLayout = 1
is the same as
list(x = 1, y = 2:4)
. But note that
list(x = c(1,3))
means
list(x = c(1,3), y = c(2,4))
not
list(x = c(1,3), y = c(4,2))
; for the latter,
list( y = c(4,2))
would work.
As indicated above, the x component gives the factors that vary from
left to right with levels of the first factor spaced the closest, the
second the next closest, etc.. When as.table
is TRUE, which
is the default for strucplot
(the default for
xyplot
is FALSE), the y component does the same for the y
factors from top to bottom. If as.table
is
FALSE, then the order is from bottom to top.
strucplot()
returns an S3 object that inherits from
class c("structured","trellis")
. There is a
print.structured
method whose default is to print the
legend to the console and then call the trellis print/plot method to
make the plot, passing to it all arguments given in the call.
Consequently, all parameters and options of print.trellis
are available to any device supported by
trellis.device
.
Some limited customization of the console legend can be done. Most
important, factor names and/or levels can be shortened to “fit”” using
the abbrevLength
argument, which is used as the
minlength
argument to R’s abbreviate
function.
Additional optional arguments can be used to change default legend
titles and headings.
However, it may be desirable to show the legend directly on the
trellis plot. The optional legendLoc
and
legend
arguments enable this. The first can be one of the
character strings “left”, “right”, “top”, “bottom”, or “newpage”, and
specifies in which margin of the trellis display the legend should be
placed. The second must be a function that returns a
grid “grob”, a “graphical object” as described in the grid
package(Murrell 2011). This function is
passed the “structure” attribute of the object to be plotted, the
legendLoc
value, the abbrevLength
argument,
and plots the returned grob in the specified legendLoc. This is done by
adding or extending the legend
argument of the plot object,
so that so long as long as there is no conflict with any existing
key
or legend
component of the
strucplot
call that created it, the new legend will
peacefully coexist with any that are already part of the display. A
default legend function, defaultStrucLegend
produces a
legend similar to the console’s, but appropriately formatted for
whichever margin it appears in. It’s Help page should be consulted for
the full argument list that controls its appearance, some of which it
shares with the console legend.
As an example, here is the hsbprog plot with the legend added at the top of the plot with the title omitted, the heading changed, and the race and schtyp levels abbreviated to 4 letters in smaller font. Note that the console legend retains its title, “PLOT STRUCTURE” but applies the other changes.
print (hsbprog,legendLoc = "t",
cex.font = .7,
head = c("Left-Right","Up-Down"),
abbrev = list(race=c(0,4), schtyp = c(0,4)))
##
## PLOT STRUCTURE
##
## _Left-Right_
## gender: female male
## ses: low middle high
##
## _Up-Down_
## schtyp: pblc prvt
## race: hspn afr- asin whit
Of course, there is a tradeoff here: If, to minimize the space used, the
legend text is abbreviated so much that it is difficult to decipher, it
defeats its purpose. This means that one must consider:
The display format for the plot: how much space overall is available for the display.
The level of detail that must be clearly shown. Generally, one wants all the information in the plot to be clear, but sometimes a more compact plot in which some detail is lost is appropriate.
The amount of space needed for a clear and intelligible legend.
This may depend on the familiarity that the expected viewer has with the
subject matter!
One of the purposes of trellis displays is to make manifest
structural relationships between response(s) and covariates. A natural
application of this capability is to display predicted values of a
fitted model in a trellis plot. strucplot.lm
does this
directly for models inheriting from the “lm” class without having to
explicitly call strucplot
with the predicted values in a
formula. In addition to the fitted model object, it can take an optional
‘newdata’ argument to plot predicted values at data other than those
used in the fit. This argument is merely passed down to the relevant
predict method. An optional list of additional named arguments,
predictArgs, can also be specified if the predict
method
requires them.
For example, here is the result of fitting a simple additive model after first “correcting”
barley <- within(barley,{changed_year <- year
changed_year[site=="Morris" & year == 1932] <- 1931
changed_year[site=="Morris" & year == 1931] <- 1932}
)
barleyFit <- lm(yield ~ variety + site + changed_year, data=barley)
strucplot(barleyFit, pch=16, col = "darkblue",
panel= function(...){panel.grid(h=3,v=0); panel.xyplot(...)} )
##
## PLOT STRUCTURE
##
## _Horizontal_
## variety: Svansota No. 462 Manchuria No. 475 Velvet Peatland Glabron No. 457 Wisconsin No. 38 Trebi
## site: Grand Rapids Duluth University Farm Morris Crookston Waseca
##
## _Vertical_
## changed_year: 1932 1931
Note that the strucplot
method for a fitted model is of the
form strucplot(~ fitted(modelFit) | v1*v2*...)
, where the
conditioning variables after the “|” are the covariates used in the
fitted model. As in any strucplot
call, the plot layout can
be changed via the xyLayout
argument and other panel
functions can be used:
##
## PLOT STRUCTURE
##
## _Horizontal_
## variety: Svansota No. 462 Manchuria No. 475 Velvet Peatland Glabron No. 457 Wisconsin No. 38 Trebi
## changed_year: 1932 1931
##
## _Vertical_
## site: Grand Rapids Duluth University Farm Morris Crookston Waseca
However, if a different formula or only a subset of the variables is to be plotted, then the fitted values must be explicitly used in the formula. For example, to reproduce the original display plotting the fitted values in place of the actual values (with the corrected years), one would use:
strucplot(variety~ fitted(barleyFit)|changed_year*site,data=barley, horizontal=TRUE,
panel=panel.dotplot,
col = "darkblue",
scales = list(alternat = 1),
spacings = list(x=0, y=.5))
##
## PLOT STRUCTURE
##
## _Horizontal_
## changed_year: 1932 1931
##
## _Vertical_
## site: Grand Rapids Duluth University Farm Morris Crookston Waseca
data.frame
MethodOne of the motivations for this package was to allow better
visualization of factorial designs, especially the 2-level (fractional)
factorial designs widely used in industrial and business applications
(Box, Hunter, and Hunter 2005). Standard
approaches such as marginal plots of data or fits, and so-called
cube plots (e.g. as in Box, et. al. or the
cubePlot()
function in the R FrF2 package) may limit
insight into the full multivariate nature of the designs, because they
project onto lower dimensional subspaces – 1 or 2 dimensions for
marginals, 3 or 4 for cube plots. When there are more experimental
factors, it can be useful to extend the visualization to 5 or more
variables. The data.frame
method does this.
As an illustration, let’s compare a a 7 factor, 16 run, 2 level fractional factorial design with a 7 factor, 12 run Plackett-Burman design. To keep this vignette reasonably self-contained, the code below generates the design manually; but there are many R experimental design packages that will do this – and much more, of course! Consult the experimental design task view at https://cran.r-project.org/web/views/ExperimentalDesign.html for details.
## A 7 factor 12 run PB design matrix (using first 7 columns of PB 12 matrix)
pb12 <- matrix(1, nrow=12,ncol=11,dimnames = list(NULL,LETTERS[1:11]))
pb12[1,]<- c(-1,1,-1,rep(1,3),rep(-1,3),1,-1)
for(i in 1:10)pb12[i+1,] <- c(pb12[i,11],pb12[i,-11])
pbex <- data.frame(pb12[,1:7])
## A 7 factor 16 run ff of Resolution IV
ffex<- expand.grid(rep(list(c(-1,1)),4))
names(ffex) <- LETTERS[1:4]
ffex <- within(ffex,{G <- A*B*C; F <- B*C*D; E <- A*B*D})
strucplot(pbex)
##
## PLOT STRUCTURE
##
## _Horizontal_
## A: -1 1
## B: -1 1
## C: -1 1
## D: -1 1
##
## _Vertical_
## E: -1 1
## F: -1 1
## G: -1 1
##
## PLOT STRUCTURE
##
## _Horizontal_
## A: -1 1
## B: -1 1
## C: -1 1
## D: -1 1
##
## _Vertical_
## E: -1 1
## F: -1 1
## G: -1 1
As usual, the xyLayout
argument can be used to change the
plot layout.
I think these plots make several useful points.
center
argumentTwo level designs are geometrically a subset of the 2n corners of an n-dimensional hypercube, where n is the number of design variables. If all the variables are continuous, there is an awful lot of space in the middle where there is no information, i.e. design points. When this is the case, the designs are sometimes supplemented by adding a (often replicated) point in the “center” in which all the variables are set at a value between their extremes (though not always the midpoint). Plotting such a design as is would usually be a disaster: for instance, the 7 factor example would be a trellis plot with 37 = 2187 separate panels, most of which would be empty, too small to render properly, and would take forever to display.
To deal with this, a more compact display that omits all the empty
panels and places the single centerpoint panel in the middle of the plot
can be produced by adding the center = TRUE
argument to the
call. As an example, augment the Plackett-Burman design with a center
point and set the center
argument to TRUE.
##
## PLOT STRUCTURE
##
## _Horizontal_
## A: -1 0 1
## B: -1 -0.5 1
## C: -1 0.5 1
## D: -1 -0.5 1
##
## _Vertical_
## E: -1 0.5 1
## F: -1 -0.5 1
## G: -1 0.5 1
Note that the “center” point is shown in the exact center of the
display, even though the values of the individual factors are mostly not
at the midpoints of the extremes.
Let’s now use these plots to analyze some real data. The weldstrength
data set in the farwaway
package is a 9 factor, 16 run ff
(a 29-5) exploring the effect of 9 factors on the strengths
of welds. The low and high levels of the factors are coded as 0 and 1. A
barplot of the response with all 9 factors using the default xyLayout of
5 factors across the rows and 4 factors down the columns would be
produced by the following code, but the display is not shown: there
would be 2^9 = 512 separate panels, only a few of which contain
data.
##
## PLOT STRUCTURE
##
## _Horizontal_
## Rods: 0 1
## Drying: 0 1
## Material: 0 1
## Thickness: 0 1
## Angle: 0 1
##
## _Vertical_
## Opening: 0 1
## Current: 0 1
## Method: 0 1
## Preheating: 0 1
A reasonable alternative is to first run a regression to see if the
number of factors to display can be reduced to those few that may reveal
interesting structure.
## (Intercept) Rods Drying Material Thickness Angle
## 43.54 0.40 2.15 -3.10 0.13 0.05
## Opening Current Method Preheating
## -0.40 0.15 -0.15 -0.37
We see that the Drying and Material factors have by far the largest magnitude (standardized) coefficients, that Rods, Opening, and Preheating have marginal and about equal in magnitude effects, and that the remaining 4 factors seem likely to be nothing more than experimental noise. It seems reasonable to view a projection onto the 5 largest factors. Because the Drying and Material effects should be easy to see, I display these vertically and the remaining 3 smaller effects horizontally, so that the results can be more easily compared across the rows.
strucplot(~Strength|Rods*Drying*Material*Opening*Preheating, xyLay = c(1,4,5),
data = weldstrength, panel= panel.bars,
main = "Weldstrength Results for the 5 Largest Effects")
##
## PLOT STRUCTURE
##
## _Horizontal_
## Rods: 0 1
## Opening: 0 1
## Preheating: 0 1
##
## _Vertical_
## Drying: 0 1
## Material: 0 1
The positive Drying effect is evident in the consistent increase in
strength from the first to second row and third to fourth. Similary for
the negative Material effect in the reduction from the top 2 rows
(Material = 0) to the bottom 2 (Material =1).
The remaining 3 factors seem to have negligible effect except for the top right corner of the design. This is where the Material is “0” and Preheating is “1”. At this condition, there are (at least) 2 equivalent ways of describing what is seen in the display: either weld strength decreases by about 2 units (about 5%) when Opening increases and/or when Rods decreases. The inability to say which – or even to say whether it could just be experimental noise – is due to the design’s “parsimony” and the resulting inherent “confounding” of various effects. See (Box, Hunter, and Hunter 2005) for a fuller discussion of such matters.
However, as that discussion emphasizes, screening designs such as this are not meant nor expected to be conclusive. Rather, they enlist the aid of Mother Nature in whittling down a large list of possible sources of variability in a phenomenon under study to those few with greatest impact. These then would need to be further studied in more focused designs. The interpretation here seems reasonable and consistent with such a program. What is relevant for our purposes is that it was simply derived without recourse to complex statistical arguments from a minimal statistical analysis and effective data visualization. In many applications of experimental design especially in industry, such simplicity is a virtue.
One can also preserve real estate and/or more clearly show patterns
by using the groups
parameter to code a factor via color or
line type instead of as a conditioning factor of the display. I use the
sono data set in the faraway package to illustrate this. This is a 7
factor, 2 level fractional factorial. First, I show the design using a
grouping variable to color the panels by one of the factors. Note that
to do this, the groups must be explicitly given in the call since there
is no data argument in which they can be looked up for the data.frame
method.
strucplot(sono[,-c(1,8)],
groups=sono$Flask,
col=c("red","blue"),
main = "sono design, colored by Flask (red = down)"
)
##
## PLOT STRUCTURE
##
## _Horizontal_
## Molarity: down up
## Solute: down up
## pH: down up
##
## _Vertical_
## Gas: down up
## Water: down up
## Horn: down up
Using color nicely portrays the balance in the design. It also reduces
the number of panels by half, from 128 to 64. When plotting the
response, putting one of the variables on the x axis effects a
further reduction by half from 64 to 32. To illustrate this, the
following code plots the response (“Intensity”) versus Molarity in each
panel, still coloring by Flask. A key is used to give the group coding.
Note that one may have to fiddle a bit to do this, as lattice uses
different sets of parameters to control key features from those encoding
the graphical elements when the defaults are not used, as here.
strucplot(Intensity~Molarity|Solute*pH*Gas*Water*Horn, groups =Flask,
data = sono,
panel=panel.superpose,
par.settings = list(superpose.polygon = list(col=c("red","blue"))),
fill = c("red","blue"),
auto.key = list(points=FALSE,rectangles=TRUE,
title = "Flask",lines.title =1.5, border =TRUE,cex.title = 1.5),
panel.groups = function(x,y,fill,...) panel.bars(x=x,y=y,col=fill))
##
## PLOT STRUCTURE
##
## _Horizontal_
## Solute: down up
## pH: down up
## Gas: down up
##
## _Vertical_
## Water: down up
## Horn: down up
This plot clearly shows that except for the two combinations when
Molarity was “up”,Solute was “down”, pH was “up”, and Flask was “down”,
Intensity was about 80; for these two combinations, Intensity was about
350. Unfortunately, due to confounding in the design, it is not possible
to unequivocally untangle the effects of these four factors. This can be
seen by showing the full 16 panel plot in just these four factors.
##
## PLOT STRUCTURE
##
## _Horizontal_
## Molarity: down up
## Solute: down up
##
## _Vertical_
## pH: down up
## Flask: down up
There are only 8 unique combinations (it’s a half-fraction in these 4
factors) and only one combination shows an effect. This means that any
three of the factors suffice to characterize the effect (i.e., main
effects are confounded with 3 factor interactions). So for example,
dropping Flask:
##
## PLOT STRUCTURE
##
## _Horizontal_
## Molarity: down up
## Solute: down up
##
## _Vertical_
## pH: down up
So maybe Flask has no effect, and it’s just this combination of
Molarity, Solute, and pH that causes the high intensity. Only subject
matter expertise and/or further experimentation can resolve this.
Incidentally, it is perhaps worth again noting that while all of this can be elucidated in a formal statisical analysis, say by multiple regression, the visualization is both simple and sufficient to reveal what the data have to say. Of course, once one has gotten matters down to three or four factors, strip labels no longer take up so much space, so there is not as compelling a reason to remove them. Even so, some may prefer these “cleaner” displays.
When there are relatively few panels containing complex displays,
removing the strips may still reclaim visual real estate that faciltates
comparisons and better engages our pattern recognition machinery. To
illustrate, I show how to get strucplot
to make a somewhat
more complex display that requires a bit of “fiddling”: a histogram of
the data with an overlaid kernel density plot.1 The fiddling is
required because strucplot
is a wrapper for
xyplot
, which requires a y response versus an
x covariate. Axis scales, labels, etc. for the panels are all
constructed from the x, y data based on this paradigm.
However histograms and densities are constructed from a single
x variable by panel functions, so there is no y
variable. So to construct a display like a histogram in which the panel
function constructs the “y” values, one must use a formula of the form
x~x
and set y limits explicitly, which typically requires
several attempts to get it right.2 Here’s the code illustrating this, using
constructed data.
## Create a data frame with 4 factors, 2 at 3 levels each and 2 and 2 levels each,
## with 200 random beta(4,4) responses for each of the 36 settings
##Set a random seed for reproducibility and generate the data
set.seed(2222)
d <- data.frame(x=rbeta(7200,4,4),
expand.grid(a= 0:1, b= 1:3, c=letters[1:3], d = c("A","B"))[rep(1:36,e=200),])
## Construct the display
strucplot(x~x|.,data = d, ylim = c(0,3.2), ylab = "Density",
col= "lightblue", col.line = "red",lwd=2,
panel = function(x,y,col, col.line, ...){
panel.histogram(x, col=col,breaks=NULL, ...)
panel.densityplot(x, plot.points= FALSE,col= col.line,... )
}
)
##
## PLOT STRUCTURE
##
## _Horizontal_
## a: 0 1
## b: 1 2 3
##
## _Vertical_
## c: a b c
## d: A B
The absence of strip labels allows the eye to easily compare the complex
patterns among the panels (although there’s nothing much to see here, of
course). Strip labels within the panels would make this much more
difficult, aside from reducing the amount of visual real estate.
Using regular layouts and simple legends instead of layers of strip
labels to encode factor settings in trellis graphics often helps to
better reveal and apprehend informative patterns in the data. The
strucplot
methods described here are a simple extension of
the lattice package’s xyplot
framework that allows this to
be easily done. The examples show how this may be useful, especially for
the factorial designs widely used in industrial applications. But
multipanel conditioning – aka small multiples – is a broadly applicable
(and underutilized) data display paradigm; strucplot
is
meant to make stripless versions available for most situations in which
such displays are contemplated. I hope that this will not only be useful
in its own right, but that it also encourages greater application of
trellis-type multivariate visualization for the many data analysis tasks
where they can be of value.
Note that lattice’s densityplot()
and
histogram()
functions – and their corresponding panel
functions – are already perfectly capable of producing such displays.
But these are high level interfaces that bypass xyplot
.↩︎
The ~x
formula works only when no
x axis is needed for the display, e.g. for strip plots or box
plots of a vector of responses, or barplots of mean responses.↩︎