Package 'psych'

Title: Procedures for Psychological, Psychometric, and Personality Research
Description: A general purpose toolbox developed originally for personality, psychometric theory and experimental psychology. Functions are primarily for multivariate analysis and scale construction using factor analysis, principal component analysis, cluster analysis and reliability analysis, although others provide basic descriptive statistics. Item Response Theory is done using factor analysis of tetrachoric and polychoric correlations. Functions for analyzing data at multiple levels include within and between group statistics, including correlations and factor analysis. Validation and cross validation of scales developed using basic machine learning algorithms are provided, as are functions for simulating and testing particular item and test structures. Several functions serve as a useful front end for structural equation modeling. Graphical displays of path diagrams, including mediation models, factor analysis and structural equation models are created using basic graphics. Some of the functions are written to support a book on psychometric theory as well as publications in personality research. For more information, see the <https://personality-project.org/r/> web page.
Authors: William Revelle [aut, cre]
Maintainer: William Revelle <[email protected]>
License: GPL (>= 2)
Version: 2.4.3
Built: 2024-05-18 07:59:12 UTC
Source: CRAN

Help Index


A package for personality, psychometric, and psychological research

Description

Overview of the psych package.

The psych package has been developed at Northwestern University to include functions most useful for personality and psychological research. Some of the functions (e.g., read.file, read.clipboard, describe, pairs.panels, error.bars and error.dots) are useful for basic data entry and descriptive analyses. Use help(package="psych") or objects("package:psych") for a list of all functions. Two vignettes are included as part of the package. The intro vignette tells how to install psych and overview vignette provides examples of using psych in many applications. In addition, there are a growing set of tutorials available on the https://personality-project.org/r/ webpages.

A companion package psychTools includes larger data set examples and four more vignette.

Psychometric applications include routines (fa for maximum likelihood (fm="mle"), minimum residual (fm="minres"), minimum rank (fm=minrank) principal axes (fm="pa") and weighted least squares (fm="wls") factor analysis as well as functions to do Schmid Leiman transformations (schmid) to transform a hierarchical factor structure into a bifactor solution. Principal Components Analysis (pca) is also available. Rotations may be done using factor or components transformations to a target matrix include the standard Promax transformation (Promax), a transformation to a cluster target, or to any simple target matrix (target.rot) as well as the ability to call many of the GPArotation functions (e.g., oblimin, quartimin, varimax, geomin, ...). Functions for determining the number of factors in a data matrix include Very Simple Structure (VSS) and Minimum Average Partial correlation (MAP).

An alternative approach to factor analysis is Item Cluster Analysis (ICLUST). This function is particularly appropriate for exploratory scale construction.

There are a number of functions for finding various reliability coefficients (see Revelle and Condon, 2019). These include the traditional alpha (found for multiple scales and with more useful output by scoreItems, score.multiple.choice), beta (ICLUST) and both of McDonald's omega coefficients (omega, omegaSem and omega.diagram) as well as Guttman's six estimates of internal consistency reliability (guttman) and the six measures of Intraclass correlation coefficients (ICC) discussed by Shrout and Fleiss are also available.

Multilevel analyses may be done by statsBy and multilevel.reliability.

The scoreItems, and score.multiple.choice functions may be used to form single or multiple scales from sets of dichotomous, multilevel, or multiple choice items by specifying scoring keys. scoreOverlap correct interscale correlations for overlapping items, so that it is possible to examine hierarchical or nested structures.

Scales can be formed that best predict (after cross validation) particular criteria using bestScales using unit weighted or correlation weights. This procedure, also called the BISCUIT algorithm (Best Items Scales that are Cross validated, Unit weighted, Informative, and Transparent) is a simple alternative to more complicated supervised machine learning algorithms.

Additional functions make for more convenient descriptions of item characteristics include 1 and 2 parameter Item Response measures. The tetrachoric, polychoric and irt.fa functions are used to find 2 parameter descriptions of item functioning. scoreIrt, scoreIrt.1pl and scoreIrt.2pl do basic IRT based scoring.

A number of procedures have been developed as part of the Synthetic Aperture Personality Assessment (SAPA https://www.sapa-project.org/) project. These routines facilitate forming and analyzing composite scales equivalent to using the raw data but doing so by adding within and between cluster/scale item correlations. These functions include extracting clusters from factor loading matrices (factor2cluster), synthetically forming clusters from correlation matrices (cluster.cor), and finding multiple ((lmCor) and partial ((partial.r) correlations from correlation matrices.

If forming empirical scales, or testing out multiple regressions, it is important to cross validate the results. crossValidation will do this on a different data set.

lmCor and mediate meet the desire to do regressions and mediation analysis from either raw data or from correlation matrices. If raw data are provided, these functions can also do moderation analyses.

Functions to generate simulated data with particular structures include sim.circ (for circumplex structures), sim.item (for general structures) and sim.congeneric (for a specific demonstration of congeneric measurement). The functions sim.congeneric and sim.hierarchical can be used to create data sets with particular structural properties. A more general form for all of these is sim.structural for generating general structural models. These are discussed in more detail in the vignette (psych_for_sem).

Functions to apply various standard statistical tests include p.rep and its variants for testing the probability of replication, r.con for the confidence intervals of a correlation, and r.test to test single, paired, or sets of correlations.

In order to study diurnal or circadian variations in mood, it is helpful to use circular statistics. Functions to find the circular mean (circadian.mean), circular (phasic) correlations (circadian.cor) and the correlation between linear variables and circular variables (circadian.linear.cor) supplement a function to find the best fitting phase angle (cosinor) for measures taken with a fixed period (e.g., 24 hours).

A dynamic model of personality and motivation (the Cues-Tendency-Actions model) is include as cta.

A number of useful helper functions allow for data input (read.file), and data manipulation cs and dfOrder,

The most recent development version of the package is always available for download as a source file from the repository at the PMC lab:

install.packages("psych", repos = "https://personality-project.org/r/", type="source").

This will provide the most recent version for PCs and Macs.

Details

Two vignettes (intro.pdf and scoring.pdf) are useful introductions to the package. They may be found as vignettes in R or may be downloaded from https://personality-project.org/r/psych/intro.pdf https://personality-project.org/r/psych/overview.pdf and https://personality-project.org/r/psych/psych_for_sem.pdf. In addition, there are a number of "HowTo"s available at https://personality-project.org/r/

The more important functions in the package are for the analysis of multivariate data, with an emphasis upon those functions useful in scale construction of item composites. However, there are a number of very useful functions for basic data manipulation including read.file, read.clipboard, describe, pairs.panels, error.bars and error.dots) which are useful for basic data entry and descriptive analyses.

When given a set of items from a personality inventory, one goal is to combine these into higher level item composites. This leads to several questions:

1) What are the basic properties of the data? describe reports basic summary statistics (mean, sd, median, mad, range, minimum, maximum, skew, kurtosis, standard error) for vectors, columns of matrices, or data.frames. describeBy provides descriptive statistics, organized by one or more grouping variables. statsBy provides even more detail for data structured by groups including within and between correlation matrices, ICCs for group differences, as well as basic descriptive statistics organized by group.

pairs.panels shows scatter plot matrices (SPLOMs) as well as histograms and the Pearson correlation for scales or items. error.bars will plot variable means with associated confidence intervals. errorCircles will plot confidence intervals for both the x and y coordinates. corr.test will find the significance values for a matrix of correlations. error.dots creates a dot chart with confidence intervals.

2) What is the most appropriate number of item composites to form? After finding either standard Pearson correlations, or finding tetrachoric or polychoric correlations, the dimensionality of the correlation matrix may be examined. The number of factors/components problem is a standard question of factor analysis, cluster analysis, or principal components analysis. Unfortunately, there is no agreed upon answer. The Very Simple Structure (VSS) set of procedures has been proposed as on answer to the question of the optimal number of factors. Other procedures (VSS.scree, VSS.parallel, fa.parallel, and MAP) also address this question. nfactors combine several of these approaches into one convenient function. Unfortunately, there is no best answer to the problem.

3) What are the best composites to form? Although this may be answered using principal components (principal, aka pca), principal axis (factor.pa) or minimum residual (factor.minres) factor analysis (all part of the fa function) and to show the results graphically (fa.diagram), it is sometimes more useful to address this question using cluster analytic techniques. Previous versions of ICLUST (e.g., Revelle, 1979) have been shown to be particularly successful at forming maximally consistent and independent item composites. Graphical output from ICLUST.graph uses the Graphviz dot language and allows one to write files suitable for Graphviz. If Rgraphviz is available, these graphs can be done in R.

Graphical organizations of cluster and factor analysis output can be done using cluster.plot which plots items by cluster/factor loadings and assigns items to that dimension with the highest loading.

4) How well does a particular item composite reflect a single construct? This is a question of reliability and general factor saturation. Multiple solutions for this problem result in (Cronbach's) alpha (alpha, scoreItems), (Revelle's) Beta (ICLUST), and (McDonald's) omega (both omega hierarchical and omega total). Additional reliability estimates may be found in the guttman function.

This can also be examined by applying irt.fa Item Response Theory techniques using factor analysis of the tetrachoric or polychoric correlation matrices and converting the results into the standard two parameter parameterization of item difficulty and item discrimination. Information functions for the items suggest where they are most effective.

5) For some applications, data matrices are synthetically combined from sampling different items for different people. So called Synthetic Aperture Personality Assessement (SAPA) techniques allow the formation of large correlation or covariance matrices even though no one person has taken all of the items. To analyze such data sets, it is easy to form item composites based upon the covariance matrix of the items, rather than original data set. These matrices may then be analyzed using a number of functions (e.g., cluster.cor, fa, ICLUST, pca, mat.regress, and factor2cluster.

6) More typically, one has a raw data set to analyze. alpha will report several reliablity estimates as well as item-whole correlations for items forming a single scale, score.items will score data sets on multiple scales, reporting the scale scores, item-scale and scale-scale correlations, as well as coefficient alpha, alpha-1 and G6+. Using a ‘keys’ matrix (created by make.keys or by hand), scales can have overlapping or independent items. score.multiple.choice scores multiple choice items or converts multiple choice items to dichtomous (0/1) format for other functions.

If the scales have overlapping items, then scoreOverlap will give similar statistics, but correcting for the item overlap.

7) The reliability function combines the output from several different ways to estimate reliability including omega and splitHalf.

8) In addition to classical test theory (CTT) based scores of either totals or averages, 1 and 2 parameter IRT based scores may be found with scoreIrt.1pl, scoreIrt.2pl or more generally scoreIrt. Although highly correlated with CTT estimates, these scores take advantage of different item difficulties and are particularly appropriate for the problem of missing data.

9) If the data has a multilevel structure (e.g, items nested within time nested within subjects) the multilevel.reliability aka mlr function will estimate generalizability coefficients for data over subjects, subjects over time, etc. mlPlot will provide plots for each subject of items over time. mlArrange takes the conventional wide output format and converts it to the long format necessary for some multilevel functions. Other functions useful for multilevel data include statsBy and faBy.

An additional set of functions generate simulated data to meet certain structural properties. sim.anova produces data simulating a 3 way analysis of variance (ANOVA) or linear model with or with out repeated measures. sim.item creates simple structure data, sim.circ will produce circumplex structured data, sim.dichot produces circumplex or simple structured data for dichotomous items. These item structures are useful for understanding the effects of skew, differential item endorsement on factor and cluster analytic soutions. sim.structural will produce correlation matrices and data matrices to match general structural models. (See the vignette).

When examining personality items, some people like to discuss them as representing items in a two dimensional space with a circumplex structure. Tests of circumplex fit circ.tests have been developed. When representing items in a circumplex, it is convenient to view them in polar coordinates.

Additional functions for testing the difference between two independent or dependent correlation r.test, to find the phi or Yule coefficients from a two by table, or to find the confidence interval of a correlation coefficient.

Many data sets are included: bfi represents 25 personality items thought to represent five factors of personality, ability has 14 multiple choice iq items. sat.act has data on self reported test scores by age and gender. galton Galton's data set of the heights of parents and their children. peas recreates the original Galton data set of the genetics of sweet peas. heights and cubits provide even more Galton data, vegetables provides the Guilford preference matrix of vegetables. cities provides airline miles between 11 US cities (demo data for multidimensional scaling).

Partial Index (to see the entire index, see the link at the bottom of every help page)

psych A package for personality, psychometric, and psychological research.

Useful data entry and descriptive statistics

read.file search for, find, and read from file
read.clipboard shortcut for reading from the clipboard
read.clipboard.csv shortcut for reading comma delimited files from clipboard
read.clipboard.lower shortcut for reading lower triangular matrices from the clipboard
read.clipboard.upper shortcut for reading upper triangular matrices from the clipboard
describe Basic descriptive statistics useful for psychometrics
describe.by Find summary statistics by groups
statsBy Find summary statistics by a grouping variable, including within and between correlation matrices.
mlArrange Change multilevel data from wide to long format
headtail combines the head and tail functions for showing data sets
pairs.panels SPLOM and correlations for a data matrix
corr.test Correlations, sample sizes, and p values for a data matrix
cor.plot graphically show the size of correlations in a correlation matrix
multi.hist Histograms and densities of multiple variables arranged in matrix form
skew Calculate skew for a vector, each column of a matrix, or data.frame
kurtosi Calculate kurtosis for a vector, each column of a matrix or dataframe
geometric.mean Find the geometric mean of a vector or columns of a data.frame
harmonic.mean Find the harmonic mean of a vector or columns of a data.frame
error.bars Plot means and error bars
error.bars.by Plot means and error bars for separate groups
error.crosses Two way error bars
interp.median Find the interpolated median, quartiles, or general quantiles.
rescale Rescale data to specified mean and standard deviation
table2df Convert a two dimensional table of counts to a matrix or data frame

Data reduction through cluster and factor analysis

fa Combined function for principal axis, minimum residual, weighted least squares,
and maximum likelihood factor analysis
factor.pa Do a principal Axis factor analysis (deprecated)
factor.minres Do a minimum residual factor analysis (deprecated)
factor.wls Do a weighted least squares factor analysis (deprecated)
fa.graph Show the results of a factor analysis or principal components analysis graphically
fa.diagram Show the results of a factor analysis without using Rgraphviz
fa.sort Sort a factor or principal components output
fa.extension Apply the Dwyer extension for factor loadingss
principal Do an eigen value decomposition to find the principal components of a matrix
fa.parallel Scree test and Parallel analysis
fa.parallel.poly Scree test and Parallel analysis for polychoric matrices
factor.scores Estimate factor scores given a data matrix and factor loadings
guttman 8 different measures of reliability (6 from Guttman (1945)
irt.fa Apply factor analysis to dichotomous items to get IRT parameters
iclust Apply the ICLUST algorithm
ICLUST.diagram The base R graphics output function called by iclust
ICLUST.graph Graph the output from ICLUST using the dot language
ICLUST.rgraph Graph the output from ICLUST using rgraphviz
kaiser Apply kaiser normalization before rotating
reliability A wrapper function to find alpha, omega, split half. etc.
polychoric Find the polychoric correlations for items and find item thresholds
poly.mat Find the polychoric correlations for items (uses J. Fox's hetcor)
omega Calculate the omega estimate of factor saturation (requires the GPArotation package)
omega.graph Draw a hierarchical or Schmid Leiman orthogonalized solution (uses Rgraphviz)
partial.r Partial variables from a correlation matrix
predict Predict factor/component scores for new data
schmid Apply the Schmid Leiman transformation to a correlation matrix
scoreItems Combine items into multiple scales and find alpha
score.multiple.choice Combine items into multiple scales and find alpha and basic scale statistics
scoreOverlap Find item and scale statistics (similar to score.items) but correct for item overlap
lmCor Find Cohen's set correlation between two sets of variables (see also lmCor for the latest version)
smc Find the Squared Multiple Correlation (used for initial communality estimates)
tetrachoric Find tetrachoric correlations and item thresholds
polyserial Find polyserial and biserial correlations for item validity studies
mixed.cor Form a correlation matrix from continuous, polytomous, and dichotomous items
VSS Apply the Very Simple Structure criterion to determine the appropriate number of factors.
VSS.parallel Do a parallel analysis to determine the number of factors for a random matrix
VSS.plot Plot VSS output
VSS.scree Show the scree plot of the factor/principal components
MAP Apply the Velicer Minimum Absolute Partial criterion for number of factors

Functions for reliability analysis (some are listed above as well).

alpha Find coefficient alpha and Guttman Lambda 6 for a scale (see also score.items)
guttman 8 different measures of reliability (6 from Guttman (1945)
omega Calculate the omega estimates of reliability (requires the GPArotation package)
omegaSem Calculate the omega estimates of reliability using a Confirmatory model (requires the sem package)
ICC Intraclass correlation coefficients
score.items Combine items into multiple scales and find alpha
glb.algebraic The greates lower bound found by an algebraic solution (requires Rcsdp). Written by Andreas Moeltner

Procedures particularly useful for Synthetic Aperture Personality Assessment

alpha Find coefficient alpha and Guttman Lambda 6 for a scale (see also score.items)
bestScales A bootstrap aggregation function for choosing most predictive unit weighted items
make.keys Create the keys file for score.items or cluster.cor
correct.cor Correct a correlation matrix for unreliability
count.pairwise Count the number of complete cases when doing pair wise correlations
cluster.cor find correlations of composite variables from larger matrix
cluster.loadings find correlations of items with composite variables from a larger matrix
eigen.loadings Find the loadings when doing an eigen value decomposition
fa Do a minimal residual or principal axis factor analysis and estimate factor scores
fa.extension Extend a factor analysis to a set of new variables
factor.pa Do a Principal Axis factor analysis and estimate factor scores
factor2cluster extract cluster definitions from factor loadings
factor.congruence Factor congruence coefficient
factor.fit How well does a factor model fit a correlation matrix
factor.model Reproduce a correlation matrix based upon the factor model
factor.residuals Fit = data - model
factor.rotate ``hand rotate" factors
guttman 8 different measures of reliability
lmCor standardized multiple regression from raw or correlation matrix input Formerly called lmCor
mat.regress standardized multiple regression from raw or correlation matrix input
polyserial polyserial and biserial correlations with massive missing data
tetrachoric Find tetrachoric correlations and item thresholds

Functions for generating simulated data sets

sim The basic simulation functions
sim.anova Generate 3 independent variables and 1 or more dependent variables for demonstrating ANOVA
and lm designs
sim.circ Generate a two dimensional circumplex item structure
sim.item Generate a two dimensional simple structure with particular item characteristics
sim.congeneric Generate a one factor congeneric reliability structure
sim.minor Simulate nfact major and nvar/2 minor factors
sim.structural Generate a multifactorial structural model
sim.irt Generate data for a 1, 2, 3 or 4 parameter logistic model
sim.VSS Generate simulated data for the factor model
phi.demo Create artificial data matrices for teaching purposes
sim.hierarchical Generate simulated correlation matrices with hierarchical or any structure
sim.spherical Generate three dimensional spherical data (generalization of circumplex to 3 space)

Graphical functions (require Rgraphviz) – deprecated

structure.graph Draw a sem or regression graph
fa.graph Draw the factor structure from a factor or principal components analysis
omega.graph Draw the factor structure from an omega analysis(either with or without the Schmid Leiman transformation)
ICLUST.graph Draw the tree diagram from ICLUST

Graphical functions that do not require Rgraphviz

diagram A general set of diagram functions.
structure.diagram Draw a sem or regression graph
fa.diagram Draw the factor structure from a factor or principal components analysis
omega.diagram Draw the factor structure from an omega analysis(either with or without the Schmid Leiman transformation)
ICLUST.diagram Draw the tree diagram from ICLUST
plot.psych A call to plot various types of output (e.g. from irt.fa, fa, omega, iclust
cor.plot A heat map display of correlations
scatterHist Bivariate scatter plot and histograms
spider Spider and radar plots (circular displays of correlations)

Circular statistics (for circadian data analysis)

circadian.cor Find the correlation with e.g., mood and time of day
circadian.linear.cor Correlate a circular value with a linear value
circadian.mean Find the circular mean of each column of a a data set
cosinor Find the best fitting phase angle for a circular data set

Miscellaneous functions

comorbidity Convert base rate and comorbity to phi, Yule and tetrachoric
df2latex Convert a data.frame or matrix to a LaTeX table
dummy.code Convert categorical data to dummy codes
fisherz Apply the Fisher r to z transform
fisherz2r Apply the Fisher z to r transform
ICC Intraclass correlation coefficients
cortest.mat Test for equality of two matrices (see also cortest.normal, cortest.jennrich )
cortest.bartlett Test whether a matrix is an identity matrix
paired.r Test for the difference of two paired or two independent correlations
r.con Confidence intervals for correlation coefficients
r.test Test of significance of r, differences between rs.
p.rep The probability of replication given a p, r, t, or F
phi Find the phi coefficient of correlation from a 2 x 2 table
phi.demo Demonstrate the problem of phi coefficients with varying cut points
phi2poly Given a phi coefficient, what is the polychoric correlation
phi2poly.matrix Given a phi coefficient, what is the polychoric correlation (works on matrices)
polar Convert 2 dimensional factor loadings to polar coordinates.
scaling.fits Compares alternative scaling solutions and gives goodness of fits
scrub Basic data cleaning
tetrachor Finds tetrachoric correlations
thurstone Thurstone Case V scaling
tr Find the trace of a square matrix
wkappa weighted and unweighted versions of Cohen's kappa
Yule Find the Yule Q coefficient of correlation
Yule.inv What is the two by two table that produces a Yule Q with set marginals?
Yule2phi What is the phi coefficient corresponding to a Yule Q with set marginals?
Yule2tetra Convert one or a matrix of Yule coefficients to tetrachoric coefficients.

Functions that are under development and not recommended for casual use

irt.item.diff.rasch IRT estimate of item difficulty with assumption that theta = 0
irt.person.rasch Item Response Theory estimates of theta (ability) using a Rasch like model

Data sets included in the psych or psychTools package

bfi represents 25 personality items thought to represent five factors of personality
Thurstone 8 different data sets with a bifactor structure
cities The airline distances between 11 cities (used to demonstrate MDS)
epi.bfi 13 personality scales
iqitems 14 multiple choice iq items
msq 75 mood items
sat.act Self reported ACT and SAT Verbal and Quantitative scores by age and gender
Tucker Correlation matrix from Tucker
galton Galton's data set of the heights of parents and their children
heights Galton's data set of the relationship between height and forearm (cubit) length
cubits Galton's data table of height and forearm length
peas Galton`s data set of the diameters of 700 parent and offspring sweet peas
vegetables Guilford`s preference matrix of vegetables (used for thurstone)

A debugging function that may also be used as a demonstration of psych.

test.psych Run a test of the major functions on 5 different data sets. Primarily for development purposes.
Although the output can be used as a demo of the various functions.

Note

Development versions (source code) of this package are maintained at the repository https://personality-project.org/r/ along with further documentation. Specify that you are downloading a source package.
Some functions require other packages. Specifically, omega and schmid require the GPArotation package, ICLUST.rgraph and fa.graph require Rgraphviz but have alternatives using the diagram functions. i.e.:

function requires
omega GPArotation
schmid GPArotation
ICLUST.rgraph Rgraphviz
fa.graph Rgraphviz
structure.graph Rgraphviz
glb.algebraic Rcsdp

Author(s)

William Revelle

References

A general guide to personality theory and research may be found at the personality-project https://personality-project.org/. See also the short guide to R at https://personality-project.org/r/. In addition, see

Revelle, W. (in preparation) An Introduction to Psychometric Theory with applications in R. Springer. at https://personality-project.org/r/book/

Revelle, W. and Condon, D.M. (2019) Reliability from alpha to omega: A tutorial. Psychological Assessment, 31, 12, 1395-1411. https://doi.org/10.1037/pas0000754. https://osf.io/preprints/psyarxiv/2y3w9/ Preprint available from PsyArxiv

Examples

#See the separate man pages and the complete index.
#to test most of the psych package run the following
#test.psych()

Find two estimates of reliability: Cronbach's alpha and Guttman's Lambda 6.

Description

Internal consistency measures of reliability range from ωh\omega_h to α\alpha to ωt\omega_t. This function reports two estimates: Cronbach's coefficient α\alpha and Guttman's λ6\lambda_6. Also reported are item - whole correlations, α\alpha if an item is omitted, and item means and standard deviations.

Usage

alpha(x, keys=NULL,cumulative=FALSE, title=NULL, max=10,na.rm = TRUE,
   check.keys=FALSE,n.iter=1,delete=TRUE,use="pairwise",warnings=TRUE,
   n.obs=NULL,impute=NULL, discrete=TRUE
   )
alpha.ci(alpha,n.obs,n.var=NULL,p.val=.05,digits=2) #returns an invisible object
alpha2r(alpha, n.var)

Arguments

x

A data.frame or matrix of data, or a covariance or correlation matrix

keys

If some items are to be reversed keyed, then either specify the direction of all items or just a vector of which items to reverse

title

Any text string to identify this run

cumulative

should means reflect the sum of items or the mean of the items. The default value is means.

max

the number of categories/item to consider if reporting category frequencies. Defaults to 10, passed to link{response.frequencies}

na.rm

The default is to remove missing values and find pairwise correlations

check.keys

if TRUE, then find the first principal component and reverse key items with negative loadings. Give a warning if this happens.

n.iter

Number of iterations if bootstrapped confidence intervals are desired

delete

Delete items with no variance and issue a warning

use

Options to pass to the cor function: "everything", "all.obs", "complete.obs", "na.or.complete", or "pairwise.complete.obs". The default is "pairwise"

warnings

By default print a warning and a message that items were reversed. Suppress the message if warnings = FALSE

alpha

The value to use for confidence intervals

n.obs

If using correlation matrices as input, by specify the number of observations, we can find confidence intervals

impute

How should we impute missing data? Not at all, medians, or means

discrete

If TRUE, then report frequencies by categories.

n.var

Number of items in the scale (to find r.bar)

p.val

width of confidence interval (pval/2 to 1-p.val/2)

digits

How many digits to use for alpha.ci

Details

Alpha is one of several estimates of the internal consistency reliability of a test.

Surprisingly, more than a century after Spearman (1904) introduced the concept of reliability to psychologists, there are still multiple approaches for measuring it. Although very popular, Cronbach's α\alpha (1951) underestimates the reliability of a test and over estimates the first factor saturation.

α\alpha (Cronbach, 1951) is the same as Guttman's λ\lambda3 (Guttman, 1945) and may be found by

λ3=nn1(1tr(V)xVx)=nn1Vxtr(Vx)Vx=α\lambda_3 = \frac{n}{n-1}\Bigl(1 - \frac{tr(\vec{V})_x}{V_x}\Bigr) = \frac{n}{n-1} \frac{V_x - tr(\vec{V}_x)}{V_x} = \alpha

Perhaps because it is so easy to calculate and is available in most commercial programs, alpha is without doubt the most frequently reported measure of internal consistency reliability. Alpha is the mean of all possible spit half reliabilities (corrected for test length). For a unifactorial test, it is a reasonable estimate of the first factor saturation, although if the test has any microstructure (i.e., if it is “lumpy") coefficients β\beta (Revelle, 1979; see ICLUST) and ωh\omega_h (see omega) are more appropriate estimates of the general factor saturation. ωt\omega_t (see omega) is a better estimate of the reliability of the total test.

Guttman's Lambda 6 (G6) considers the amount of variance in each item that can be accounted for the linear regression of all of the other items (the squared multiple correlation or smc), or more precisely, the variance of the errors, ej2e_j^2, and is

λ6=1ej2Vx=1(1rsmc2)Vx.\lambda_6 = 1 - \frac{\sum e_j^2}{V_x} = 1 - \frac{\sum(1-r_{smc}^2)}{V_x} .

The squared multiple correlation is a lower bound for the item communality and as the number of items increases, becomes a better estimate.

G6 is also sensitive to lumpyness in the test and should not be taken as a measure of unifactorial structure. For lumpy tests, it will be greater than alpha. For tests with equal item loadings, alpha > G6, but if the loadings are unequal or if there is a general factor, G6 > alpha. alpha is a generalization of an earlier estimate of reliability for tests with dichotomous items developed by Kuder and Richardson, known as KR20, and a shortcut approximation, KR21. (See Revelle, in prep).

Alpha and G6 are both positive functions of the number of items in a test as well as the average intercorrelation of the items in the test. When calculated from the item variances and total test variance, as is done here, raw alpha is sensitive to differences in the item variances. Standardized alpha is based upon the correlations rather than the covariances.

A useful index of the quality of the test that is linear with the number of items and the average correlation is the Signal/Noise ratio where

s/n=nrˉ1rˉs/n = \frac{n \bar{r}}{1-\bar{r}}

(Cronbach and Gleser, 1964; Revelle and Condon (2019)).

More complete reliability analyses of a single scale can be done using the omega function which finds ωh\omega_h and ωt\omega_t based upon a hierarchical factor analysis.

Alternative functions score.items and cluster.cor will also score multiple scales and report more useful statistics. “Standardized" alpha is calculated from the inter-item correlations and will differ from raw alpha.

Four alternative item-whole correlations are reported, three are conventional, one unique. raw.r is the correlation of the item with the entire scale, not correcting for item overlap. std.r is the correlation of the item with the entire scale, if each item were standardized. r.drop is the correlation of the item with the scale composed of the remaining items. Although each of these are conventional statistics, they have the disadvantage that a) item overlap inflates the first and b) the scale is different for each item when an item is dropped. Thus, the fourth alternative, r.cor, corrects for the item overlap by subtracting the item variance but then replaces this with the best estimate of common variance, the smc. This is similar to a suggestion by Cureton (1966).

If some items are to be reversed keyed then they can be specified by either item name or by item location. (Look at the 3rd and 4th examples.) Automatic reversal can also be done, and this is based upon the sign of the loadings on the first principal component (Example 5). This requires the check.keys option to be TRUE. Previous versions defaulted to have check.keys=TRUE, but some users complained that this made it too easy to find alpha without realizing that some items had been reversed (even though a warning was issued!). Thus, I have set the default to be check.keys=FALSE with a warning that some items need to be reversed (if this is the case). To suppress these warnings, set warnings=FALSE.

Scores are based upon the simple averages (or totals) of the items scored. Thus, if some items are missing, the scores reflect just the items answered. This is particularly problematic if using total scores (with the cumulative=TRUE option). To impute missing data using either means or medians, use the scoreItems function. Reversed items are subtracted from the maximum + minimum item response for all the items.

When using raw data, standard errors for the raw alpha are calculated using equation 2 and 3 from Duhhachek and Iacobucci (2004). This is problematic because some simulations suggest these values are too small. It is probably better to use bootstrapped values.

alpha.ci finds confidence intervals using the Feldt et al. (1987) procedure. This procedure does not consider the internal structure of the test the way that the Duhachek and Iacobucci (2004) procedure does. That is, the latter considers the variance of the covariances, while the Feldt procedure is based upon just the mean covariance. In March, 2022, alpha.ci was finally fixed to follow the Feldt procedure. The confidence intervals reported by alpha use both the Feld and the Duhaceck and Iabocucci precedures. Note that these match for large values of N, but differ for smaller values.

Because both of these procedures use normal theory, if you really care about confidence intervals, using the boot option (n.iter > 1) is recommended.

Bootstrapped resamples are found if n.iter > 1. These are returned as the boot object. They may be plotted or described. The 2.5% and 97.5% values are returned in the boot.ci object.

Value

total

a list containing

raw_alpha

alpha based upon the covariances

std.alpha

The standarized alpha based upon the correlations

G6(smc)

Guttman's Lambda 6 reliability

average_r

The average interitem correlation

median_r

The median interitem correlation

mean

For data matrices, the mean of the scale formed by averaging or summing the items (depending upon the cumulative option)

sd

For data matrices, the standard deviation of the total score

alpha.drop

A data frame with all of the above for the case of each item being removed one by one.

item.stats

A data frame including

n

number of complete cases for the item

raw.r

The correlation of each item with the total score, not corrected for item overlap.

std.r

The correlation of each item with the total score (not corrected for item overlap) if the items were all standardized

r.cor

Item whole correlation corrected for item overlap and scale reliability

r.drop

Item whole correlation for this item against the scale without this item

mean

for data matrices, the mean of each item

sd

For data matrices, the standard deviation of each item

response.freq

For data matrices, the frequency of each item response (if less than 20) May be suppressed by specifying discretet=FALSE.

scores

Scores are by default simply the average response for all items that a participant took. If cumulative=TRUE, then these are sum scores. Note, this is dangerous if there are lots of missing values.

boot.ci

The lower, median, and upper ranges of the 95% confidence interval based upon the bootstrap.

boot

a 6 column by n.iter matrix of boot strapped resampled values

Unidim

An index of unidimensionality

Fit

The fit of the off diagonal matrix

Note

By default, items that correlate negatively with the overall scale will be reverse coded. This option may be turned off by setting check.keys = FALSE. If items are reversed, then each item is subtracted from the minimum item response + maximum item response where min and max are taken over all items. Thus, if the items intentionally differ in range, the scores will be off by a constant. See scoreItems for a solution.

Two item level statistics are worth comparing: the mean interitem r and the median interitem r. If these differ very much, that is a sign that the scale is not particularly homogeneous.

Variables without variance do not contribute to reliability but do contribute to total score. They are dropped with a warning that they had no variance and were thus dropped. However the scores found still include these values in the calculations.

If the data have been preprocessed by the dplyr package, a strange error can occur. alpha expects either data.frames or matrix input. data.frames returned by dplyr have had three extra classes added to them which causes alpha to break. The solution is merely to change the class of the input to "data.frame".

Two experimental measures of Goodness of Fit are returned in the output: Unidim and Fit. They are not printed or displayed, but are available for analysis. The first is an index of how well the modeled average correlations actually reproduce the original correlation matrix. The second is how well the modeled correlations reproduce the off diagonal elements of the matrix. Both are indices of squared residuals compared to the squared original correlations. These two measures are under development and might well be modified or dropped in subsequent versions.

Author(s)

William Revelle

References

Cronbach, L.J. (1951) Coefficient alpha and the internal strucuture of tests. Psychometrika, 16, 297-334.

Cureton, E. (1966). Corrected item-test correlations. Psychometrika, 31(1):93-96.

Cronbach, L.J. and Gleser G.C. (1964)The signal/noise ratio in the comparison of reliability coefficients. Educational and Psychological Measurement, 24 (3) 467-480.

Duhachek, A. and Iacobucci, D. (2004). Alpha's standard error (ase): An accurate and precise confidence interval estimate. Journal of Applied Psychology, 89(5):792-808.

Feldt, L. S., Woodruff, D. J., & Salih, F. A. (1987). Statistical inference for coefficient alpha. Applied Psychological Measurement (11) 93-103.

Guttman, L. (1945). A basis for analyzing test-retest reliability. Psychometrika, 10 (4), 255-282.

Revelle, W. (in preparation) An introduction to psychometric theory with applications in R. Springer. (Available online at https://personality-project.org/r/book/).

Revelle, W. Hierarchical Cluster Analysis and the Internal Structure of Tests. Multivariate Behavioral Research, 1979, 14, 57-74.

Revelle, W. and Condon, D.M. (2019) Reliability from alpha to omega: A tutorial. Psychological Assessment, 31, 12, 1395-1411. https://doi.org/10.1037/pas0000754. https://osf.io/preprints/psyarxiv/2y3w9 Preprint available from PsyArxiv

Revelle, W. and Condon, D.M. (2018) Reliability. In Irwing, P., Booth, T. and Hughes, D. (Eds). the Wiley-Blackwell Handbook of Psychometric Testing: A multidisciplinary reference on survey, scale, and test development.

Revelle, W. and Zinbarg, R. E. (2009) Coefficients alpha, beta, omega and the glb: comments on Sijtsma. Psychometrika, 74 (1) 1145-154.

See Also

omega, ICLUST, guttman, scoreItems, cluster.cor

Examples

set.seed(42) #keep the same starting values
#four congeneric measures
r4 <- sim.congeneric()
alpha(r4)
#nine hierarchical measures -- should actually use omega
r9 <- sim.hierarchical()
alpha(r9)

# examples of two independent factors that produce reasonable alphas
#this is a case where alpha is a poor indicator of unidimensionality

two.f <- sim.item(8)
#specify which items to reverse key by name
 alpha(two.f,keys=c("V3","V4","V5","V6"))
 cov.two <- cov(two.f)
 alpha(cov.two,check.keys=TRUE)
 #automatic reversal base upon first component
alpha(two.f,check.keys=TRUE)    #note that the median is much less than the average R
#this suggests (correctly) that the 1 factor model is probably wrong 
#an example with discrete item responses  -- show the frequencies
items <- sim.congeneric(N=500,short=FALSE,low=-2,high=2,
        categorical=TRUE) #500 responses to 4 discrete items with 5 categories
a4 <- alpha(items$observed)  #item response analysis of congeneric measures
a4
#summary just gives Alpha
summary(a4)

alpha2r(alpha = .74,n.var=4)

#because alpha.ci returns an invisible object, you need to print it
print(alpha.ci(.74, 100,p.val=.05,n.var=4))

Model comparison for regression, mediation, and factor analysis

Description

When doing regressions from the data or from a correlation matrix using setCor or doing a mediation analysis using link{mediate}, it is useful to compare alternative models. Since these are both regression models, the appropriate test is an Analysis of Variance. Similar tests, using Chi Square may be done for factor analytic models.

Usage

## S3 method for class 'psych'
anova(object,...)

Arguments

object

An object from setCor, mediate, omega, or fa.

...

More objects of the same type may be supplied here

Details

setCor returns the SE.residual and degrees of freedom. These are converted to SSR and then an analysis of variance is used to compare two (or more) models. For omega or fa the change in the ML chisquare statistic as a function of change in df is reported.

Value

An ANOVA table comparing the models.

Note

The code has been adapted from the anova.lm function in stats and the anova.sem by John Fox.

Author(s)

Wiliam Revelle

See Also

setCor, mediate, omega, fa

Examples

if(require("psychTools")) {
m1 <- lmCor(reaction ~ import, data = Tal_Or,std=FALSE)
m2 <- lmCor(reaction ~ import+pmi, data = Tal_Or,std=FALSE)
m3 <- lmCor(reaction ~ import+pmi + cond, data = Tal_Or,std=FALSE)
anova(m1,m2,m3)
}


#Several interesting test cases are taken from analyses of the Spengler data set
#Although the sample sizes are actually very large in the first wave,  I use the
#sample sizes from the last wave 
#This data set is actually in psychTools but is copied here until we can update psychTools
#We set the n.iter to be 50 instead of the default value of 5,000
if(require("psychTools")) {

 mod1 <- mediate(Income.50 ~ IQ + Parental+ (Ed.11) ,data=Spengler,
    n.obs = 1952, n.iter=50)
 mod2 <- mediate(Income.50 ~ IQ + Parental+ (Ed.11)  + (Income.11)
  ,data=Spengler,n.obs = 1952, n.iter=50)

#Now, compare these models
anova(mod1,mod2)
}

f3 <- fa(Thurstone,3,n.obs=213)  #we need to specifiy the n.obs for the test to work
f2 <- fa(Thurstone,2, n.obs=213)
anova(f2,f3)

Decision Theory measures of specificity, sensitivity, and d prime

Description

In many fields, decisions and outcomes are categorical even though the underlying phenomenon are probably continuous. E.g. students are accepted to graduate school or not, they finish or not. X-Rays are diagnosed as patients having cancer or not. Outcomes of such decisions are usually labeled as Valid Positives, Valid Negatives, False Positives and False Negatives. In hypothesis testing, False Positives are known as Type I errors, while False Negatives are Type II errors. The relationship between these four cells depends upon the correlation between the decision rule and the outcome as well as the level of evidence needed for a decision (the criterion). Signal Detection Theory and Decision Theory have a number of related measures of performance (accuracy = VP + VN), Sensitivity (VP/(VP + FN)), Specificity (1 - FP), d prime (d'), and the area under the Response Operating Characteristic Curve (AUC). More generally, these are examples of correlations based upon dichotomous data. AUC addresses some of these questions.

Usage

AUC(t=NULL,BR=NULL,SR=NULL,Phi=NULL,VP=NULL,labels=NULL,plot="b",zero=TRUE,correct=.5, 
     col=c("blue","red"))

Arguments

t

a 4 x 1 vector or a 2 x2 table of TP, FP, FN, TN values (see below) May be counts or proportions.

BR

Base Rate of successful outcomes or actual symptom (if t is not specified)

SR

Selection Rate for candidates or diagnoses (if t is not specified)

Phi

The Phi correlation coefficient between the predictor and the outcome variable (if t is not specified)

VP

The number of Valid Positives (selected applicants who succeed; correct diagnoses).(if t and Phi are not specified)

labels

Names of variables 1 and 2

plot

"b" (both), "d" (decision theory), "a" (auc), or "n" neither

zero

If True, then the noise distribution is centered at zero

correct

Cell values of 0 are replaced with correct. (See tetrachoric for a discussion of why this is needed.)

col

The color choice for the VP and FP, defaults to =c("blue","red") but could be c("grey","black") if we want to avoid colors

Details

The problem of making binary decisions about the state of the world is ubiquitous. We see this in Null Hypothesis Significance Testing (NHST), medical diagnoses, and selection for occupations. Variously known as NHST, Signal Detection Theory, clinical Assessment, or college admissions, all of these domains share the same two x two decision task.

Although the underlying phenomena are probably continuous, a typical decision or diagnostic situation makes dichotomous decisions: Accept or Reject, correctly identified, incorrectly identified. In Signal Detection Theory, the world has two states: Noise versus Signal + Noise. The decision is whether there is a signal or not.

In diagnoses, it is whether to diagnose an illness or not given some noisy signal (e.g., an X-Ray, a set of diagnostic tests).

In college admissions, we accept some students and reject others. Four-Five years later we observe who "succeeds" or graduates.

All of these decisions lead to four cells based upon a two x two categorization. Given the true state of the world is Positive or Negative, and a rater assigns positive or negative ratings, then the resulting two by two table has True (Valid) Positives and True (Valid) Negatives on the diagonal and False Positives and False Negatives off the diagonal.

When expressed as percentages of the total, then Base Rates (BR) depend upon the state of the world, but Selection Ratios (SR) are under the control of the person making the decision and affect the number of False Positives and the number of Valid Positives.

Given a two x two table of counts or percentages

Decide + Decide -
True + Valid Positive False Negative Base Rate %
True - False Positive Valid Negative 1- Base Rate
Selection ratio 1 - Selection ratio (Total N)

Unfortunately, although this way of categorizing the data is typical in assessment (e.g., Wiggins 1973), and everything is expressed as percentages of the total, in some decision papers, VP are expressed as the ratio of VP to total positive decisions (e.g., Wickens, 1984). This requires dividing through by the column totals (and represented as VP* and FP* in the table below).

The relationships implied by these data can be summarized as a phi or tetrachoric correlation between the raters and the world, or as a decision process with several alternative measures. If we make the assumption that the two dimensions are continuous and were artificially dichotomised, then the tetrachoric correlation is an estimate of the continuous correlation between these two latent dimensions. If we think of the data as truly representing two states e.g., vaccinated or not vaccinanated, dead or alive, then the phi coefficient is more appropriate.

Sensitivity, Specificity, Accuracy, Area Under the Curve, and d' (d prime). These measures may be defined as

Measure Definition
Sensitivity VP/(VP+ FN)
Specificity VN/(FP + VN)
Accuracy VP + VN
VP* VP/(VP + FP)
FP* (FP/(VP + FP
d' z(VP*) - z(FP*)
d' sqrt(2) z(AUC)
beta prob(X/S)/(prob(X/N))

Although only one point is found, we can form a graphical display of VP versus FP as a smooth curve as a function of the decision criterion. The smooth curve assumes normality whereas the other merely are the two line segments between the points (0,0), (FP,VP), (1,1). The resulting correlation between the inferred continuous state of the world and the dichotomous decision process is a biserial correlation.

When using table input, the values can be counts and thus greater than 1 or merely probabilities which should add up to 1. Base Rates and Selection Ratios are proportions and thus less than 1.

Value

phi

Phi coefficient of the two by two table

tetra

Tetrachoric (latent) coefficient inferred from the two by two table

r.bis

Biserial correlation of continuous state of world with decision

observed

The observed input (as a check)

probabilities

Observed values/ total number of observations

conditional

prob / rowSums(prob)

Accuracy

percentage of True Positives + True Negatives

Sensitivity

VP/(VP + FN)

Specificity

VN/(FP + VN)

d.prime

difference of True Positives versus True Negatives

beta

ratio of ordinates at the decision point

Author(s)

William Revelle

References

Metz, C.E. (1978) Basic principles of ROC analysis. Seminars in Nuclear Medicine, 8, 283-298.

Wiggins, Jerry S. (1973) Personality and Prediction: Principles of Personality Assessment. Addison-Wesley.

Wickens, Christopher D. (1984) Engineering Psychology and Human Performance. Merrill.

See Also

phi, phi2tetra ,Yule, Yule.inv Yule2phi, tetrachoric and polychoric, comorbidity

Examples

AUC(c(30,20,20,30))  #specify the table input
AUC(c(140,60,100,900)) #Metz example with colors
AUC(c(140,60,100,900),col=c("grey","black"))  #Metz example 1 no colors
AUC(c(80,120,40, 960)) #Metz example 2  Note how the accuracies are the same but d's differ
AUC(c(49,40,79,336)) #Wiggins p 249
AUC(BR=.05,SR=.254,Phi = .317) #Wiggins 251 extreme Base Rates

The Bass-Ackward factoring algorithm discussed by Goldberg

Description

Goldberg (2006) described a hierarchical factor structure organization from the “top down". The original idea was to do successive factor analyses from 1 to nf factors organized by factor score correlations from one level to the next. Waller (2007) discussed a simple way of doing this for components without finding the scores. Using the factor correlations (from Gorsuch) to organize factors hierarchically results may be organized at many different levels. The algorithm may be applied to principal components (pca) or to true factor analysis.

Usage

bassAckward(r, nfactors = 1, fm = "minres", rotate = "oblimin", scores = "tenBerge",
   adjust=TRUE, plot=TRUE,cut=.3, use = "pairwise", cor = "cor", weight = NULL,
   correct = 0.5,...)
bassAckward.diagram(x,digits=2,cut = .3,labels=NULL,marg=c(1.5,.5,1.0,.5),
   main="BassAckward",items=TRUE,sort=TRUE,lr=TRUE,curves=FALSE,organize=TRUE,
   values=FALSE,...)

Arguments

r

A correlation matrix or a data matrix suitable for factoring

nfactors

Factors from 1 to nfactors will be extracted. If nfactors is a a vector, then just the number of factors specified in the vector will be extracted. (See examples).

fm

Factor method. The default is 'minres' factoring. Although to be consistent with the original Goldberg article, we can also do principal components (fm ="pca").

rotate

What type of rotation to apply. The default for factors is oblimin. Unlike the normal call to pca where the default is varimax, in bassAckward the default for pca is oblimin.

scores

What factor scoring algorithm should be used. The default is "tenBerge", other possibilities include "regression", or "bartlett"

adjust

If using any other scoring proceure that "tenBerge" should we adjust the correlations for the lack of factor score fit?

plot

By default draw a bassAckward diagram

use

How to treat missing data. Use='pairwise" finds pairwise complete correlations.

cor

What kind of correlation to find. The default is Pearson.

weight

Should cases be weighted? Default, no.

correct

If finding tetrachoric or polychoric correlations, what correction should be applied to empty cells (defaults to .5)

x

The object returned by bassAckward

digits

Number of digits to display on each path

cut

Values greater than the abs(cut) will be displayed in a path diagram.

labels

Labels may be taken from the output of the bassAckward function or can be specified as a list.

marg

Margins are set to be slightly bigger than normal to allow for a cleaner diagram

main

The main title for the figure

items

if TRUE, show the items associated with the factors

sort

if TRUE, sort the items by factor loadings

lr

Should the graphic be drawn left to right or top to bottom

curves

Should we show the correlations between factors at the same level

organize

Rename and sort the factors at two lowest levels for a more pleasing figure

values

If TRUE, then show the percent variance accounted for by this factor.

...

Other graphic parameters (e.g., cex)

Details

This is essentially a wrapper to the fa and pca combined with the faCor functions. They are called repeatedly and then the weights from the resulting solutions are used to find the factor/component correlations.

Although the default is do all factor solutions from 1 to the nfactors, this can be simplified by specifying just some of the factor solutions. Thus, for the 135 items of the spi, it is more reasonable to ask for 3,5, and 27 item solutions.

The function bassAckward.diagram may be called using the diagram function or may be called directly.

The output from bassAckward.diagram is the sorted factor structure suitable for using fa.lookup.

Although not particularly pretty, it is possible to do Schmid-Leiman rotations at each level. Specify the rotation as rotate="schmid".

Value

Call

Echo the call

fm

Echos the factor method used

fa

A list of the factor loadings at each level

bass.ack

A list of the factor correlations at each level

summary

The factors at each level

sumnames

Summary of the factor names

labels

Factor labels including items for each level

r

The correlation matrix analyzed

Phi

The factor correlations at each level

fa

The factor analysis loadings at each level, now includes Phi values

fa.vac

The variance accounted for by each factor at each level

Note

Goldberg calculated factor/component scores and then correlated these. Waller suggests just looking at the unrotated components and then examining the correlations when rotating different numbers of components. I do not follow the Waller procedure, but rather find successive factors and then find factor/component correlations following Gorsuch.

It is important to note that the BassAckward solution is not a hierarchical solution in the standard meaning. The factors are not factors of factors as is found in a hierarchical model (e.g. sim.hierarchical or omega, but is merely way of organising solutions with a different number of factors. In each case, unlike omega the factors are of the original variables, not the lower level factors. Thus, detailed statistics for any level of the hierarchy may be found by doing a factoring with that particular number of factors.

To find basic statistics for the multiple factorings, examine the fa object. For more detail just do the factoring at that level.

To see the items associated with the lowest level factors, use fa.lookup. For the items associated with other levels, use fa.lookup specifying the level. (See examples.)

Author(s)

William Revelle

References

Goldberg, L.R. (2006) Doing it all Bass-Ackwards: The development of hierarchical factor structures from the top down. Journal of Research in Personality, 40, 4, 347-358.

Gorsuch, Richard, (1983) Factor Analysis. Lawrence Erlebaum Associates.

Revelle, William. (in prep) An introduction to psychometric theory with applications in R. Springer. Working draft available at https://personality-project.org/r/book/

Waller, N. (2007), A general method for computing hierarchical component structures by Goldberg's Bass-Ackwards method, Journal of Research in Personality, 41, 4, 745-752, DOI: 10.1016/j.jrp.2006.08.005

See Also

fa, pca, omega and iclust for alternative hierarchical solutions. link{fa.lookup} to show the items in the lowest level of the solution.

Examples

bassAckward(Thurstone,4,main="Thurstone data set")
f.labels <- list(level1=cs(Approach,Avoid),level2=cs(PosAffect,NegAffect,Constraint), 
level3 = cs(Extraversion,Agreeableness,Neuroticism,Conscientiousness,Openness))

ba <- bassAckward(psychTools::bfi[1:25],c(2,3,5),labels=f.labels,
        main="bfi data set from psychTools", values=TRUE)
print(ba,short=FALSE)
#show the items associated with the 5 level solution
fa.lookup(ba,dictionary=psychTools::bfi.dictionary)
#now show the items associated with the 3 level solution
fa.lookup(ba$fa[[2]],dictionary=psychTools::bfi.dictionary)
# compare the 3 factor solution to what get by extracting 3 factors directly
f3 <- fa(psychTools::bfi[1:25],3)
f3$loadings - ba$fa[[2]]$loadings   # these are the same
 #do pca instead of factors  just summarize, don't plot
summary(bassAckward(psychTools::bfi[1:25],c(1,3,5,7),fm="pca",
       main="Components",plot=FALSE))
##not run, but useful example
  
f.labels <- list(level1 = cs(Neg,Pos,Constrain),level2 = cs(Extra,Neuro,Cons,Open,Agree),
level3 = cs(attnseeking,sociability,impulsivity,
   charisma,sensationseek,emotexpr,humor,anxiety,
   emotstab,irritability,wellbeing,industry,order,author,honesty,perfect,easygoing,
   selfcontrol,conservatism,creativity,introspect,art,
   intellect,conform,adaptability,compassion,trust))
 
sp5 <- bassAckward(psychTools::spi[11:145], c(3,5,27),labels=f.labels,
           main="spi data set from psychTools")

Seven data sets showing a bifactor solution.

Description

Holzinger-Swineford (1937) introduced the bifactor model of a general factor and uncorrelated group factors. The Holzinger data sets are original 14 * 14 matrix from their paper as well as a 9 *9 matrix used as an example by Joreskog. The Thurstone correlation matrix is a 9 * 9 matrix of correlations of ability items. The Reise data set is 16 * 16 correlation matrix of mental health items. The Bechtholdt data sets are both 17 x 17 correlation matrices of ability tests.

Usage

data(Thurstone)
data(Thurstone.33)
data(Thurstone.33G)
data(Thurstone.9)
data(Holzinger)
data(Holzinger.9)
data(Bechtoldt)
data(Bechtoldt.1)
data(Bechtoldt.2)
data(Reise)

Details

Holzinger and Swineford (1937) introduced the bifactor model (one general factor and several group factors) for mental abilities. This is a nice demonstration data set of a hierarchical factor structure that can be analyzed using the omega function or using sem. The bifactor model is typically used in measures of cognitive ability.

There are several ways to analyze such data. One is to use the omega function to do a hierarchical factoring using the Schmid-Leiman transformation. This can then be done as an exploratory and then as a confirmatory model using omegaSem. Another way is to do a regular factor analysis and use either a bifactor or biquartimin rotation. These latter two functions implement the Jennrich and Bentler (2011) bifactor and biquartimin transformations. The bifactor rotation suffers from the problem of local minima (Mansolf and Reise, 2016) and thus a mixture of exploratory and confirmatory analysis might be preferred.

The 14 variables are ordered to reflect 3 spatial tests, 3 mental speed tests, 4 motor speed tests, and 4 verbal tests. The sample size is 355.

Another data set from Holzinger (Holzinger.9) represents 9 cognitive abilities (Holzinger, 1939) and is used as an example by Karl Joreskog (2003) for factor analysis by the MINRES algorithm and also appears in the LISREL manual as example NPV.KM. This data set represents the scores from the Grant White middle school for 9 tests: "t01_visperc" "t02_cubes" "t04_lozenges" "t06_paracomp" "t07_sentcomp" "t09_wordmean" "t10_addition" "t12_countdot" and "t13_sccaps" and as variables x1 ... x9 (for the Grant-White school) in the lavaan package.

Another classic data set is the 9 variable Thurstone problem which is discussed in detail by R. P. McDonald (1985, 1999) and and is used as example in the sem package as well as in the PROC CALIS manual for SAS. These nine tests were grouped by Thurstone and Thurstone, 1941 (based on other data) into three factors: Verbal Comprehension, Word Fluency, and Reasoning. The original data came from Thurstone and Thurstone (1941) but were reanalyzed by Bechthold (1961) who broke the data set into two. McDonald, in turn, selected these nine variables from the larger set of 17 found in Bechtoldt.2. The sample size is 213.

Another set of 9 cognitive variables attributed to Thurstone (1933) is the data set of 4,175 male students reported by Carl Brigham of Princeton to the College Entrance Examination Board. (Brigham, 1932, Table VIII p 352.) This set does not show a clear bifactor solution but is included as a demonstration of the differences between a maximimum likelihood factor analysis solution versus a principal axis factor solution. On page 352, we are given reliability estimates of .86, .73, .83, .97,.94, .85, .92, .92, and .90. These are based upon sampling 743 boys.

A parallel set (also from Brigham) is the correlation matrix of the same 9 variables for 2899 girls. (Thurstone.33G)

Tucker (1958) uses 9 variables from Thurstone and Thurstone (1941) for his example of interbattery factor analysis.

More recent applications of the bifactor model are to the measurement of psychological status. The Reise data set is a correlation matrix based upon >35,000 observations to the Consumer Assessment of Health Care Provideers and Systems survey instrument. Reise, Morizot, and Hays (2007) describe a bifactor solution based upon 1,000 cases.

The five factors from Reise et al. reflect Getting care quickly (1-3), Doctor communicates well (4-7), Courteous and helpful staff (8,9), Getting needed care (10-13), and Health plan customer service (14-16).

The two Bechtoldt data sets are two samples from Thurstone and Thurstone (1941). They include 17 variables, 9 of which were used by McDonald to form the Thurstone data set. The sample sizes are 212 and 213 respectively. The six proposed factors reflect memory, verbal, words, space, number and reasoning with three markers for all expect the rote memory factor. 9 variables from this set appear in the Thurstone data set.

Two more data sets with similar structures are found in the Harman data set. This includes the another 9 variables (with 696 subjects) from Holzinger used by Harman link{Harman.Holzinger} as well as 8 affective variables from link{burt}.

Another data set that is worth examining for tests of bifactor structure is the holzinger.swineford data set which includes the original data from Holzinger and Swineford (1939) supplied by Keith Widaman. This is in the psychTools package.

  • Bechtoldt.1: 17 x 17 correlation matrix of ability tests, N = 212.

  • Bechtoldt.2: 17 x 17 correlation matrix of ability tests, N = 213.

  • Holzinger: 14 x 14 correlation matrix of ability tests, N = 355

  • Holzinger.9: 9 x 9 correlation matrix of ability tests, N = 145

  • Reise: 16 x 16 correlation matrix of health satisfaction items. N = 35,000

  • Thurstone: 9 x 9 correlation matrix of ability tests, N = 213

  • Thurstone.33: Another 9 x 9 correlation matrix of ability tests, N=4175

  • Thurstone:9: And yet another 9 x 9 correlation matrix of ability tests, N =710

Note

Note that these are tests, not items. Thus, it was possible to find the reliabilities of each test.

For the Holzinger 14 tests these were found from 1- t2 where t = c(.332, .517, .360, .382, .354,.249, .444, .393, .455, .424, .393, .487, .534, .382) (page 53) and thus the reliabilities were 0.890, 0.733, 0.870, 0.854, 0.875, 0.938, 0.803, 0.846, 0.793, 0.820, 0.846, 0.763, 0.715, 0.854.

For the Holzinger.9 tests, the reliabilities for the Grant-White tests were: .76, .57, .94, .65, .75, .87, .95, .84 and .89 (Keith Widaman, personal communication, 2020),

Source

Holzinger: Holzinger and Swineford (1937)
Reise: Steve Reise (personal communication)
sem help page (for Thurstone) Brigham (for Thurstone.33)

References

Bechtoldt, Harold, (1961). An empirical study of the factor analysis stability hypothesis. Psychometrika, 26, 405-432.

Brigham, Carl C. (1932) A study of errors. College Entrance Examination Board.

Holzinger, Karl and Swineford, Frances (1937) The Bi-factor method. Psychometrika, 2, 41-54

Holzinger, K., & Swineford, F. (1939). A study in factor analysis: The stability of a bifactor solution. Supplementary Educational Monograph, no. 48. Chicago: University of Chicago Press.

McDonald, Roderick P. (1999) Test theory: A unified treatment. L. Erlbaum Associates. Mahwah, N.J.

Mansolf, Maxwell and Reise, Steven P. (2016) Exploratory Bifactor Analysis: The Schmid-Leiman Orthogonalization and Jennrich-Bentler Analytic Rotations, Multivariate Behavioral Research, 51:5, 698-717, DOI: 10.1080/00273171.2016.1215898

Reise, Steven and Morizot, Julien and Hays, Ron (2007) The role of the bifactor model in resolving dimensionality issues in health outcomes measures. Quality of Life Research. 16, 19-31.

Thurstone, Louis Leon (1933) The theory of multiple factors. Edwards Brothers, Inc. Ann Arbor.

Thurstone, Louis Leon and Thurstone, Thelma (Gwinn). (1941) Factorial studies of intelligence. The University of Chicago Press. Chicago, Il.

Tucker, Ledyard (1958) An inter-battery method of factor analysis, Psychometrika, 23, 111-136.

Examples

if(!require(GPArotation)) {message("I am sorry, to run omega requires GPArotation") 
        } else {
#holz <- omega(Holzinger,4, title = "14 ability tests from Holzinger-Swineford")
#bf <- omega(Reise,5,title="16 health items from Reise") 
#omega(Reise,5,labels=colnames(Reise),title="16 health items from Reise")
thur.om <- omega(Thurstone,title="9 variables from Thurstone") #compare with
thur.bf   <- fa(Thurstone,3,rotate="biquartimin")
factor.congruence(thur.om,thur.bf)
}

A bootstrap aggregation function for choosing most predictive unit weighted items

Description

bestScales forms scales from the items/scales most correlated with a particular criterion and then cross validates on a hold out sample using unit weighted scales. This may be repeated n.iter times using either basic bootstrap aggregation (bagging) techniques or K-fold cross validation. Thus, the technique is known as BISCUIT (Best Items Scales that are Cross validated, Unit weighted, Informative, and Transparent). Given a dictionary of item content, bestScales will sort by criteria correlations and display the item content. Options for bagging (bootstrap aggregation) are included. An alternative to unit weighting is to weight items by their zero order correlations (cross validated) with the criteria. This weighted version is called BISCWIT and is an optional output.

Usage

bestScales(x,criteria,min.item=NULL,max.item=NULL, delta = 0,
           cut=.1, n.item =10, wtd.cut=0, wtd.n=10, 
          n.iter =1, folds=1, p.keyed=.9,
          overlap=FALSE, dictionary=NULL, check=TRUE, impute="none",log.p=FALSE,digits=2)

bestItems(x,criteria=1,cut=.1, n.item=10, abs=TRUE, 
   dictionary=NULL,check=FALSE,digits=2,use="pairwise",method="pearson")

Arguments

x

A data matrix or data frame depending upon the function.

criteria

Which variables (by name or location) should be the empirical target for bestScales and bestItems. May be a separate object.

min.item

Find unit weighted and correlation weighted scales from min.item to max.item

max.item

These are all summarized in the final.multi.valid object

delta

Return items where the predicted r + delta * se of r < max value

cut

Return all values in abs(x[,c1]) > cut.

wtd.cut

When finding the weighted scales, use all items with zero order correlations > wtd.cut

wtd.n

When finding the weighted scales, use the wtd.n items that are > than wtd.cut

abs

if TRUE, sort by absolute value in bestItems

dictionary

a data.frame with rownames corresponding to rownames in the f$loadings matrix or colnames of the data matrix or correlation matrix, and entries (may be multiple columns) of item content.

check

if TRUE, delete items with no variance

n.item

How many items make up an empirical scale, or (bestItems, show the best n.items)

overlap

Are the correlations with other criteria fair game for bestScales

impute

When finding the best scales, and thus the correlations with the criteria, how should we handle missing data? The default is to drop missing items. (That is to say, to use pairwise complete correlations.)

n.iter

How many times to perform a bootstrap estimate. Replicate the best item function n.iter times, sampling roughly 1-1/e of the cases each time, and validating on the remaining 1/e of the cases for each iteration.

folds

If folds > 1, this is k-folds validation. Note, set n.iter > 1 to do bootstrap aggregation, or set folds > 1 to do k-folds.

p.keyed

The proportion of replications needed to include items in the final best keys.

log.p

Select items based upon the log of the probability of the correlations. This will only have an effect if the number of pairwise cases differs drastically from pair to pair.

digits

round to digits when showing output.

use

How to handle missing data. Defaults to "pairwise"

method

Which correlation to find. Defaults to "pearson"

Details

There are a number of procedures that can be used for predicting criteria from a set of predictors. The generic term for this is "machine learning" or "statistical learning". The basic logic of these procedures is to find a set of items that best predict a criteria according to some fit statistic and then cross validate these items numerous times. "lasso" regression (least absolute shrinkage and selection) is one such example. bestScales differs from these procedures by unit weighting items chosen from their zero order correlations with the criteria rather than weighting the partial correlations ala regression. This is an admittedly simple procedure that takes into account the well known finding (Wilks, 1938; Wainer, 1976; Dawes, 1979; Waller, 2008) that although regression weights are optimal for any particular data set, unit weights are almost as good (fungible) and more robust across sample variation.

Following some suggestions, we have added the ability to find scales where items are weighted by their zero order correlations with the criteria. This is effectively a comprimise between unit weighting and regression weights (where the weights are the zero order correlations times the inverse of the correlation matrix). This weighted version may be thought of as BISCWIT in contrast to the unit weighted version BISCUIT.

To be comparable to other ML algorithms, we now consider multiple solutions (for number of items >= min.item to max.item). The final scale consists of the number items which maximize the validity or at least are within delta * standard error of r of the maximum.

Thus, bestScales will find up to n.items per criterion that have absolute correlations with a criterion greater than cut. If the overlap option is FALSE (default) the other criteria are not used. This is an example of “dust bowl empiricism" in that there is no latent construct being measured, just those items that most correlate with a set of criteria. The empirically identified items are then formed into scales (ignoring concepts of internal consistency) which are then correlated with the criteria.

Clearly, bestScales is capitalizing on chance associations. Thus, we should validate the empirical scales by deriving them on a fraction of the total number of subjects, and cross validating on the remaining subjects. (This is known both as K-fold cross validation and bagging. Both may be done). If folds > 1, then a k-fold cross validation is done. This removes 1/k (a fold) from the sample for the derivation sample and validates on that remaining fold. This is done k-folds times. Traditional cross validation would thus be a k-fold with k =2. More modern applications seem to prefer k =10 to have 90% derivation sample and a 10% cross validation sample.

The alternative, known as 'bagging' is to do a bootstrap sample (which because it is sampling with replacement will typically extract 1- 1/e = 63.2% of the sample) for the derivation sample (the bag) and then validate it on the remaining 1/e of the sample (the out of bag). This is done n.iter times. This should be repeated multiple times (n.iter > 1, e.g. n.iter=1000) to get stable cross validations.

One can compare the validity of these two approaches by trying each. The average predictability of the n.iter samples are shown as are the average validity of the cross validations. This can only be done if x is a data matrix/ data.frame, not a correlation matrix. For very large data sets (e.g., those from SAPA) these scales seem very stable.

bestScales is effectively a straight forward application of 'bagging' (bootstrap aggregation) and machine learning as well as k-fold validation.

The criteria can be the colnames of elements of x, or can be a separate data.frame.

bestItems and lookup are simple helper functions to summarize correlation matrices or factor loading matrices. bestItems will sort the specified column (criteria) of x on the basis of the (absolute) value of the column. The return as a default is just the rowname of the variable with those absolute values > cut. If there is a dictionary of item content and item names, then include the contents as a two column matrix with rownames corresponding to the item name and then as many fields as desired for item content. (See the example dictionary bfi.dictionary).

The derived model can be further validated against yet another hold out sample using the predict.psych function if given the best scale object and the new data set.

Value

bestScales returns the correlation of the empirically constructed scale with each criteria and the items used in the scale. If a dictionary is specified, it also returns a list (value) that shows the item content. Also returns the keys.list so that scales can be found using cluster.cor or scoreItems. If using replications (bagging or kfold) then it also returns the best.keys, a list suitable for scoring.

There are actually four keys lists reported.

best.keys are all the items used to form unit weighted scales with the restriction of n.item.

weights may be used in the scoreWtd function to find scales based upon the raw correlation weights.

If the min.item and max.item options are used, then two more sets of weights are provided.

optimal.keys are a subset of the best.keys, taking just those items that increase the cross validation values up to the delta * se of the maximum. This is a slightly more parsimonious set.

optimal.weights is analogous to optimal keys, but for supplies weights for just those items that are used to predict cross validation values up to delta * se of the maximum.

The best.keys object is a list of items (with keying information) that may be used in subsequent analyses. These “best.keys" are formed into scale scores for the “final.valid" object which reports how well the best.keys work on the entire sample. This is, of course, not cross validated. Further cross validation can be done using the predict.psych function.

scores

Are the unit weighted scores from the original items

best.keys

A key list of those items that were used in the unit weighting.

wtd.scores

Are the zero-order correlation based scores.

weights

the scoring weights used

final.multi.valid

An object with the unit weighted and correlation weighted correlations from low.step to high.step

The print and summary output list a number of summary statistics for each criteria. This is given for the default case (number of items fixed) and then if requested, the optimal values chosen from min.item to max.item:

The default statistics:

derivation mean

Mean correlation of fixed length scale with the criteria, derivation sample

derivation.sd

The standard deviation of these estimates

validation.m

The mean cross validated correlations with the criteria

validation.sd

The standard deviations of these estimates

final.valid

The correlation of the pooled models with all the subjects

final.wtd

The correlation of the pooled weighted model with all subjects

N.wtd

Number of items used in the final weighted model

The optimal number of items statistics:

n

The mean number of items meeting the criteria

unit

The mean derivation predictive valididy

n.wtd

the mean number of items used in the wtd scales

wtd

The mean derivation wtd correlaton

valid.n

the mean number of items in the cross validation sample

valid.unit

The mean cross validated unit weighted correlations

valid.wtd.n

The mean number of items used in the cross validated correlated weighed scale

valid.wtd

The mean cross validated weighted correlation with criteria

n.final

The optimal number of items on the final cross validation sample

n.wtd.final

The optimal number of weighted items on the final cross validation.

derviation.mean

bestItems returns a sorted list of factor loadings or correlations with the labels as provided in the dictionary. If given raw data, then the correlations with the criteria variables are found first according to "use" and "method".

The stats object can be used to create error.dots plots to show the mean estimate and the standard error of the estimates. See the examples.

The resulting best scales can be cross validated on a different hold out sample using crossValidation. See the last example.

Note

Although bestScales was designed to form the best unit weighted scales, for large data sets, there seems to be some additional information in weighting by the average zero-order correlation.

To create a dictionary, create an object with row names as the item numbers, and the columns as the item content. See the link{bfi.dictionary} as an example.

This is a very computationally intensive function which can be speeded up considerably by using multiple cores and using the parallel package. The number of cores to use when doing bestScales may be specified using the options command. The greatest step in speed is going from 1 core to 2. This is about a 50

Note

Although empirical scale construction is appealing, it has the basic problem of capitalizing on chance. Thus, be careful of over interpreting the results unless working with large samples. Iteration and bootstrapping aggregation (bagging) gives information on the stability of the solutions.

It is also important to realize that the variables that are most related to the criterion might be because of other, trivial reasons (e.g. height predicts gender)

Author(s)

William Revelle

References

Dawes, R.M. (1979) The robust beauty of improper linear models in decision making, American Psychologist, 34, 571-582.

Elleman, L. G., McDougald, S. K., Condon, D. M., & Revelle, W. 2020 (in press). That takes the BISCUIT: A comparative study of predictive accuracy and parsimony of four statistical learning techniques in personality data, with data missingness conditions. European Journal of Psychological Assessment. (Preprint available at https://psyarxiv.com/tuqap/)

Revelle, W. (in preparation) An introduction to psychometric theory with applications in R. Springer. (Available online at https://personality-project.org/r/book/).

Wainer, H. (1979) Estimating coefficients in linear models: It don't make no nevermind. Psychological Buletin, 83, 213-217.

Waller, N.G. (2008), Fungible weights in multiple regression. Psychometrica, 73, 691-703.

Wilks, S. S. (1938), Weighting systems for linear functions of correlated variables when there is no dependent variable. Psychometrika. 3. 23-40.

See Also

fa, iclust,principal, error.dots

Examples

#This is an example of 'bagging' (bootstrap aggregation)
#not run in order to pass the timing tests for Debian at CRAN
#bestboot <- bestScales(psychTools::bfi,criteria=cs(gender,age,education), 
# n.iter=10,dictionary=psychTools::bfi.dictionary[1:3])
#bestboot
#compare with 10 fold cross validation 
#don't test for purposes of speed in installation to pass the debian CRAN test
tenfold <- bestScales(psychTools::bfi,criteria=cs(gender,age,education),fold=10,
           dictionary= psychTools::bfi.dictionary[1:3])
tenfold

#Then, to display the results graphically
#Note that we scale the two graphs with the same x.lim values

 error.dots(tenfold,eyes=TRUE,pch=16,xlim=c(0,.4))
 # error.dots(bestboot,add=TRUE,xlim=c(0,.4))
#do this again, but this time display the scale fits from 1 to 15 Items
# tenfold <- bestScales(psychTools::bfi,criteria=cs(gender,age,education),fold=10,
# dictionary= psychTools::bfi.dictionary[1:3],min.item=1,max.item=15)
# matplot(tenfold$multi.validities$wtd.deriv,typ="b",
# xlab="Number of Items",main="Fit as a function of number of items")   #the wtd weights
# matpoints(tenfold$multi.validities$unit.deriv,typ="b",lwd=2) #the unit weights 

#an example of using crossValidation with bestScales
set.seed(42)
ss <- sample(1:2800,1400)
model <- bestScales(bfi[ss,],criteria=cs(gender,education,age))
original.fit <- crossValidation(model,bfi[ss,]) #the derivation set
cross.fit <- crossValidation(model,bfi[-ss,])  #the cross validation set
summary(original.fit)
summary(cross.fit)

25 Personality items representing 5 factors

Description

25 personality self report items taken from the International Personality Item Pool (ipip.ori.org) were included as part of the Synthetic Aperture Personality Assessment (SAPA) web based personality assessment project. The data from 2800 subjects are included here as a demonstration set for scale construction, factor analysis, and Item Response Theory analysis. Three additional demographic variables (sex, education, and age) are also included. This data set is deprecated and users are encouraged to use bfi.

Usage

data(bfi)

Format

A data frame with 2800 observations on the following 28 variables. (The q numbers are the SAPA item numbers).

A1

Am indifferent to the feelings of others. (q_146)

A2

Inquire about others' well-being. (q_1162)

A3

Know how to comfort others. (q_1206)

A4

Love children. (q_1364)

A5

Make people feel at ease. (q_1419)

C1

Am exacting in my work. (q_124)

C2

Continue until everything is perfect. (q_530)

C3

Do things according to a plan. (q_619)

C4

Do things in a half-way manner. (q_626)

C5

Waste my time. (q_1949)

E1

Don't talk a lot. (q_712)

E2

Find it difficult to approach others. (q_901)

E3

Know how to captivate people. (q_1205)

E4

Make friends easily. (q_1410)

E5

Take charge. (q_1768)

N1

Get angry easily. (q_952)

N2

Get irritated easily. (q_974)

N3

Have frequent mood swings. (q_1099

N4

Often feel blue. (q_1479)

N5

Panic easily. (q_1505)

O1

Am full of ideas. (q_128)

O2

Avoid difficult reading material.(q_316)

O3

Carry the conversation to a higher level. (q_492)

O4

Spend time reflecting on things. (q_1738)

O5

Will not probe deeply into a subject. (q_1964)

gender

Males = 1, Females =2

education

1 = HS, 2 = finished HS, 3 = some college, 4 = college graduate 5 = graduate degree

age

age in years

Details

This data set is deprecated and users are encouraged to use bfi.It is kept here backward compatability for one more release.

The first 25 items are organized by five putative factors: Agreeableness, Conscientiousness, Extraversion, Neuroticism, and Opennness. The scoring key is created using make.keys, the scores are found using score.items.

These five factors are a useful example of using irt.fa to do Item Response Theory based latent factor analysis of the polychoric correlation matrix. The endorsement plots for each item, as well as the item information functions reveal that the items differ in their quality.

The item data were collected using a 6 point response scale: 1 Very Inaccurate 2 Moderately Inaccurate 3 Slightly Inaccurate 4 Slightly Accurate 5 Moderately Accurate 6 Very Accurate

as part of the Synthetic Apeture Personality Assessment (SAPA https://www.sapa-project.org/) project. To see an example of the data collection technique, visit https://www.sapa-project.org/ or the International Cognitive Ability Resource at https://icar-project.org/. The items given were sampled from the International Personality Item Pool of Lewis Goldberg using the sampling technique of SAPA. This is a sample data set taken from the much larger SAPA data bank.

Note

This data set is deprecated and users are encouraged to use bfi.It is kept here backward compatability for one more release.

The bfi data set and items should not be confused with the BFI (Big Five Inventory) of Oliver John and colleagues (John, O. P., Donahue, E. M., & Kentle, R. L. (1991). The Big Five Inventory–Versions 4a and 54. Berkeley, CA: University of California,Berkeley, Institute of Personality and Social Research.)

Source

The items are from the ipip (Goldberg, 1999). The data are from the SAPA project (Revelle, Wilt and Rosenthal, 2010) , collected Spring, 2010 ( https://www.sapa-project.org/).

References

Goldberg, L.R. (1999) A broad-bandwidth, public domain, personality inventory measuring the lower-level facets of several five-factor models. In Mervielde, I. and Deary, I. and De Fruyt, F. and Ostendorf, F. (eds) Personality psychology in Europe. 7. Tilburg University Press. Tilburg, The Netherlands.

Revelle, W., Wilt, J., and Rosenthal, A. (2010) Individual Differences in Cognition: New Methods for examining the Personality-Cognition Link In Gruszka, A. and Matthews, G. and Szymura, B. (Eds.) Handbook of Individual Differences in Cognition: Attention, Memory and Executive Control, Springer.

Revelle, W, Condon, D.M., Wilt, J., French, J.A., Brown, A., and Elleman, L.G. (2016) Web and phone based data collection using planned missing designs. In Fielding, N.G., Lee, R.M. and Blank, G. (Eds). SAGE Handbook of Online Research Methods (2nd Ed), Sage Publcations.

See Also

bi.bars to show the data by age and gender, irt.fa for item factor analysis applying the irt model.

Examples

data(bfi)
psych::describe(bfi)
# create the bfi.keys (actually already saved in the data file)
keys <-
  list(agree=c("-A1","A2","A3","A4","A5"),conscientious=c("C1","C2","C3","-C4","-C5"),
extraversion=c("-E1","-E2","E3","E4","E5"),neuroticism=c("N1","N2","N3","N4","N5"),
openness = c("O1","-O2","O3","O4","-O5")) 

 scores <- psych::scoreItems(keys,bfi,min=1,max=6) #specify the minimum and maximum values
 scores
 #show the use of the fa.lookup with a dictionary
 #psych::keys.lookup(bfi.keys,bfi.dictionary[,1:4])   #deprecated  -- use psychTools

Draw pairs of bargraphs based on two groups

Description

When showing e.g., age or education distributions for two groups, it is convenient to plot them back to back. bi.bars will do so.

Usage

bi.bars(x,var=NULL,grp=NULL,horiz,color,label=NULL,zero=FALSE,xlab,ylab,...)

Arguments

x

The data frame or matrix from which we specify the data

var

The variable to plot

grp

a grouping variable.

horiz

horizontal (default) or vertical bars

color

colors for the two groups – defaults to blue and red

label

If specified, labels for the dependent axis

zero

If TRUE, subtract the minimum value to make the numbers range from 0 to max -min. This is useful if showing heights

xlab

xaxis label if appropriate

ylab

y axis label otherwise

...

Further parameters to pass to the graphing program

Details

A trivial, if useful, function to draw back to back histograms/barplots. One for each group.

Value

a graphic

Author(s)

William Revelle

See Also

describe, describeBy and statsBy for descriptive statistics and error.bars error.bars.by and densityBy violinBy for graphic displays

Examples

#data(bfi)
bi.bars(psychTools::bfi,"age","gender" ,ylab="Age",main="Age by males and females")
 bi.bars(psychTools::bfi,"education","gender",xlab="Education",
     main="Education by gender",horiz=FALSE)

Find large correlation matrices by stitching together smaller ones found more rapidly

Description

When analyzing many subjects (ie. 100,000 or more) with many variables (i.e. 1000 or more) core R can take a long time and sometime exceed memory limits (i.e. with 600K subjects and 6K variables). bigCor runs (in parallel if multicores are available) by breaking the variables into subsets (of size=size), finding all subset correlations, and then stitches the resulting matrices into one large matrix. Noticeable improvements in speed compared to cor.

Usage

bigCor(x, size = NULL, use = "pairwise",cor="pearson",correct=.5)

Arguments

x

A data set of numeric variables

size

What should the size of the subsets be? Defaults to NCOL (x)/20

use

The standard correlation option. "pairwise" allows for missing data

cor

Defaults to Pearson correlations, alteratives are polychoric and spearman

correct

Correction for continuity for polychoric correlations. (see polychoric)

Details

The data are divided into subsets of size=size. Correlations are then found for each subset and pairs of subsets.

Time is roughly linear with the number of cases and increases by the square of the number of variables. The benefit of more cores is noticeable. It seems as if with 4 cores, we should use sizes to split it into 8 or 12 sets. Otherwise we don't actually use all cores efficiently.

There is some overhead in using multicores. So for smaller problems (e.g. the 4,000 cases of the 145 items of the psychTools::spi data set, the timings are roughly .14 seconds for bigCor (default size) and .10 for normal cor. For small problems, this actually gets worse as we use more cores. The cross over point seems to be at roughly 5K subjects. (updated these timings to recognize the M1 Max chip. An increase of 4x in speed! They had been .44 and .36.)

The basic loop loops over the subsets. When the size is a integer subset of the number of variables and is a multiple of the number of cores, the multiple cores will be used more. Notice the benefit of 660/80 versus 660/100. But this breaks down if we try 660/165. Further notice the benefit when using a smaller subset (55) which led to the 4 cores being used more.

The following timings are included to help users tinker with parameters:

Timings (in seconds) for various problems with 645K subjects on an 8 core Mac Book Pro with a 2.4 GHZ Intell core i9.

options(mc.cores=4) (Because we have 8 we can work at the same time as we test this.)

First test it with 644,495 subjects and 1/10 of the number of possible variables. Then test it for somewhat fewer variables.

Variables size 2 cores 4 cores compared to normal cor function
660 100 430 434 430
660 80 600 348 notice the improvement with 8ths
660 165 666 (Stitching seems to have been very slow)
660 55 303 Even better if we break it into 12ths!
500 100 332 322 secs
480 120 408 365 315 Better to change the size
480 60 358 206 This leads to 8 splits

We also test it with fewer subjects. Time is roughly linear with number of subjects.

Variables size 2 cores 4 cores compared to normal cor function Further comparisons with fewer subjects (100K)
480 60 57 31 47 with normal cor. Note the effect of n subjects!
200 50 19.9 13.6 27.13
100 25 4.6 3.5 5.85

One last comparison, 10,000 subjects, showing the effect of getting the proper size value. You can tune on these smaller sets of subjects before trying large problems.

Variables size 2 cores 4 cores compared to normal cor function
480 120 5.2 5.1 4.51
480 60 2.9 2.88 4.51
480 30 2.65 2.691
480 20 2.73 2.77
480 10 2.82 2.97 too many splits?
200 50 2.18 1.39 2.47 for normal cor (1.44 with 8 cores 2.99 with 1 core)
200 25 1.2 1.17 2.47 for normal cor
(1.16 with 8 cores, 1.17 with 1 core)
100 25 .64 .52 .56

Timings updated in 2/23 using a MacBook Pro with M1 max chip 10,000 subjects 953 variables suggests that a very small size (e.g. 20) is probably optimal

Variables size 2 cores 4 cores 8 cores compared to normal cor function
953 20 7.92 4.55 2.88 11.04
953 30 7.98 4.88 3.15 11.04
953 40 8.22 5.14 3.63 11.16
953 60 8.51 5.59 3.93 11.16
953 80 8.31 5.59 4.14 11.16
953 120 8.33 6.22 4.75 11.16

Value

The correlation matrix

Note

Does not seem to work with data.tables

Author(s)

William Revelle

References

Examples of large data sets with massively missing data are taken from the SAPA project. e.g.,

William Revelle, Elizabeth M. Dworak, and David M. Condon (2021) Exploring the persome: The power of the item in understanding personality structure. Personality and Individual Differences, 169, doi:10.1016/j.paid.2020.109905

David Condon (2018)The SAPA Personality Inventory: an empirically-derived, hierarchically-organized self-report personality assessment model. PsyArXiv /sc4p9/ doi:10.31234/osf.io/sc4p9

See Also

pairwiseCountBig which will do the same, but find the count of observations per cell.

Examples

R <- bigCor(psychTools::bfi,10)
#compare the results with 
r.bfi <- cor(psychTools::bfi,use="pairwise")
all.equal(R,r.bfi)

Draw biplots of factor or component scores by factor or component loadings

Description

Extends the biplot function to the output of fa, fa.poly or principal. Will plot factor scores and factor loadings in the same graph. If the number of factors > 2, then all pairs of factors are plotted. Factor score histograms are plotted on the diagonal. The input is the resulting object from fa, principal, or fa.poly with the scores=TRUE option. Points may be colored according to other criteria.

Usage

## S3 method for class 'psych'
biplot(x, labels=NULL,cex=c(.75,1),main="Biplot from fa",
hist.col="cyan",xlim.s=c(-3,3),ylim.s=c(-3,3),xlim.f=c(-1,1),ylim.f=c(-1,1),
maxpoints=100,adjust=1.2,col,pos, arrow.len = 0.1,pch=16,choose=NULL,
cuts=1,cutl=.0,group=NULL,smoother=FALSE,vars=TRUE,...)

Arguments

x

The output from fa, fa.poly or principal with the scores=TRUE option

labels

if NULL, draw the points with the plot character (pch) specified. To identify the data points, specify labels= 1:n where n is the number of observations, or labels =rownames(data) where data was the data set analyzed by the factor analysis.

cex

A vector of plot sizes of the data labels and of the factor labels

main

A main title for a two factor biplot

hist.col

If plotting more than two factors, the color of the histogram of the factor scores

xlim.s

x limits of the scores. Defaults to plus/minus three sigma

ylim.s

y limits of the scores.Defaults to plus/minus three sigma

xlim.f

x limits of the factor loadings.Defaults to plus/minus 1.0

ylim.f

y limits of the factor loadings.Defaults to plus/minus 1.0

maxpoints

When plotting 3 (or more) dimensions, at what size should we switch from plotting "o" to plotting "."

adjust

an adjustment factor in the histogram

col

a vector of colors for the data points and for the factor loading labels

pos

If plotting labels, what position should they be in? 1=below, 2=left, 3 top, 4 right. If missing, then the assumption is that labels should be printed instead of data points.

arrow.len

the length of the arrow head

pch

The plotting character to use. pch=16 gives reasonable size dots. pch="." gives tiny points. If adding colors, use pch between 21 and 25. (see examples).

choose

Plot just the specified factors

cuts

Do not label cases with abs(factor scores) < cuts) (Actually, the distance of the x and y scores from 0)

cutl

Do not label variables with communalities in the two space < cutl

group

A vector of a grouping variable for the scores. Show a different color and symbol for each group.

smoother

If TRUE then do a smooth scatter plot (which shows the density rather than the data points). Only useful for large data sets.

vars

If TRUE, draw arrows for the variables, and plot the scores. If FALSE, then draw arrows for the scores and plot the variables.

...

more options for graphics

Details

Uses the generic biplot function to take the output of a factor analysis fa, fa.poly or principal components analysis principal and plot the factor/component scores along with the factor/component loadings.

This is an extension of the generic biplot function to allow more control over plotting points in a two space and also to plot three or more factors (two at time).

This will work for objects produced by fa, fa.poly or principal if they applied to the original data matrix. If however, one has a correlation matrix (e.g., based upon the output from tetrachoric or polychoric), and has done either fa or principal on the correlations, then obviously, we can not do a biplot.

However, both of those functions produce a weights matrix, which, in combination with the original data can be used to find the scores by using factor.scores. Since biplot.psych is looking for two elements of the x object: x$loadings and x$scores, you can create the appropriate object to plot, or add it to the factor object See the third and fourth examples.

In order to just plot the loadings, use fa.plot. Or, if we want to show the loadings as vectors, use pch = "".

Author(s)

William Revelle

See Also

fa, fa.poly, principal, fa.plot, pairs.panels

Examples

#the standard example
data(USArrests)
fa2 <- fa(USArrests,2,scores=TRUE)
biplot(fa2,labels=rownames(USArrests))

# plot the 3 factor solution
#data(bfi)
fa3 <- fa(psychTools::bfi[1:200,1:15],3,scores=TRUE)
biplot(fa3)
#just plot factors 1 and 3 from that solution
biplot(fa3,choose=c(1,3))

#
fa2 <- fa(psychTools::bfi[16:25],2)  #factor analysis
fa2$scores <- fa2$scores[1:100,]  #just take the first 100
#now plot with different colors and shapes for males and females
biplot(fa2,pch=c(24,21)[psychTools::bfi[1:100,"gender"]],
group =psychTools::bfi[1:100,"gender"],
   main="Biplot of Openness and Neuroticism by gender")

## An example from the correlation matrix
r <- cor(psychTools::bfi[1:200,1:10], use="pairwise") #find the correlations
f2 <- fa(r,2) 
#biplot(f2)  #this throws an error    (not run)

#f2 does not have scores, but we can find them
f2$scores <- factor.scores(psychTools::bfi[1:200,1:10],f2)

biplot(f2,main="biplot from correlation matrix and factor scores")


#or create a new object with the scores
 #find the correlations for all subjects
r <- cor(psychTools::bfi[1:10], use="pairwise") 
f2 <- fa(r,2) 
x <- list() 
#find the scores for just the first 200 subjects
x$scores <- factor.scores(psychTools::bfi[1:200,1:10],f2)
x$loadings <- f2$loadings
class(x) <- c('psych','fa')
biplot(x,main="biplot from correlation matrix combined with factor scores")

Create a block randomized structure for n independent variables

Description

Random assignment of n subjects with an equal number in all of N conditions may done by block randomization, where the block size is the number of experimental conditions. The number of Independent Variables and the number of levels in each IV are specified as input. The output is a the block randomized design.

Usage

block.random(n, ncond = NULL)

Arguments

n

The number of subjects to randomize. Must be a multiple of the number of experimental conditions

ncond

The number of conditions for each IV. Defaults to 2 levels for one IV. If more than one IV, specify as a vector. If names are provided, they are used, otherwise the IVs are labeled as IV1 ... IVn

Value

blocks

A matrix of subject numbers, block number, and randomized levels for each IV

Note

Prepared for a course on Research Methods in Psychology https://personality-project.org/courses/205/205.syllabus.html

Author(s)

William Revelle

Examples

br <- block.random(n=24,c(2,3))
pairs.panels(br)
br <- block.random(96,c(time=4,drug=3,sex=2))
pairs.panels(br)

Bock and Liberman (1970) data set of 1000 observations of the LSAT

Description

An example data set used by McDonald (1999) as well as other discussions of Item Response Theory makes use of a data table on 10 items (two sets of 5) from the Law School Admissions Test (LSAT). Included in this data set is the original table as well as the reponses for 1000 subjects on the first set (Figure Classification) and second set (Debate).

Usage

data(bock)

Format

A data frame with 32 observations on the following 8 variables.

index

32 response patterns

Q1

Responses to item 1

Q2

Responses to item 2

Q3

Responses to item 3

Q4

Responses to item 4

Q5

Responses to item 5

Ob6

count of observations for the section 6 test

Ob7

count of observations for the section 7 test

Two other data sets are derived from the bock dataset. These are converted using the table2df function.

lsat6

reponses to 5 items for 1000 subjects on section 6

lsat7

reponses to 5 items for 1000 subjects on section 7

Details

The lsat6 data set is analyzed in the ltm package as well as by McDonald (1999). lsat7 is another 1000 subjects on part 7 of the LSAT. Both sets are described by Bock and Lieberman (1970). Both sets are useful examples of testing out IRT procedures and showing the use of tetrachoric correlations and item factor analysis using the irt.fa function.

Source

R. Darrell Bock and M. Lieberman (1970). Fitting a response model for dichotomously scored items. Psychometrika, 35(2):179-197.

References

R.P. McDonald. Test theory: A unified treatment. L. Erlbaum Associates, Mahwah, N.J., 1999.

Examples

data(bock)
responses <- table2df(bock.table[,2:6],count=bock.table[,7],
        labs= paste("lsat6.",1:5,sep=""))
describe(responses)
## maybe str(bock.table) ; plot(bock.table) ...

12 cognitive variables from Cattell (1963)

Description

Rindskopf and Rose (1988) use this data set to demonstrate confirmatory second order factor models. It is a nice example data set to explore hierarchical structure and alternative factor solutions. It contains measures of fluid and crystallized intelligence.

Usage

data("cattell")

Format

A correlation matrix of the following 12 variables from 277 7th and 8th graders

Verbal

A verbal ability test from Thurstone

Verbal2

A verbal ability test from Thurstone

Space1

A Spatial ability test from Thurstone

Space2

A Spatial ability test from Thurstone

Reason1

A reasoning test from Thurstone

Reason2

A reasoning test from Thurstone

Number1

A Numerical ability test from Thurstone

Number2

A Numerical ability test from Thurstone

IPATSer

A "culture fair" series from the IPAT

IPATCLAS

A "culture fair" classification test from the IPAT

IPATMatr

A "culture fair" matrix reasoning test from the IPAT

IPATTop

A "culture fair" topology test from the IPAT

Details

Cattell (1963) reported on 8 cognitive variables from Thurstone and four from the Institute for Personality Assessment Test (IPAT). Rindskopf and Rose (1988) use this data set as an example of second order factor analysis. It is thus a nice set for examining alternative solutions such as bifactor rotation, omega hierarchical, as well as esem and interbattery factor analysis.

Source

David Rindskopf and Tedd Rose, (1988) Some Theory and Applications of Confirmatory Second- Order Factor Analysis, Multivariate Behavioral Research, 23, 51-67.

References

Cattell, R. B. (1963).Theory of fluid and crystallized intelligence: A critical experiment. Journal of Educational Psychology, 54, 1-22.

David Rindskopf and Tedd Rose, (1988) Some Theory and Applications of Confirmatory Second- Order Factor Analysis, Multivariate Behavioral Research, 23, 51-67.

Examples

data(cattell)
corPlot(cattell,numbers=TRUE,upper=FALSE,diag=FALSE,
             main="12 cognitive variables from Cattell (1963)",xlas=2)

Apply four tests of circumplex versus simple structure

Description

Rotations of factor analysis and principal components analysis solutions typically try to represent correlation matrices as simple structured. An alternative structure, appealing to some, is a circumplex structure where the variables are uniformly spaced on the perimeter of a circle in a two dimensional space. Generating these data is straightforward, and is useful for exploring alternative solutions to affect and personality structure.

Usage

circ.tests(loads, loading = TRUE, sorting = TRUE)

Arguments

loads

A matrix of loadings loads here

loading

Are these loadings or a correlation matrix loading

sorting

Should the variables be sorted sorting

Details

“A common model for representing psychological data is simple structure (Thurstone, 1947). According to one common interpretation, data are simple structured when items or scales have non-zero factor loadings on one and only one factor (Revelle & Rocklin, 1979). Despite the commonplace application of simple structure, some psychological models are defined by a lack of simple structure. Circumplexes (Guttman, 1954) are one kind of model in which simple structure is lacking.

“A number of elementary requirements can be teased out of the idea of circumplex structure. First, circumplex structure implies minimally that variables are interrelated; random noise does not a circumplex make. Second, circumplex structure implies that the domain in question is optimally represented by two and only two dimensions. Third, circumplex structure implies that variables do not group or clump along the two axes, as in simple structure, but rather that there are always interstitial variables between any orthogonal pair of axes (Saucier, 1992). In the ideal case, this quality will be reflected in equal spacing of variables along the circumference of the circle (Gurtman, 1994; Wiggins, Steiger, & Gaelick, 1981). Fourth, circumplex structure implies that variables have a constant radius from the center of the circle, which implies that all variables have equal communality on the two circumplex dimensions (Fisher, 1997; Gurtman, 1994). Fifth, circumplex structure implies that all rotations are equally good representations of the domain (Conte & Plutchik, 1981; Larsen & Diener, 1992). (Acton and Revelle, 2004)

Acton and Revelle reviewed the effectiveness of 10 tests of circumplex structure and found that four did a particularly good job of discriminating circumplex structure from simple structure, or circumplexes from ellipsoidal structures. Unfortunately, their work was done in Pascal and is not easily available. Here we release R code to do the four most useful tests:

1 The Gap test of equal spacing

2 Fisher's test of equality of axes

3 A test of indifference to Rotation

4 A test of equal Variance of squared factor loadings across arbitrary rotations.

To interpret the values of these various tests, it is useful to compare the particular solution to simulated solutions representing pure cases of circumplex and simple structure. See the example output from circ.simulation and compare these plots with the results of the circ.test.

Value

A list of four items is returned. These are the gap, fisher, rotation and variance test results.

gaps

gap.test

fisher

fisher.test

RT

rotation.test

VT

variance.test

Note

Of the 10 criterion discussed in Acton and Revelle (2004), these tests operationalize the four most useful.

Author(s)

William Revelle

References

Acton, G. S. and Revelle, W. (2004) Evaluation of Ten Psychometric Criteria for Circumplex Structure. Methods of Psychological Research Online, Vol. 9, No. 1 https://personality-project.org/revelle/publications/acton.revelle.mpr110_10.pdf

See Also

To understand the results of the circ.tests it it best to compare it to simulated values. Thus, see circ.simulation, sim.circ

Examples

circ.data <- circ.sim(24,500)
circ.fa <- fa(circ.data,2)
plot(circ.fa,title="Circumplex Structure")
ct <- circ.tests(circ.fa)
#compare with non-circumplex data
simp.data <- item.sim(24,500)
simp.fa <- fa(simp.data,2)
plot(simp.fa,title="Simple Structure")
st <- circ.tests(simp.fa)
res <- rbind(ct[1:4],st[1:4])
rownames(res) <- c("circumplex","Simple")
print(res,digits=2)

cluster Fit: fit of the cluster model to a correlation matrix

Description

How well does the cluster model found by ICLUST fit the original correlation matrix? A similar algorithm factor.fit is found in VSS. This function is internal to ICLUST but has more general use as well.

In general, the cluster model is a Very Simple Structure model of complexity one. That is, every item is assumed to represent only one factor/cluster. Cluster fit is an analysis of how well this model reproduces a correlation matrix. Two measures of fit are given: cluster fit and factor fit. Cluster fit assumes that variables that define different clusters are orthogonal. Factor fit takes the loadings generated by a cluster model, finds the cluster loadings on all clusters, and measures the degree of fit of this somewhat more complicated model. Because the cluster loadings are similar to, but not identical to factor loadings, the factor fits found here and by factor.fit will be similar.

Usage

cluster.fit(original, load, clusters, diagonal = FALSE)

Arguments

original

The original correlation matrix being fit

load

Cluster loadings – that is, the correlation of individual items with the clusters, corrected for item overlap

clusters

The cluster structure

diagonal

Should we fit the diagonal as well?

Details

The cluster model is similar to the factor model: R is fitted by C'C. Where C <- Cluster definition matrix x the loading matrix. How well does this model approximate the original correlation matrix and how does this compare to a factor model?

The fit statistic is a comparison of the original (squared) correlations to the residual correlations. Fit = 1 - r*2/r2 where r* is the residual correlation of data - model and model = C'C.

Value

clusterfit

The cluster model is a reduced form of the factor loading matrix. That is, it is the product of the elements of the cluster matrix * the loading matrix.

factorfit

How well does the complete loading matrix reproduce the correlation matrix?

Author(s)

Maintainer: William Revelle [email protected]

References

https://personality-project.org/r/r.ICLUST.html

See Also

VSS, ICLUST, factor2cluster, cluster.cor, factor.fit

Examples

r.mat<- Harman74.cor$cov
 iq.clus <- ICLUST(r.mat,nclusters =2)
 fit <- cluster.fit(r.mat,iq.clus$loadings,iq.clus$clusters)
 fit

Find item by cluster correlations, corrected for overlap and reliability

Description

Given a n x n correlation matrix and a n x c matrix of -1,0,1 cluster weights for those n items on c clusters, find the correlation of each item with each cluster. If the item is part of the cluster, correct for item overlap. Part of the ICLUST set of functions, but useful for many item analysis problems.

Usage

cluster.loadings(keys, r.mat, correct = TRUE,SMC=TRUE)

Arguments

keys

Cluster keys: a matrix of -1,0,1 cluster weights

r.mat

A correlation matrix

correct

Correct for reliability

SMC

Use the squared multiple correlation as a communality estimate, otherwise use the greatest correlation for each variable

Details

Given a set of items to be scored as (perhaps overlapping) clusters and the intercorrelation matrix of the items, find the clusters and then the correlations of each item with each cluster. Correct for item overlap by replacing the item variance with its average within cluster inter-item correlation.

Although part of ICLUST, this may be used in any SAPA (https://www.sapa-project.org/) application where we are interested in item-whole correlations of items and composite scales.

For information about SAPA see Revelle et al, 2010, 2016. For information about SAPA based measures of ability, see https://icar-project.org.

These loadings are particularly interpretable when sorted by absolute magnitude for each cluster (see ICLUST.sort).

Value

loadings

A matrix of item-cluster correlations (loadings)

cor

Correlation matrix of the clusters

corrected

Correlation matrix of the clusters, raw correlations below the diagonal, alpha on diagonal, corrected for reliability above the diagonal

sd

Cluster standard deviations

alpha

alpha reliabilities of the clusters

G6

G6* Modified estimated of Guttman Lambda 6

count

Number of items in the cluster

Note

Although part of ICLUST, this may be used in any SAPA application where we are interested in item- whole correlations of items and composite scales.

Author(s)

Maintainer: William Revelle [email protected]

References

ICLUST: https://personality-project.org/r/r.ICLUST.html

Revelle, W., Wilt, J., and Rosenthal, A. (2010) Individual Differences in Cognition: New Methods for examining the Personality-Cognition Link In Gruszka, A. and Matthews, G. and Szymura, B. (Eds.) Handbook of Individual Differences in Cognition: Attention, Memory and Executive Control, Springer.

Revelle, W, Condon, D.M., Wilt, J., French, J.A., Brown, A., and Elleman, L.G. (2016) Web and phone based data collection using planned missing designs. In Fielding, N.G., Lee, R.M. and Blank, G. (Eds). SAGE Handbook of Online Research Methods (2nd Ed), Sage Publcations.

See Also

ICLUST, factor2cluster, cluster.cor

Examples

r.mat<- Harman74.cor$cov
 clusters <- matrix(c(1,1,1,rep(0,24),1,1,1,1,rep(0,17)),ncol=2)
 cluster.loadings(clusters,r.mat)

Plot factor/cluster loadings and assign items to clusters by their highest loading.

Description

Cluster analysis and factor analysis are procedures for grouping items in terms of a smaller number of (latent) factors or (observed) clusters. Graphical presentations of clusters typically show tree structures, although they can be represented in terms of item by cluster correlations.

Cluster.plot plots items by their cluster loadings (taken, e.g., from ICLUST) or factor loadings (taken, eg., from fa). Cluster membership may be assigned apriori or may be determined in terms of the highest (absolute) cluster loading for each item.

If the input is an object of class "kmeans", then the cluster centers are plotted.

Usage

cluster.plot(ic.results, cluster = NULL, cut = 0, labels=NULL,
          title = "Cluster plot",pch=18,pos,show.points=TRUE,choose=NULL,...)
fa.plot(ic.results, cluster = NULL, cut = 0, labels=NULL,title, 
    jiggle=FALSE,amount=.02,pch=18,pos,show.points=TRUE,choose=NULL,main=NULL,...)
factor.plot(ic.results, cluster = NULL, cut = 0, labels=NULL,title,jiggle=FALSE,
                  amount=.02,pch=18,pos,show.points=TRUE,...)  #deprecated

Arguments

ic.results

A factor analysis or cluster analysis output including the loadings, or a matrix of item by cluster correlations. Or the output from a kmeans cluster analysis.

cluster

A vector of cluster membership

cut

Assign items to clusters if the absolute loadings are > cut

labels

If row.names exist they will be added to the plot, or, if they don't, labels can be specified. If labels =NULL, and there are no row names, then variables are labeled by row number.)

title

Any title

jiggle

When plotting with factor loadings that are almost identical, it is sometimes useful to "jiggle" the points by jittering them. The default is to not jiggle.

amount

if jiggle=TRUE, then how much should the points be jittered?

pch

factor and clusters are shown with different pch values, starting at pch+1

pos

Position of the text for labels for two dimensional plots. 1=below, 2 = left, 3 = above, 4= right

show.points

When adding labels to the points, should we show the points as well as the labels. For many points, better to not show them, just the labels.

choose

Specify the factor/clusters to plot

main

Any title – redundant with title

...

Further options to plot

Details

Results of either a factor analysis or cluster analysis are plotted. Each item is assigned to its highest loading factor, and then identified by variable name as well as cluster (by color). The cluster assignments can be specified to override the automatic clustering by loading. Both of these functions may be called directly or by calling the generic plot function. (see example).

Value

Graphical output is presented.

Author(s)

William Revelle

See Also

ICLUST, ICLUST.graph, fa.graph, plot.psych

Examples

circ.data <- circ.sim(24,500)
circ.fa <- fa(circ.data,2)
plot(circ.fa,cut=.5)
f5 <- fa(psychTools::bfi[1:25],5) 
plot(f5,labels=colnames(psychTools::bfi)[1:25],show.points=FALSE)
plot(f5,labels=colnames(psychTools::bfi)[1:25],show.points=FALSE,choose=c(1,2,4))

Convert a cluster vector (from e.g., kmeans) to a keys matrix suitable for scoring item clusters.

Description

The output of the kmeans clustering function produces a vector of cluster membership. The score.items and cluster.cor functions require a matrix of keys. cluster2keys does this.

May also be used to take the output of an ICLUST analysis and find a keys matrix. (By doing a call to the factor2cluster function.

Usage

cluster2keys(c)

Arguments

c

A vector of cluster assignments or an object of class “kmeans" that contains a vector of clusters.

Details

Note that because kmeans will not reverse score items, the clusters defined by kmeans will not necessarily match those of ICLUST with the same number of clusters extracted.

Value

keys

A matrix of keys suitable for score.items or cluster.cor

Author(s)

William Revelle

See Also

cluster.cor,score.items, factor2cluster, make.keys

Examples

test.data <- Harman74.cor$cov
kc <- kmeans(test.data,4)
keys <- cluster2keys(kc)
keys  #these match those found by ICLUST
cluster.cor(keys,test.data)

Find Cohen d and confidence intervals

Description

Given a data.frame or matrix, find the standardized mean difference (Cohen's d) and confidence intervals for each variable depending upon a grouping variable. Convert the d statistic to the r equivalent, report the student's t statistic and associated p values, and return statistics for both values of the grouping variable. The Mahalanobis distance between the centroids of the two groups in the space defined by all the variables ia also found. Confidence intervals for Cohen d for one group (difference from 0) may also be found. Several measures of the distributional overlap (e.g. OVL, OVL2, etc.) are available.

Usage

cohen.d(x, group,alpha=.05,std=TRUE,sort=NULL,dictionary=NULL,MD=TRUE,data=NULL)
d.robust(x,group,trim=.2)
cohen.d.ci(d,n=NULL,n2=NULL,n1=NULL,alpha=.05)
d.ci(d,n=NULL,n2=NULL,n1=NULL,alpha=.05)
cohen.d.by(x,group,group2,alpha=.05,MD=TRUE)
d2r(d)
r2d(rho)
d2t(d,n=NULL,n2=NULL,n1=NULL)
t2d(t,n=NULL,n2=NULL,n1=NULL)
m2t(m1,m2,s1,s2,n1=NULL,n2=NULL,n=NULL,pooled=TRUE)  #returns d invisibily
m2d(m1,m2,s1,s2,n1=NULL,n2=NULL,n=NULL,pooled=TRUE)
d2OVL(d)  #Percent overlap for 1 distribtion
d2OVL2(d)  #Percent overlap joint distribution
d2CL(d)   #Common language effect size
d2U3(d)   #Proportion in higher group exceedding median of lower group
cd.validity(d, keys, abs=TRUE)

Arguments

x

A data frame or matrix (can be specified in formula mode)

group

Some dichotomous grouping variable (may be specified using formula input (see example))

group2

Apply cohen.d for each of the subgroups defined by group2 (may be specified by formula as well)

data

If using formula mode and specifying a particular variable (see example)

d

An effect size

keys

A list of scoring keys (similar to scoreItems)

abs

When finding average cd validities, should we take absolute values (TRUE)

trim

The amount of trimming used in finding the means and sds in d.robust

n

Total sample size (of groups 1 and 2)

n1

Sample size of group 1 (if only one group)

n2

Sample size of group 2

pooled

Pool the two variances

t

Student's t statistic

alpha

1-alpha is the width of the confidence interval

std

Find the correlation rather covariance matrix

rho

A correlation to be converted to an effect size

m1

Mean of group 1

m2

Mean of group 2

s1

Standard deviation of group 1

s2

Standard deviation of group 2

sort

Should we sort (and if so, in which direction), the results of cohen.d? Directions are "decreasing" or "increasing". If TRUE, sorts in a decreasing order.

dictionary

What are the items being described?

MD

Find Mahalanobis distance in cohen.d.

Details

There are many ways of reporting how two groups differ. Cohen's d statistic is just the differences of means expressed in terms of the pooled within group standard deviation. This is insensitive to sample size. r is the a universal measure of effect size that is a simple function of d, but is bounded -1 to 1. The t statistic is merely d * sqrt(n)/2 and thus reflects sample size.

d=M2M1Spd = \frac{M2 - M1}{Sp}

where Sp is the pooled standard deviation.

Sp=(n11)s12+(n21)s22NSp = \sqrt{\frac{(n1-1)*s1^2 + (n2-1)* s2^2}{N} }

Cohens d uses N as the divisor for the pooled sums of squares. Hedges g uses N-2.

Confidence intervals for Cohen's d are found by converting the d to a t, finding the confidence intervals for t, and then converting those back to ds. This take advantage of the uniroot function and the non-centrality parameter of the t distribution.

The results of cohen.d may be displayed using the error.dots function. This will include the labels provided in the dictionary.

In the case of finding the confidence interval (using cohen.d.ci for a comparison against 0 (the one sample case), specify n1. This will yield a d = t/sqrt(n1) whereas in the case of the difference between two samples, d = 2*t/sqrt(n) (for equal sample sizes n = n1+ n2) or d = t/sqrt(1/n1 + 1/n2) for the case of unequal sample sizes.

Since we find d and then convert this to t, using d2t, the question is how to pool the variances. Until 7/14/21 I was using the total n to estimate the t and thus the p values. In response to a query (see news), I switched to using the actual sample size ns (n1 and n2) and then finding t based upon the hedges g value. This produces t values as reported by t.test with the var.equal = TRUE option.

It is probably useful to comment that the various confidence intervals reported are based upon normal theory and should be interpreted cautiously.

cohen.d.by will find Cohen's d for groups for each subset of the data defined by group2. The summary of the output produces a simplified listing of the d values for each variable for each group. May be called directly from cohen.d by using formula input and specifying two grouping variables.

d.robust follows Algina et al. 2005) to find trimmed means (trim =.2) and Winsorize variances (trim =.2). Supposedly, this provides a more robust estimate of effect sizes.

m2t reports Student's t.test for two groups given their means, standard deviations, and sample size. This is convenient when checking statistics where those estimates are provided, but the raw data are not available. By default, it gives the pooled estimate of variance, but if pooled is FALSE, it applies Welch's correction.

The Mahalanobis Distance combines the individual ds and weight them by their unique contribution: D=dR1dD = \sqrt{d' R^{-1}d}. By default, cohen.d will find the Mahalanobis distance between the two groups (if there is more than one DV.) This requires finding the correlation of all of the DVs and can fail if that matrix is not invertible because some pairs do not exist. Thus, setting MD=FALSE will prevent the Mahalanobis calculation.

Marco del Giudice (2019) has a very helpful paper discussing how to interpret d and Md in terms of various overlap coefficients. These may be found by the use of the d2OVL (percent overlap for 1 distribution), d2OVL2 percent overlap of joint distributions, d2CL (the common language effect size), and d2U3 (proportion in higher group exceeding median of the lower group).

OVL=2ϕ(d/2)OVL = 2\phi(-d/2)

is the proportion of overlap (and gets smaller the larger the d). where Phi is the cumulative density function of the normal distribution.

OVL2=OVL2OVLOVL_2 = \frac{OVL}{2-OVL}

The proportion of individuals in one group above the median of the other group is U3

U3=ϕdU_3 = \phi_d

.

The Common Language Effect size

CL=ϕ(d2)CL = \phi (d *\sqrt{2})

These last two get larger with (abs (d)). For graphic displays of Cohen's d and Mahalanobis D, see the scatterHist examples, or the example from the psychTools::GERAS data set.

Value

d

Cohen's d statistic, including the upper and lower confidence levels

hedges.g

Hedge's g statistic, including the upper and lower confidence levels

M.dist

Mahalanobis distance between the two groups

t

Student's t statistic

r

The point biserial r equivalent of d

n

sample size used for each analysis

p

The probability of abs(t)>0

descriptive

The descriptive statistics for each group. This is useful to show means and then the d values.

OVL

etc. some of the measures of overlap discussed by DelGiudice, 2009

Note

Cohen and Hedges differ in they way they calculate the pooled within group standard deviation. I find the treatment by McGrath and Meyer to be most helpful in understanding the differences.

Author(s)

William Revelle

References

Cohen, Jackob (1988) Statistical Power Analysis for the Behavioral Sciences. 2nd Edition, Lawrence Erlbaum Associates.

Algina, James and Keselman, H. J. and Penfield, Randall D. (2005) An Alternative to Cohen's Standardized Mean Difference Effect Size: A Robust Parameter and Confidence Interval in the Two Independent Groups Case. Psychological Methods. 10, 317-328.

Goulet-Pelletier, Jean-Christophe and Cousineau, Denis.(2018) A review of effect sizes and their confidence intervals, Part I: The Cohen's d family. The Quantitative Methods for Psychology, 14, 242-265.

Marco Del Giudice (2019) Measuring Sex Differences and Similarities, (in VanderLaan and Wong (ed. ) Gender and sexuality development: Contemporary theory and research.)

McGrath, Robert E and Meyer, Gregory J. (2006) When effect sizes disagree: the case of r and d. Psychological methods, 11, 4, 386-401.

See Also

describeBy, describe error.dots to display the results. scatterHist to show d and MD for pairs of variables. (See in particular the use of scatterHist on psychTools::GERAS daa set.)

Examples

cohen.d(sat.act,"gender")
#robust version
round(d.robust(sat.act,"gender")$robust.d,2)

#formula input is nicer
cohen.d(sat.act ~ gender) #formula input version
#if we want to report the group means, we grab the data in descriptive
cd <- cohen.d(sat.act ~ gender)
cd.df <- data.frame(d = cd$cohen.d[,"effect"], male = cd$descriptive$mean[1,cd$order[-1]],
      female = cd$descriptive$mean[2, cd$order[-1]])
#report cohen.d by another group
cd <- cohen.d.by(sat.act,"gender","education")
cohen.d(SATV + SATQ ~ gender, data=sat.act) #just choose two variables
summary(cd)  #summarize the output

#formula version combines these functions
cd <- cohen.d(sat.act ~ gender + education)  #find d by gender for each level of education
summary(cd)

#now show several examples of confidence intervals
#one group (d vs 0)
#consider the t from the cushny data set
t2d( -4.0621,n1=10)
d.ci(-1.284549,n1=10)  #the confidence interval of the effect of drug on sleep
#two groups
d.ci(.62,n=64)  #equal group size
d.ci(.62,n1=35,n2=29) #unequal group size
#several examples of d and t from data
m2d(52.58,-70.65,49.9,47.5) #Terman and Miles 1936

#graphically show the various overlap statistics
curve(d2OVL2(x),0,3,xlab="d",ylab="",lty="dashed",
       main="Four representations of effect size (d) ")
curve(d2OVL(x),0,3,xlab="d",add=TRUE,)
curve(d2CL(x),0,3,add=TRUE)
curve(d2U3(x), add=TRUE,lty="dotted")
text(1,.37,"OVL2")
text(2,.37,"OVL")
text(1,.88,"U3")
text(2, .88,"CL")

Find Cohen's kappa and weighted kappa coefficients for correlation of two raters

Description

Cohen's kappa (Cohen, 1960) and weighted kappa (Cohen, 1968) may be used to find the agreement of two raters when using nominal scores. Light's kappa is just the average cohen.kappa if using more than 2 raters.

weighted.kappa is (probability of observed matches - probability of expected matches)/(1 - probability of expected matches). Kappa just considers the matches on the main diagonal. Weighted kappa considers off diagonal elements as well.

Usage

cohen.kappa(x, w=NULL,n.obs=NULL,alpha=.05,levels=NULL,w.exp=2)  
wkappa(x, w = NULL)    #deprectated

Arguments

x

Either a two by n data with categorical values from 1 to p or a p x p table. If a data array, a table will be found.

w

A p x p matrix of weights. If not specified, they are set to be 0 (on the diagonal) and (distance from diagonal) off the diagonal)^2.

n.obs

Number of observations (if input is a square matrix.

alpha

Probability level for confidence intervals

levels

Specify the levels if some levels of x or y are completely missing. See Examples

w.exp

Exponent to apply to weights matrix – see examples

Details

When cateogorical judgments are made with two cateories, a measure of relationship is the phi coefficient. However, some categorical judgments are made using more than two outcomes. For example, two diagnosticians might be asked to categorize patients three ways (e.g., Personality disorder, Neurosis, Psychosis) or to categorize the stages of a disease. Just as base rates affect observed cell frequencies in a two by two table, they need to be considered in the n-way table (Cohen, 1960).

Kappa considers the matches on the main diagonal. A penalty function (weight) may be applied to the off diagonal matches. If the weights increase by the square of the distance from the diagonal, weighted kappa is similar to an Intra Class Correlation (ICC).

Derivations of weighted kappa are sometimes expressed in terms of similarities, and sometimes in terms of dissimilarities. In the latter case, the weights on the diagonal are 1 and the weights off the diagonal are less than one. In this case, if the weights are 1 - squared distance from the diagonal / k, then the result is similar to the ICC (for any positive k).

cohen.kappa may use either similarity weighting (diagonal = 0) or dissimilarity weighting (diagonal = 1) in order to match various published examples.

The input may be a two column data.frame or matrix with columns representing the two judges and rows the subjects being rated. Alternatively, the input may be a square n x n matrix of counts or proportion of matches. If proportions are used, it is necessary to specify the number of observations (n.obs) in order to correctly find the confidence intervals.

The confidence intervals are based upon the variance estimates discussed by Fleiss, Cohen, and Everitt who corrected the formulae of Cohen (1968) and Blashfield.

Some data sets will include data with numeric categories with some category values missing completely. In the sense that kappa is a measure of category relationship, this should not matter. But when finding weighted kappa, the number of categories weighted will be less than the number of categories potentially in the data. This can be remedied by specifying the levels parameter. This is a vector of the levels potentially in the data (even if some are missing). See the examples.

If there are more than 2 raters, then the average of all raters is known as Light's kappa. (Conger, 1980).

Following a request from Julius Pfadt, the weights matrix may be formed by squared distances (the default) or linear distances (w.exp=1)

Value

kappa

Unweighted kappa

weighted.kappa

The default weights are quadratric.

var.kappa

Variance of kappa

var.weighted

Variance of weighted kappa

n.obs

number of observations

weight

The weights used in the estimation of weighted kappa

confid

The alpha/2 confidence intervals for unweighted and weighted kappa

plevel

The alpha level used in determining the confidence limits

Note

As is true of many R functions, there are alternatives in other packages. The Kappa function in the vcd package estimates unweighted and weighted kappa and reports the variance of the estimate. The input is a square matrix. The ckappa and wkappa functions in the psy package take raw data matrices. The kappam.light function from the irr package finds Light's average kappa.

To avoid confusion with Kappa (from vcd) or the kappa function from base, the function was originally named wkappa. With additional features modified from psy::ckappa to allow input with a different number of categories, the function has been renamed cohen.kappa.

Unfortunately, to make it more confusing, the weights described by Cohen are a function of the reciprocals of those discucssed by Fleiss and Cohen. The cohen.kappa function uses the appropriate formula for Cohen or Fleiss-Cohen weights.

There are some cases where the large sample size approximation of Fleiss et al. will produce confidence intervals exceeding +/- 1. Clearly, for these cases, the upper (or lower for negative values) should be set to 1. Boot strap resampling shows the problem is that the values are not symmetric. See the last (unrun) example.

It is also possible to have more than 2 raters. In this case, cohen.kappa is reported for all pairs of raters (e.g. R1 and R2, R1 and R3, ... R3 and R4). To see the confidence intervals for these cohen.kappas, use the print command with the all=TRUE option. (See the exmaple of multiple raters.)

Author(s)

William Revelle

References

Banerjee, M., Capozzoli, M., McSweeney, L and Sinha, D. (1999) Beyond Kappa: A review of interrater agreement measures The Canadian Journal of Statistics / La Revue Canadienne de Statistique, 27, 3-23

Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20 37-46

Cohen, J. (1968). Weighted kappa: Nominal scale agreement provision for scaled disagreement or partial credit. Psychological Bulletin, 70, 213-220.

Conger, A. J. (1980) Integration and generalization of kappas for multiple raters, Psychological Bulletin,, 88, 322-328.

Fleiss, J. L., Cohen, J. and Everitt, B.S. (1969) Large sample standard errors of kappa and weighted kappa. Psychological Bulletin, 72, 332-327.

Light, R. J. (12971) Measures of response agreement for qualitative data: Some generalizations and alternatives, Psychological Bulletin, 76, 365-377.

Zwick, R. (1988) Another look at interrater agreement. Psychological Bulletin, 103, 374 - 378.

Examples

#rating data (with thanks to Tim Bates)
rater1 = c(1,2,3,4,5,6,7,8,9) # rater one's ratings
rater2 = c(1,3,1,6,1,5,5,6,7) # rater two's ratings
cohen.kappa(x=cbind(rater1,rater2))

#data matrix taken from Cohen
cohen <- matrix(c(
0.44, 0.07, 0.09,
0.05, 0.20, 0.05,
0.01, 0.03, 0.06),ncol=3,byrow=TRUE)

#cohen.weights  weight differences
cohen.weights <- matrix(c(
0,1,3,
1,0,6,
3,6,0),ncol=3)


cohen.kappa(cohen,cohen.weights,n.obs=200)
#cohen reports .492 and .348 

#another set of weights
#what if the weights are non-symmetric
wc <- matrix(c(
0,1,4,
1,0,6,
2,2,0),ncol=3,byrow=TRUE)
cohen.kappa(cohen,wc)  
#Cohen reports kw = .353

cohen.kappa(cohen,n.obs=200)  #this uses the squared weights

fleiss.cohen <- 1 - cohen.weights/9
cohen.kappa(cohen,fleiss.cohen,n.obs=200)

#however, Fleiss, Cohen and Everitt weight similarities
fleiss <- matrix(c(
106, 10,4,
22,28, 10,
2, 12,  6),ncol=3,byrow=TRUE)

#Fleiss weights the similarities
weights <- matrix(c(
 1.0000, 0.0000, 0.4444,
 0.0000, 1.0000, 0.6667,
 0.4444, 0.6667, 1.0000),ncol=3)
 
 cohen.kappa(fleiss,weights,n.obs=200)
 
 #another example is comparing the scores of two sets of twins
 #data may be a 2 column matrix
 #compare weighted and unweighted
 #also look at the ICC for this data set.
 twins <- matrix(c(
    1, 2, 
    2, 3,
    3, 4,
    5, 6,
    6, 7), ncol=2,byrow=TRUE)
  cohen.kappa(twins)
  
#data may be explicitly categorical
x <- c("red","yellow","blue","red")
y <- c("red",  "blue", "blue" ,"red") 
xy.df <- data.frame(x,y)
ck <- cohen.kappa(xy.df)
ck
ck$agree

#Example for specifying levels
#The problem of missing categories (from Amy Finnegan)
#We need to specify all the categories possible using the levels option
numbers <- data.frame(rater1=c(6,3,7,8,7),
                      rater2=c(6,1,8,5,10))
cohen.kappa(numbers)  #compare with the next analysis
cohen.kappa(numbers,levels=1:10)  #specify the number of levels 
              #   these leads to slightly higher weighted kappa
  
#finally, input can be a data.frame of ratings from more than two raters
ratings <- matrix(rep(1:5,4),ncol=4)
ratings[1,2] <- ratings[2,3] <- ratings[3,4] <- NA
ratings[2,1] <- ratings[3,2] <- ratings[4,3] <- 1
ck <- cohen.kappa(ratings)
ck  #just show the raw and weighted kappas
print(ck, all=TRUE)  #show the confidence intervals as well

 
 #In the case of confidence intervals being artificially truncated to +/- 1, it is 
 #helpful to compare the results of a boot strap resample
 #ck.boot <-function(x,s=1:nrow(x)) {cohen.kappa(x[s,])$kappa}
 #library(boot)
 #ckb <- boot(x,ck.boot,R=1000)
 #hist(ckb$t)

Convert base rates of two diagnoses and their comorbidity into phi, Yule, and tetrachorics

Description

In medicine and clinical psychology, diagnoses tend to be categorical (someone is depressed or not, someone has an anxiety disorder or not). Cooccurrence of both of these symptoms is called comorbidity. Diagnostic categories vary in their degree of comorbidity with other diagnostic categories. From the point of view of correlation, comorbidity is just a name applied to one cell in a four fold table. It is thus possible to analyze comorbidity rates by considering the probability of the separate diagnoses and the probability of the joint diagnosis. This gives the two by two table needed for a phi, Yule, or tetrachoric correlation.

Usage

comorbidity(d1, d2, com, labels = NULL)

Arguments

d1

Proportion of diagnostic category 1

d2

Proportion of diganostic category 2

com

Proportion of comorbidity (diagnostic category 1 and 2)

labels

Names of categories 1 and 2

Value

twobytwo

The two by two table implied by the input

phi

Phi coefficient of the two by two table

Yule

Yule coefficient of the two by two table

tetra

Tetrachoric coefficient of the two by two table

Author(s)

William Revelle

See Also

phi, phi2tetra ,Yule, Yule.inv Yule2phi, tetrachoric and polychoric, as well as AUC for graphical displays

Examples

comorbidity(.2,.15,.1,c("Anxiety","Depression"))

Matrix and profile congruences and distances

Description

The congruence coefficient two matrices is just the cross product of their respective values divided by the square root of their sums of squares. If the columns are zero centered, this is just the correlation. If the columns are centered around the scale neutral point, this is Cohen's profile correlation. A set of distances (city block, euclidean, Minkowski) may be found by the distance function.

Usage

congruence(x,y=NULL)
cohen.profile(x,y=NULL ,M=NULL)
distance(x,y=NULL,r=2)

Arguments

x

A matrix of factor loadings or a list of matrices of factor loadings

y

A second matrix of factor loadings (if x is a list, then y may be empty)

M

The midpoint of the items which should be used for reflection. NULL if to be found from the data.

r

The exponent for the generalized distance. (r=1 is city block, r=2 is euclidian, r=100 or larger emphasize the largest distance )

Details

Congruences are the cosines of pairs of vectors defined by a matrix and based at the origin. Thus, for values that differ only by a scaler the congruence will be 1.

For two matrices, F1 and F2, the measure of congruence, phi, is

ϕ=F1F2(F12)(F22).\phi = \frac{\sum F_1 F_2}{\sqrt{\sum(F_1^2)\sum(F_2^2)}} .

It is an interesting exercise to compare congruences with the correlations of the respective values. Congruences are based upon the raw cross products, while correlations are based upon centered cross products. That is, correlations of items are cosines of the vectors based at the mean loading for each column.

ϕ=(F1a)(F2b)((F1a)2)((F2b)2).\phi = \frac{\sum (F_1-a) (F_2 - b)}{\sqrt{\sum((F_1-a)^2)\sum((F_2-b)^2)}} .

.

For congruence coefficients, a = b= 0. For correlations a=mean F1, b= mean F2.

Input may either be one or two matrices.

For congruences of factor or component loading matrices, use factor.congruence.

Normally, all factor loading matrices should be complete (have no missing loadings). In the case where some loadings are missing, if the use option is specified, then variables with missing loadings are dropped.

If the data are zero centered, this is the correlation, if the data are centered around the scale midpoint (M), this is Cohen's Similarity coefficient. See examples. If M is not specified, it is found as the midpoint of the items in x and y.

cohen.profile applies the congruence function to data centered around M. M may be specified, or found from the data. The last example is taken from Cohen (1969).

distance finds the generalized distance as a function of r. City block (r=1), Euclidean (r=2) or weighted towards maximimum (r >2).

Value

A matrix of congruences or distances.

Author(s)

William Revelle

References

Burt, Cyril (1948) Methods of factor-analysis with and without successive approximation. British Journal of Educational Psychology, 7 (2) 172-195.

Burt, Cyril (1948) The factorial study of temperamental traits. British Journal of Statistical Psychology, 1(3) 178-203.

Cohen, Jacob (1969), rc: A profile similarity coefficient invariant over variable reflection. Psychological Bulletin, 71 (4) 281-284.

See Also

factor.congruence.

Examples

#cohen's example
# a and b have reversed one item around the midpoint
co <-  data.frame(ira=c(2,6,5,6,4),
       jim=c(1,3,5,4,4),
      a=c(5,6,5,6,4),b=c(6,3,5,4,4))

lowerMat(congruence(co-3.5)) # congruence around the midpoint is insensitive to reflection
lowerCor(co)   #the correlation is not
lowerMat(congruence(scale(co,scale=FALSE))) #zero centered congruence is  r
cohen.profile(co)

Smooth a non-positive definite correlation matrix to make it positive definite

Description

Factor analysis requires positive definite correlation matrices. Unfortunately, with pairwise deletion of missing data or if using tetrachoric or polychoric correlations, not all correlation matrices are positive definite. cor.smooth does a eigenvector (principal components) smoothing. Negative eigen values are replaced with 100 * eig.tol, the matrix is reproduced and forced to a correlation matrix using cov2cor.

Usage

cor.smooth(x,eig.tol=10^-12)
cor.smoother(x,cut=.01)

Arguments

x

A correlation matrix or a raw data matrix.

eig.tol

the minimum acceptable eigenvalue

.

cut

Report all abs(residuals) > cut

Details

The smoothing is done by eigen value decomposition. eigen values < eig.tol are changed to 100 * eig.tol. The positive eigen values are rescaled to sum to the number of items. The matrix is recomputed (eigen.vectors %*% diag(eigen.values) %*% t(eigen.vectors) and forced to a correlation matrix using cov2cor. (See Bock, Gibbons and Muraki, 1988 and Wothke, 1993).

This does not implement the Knol and ten Berge (1989) solution, nor do nearcor and posdefify in sfmsmisc, not does nearPD in Matrix. As Martin Maechler puts it in the posdedify function, "there are more sophisticated algorithms to solve this and related problems."

cor.smoother examines all of nvar minors of rank nvar-1 by systematically dropping one variable at a time and finding the eigen value decomposition. It reports those variables, which, when dropped, produce a positive definite matrix. It also reports the number of negative eigenvalues when each variable is dropped. Finally, it compares the original correlation matrix to the smoothed correlation matrix and reports those items with absolute deviations great than cut. These are all hints as to what might be wrong with a correlation matrix.

Value

The smoothed matrix with a warning reporting that smoothing was necessary (if smoothing was in fact necessary).

Author(s)

William Revelle

References

R. Darrell Bock, Robert Gibbons and Eiji Muraki (1988) Full-Information Item Factor Analysis. Applied Psychological Measurement, 12 (3), 261-280.

Werner Wothke (1993), Nonpositive definite matrices in structural modeling. In Kenneth A. Bollen and J. Scott Long (Editors),Testing structural equation models, Sage Publications, Newbury Park.

D.L. Knol and JMF ten Berge (1989) Least squares approximation of an improper correlation matrix by a proper one. Psychometrika, 54, 53-61.

See Also

tetrachoric, polychoric, fa and irt.fa, and the burt data set.

See also nearcor and posdefify in the sfsmisc package and nearPD in the Matrix package.

Examples

burt <- psychTools::burt
bs <- cor.smooth(psychTools::burt)  #burt data set is not positive definite
plot(burt[lower.tri(burt)],bs[lower.tri(bs)],ylab="smoothed values",xlab="original values")
abline(0,1,lty="dashed")

round(burt - bs,3) 
fa(burt,2) #this throws a warning that the matrix yields an improper solution
#Smoothing first throws a warning that the matrix was improper, 
#but produces a better solution 
fa(cor.smooth(burt),2)  

#this next example is a correlation matrix from DeLeuw used as an example 
#in Knol and ten Berge.  
#the example is also used in the nearcor documentation
 cat("pr is the example matrix used in Knol DL, ten Berge (1989)\n")
 pr <- matrix(c(1,     0.477, 0.644, 0.478, 0.651, 0.826,
		0.477, 1,     0.516, 0.233, 0.682, 0.75,
		0.644, 0.516, 1,     0.599, 0.581, 0.742,
		0.478, 0.233, 0.599, 1,     0.741, 0.8,
		0.651, 0.682, 0.581, 0.741, 1,     0.798,
		0.826, 0.75,  0.742, 0.8,   0.798, 1),
	      nrow = 6, ncol = 6)
	      
sm <- cor.smooth(pr)
resid <- pr - sm
# several goodness of fit tests
# from Knol and ten Berge
tr(resid %*% t(resid)) /2

# from nearPD
sum(resid^2)/2

The sample size weighted correlation may be used in correlating aggregated data

Description

If using aggregated data, the correlation of the means does not reflect the sample size used for each mean. cov.wt in RCore does this and returns a covariance matrix or the correlation matrix. The cor.wt function weights by sample size or by standard errors and by default return correlations.

Usage

cor.wt(data,vars=NULL, w=NULL,sds=NULL, cor=TRUE)

Arguments

data

A matrix or data frame

vars

Variables to analyze

w

A set of weights (e.g., the sample sizes)

sds

Standard deviations of the samples (used if weighting by standard errors)

cor

Report correlations (the default) or covariances

Details

A weighted correlation is just rij=(wtk(xikxjk)wtik(xik2)wtjk(xjk2)r_{ij} = \frac{\sum(wt_{k} (x_{ik} - x_{jk})}{\sqrt{wt_{ik} \sum(x_{ik}^2) wt_jk \sum(x_{jk}^2)}} where xikx_{ik} is a deviation from the weighted mean.

The weighted correlation is appropriate for correlating aggregated data, where individual data points might reflect the means of a number of observations. In this case, each point is weighted by its sample size (or alternatively, by the standard error). If the weights are all equal, the correlation is just a normal Pearson correlation.

Used when finding correlations of group means found using statsBy.

Value

cor

The weighted correlation

xwt

The data as weighted deviations from the weighted mean

wt

The weights used (calculated from the sample sizes).

mean

The weighted means

xc

Unweighted, centered deviation scores from the weighted mean

xs

Deviation scores weighted by the standard error of each sample mean

Note

A generalization of cov.wt in core R

Author(s)

William Revelle

See Also

See Also as cov.wt, statsBy

Examples

means.by.age <- statsBy(sat.act,"age")
wt.cors <- cor.wt(means.by.age)
lowerMat(wt.cors$r)  #show the weighted correlations
unwt <- lowerCor(means.by.age$mean)
mixed <- lowerUpper(unwt,wt.cors$r)  #combine both results
cor.plot(mixed,TRUE,main="weighted versus unweighted correlations")
diff <- lowerUpper(unwt,wt.cors$r,TRUE)
cor.plot(diff,TRUE,main="differences of weighted versus unweighted correlations")

Convert correlations to distances (necessary to do multidimensional scaling of correlation data)

Description

A minor helper function to convert correlations (ranging from -1 to 1) to distances (ranging from 0 to 2). d=(2(1r))d = \sqrt{(2(1-r))}.

Usage

cor2dist(x)

Arguments

x

If square, then assumed to be a correlation matrix, otherwise the correlations are found first.

Value

dist: a square matrix of distances.

Note

For an example of doing multidimensional scaling on data that are normally factored, see Revelle (in prep)

Author(s)

William Revelle

References

Revelle, William. (in prep) An introduction to psychometric theory with applications in R. Springer. Working draft available at https://personality-project.org/r/book/


Bootstrapped and normal confidence intervals for raw and composite correlations

Description

Although normal theory provides confidence intervals for correlations, this is particularly problematic with Synthetic Aperture Personality Assessment (SAPA) data where the individual items are Massively Missing at Random. Bootstrapped confidence intervals are found for Pearson, Spearman, Kendall, tetrachoric, or polychoric correlations and for scales made from those correlations. If given a correlation matrix and sample size(s), normal theory confidence intervals are provided.

Usage

corCi(x, keys = NULL, n.iter = 100,  p = 0.05,overlap = FALSE, 
 poly = FALSE, method = "pearson", plot=TRUE,minlength=5,n=NULL,...)
 
cor.ci(x, keys = NULL, n.iter = 100,  p = 0.05,overlap = FALSE, 
 poly = FALSE, method = "pearson", plot=TRUE,minlength=5,n=NULL,...)

Arguments

x

The raw data, or a correlation matrix if not doing bootstrapping

keys

If NULL, then the confidence intervals of the raw correlations are found. Otherwise, composite scales are formed from the keys applied to the correlation matrix (in a logic similar to cluster.cor but without the bells and whistles) and the confidence of those composite scales intercorrelations.

n.iter

The number of iterations to bootstrap over. This will be very slow if using tetrachoric/or polychoric correlations.

p

The upper and lower confidence region will include 1-p of the distribution.

overlap

If true, the correlation between overlapping scales is corrected for item overlap.

poly

if FALSE, then find the correlations using the method specified (defaults to Pearson). If TRUE, the polychoric correlations will be found (slowly). Because the polychoric function uses multicores (if available), and corCi does as well, the number of cores used is options("mc.cores")^2.

method

"pearson","spearman", "kendall"

plot

Show the correlation plot with correlations scaled by the probability values. To show the matrix in terms of the confidence intervals, use cor.plot.upperLowerCi.

minlength

What is the minlength to use in abbreviations of the cis? Defaults to 5

n

If finding confidence intervals from a correlation matrix, specify the n

...

Other parameters for axis (e.g., cex.axis to change the font size, srt to rotate the numbers in the plot)

Details

If given a correlation matrix, then confidence intervals are found based upon the sample sizes using the conventional r2z fisher transformation (fisherz and the normal distribution.

If given raw data, correlations are found. If keys are specified (the normal case), then composite scales based upon the correlations are found and reported. This is the same procedure as done using cluster.cor or scoreItems.

Then (with raw data) the data are recreated n.iter times by sampling subjects (rows) with replacement and the correlations (and composite scales) are found again (and again and again). Mean and standard deviations of these values are calculated based upon the Fisher Z transform of the correlations. Summary statistics include the original correlations and their confidence intervals. For those who want the complete set of replications, those are available as an object in the resulting output.

Although particularly useful for SAPA (https://www.sapa-project.org/) type data where we have lots of missing data, this will work for any normal data set as well.

Although the correlations are shown automatically as a cor.plot, it is possible to show the upper and lower confidence intervals by using cor.plot.upperLowerCi. This will also return, invisibly, a matrix for printing with the lower and upper bounds of the correlations shown below and above the diagonal (see the first example).

Value

rho

The original (composite) correlation matrix.

means

Mean (of Fisher transformed) correlation retransformed back to the r units

sds

Standard deviation of Fisher transformed correlations

ci

Mean +/- alpha/2 of the z scores as well as the alpha/2 and 1-alpha/2 quantiles. These are labeled as lower.emp(ircal), lower.norm(al), upper.norm and upper.emp.

replicates

The observed replication values so one can do one's own estimates

Author(s)

William Revelle

References

For SAPA type data, see Revelle, W., Wilt, J., and Rosenthal, A. (2010) Personality and Cognition: The Personality-Cognition Link. In Gruszka, A. and Matthews, G. and Szymura, B. (Eds.) Handbook of Individual Differences in Cognition: Attention, Memory and Executive Control, Springer.

See Also

make.keys, cluster.cor, and scoreItems for forming synthetic correlation matrices from composites of item correlations. See scoreOverlap for correcting for item overlap in scales. See also corr.test for standard significance testing of correlation matrices. See also lowerCor for finding and printing correlation matrices, as well as lowerMat for displaying them. Also see cor.plot.upperLowerCi for displaying the confidence intervals graphically.

Examples

#find confidence intervals of a correlation matrix with specified sample size
ci <- corCi(Thurstone[1:6,1:6],n=213)
ci  #show them
R <- cor.plot.upperLowerCi(ci)  #show them graphically
R #show them as a matrix 


#confidence intervals by bootstrapping requires raw data
corCi(psychTools::bfi[1:200,1:10])  # just the first 10 variables
#The keys have overlapping scales
keys <- list(agree=c("-A1","A2","A3","A4","A5"), conscientious= c("C1", 
  "C2","C3","-C4","-C5"),extraversion=c("-E1","-E2","E3","E4","E5"), neuroticism= 
  c("N1", "N2", "N3","N4","N5"), openness = c("O1","-O2","O3","O4","-O5"), 
  alpha=c("-A1","A2","A3","A4","A5","C1","C2","C3","-C4","-C5","N1","N2","N3","N4","N5"),
beta = c("-E1","-E2","E3","E4","E5","O1","-O2","O3","O4","-O5") )

  
#do not correct for item overlap
rci <-  corCi(psychTools::bfi[1:200,],keys,n.iter=10,main="correlation with overlapping scales") 
#also shows the graphic -note the overlap
#correct for overlap
rci <-  cor.ci(psychTools::bfi[1:200,],keys,overlap=TRUE, n.iter=10,main="Correct for overlap") 
#show the confidence intervals
ci <- cor.plot.upperLowerCi(rci)  #to show the upper and lower confidence intervals
ci   #print the confidence intervals in matrix form

Find a Full Information Maximum Likelihood (FIML) correlation or covariance matrix from a data matrix with missing data

Description

Makes use of functions adapted from the lavaan package to find FIML covariance/correlation matrices. FIML can be much slower than the normal pairwise deletion option of cor, but provides slightly more precise estimates.

Usage

corFiml(x, covar = FALSE,show=FALSE)

Arguments

x

A data.frame or data matrix

covar

By default, just return the correlation matrix. If covar is TRUE, return a list containing the covariance matrix and the ML fit function.

show

If show=TRUE, then just show the patterns of missingness, but don't do the FIML. Useful for understanding the process of fiml.

Details

In the presence of missing data, Full Information Maximum Likelihood (FIML) is an alternative to simply using the pairwise correlations. The implementation in the lavaan package for structural equation modeling has been adapted for the simpler case of just finding the correlations or covariances.

The pairwise solution for any pair of variables is insensitive to other variables included in the matrix. On the other hand, the ML solution depends upon the entire set of items being correlated. This will lead to slightly different solutions for different subsets of variables.

The basic FIML algorithm is to find the pairwise ML solution for covariances and means for every pattern of missingness and then to weight the solution by the size of every unique pattern of missingness.

Value

cor

The correlation matrix found using FIML

cov

The covariance matrix found using FIML

fx

The ML fit function

Note

The functions used in lavaan are not exported and so have been copied (and simplified) to the psych package.

Author(s)

Wiliam Revelle

See Also

To use the resulting correlations, see fa. To see the pairwise pattern of missingness, see count.pairwise.

Examples

rML <- corFiml(psychTools::bfi[20:27])
rpw <- cor(psychTools::bfi[20:27],use="pairwise") 
round(rML - rpw,3)
mp <- corFiml(psychTools::bfi[20:27],show=TRUE)
mp

Create an image plot for a correlation or factor matrix

Description

Correlation matrices may be shown graphically by using the image function to emphasize structure. This is a particularly useful tool for showing the structure of correlation matrices with a clear structure. Partially meant for the pedagogical value of the graphic for teaching or discussing factor analysis and other multivariate techniques. The sort option uses iclust to sort the matrix before plotting.

Usage

corPlot(r,numbers=TRUE,colors=TRUE,n=51,main=NULL,zlim=c(-1,1),
  show.legend=TRUE, labels=NULL,n.legend=10,keep.par=TRUE,select=NULL, pval=NULL, 
  digits=2, trailing=TRUE, cuts=c(.001,.01),scale=TRUE,cex,MAR,upper=TRUE,diag=TRUE, 
  symmetric=TRUE,stars=FALSE, adjust="holm",xaxis=1, xlas=0, ylas=2,ysrt=0,xsrt=0, 
   gr=NULL, alpha=.75,  min.length=NULL,sort=FALSE,n.obs=NULL, ...)

corPlotUpperLowerCi(R,numbers=TRUE,cuts=c(.001,.01,.05),select=NULL,
      main="Upper and lower confidence intervals of correlations",adjust=FALSE,...)
 
 cor.plot(r,numbers=TRUE,colors=TRUE,n=51,main=NULL,zlim=c(-1,1),
 show.legend=TRUE, labels=NULL,n.legend=10,keep.par=TRUE,select=NULL, pval=NULL, 
  digits=2, trailing=TRUE, cuts=c(.001,.01),scale=TRUE,cex,MAR,upper=TRUE,diag=TRUE, 
  symmetric=TRUE,stars=FALSE, adjust="holm",xaxis=1, xlas=0, ylas=2,ysrt=0,xsrt=0,
     gr=NULL, alpha=.75, min.length=NULL, sort=FALSE, n.obs=NULL,...)     #deprecated
           
cor.plot.upperLowerCi(R,numbers=TRUE,cuts=c(.001,.01,.05),select=NULL,
      main="Upper and lower confidence intervals of correlations",adjust=FALSE,...)
      #deprecated

Arguments

r

A correlation matrix or the output of fa, principal or omega, or a raw data matrix.

R

The object returned from cor.ci

numbers

Display the numeric value of the correlations. (As of September, 2019) Defaults to TRUE.

colors

Defaults to TRUE and colors use colors from the colorRampPalette from red through white to blue, but colors=FALSE will use a grey scale

n

The number of levels of shading to use. Defaults to 51

main

A title. Defaults to "correlation plot"

zlim

The range of values to color – defaults to -1 to 1. If specified as NULL, then defaults to min and max observed correlation.

show.legend

A legend (key) to the colors is shown on the right hand side

labels

if NULL, use column and row names, otherwise use labels

n.legend

How many categories should be labelled in the legend?

sort

If true, then sort the variables using the iclust algorithm

keep.par

restore the graphic parameters when exiting

pval

scale the numbers by their pvals, categorizing them based upon the values of cuts

digits

Round off to digits. Defaults to 2.

trailing

Show trailing zeros.

cuts

Scale the numbers by the categories defined by pval < cuts

scale

Should the size of the numbers be scaled by the significance level?

select

Select the subset of variables to plot

cex

Character size. Should be reduced a bit for large numbers of variables.

MAR

Allows for adjustment .of the margins if using really long labels or big fonts

upper

Should the upper off diagonal matrix be drawn, or left blank?

diag

Should we show the diagonal?

symmetric

By default, if given a non-symmetric matrix, we find the correlations using pair.wise complete and then show them. If wanting to display a non-symmetric matrix, then specify that symmetric is FALSE

stars

For those people who like to show the 'significance' of correlations by using magic astricks, set stars=TRUE

n.obs

If you want to show "stars" for symmetric input matrices (i.e. correlations), specify the number of observations

adjust

If showing significance, should we adjust for multiple tests? The default is to show zero order probabilities below the diagonal and adjust these using the 'holm' correction above the diagonal. Use adjust = "none" if no adjustment is desired. adjust is also used in corPlotUpperLowerCI to show the nominal alpha confidence intervals (adjust =FALSE) or the Bonferonni adjusted confidence intervals (adjust=TRUE).

xlas

Orientation of the x axis labels (1 = horizontal, 0, parallel to axis, 2 perpendicular to axis)

ylas

Orientation of the y axis labels (1 = horizontal, 0, parallel to axis, 2 perpendicular to axis)

ysrt

Rotation of y labels in degrees

xsrt

Rotation of x labels in degrees

xaxis

By default, draw this below the figure. If xaxis=3, then it wil be drawn above the figure

gr

A color gradient: e.g., gr <- colorRampPalette(c("#B52127", "white", "#2171B5")) will produce slightly more pleasing (to some) colors. See next to last example.

alpha

The degree of transparency (0 = completely, 1= not). Default value of .75 makes somewhat moreor pleasing plots when using numbers.

min.length

If not NULL, then the maximum number of characters to use in row/column labels

...

Other parameters for axis (e.g., cex.axis to change the font size, srt to rotate the numbers in the plot)

Details

When summarizing the correlations of large data bases or when teaching about factor analysis or cluster analysis, it is useful to graphically display the structure of correlation matrices. This is a simple graphical display using the image function.

The difference between mat.plot with a regular image plot is that the primary diagonal goes from the top left to the lower right. zlim defines how to treat the range of possible values. -1 to 1 and the color choice is more reasonable. Setting it as c(0,1) will lead to negative correlations treated as zero. This is advantageous when showing general factor structures, because it makes the 0 white.

There is an interesting case when plotting correlations corrected for attenuation. Some of these might exceed 1. In this case, either set zlim = NULL (to use the observed maximum and minimum values) or all values above 1 will be given a slightly darker shade than 1, but do not differ.

The default shows a legend for the color coding on the right hand side of the figure.

Inspired, in part, by a paper by S. Dray (2008) on the number of components problem.

Modified following suggestions by David Condon and Josh Wilt to use a more meaningful color choice ranging from dark red (-1) through white (0) to dark blue (1). Further modified to allow for color choices using the gr option (suggested by Lorien Elleman). Further modified to include the numerical value of the correlation. (Inspired by the corrplot package). These values may be scaled according the the probability values found in cor.ci or corTest.

Unless specified, the font size is dynamically scaled to have a cex = 10/max(nrow(r),ncol(r). This can produce fairly small fonts for large problems. The font size of the labels may be adjusted using cex.axis which defaults to one.

By default cor.ci calls corPlotUpperLowerCi and scales the correlations based upon "significance" values. The correlations plotted are the upper and lower confidence boundaries. To show the correlations themselves, call corPlot directly.

If using the output of corTest, the upper off diagonal will be scaled by the corrected probability, the lower off diagonal the scaling is the uncorrected probabilities.

If given raw data or correlation matrix, corPlotUpperLowerCi will automatically call corTest or cor.ci.

If using the output of corTest or cor.ci as input to corPlotUpperLowerCi, the upper off diagonal will be the upper bounds and the lower off diagonal the lower bounds of the confidence intervals. If adjust=TRUE, these will use the Holm or Bonferroni adjusted values (depending upon corTest).

To compare the elements of two correlation matrices, corPlot the results from lowerUpper.

To do multiple corPlot on the same plot, specify that show.legend=FALSE and keep.par=FALSE. See the last examples.

Care should be taken when selecting rows and columns from a non-symmetric matrix (e.g., the corrected correlations from scoreItems or scoreOverlap).

To show a factor loading matrix (or any non-symmetric matrix), set symmetric=FALSE. Otherwise the input will be treated as raw data and correlations will be found.

The sort option will sort the matrix using the output from iclust. To sort the matrix use another order, use mat.sort first. To find correlations other than Pearson, plot the output from e.g., mixed.cor.

Author(s)

William Revelle

References

Dray, Stephane (2008) On the number of principal components: A test of dimensionality based on measurements of similarity between matrices. Computational Statistics & Data Analysis. 52, 4, 2228-2237.

See Also

fa, mat.sort, cor.ci, corTest lowerUpper.

Examples

corPlot(Thurstone,main="9 cognitive variables from Thurstone") 
#just blue implies positive manifold
#select just some variables to plot
corPlot(Thurstone, zlim=c(0,1),main="9 cognitive variables from Thurstone",select=c(1:3,7:9))
#now show a non-symmetric plot
corPlot(Thurstone[4:9,1:3], zlim=c(0,1),main="9 cognitive variables
 from Thurstone",numbers=TRUE,symmetric=FALSE)

#Two ways of including stars to show significance
#From the raw data
corPlot(sat.act,numbers=TRUE,stars=TRUE)
#from a correlation matrix with pvals
cp <- corTest(sat.act)  #find the correlations and pvals
r<- cp$r
p <- cp$p
corPlot(r,numbers=TRUE,diag=FALSE,stars=TRUE, pval = p,main="Correlation plot
 with Holm corrected 'significance'")

#now red means less than .5
corPlot(mat.sort(Thurstone),TRUE,zlim=c(0,1), 
       main="9 cognitive variables from Thurstone (sorted by factor loading) ")
simp <- sim.circ(24)
corPlot(cor(simp),main="24 variables in a circumplex")

#scale by raw and adjusted probabilities
rs <- corTest(sat.act[1:200,] ) #find the probabilities of the correlations
corPlot(r=rs$r,numbers=TRUE,pval=rs$p,main="Correlations scaled by probability values") 
 #Show the upper and lower confidence intervals
corPlotUpperLowerCi(R=rs,numbers=TRUE) 

#now do this again, but with lighter colors
gr <- colorRampPalette(c("#B52127", "white", "#2171B5"))
corPlot(r=rs$r,numbers=TRUE,pval=rs$p,main="Correlations scaled by probability values",gr=gr) 

corPlotUpperLowerCi(R=rs,numbers=TRUE,gr=gr) 




#do multiple plots 
#Also show the xaxis option
op <- par(mfrow=c(2,2))
corPlot(psychTools::ability,show.legend=FALSE,keep.par=FALSE,upper=FALSE)
f4 <- fa(psychTools::ability,4)
corPlot(f4,show.legend=FALSE,keep.par=FALSE,numbers=TRUE,xlas=3)
om <- omega(psychTools::ability,4)
corPlot(om,show.legend=FALSE,keep.par=FALSE,numbers=TRUE,xaxis=3)
par(op)


corPlotUpperLowerCi(rs,adjust=TRUE,main="Holm adjusted confidence intervals",gr=gr)

Find dis-attenuated correlations given correlations and reliabilities

Description

Given a raw correlation matrix and a vector of reliabilities, report the disattenuated correlations above the diagonal.

Usage

correct.cor(x, y)

Arguments

x

A raw correlation matrix

y

Vector of reliabilities

Details

Disattenuated correlations may be thought of as correlations between the latent variables measured by a set of observed variables. That is, what would the correlation be between two (unreliable) variables be if both variables were measured perfectly reliably.

This function is mainly used if importing correlations and reliabilities from somewhere else. If the raw data are available, use score.items, or cluster.loadings or cluster.cor.

Examples of the output of this function are seen in cluster.loadings and cluster.cor

Value

Raw correlations below the diagonal, reliabilities on the diagonal, disattenuated above the diagonal.

Author(s)

Maintainer: William Revelle [email protected]

References

Revelle, W. (in preparation) An Introduction to Psychometric Theory with applications in R. Springer. at https://personality-project.org/r/book/

See Also

cluster.loadings and cluster.cor

Examples

# attitude from the datasets package
#example 1 is a rather clunky way of doing things

a1 <- attitude[,c(1:3)]
a2 <- attitude[,c(4:7)]
x1 <- rowSums(a1)  #find the sum of the first 3 attitudes
x2 <- rowSums(a2)   #find the sum of the last 4 attitudes
alpha1 <- alpha(a1)
alpha2 <- alpha(a2)
x <- matrix(c(x1,x2),ncol=2)
x.cor <- cor(x)
alpha <- c(alpha1$total$raw_alpha,alpha2$total$raw_alpha)
round(correct.cor(x.cor,alpha),2)
#
#much better - although uses standardized alpha 
clusters <- matrix(c(rep(1,3),rep(0,7),rep(1,4)),ncol=2)
cluster.loadings(clusters,cor(attitude))
# or 
clusters <- matrix(c(rep(1,3),rep(0,7),rep(1,4)),ncol=2)
cluster.cor(clusters,cor(attitude))
#
#best
keys <- make.keys(attitude,list(first=1:3,second=4:7))
scores <- scoreItems(keys,attitude)
scores$corrected

#However, to do the more general case of correcting correlations for reliabilty
#corrected <- cor2cov(x.cor,1/alpha)
#diag(corrected) <- 1

Chi square tests of whether a single matrix is an identity matrix, or a pair of matrices are equal.

Description

Steiger (1980) pointed out that the sum of the squared elements of a correlation matrix, or the Fisher z score equivalents, is distributed as chi square under the null hypothesis that the values are zero (i.e., elements of the identity matrix). This is particularly useful for examining whether correlations in a single matrix differ from zero or for comparing two matrices. Jennrich (1970) also examined tests of differences between matrices.

Usage

cortest(R1,R2=NULL,n1=NULL,n2 = NULL, fisher = TRUE,cor=TRUE,method="pearson",
           use ="pairwise") #same as cortest.normal this does the steiger test
cortest.normal(R1, R2 = NULL, n1 = NULL, n2 = NULL, fisher = TRUE)      #the steiger test

cortest.jennrich(R1,R2,n1=NULL, n2=NULL)  #the Jennrich test
cortest.mat(R1,R2=NULL,n1=NULL,n2 = NULL) #an alternative test

Arguments

R1

A correlation matrix. (If R1 is not rectangular, and cor=TRUE, the correlations are found).

R2

A correlation matrix. If R2 is not rectangular, and cor=TRUE, the correlations are found. If R2 is NULL, then the test is just whether R1 is an identity matrix.

n1

Sample size of R1

n2

Sample size of R2

fisher

Fisher z transform the correlations?

cor

By default, if the input matrices are not symmetric, they are converted to correlation matrices. That is, they are treated as if they were the raw data. If cor=FALSE, then the input matrices are taken to be correlation matrices.

method

Which type of correlation to find ("pearson", "spearman","kendall")

use

How to handle missing data (defaults to pairwise)

Details

There are several ways to test if a matrix is the identity matrix. The most well known is the chi square test of Bartlett (1951) and Box (1949). A very straightforward test, discussed by Steiger (1980) is to find the sum of the squared correlations or the sum of the squared Fisher transformed correlations. Under the null hypothesis that all the correlations are equal, this sum is distributed as chi square. This is implemented in cortest and cortest.normal

Yet another test, is the Jennrich(1970) test of the equality of two matrices. This compares the differences between two matrices to the averages of two matrices using a chi square test. This is implemented in cortest.jennrich.

Yet another option cortest.mat is to compare the two matrices using an approach analogous to that used in evaluating the adequacy of a factor model. In factor analysis, the maximum likelihood fit statistic is
f=log(trace((FF+U2)1R)log((FF+U2)1R)n.itemsf = log(trace ((FF'+U2)^{-1} R) - log(|(FF'+U2)^{-1} R|) - n.items.

This in turn is converted to a chi square

χ2=(n.obs1(2p+5)/6(2factors)/3))f\chi^2 = (n.obs - 1 - (2 * p + 5)/6 - (2 * factors)/3)) * f (see fa.)

That is, the model (M = FF' + U2) is compared to the original correlation matrix (R) by a function of M1RM^{-1} R. By analogy, in the case of two matrices, A and B, cortest.mat finds the chi squares associated with A1BA^{-1}B and AB1A B^{-1}. The sum of these two χ2\chi^2 will also be a χ2\chi^2 but with twice the degrees of freedom.

Value

chi2

The chi square statistic

df

Degrees of freedom for the Chi Square

prob

The probability of observing the Chi Square under the null hypothesis.

Note

Both the cortest.jennrich and cortest.normal are probably overly stringent. The ChiSquare values for pairs of random samples from the same population are larger than would be expected. This is a good test for rejecting the null of no differences.

Author(s)

William Revelle

References

Steiger, James H. (1980) Testing pattern hypotheses on correlation matrices: alternative statistics and some empirical results. Multivariate Behavioral Research, 15, 335-352.

Jennrich, Robert I. (1970) An Asymptotic χ2\chi^2 Test for the Equality of Two Correlation Matrices. Journal of the American Statistical Association, 65, 904-912.

See Also

cortest.bartlett corr.test

Examples

set.seed(42)
x <- matrix(rnorm(1000),ncol=10)
cortest.normal(x)  #just test if this matrix is an identity
#now create two correlation matrices that should be equal
x <- sim.congeneric(loads =c(.9,.8,.7,.6,.5),N=1000,short=FALSE)
y <- sim.congeneric(loads =c(.9,.8,.7,.6,.5),N=1000,short=FALSE)

cortest(x$r,y$r,n1=1000,n2=1000) #The Steiger test
cortest.jennrich(x$r,y$r,n1=100,n2=1000) # The Jennrich test

cortest.mat(x$r,y$r,n1=1000,n2=1000)   #twice the degrees of freedom as the Jennrich

#create a new matrix that is different
z <- sim.congeneric(loads=c(.8,.8,.7,.7, .6), N= 1000, short=FALSE)
cortest(x$r,z$r,n1=1000)  #these should be different

Find the correlations, sample sizes, and probability values between elements of a matrix or data.frame.

Description

Although the cor function finds the correlations for a matrix, it does not report probability values. cor.test does, but for only one pair of variables at a time. corr.test uses cor to find the correlations for either complete or pairwise data and reports the sample sizes and probability values as well. For symmetric matrices, raw probabilites are reported below the diagonal and correlations adjusted for multiple comparisons above the diagonal. In the case of different x and ys, the default is to adjust the probabilities for multiple tests. Both corr.test and corr.p return raw and adjusted confidence intervals for each correlation.

Usage

corTest(x, y = NULL, use = "pairwise",method="pearson",adjust="holm", 
    alpha=.05,ci=TRUE,minlength=5,normal=TRUE)
corr.test(x, y = NULL, use = "pairwise",method="pearson",adjust="holm", 
    alpha=.05,ci=TRUE,minlength=5,normal=TRUE)
corr.p(r,n,adjust="holm",alpha=.05,minlength=5,ci=TRUE)

Arguments

x

A matrix or dataframe

y

A second matrix or dataframe with the same number of rows as x

use

use="pairwise" is the default value and will do pairwise deletion of cases. use="complete" will select just complete cases.

method

method="pearson" is the default value. The alternatives to be passed to cor are "spearman" and "kendall". These last two are much slower, particularly for big data sets.

adjust

What adjustment for multiple tests should be used? ("holm", "hochberg", "hommel", "bonferroni", "BH", "BY", "fdr", "none"). See p.adjust for details about why to use "holm" rather than "bonferroni").

alpha

alpha level of confidence intervals

r

A correlation matrix

n

Number of observations if using corr.p. May be either a matrix (as returned from corr.test, or a scaler. Set to n - np if finding the significance of partial correlations. (See below).

ci

By default, confidence intervals are found. However, this leads to a noticable slowdown of speed, particularly for large problems. So, for just the rs, ts and ps, set ci=FALSE

minlength

What is the minimum length for abbreviations. Defaults to 5.

normal

By default, probabilities for method="spearman" and method="kendall" are found by normal theory. If normal=="FALSE", then repetitive calls are made to cor.test. This is much slower, but gives more accurate p values. exact is set to be FALSE which means that exact p values for small samples are not found given the problem of ties.

Details

corr.test uses the cor function to find the correlations, and then applies a t-test to the individual correlations using the formula

t=r(n2)(1r2)t = \frac{r * \sqrt(n-2)}{\sqrt(1-r^2)}

se=(1r2n2)se = \sqrt(\frac{1-r^2}{n-2})

The t and Standard Errors are returned as objects in the result, but are not normally displayed. Confidence intervals are found and printed if using the print(short=FALSE) option. These are found by using the fisher z transform of the correlation and then taking the range r +/- qnorm(alpha/2) * se and the standard error of the z transforms is

se=(1n3)se = \sqrt(\frac {1}{n-3})

. These values are then back transformed to be in correlation units. They are returned in the ci object.

Note that in the case of method=="kendall" since these are the normal theory confidence intervals they are slightly too big.

The probability values may be adjusted using the Holm (or other) correction. If the matrix is symmetric (no y data), then the original p values are reported below the diagonal and the adjusted above the diagonal. Otherwise, all probabilities are adjusted (unless adjust="none"). This is made explicit in the output. Confidence intervals are shown for raw and adjusted probabilities in the ci object.

For those who like the conventional use of "magic asterisks" to show (stars) to represent conventional levels of significance, the object stars is returned (but not shown)). See the examples.

corr.p may be applied to the results of partial.r if n is set to n - s (where s is the number of variables partialed out) Fisher, 1924.

Value

r

The matrix of correlations

n

Number of cases per correlation

t

value of t-test for each correlation

p

two tailed probability of t for each correlation. For symmetric matrices, p values adjusted for multiple tests are reported above the diagonal.

se

standard error of the correlation

ci

the alpha/2 lower and upper values.

ci2

ci but with the adjusted pvalues as well. This was added after tests showed we were breaking some packages that were calling the ci object without bothering to check for its dimensions.

ci.adj

These are the adjusted ((Holm or Bonferroni) confidence intervals. If asking to not adjust, the Holm adjustments for the confidence intervals are shown anyway, but the probability values are not adjusted and the appropriate confidence intervals are shown in the ci object.

stars

For those people who like to show magic asterisks denoting “statistical significance" the stars object flags those correlation values that are unlikely given normal theory. See the last example for how to print these neatly.

Note

For very large matrices (> 200 x 200), there is a noticeable speed improvement if confidence intervals are not found.

That adjusted confidence intervals are shown even when asking for no adjustment might be confusing. If you don't want adjusted intervals, just use the ci object. The adjusted values are given in the ci.adj object.

See Also

cor.test for tests of a single correlation, Hmisc::rcorr for an equivalant function, r.test to test the difference between correlations, and cortest.mat to test for equality of two correlation matrices.

Also see cor.ci for bootstrapped confidence intervals of Pearson, Spearman, Kendall, tetrachoric or polychoric correlations. In addition cor.ci will find bootstrapped estimates of composite scales based upon a set of correlations (ala cluster.cor).

In particular, see p.adjust for a discussion of p values associated with multiple tests.

Other useful functions related to finding and displaying correlations include link{corPlot} to graphically display the correlation matrix, and lowerCor for finding the correlations and then displaying the lower off diagonal using the lowerMat function. lowerUpper to compare two correlation matrices. Also see pairs.panels to show the correlations and scatter plots.

Examples

ct  <- corTest(attitude)
#ct <- corr.test(attitude)  #find the correlations and give the probabilities
ct #show the results


cts <- corr.test(attitude[1:3],attitude[4:6]) #reports all values corrected for multiple tests

#corr.test(sat.act[1:3],sat.act[4:6],adjust="none")  #don't adjust the probabilities

#take correlations and show the probabilities as well as the confidence intervals
print(corr.p(cts$r,n=30),short=FALSE)  

#don't adjust the probabilities
print(corr.test(sat.act[1:3],sat.act[4:6],adjust="none"),short=FALSE)  

#print out the stars object without showing quotes
print(corr.test(attitude)$stars,quote=FALSE)  #note that the adjusted ps are given as well


kendall.r <- corr.test(bfi[1:40,4:6], method="kendall", normal=FALSE)
#compare with 
cor.test(x=bfi[1:40,4],y=bfi[1:40,6],method="kendall", exact=FALSE)
print(kendall.r,digits=6)

Bartlett's test that a correlation matrix is an identity matrix

Description

Bartlett (1951) proposed that -ln(det(R)*(N-1 - (2p+5)/6) was distributed as chi square if R were an identity matrix. A useful test that residuals correlations are all zero. Contrast to the Kaiser-Meyer-Olkin test.

Usage

cortest.bartlett(R, n = NULL,diag=TRUE)

Arguments

R

A correlation matrix. (If R is not square, correlations are found and a warning is issued.

n

Sample size (if not specified, 100 is assumed).

diag

Will replace the diagonal of the matrix with 1s to make it a correlation matrix.

Details

More useful for pedagogical purposes than actual applications. The Bartlett test is asymptotically chi square distributed.

Note that if applied to residuals from factor analysis (fa) or principal components analysis (principal) that the diagonal must be replaced with 1s. This is done automatically if diag=TRUE. (See examples.)

An Alternative way of testing whether a correlation matrix is factorable (i.e., the correlations differ from 0) is the Kaiser-Meyer-Olkin KMO test of factorial adequacy.

Value

chisq

Assymptotically chisquare

p.value

Of chi square

df

The degrees of freedom

Author(s)

William Revelle

References

Bartlett, M. S., (1951), The Effect of Standardization on a chi square Approximation in Factor Analysis, Biometrika, 38, 337-344.

See Also

cortest.mat, cortest.normal, cortest.jennrich

Examples

set.seed(42)   
x <- matrix(rnorm(1000),ncol=10)
r <- cor(x)
cortest.bartlett(r)      #random data don't differ from an identity matrix
#data(bfi)
cortest.bartlett(psychTools::bfi[1:200,1:10])    #not an identity matrix
f3 <- fa(Thurstone,3)
f3r <- f3$resid
cortest.bartlett(f3r,n=213,diag=FALSE)  #incorrect

cortest.bartlett(f3r,n=213,diag=TRUE)  #correct (by default)

Functions for analysis of circadian or diurnal data

Description

Circadian data are periodic with a phase of 24 hours. These functions find the best fitting phase angle (cosinor), the circular mean, circular correlation with circadian data, and the linear by circular correlation

Usage

cosinor(angle,x=NULL,code=NULL,data=NULL,hours=TRUE,period=24,
            plot=FALSE,opti=FALSE,na.rm=TRUE)
cosinor.plot(angle,x=NULL,data = NULL, IDloc=NULL, ID=NULL,hours=TRUE, period=24,
 na.rm=TRUE,ylim=NULL,ylab="observed",xlab="Time (double plotted)",
 main="Cosine fit",add=FALSE,multi=FALSE,typ="l",...)
 cosinor.period(angle,x=NULL,code=NULL,data=NULL,hours=TRUE,period=seq(23,26,1),
            plot=FALSE,opti=FALSE,na.rm=TRUE)          
circadian.phase(angle,x=NULL,code=NULL,data=NULL,hours=TRUE,period=24,
            plot=FALSE,opti=FALSE,na.rm=TRUE)
circadian.mean(angle,data=NULL, hours=TRUE,na.rm=TRUE)
circadian.sd(angle,data=NULL,hours=TRUE,na.rm=TRUE)
circadian.stats(angle,data=NULL,hours=TRUE,na.rm=TRUE)
circadian.F(angle,group,data=NULL,hours=TRUE,na.rm=TRUE)
circadian.reliability(angle,x=NULL,code=NULL,data = NULL,min=16,   
         oddeven=FALSE, hours=TRUE,period=24,plot=FALSE,opti=FALSE,na.rm=TRUE) 
circular.mean(angle,na.rm=TRUE) #angles in radians
circadian.cor(angle,data=NULL,hours=TRUE,na.rm=TRUE)  #angles in radians
circular.cor(angle,na.rm=TRUE) #angles in radians
circadian.linear.cor(angle,x=NULL,data=NULL,hours=TRUE)

Arguments

angle

A data frame or matrix of observed values with the time of day as the first value (unless specified in code) angle can be specified either as hours or as radians)

code

A subject identification variable

data

A matrix or data frame of data. If specified, then angle and code are variable names (or locations). See examples.

group

If doing comparisons by groups, specify the group code

.

min

The minimum number of observations per subject to use when finding split half reliabilities.

oddeven

Reliabilities are based upon odd and even items (TRUE) or first vs. last half (FALSE). Default is first and last half.

period

Although time of day is assumed to have a 24 hour rhythm, other rhythms may be fit. If calling cosinor.period, a range may be specified.

IDloc

Which column number is the ID field

ID

What specific subject number should be plotted for one variable

plot

if TRUE, then plot the first variable (angle)

opti

opti=TRUE: iterative optimization (slow) or opti=FALSE: linear fitting (fast)

hours

If TRUE, measures are in 24 hours to the day, otherwise, radians

x

A set of external variables to correlate with the phase angles

na.rm

Should missing data be removed?

ylim

Specify the range of the y axis if the defaults don't work

ylab

The label of the yaxis

xlab

Labels for the x axis

main

the title of the graphic

add

If doing multiple (spagetti) plots, set add = TRUE for the second and beyond plots

multi

If doing multiple (spagetti) plots, set multi=TRUE for the first and subsequent plots

typ

Pass the line type to graphics

...

any other graphic parameters to pass

Details

When data represent angles (such as the hours of peak alertness or peak tension during the day), we need to apply circular statistics rather than the more normal linear statistics (see Jammalamadaka (2006) for a very clear set of examples of circular statistics). The generalization of the mean to circular data is to convert each angle into a vector, average the x and y coordinates, and convert the result back to an angle. A statistic that represents the compactness of the observations is R which is the (normalized) vector length found by adding all of the observations together. This will achieve a maximum value (1) when all the phase angles are the same and a minimum (0) if the phase angles are distributed uniformly around the clock.

The generalization of Pearson correlation to circular statistics is straight forward and is implemented in cor.circular in the circular package and in circadian.cor here. Just as the Pearson r is a ratio of covariance to the square root of the product of two variances, so is the circular correlation. The circular covariance of two circular vectors is defined as the average product of the sines of the deviations from the circular mean. The variance is thus the average squared sine of the angular deviations from the circular mean. Circular statistics are used for data that vary over a period (e.g., one day) or over directions (e.g., wind direction or bird flight). Jammalamadaka and Lund (2006) give a very good example of the use of circular statistics in calculating wind speed and direction.

The code from CircStats and circular was adapted to allow for analysis of data from various studies of mood over the day. Those two packages do not seem to handle missing data, nor do they take matrix input, but rather emphasize single vectors.

The cosinor function will either iteratively fit cosines of the angle to the observed data (opti=TRUE) or use the circular by linear regression to estimate the best fitting phase angle. If cos.t <- cos(time) and sin.t = sin(time) (expressed in hours), then beta.c and beta.s may be found by regression and the phase is sign(beta.c)acos(beta.c/(beta.c2+beta.s2))12/pisign(beta.c) * acos(beta.c/\sqrt(beta.c^2 + beta.s^2)) * 12/pi

Simulations (see examples) suggest that with incomplete times, perhaps the optimization procedure yields slightly better fits with the correct phase than does the linear model, but the differences are very small. In the presence of noisey data, these advantages seem to reverse. The recommendation thus seems to be to use the linear model approach (the default). The fit statistic reported for cosinor is the correlation of the data with the model [ cos(time - acrophase) ].

The circadian.reliability function splits the data for each subject into a first and second half (by default, or into odd and even items) and then finds the best fitting phase for each half. These are then correlated (using circadian.cor) and this correlation is then adjusted for test length using the conventional Spearman-Brown formula. Returned as object in the output are the statistics for the first and second part, as well as an ANOVA to compare the two halves.

circular.mean and circular.cor are just circadian.mean and circadian.cor but with input given in radians rather than hours.

The circadian.linear.cor function will correlate a set of circular variables with a set of linear variables. The first (angle) variables are circular, the second (x) set of variables are linear.

The circadian.F will compare 2 or more groups in terms of their mean position. This is adapted from the equivalent function in the circular pacakge. This is clearly a more powerful test the more each group is compact around its mean (large values of R).

Value

phase

The phase angle that best fits the data (expressed in hours if hours=TRUE).

fit

Value of the correlation of the fit. This is just the correlation of the data with the phase adjusted cosine.

mean.angle

A vector of mean angles

n, mean, sd

The appropriate circular statistic.

correl

A matrix of circular correlations or linear by circular correlations

R

R is the vector length (0-1) of the mean vector when finding circadian statistics using circadian.stats

z, p

z is the number of observations x R^2. p is the probability of a z.

phase.rel

The reliability of the phase measures. This is the circular correlation between the two halves adjusted using the Spearman-Brown correction.

fit.rel

The split half reliability of the fit statistic.

split.F

Do the two halves differ from each other? One would hope not.

group1, group2

The statistics from each half

splits

The individual data from each half.

Note

These functions have been adapted from the circular package to allow for ease of use with circadian data, particularly for data sets with missing data and multiple variables of interest.

Author(s)

William Revelle

References

See circular statistics Jammalamadaka, Sreenivasa and Lund, Ulric (2006),The effect of wind direction on ozone levels: a case study, Environmental and Ecological Statistics, 13, 287-298.

See Also

See the circular and CircStats packages.

Examples

time <- seq(1:24) #create a 24 hour time
pure <- matrix(time,24,18) 
colnames(pure) <- paste0("H",1:18)
pure <- data.frame(time,cos((pure - col(pure))*pi/12)*3 + 3)
    #18 different phases but scaled to 0-6  match mood data
matplot(pure[-1],type="l",main="Pure circadian arousal rhythms",
    xlab="time of day",ylab="Arousal") 
op <- par(mfrow=c(2,2))
 cosinor.plot(1,3,pure)
 cosinor.plot(1,5,pure)
 cosinor.plot(1,8,pure)
 cosinor.plot(1,12,pure)

p <- cosinor(pure) #find the acrophases (should match the input)

#now, test finding the acrophases for  different subjects on 3 variables
#They should be the first 3, second 3, etc. acrophases of pure
pp <- matrix(NA,nrow=6*24,ncol=4)
pure <- as.matrix(pure)
pp[,1] <- rep(pure[,1],6)
pp[1:24,2:4] <- pure[1:24,2:4] 
pp[25:48,2:4] <- pure[1:24,5:7] *2   #to test different variances
pp[49:72,2:4] <- pure[1:24,8:10] *3
pp[73:96,2:4] <- pure[1:24,11:13]
pp[97:120,2:4] <- pure[1:24,14:16]
pp[121:144,2:4] <- pure[1:24,17:19]
pure.df <- data.frame(ID = rep(1:6,each=24),pp)
colnames(pure.df) <- c("ID","Time",paste0("V",1:3))
cosinor("Time",3:5,"ID",pure.df)

op <- par(mfrow=c(2,2))
 cosinor.plot(2,3,pure.df,IDloc=1,ID="1")
 cosinor.plot(2,3,pure.df,IDloc=1,ID="2")
 cosinor.plot(2,3,pure.df,IDloc=1,ID="3")
 cosinor.plot(2,3,pure.df,IDloc=1,ID="4")
 
 #now, show those in one panel as spagetti plots
op <- par(mfrow=c(1,1))
cosinor.plot(2,3,pure.df,IDloc=1,ID="1",multi=TRUE,ylim=c(0,20),ylab="Modeled")
 cosinor.plot(2,3,pure.df,IDloc=1,ID="2",multi=TRUE,add=TRUE,lty="dotdash")
 cosinor.plot(2,3,pure.df,IDloc=1,ID="3",multi=TRUE,add=TRUE,lty="dashed")
 cosinor.plot(2,3,pure.df,IDloc=1,ID="4",multi=TRUE,add=TRUE,lty="dotted")

set.seed(42)   #what else?
noisy <- pure
noisy[,2:19]<- noisy[,2:19] + rnorm(24*18,0,.2)

n <- cosinor(time,noisy) #add a bit of noise

small.pure <- pure[c(8,11,14,17,20,23),]
small.noisy <- noisy[c(8,11,14,17,20,23),]
small.time <- c(8,11,14,17,20,23)

cosinor.plot(1,3,small.pure,multi=TRUE)
cosinor.plot(1,3,small.noisy,multi=TRUE,add=TRUE,lty="dashed")

         
# sp <- cosinor(small.pure)
# spo <- cosinor(small.pure,opti=TRUE) #iterative fit
# sn <- cosinor(small.noisy) #linear
# sno <- cosinor(small.noisy,opti=TRUE) #iterative
# sum.df <- data.frame(pure=p,noisy = n, small=sp,small.noise = sn, 
#         small.opt=spo,small.noise.opt=sno)
# round(sum.df,2)
# round(circadian.cor(sum.df[,c(1,3,5,7,9,11)]),2)  #compare alternatives 
# 
# #now, lets form three "subjects" and show how the grouping variable works
# mixed.df <- rbind(small.pure,small.noisy,noisy)
# mixed.df <- data.frame(ID=c(rep(1,6),rep(2,6),rep(3,24)),
#           time=c(rep(c(8,11,14,17,20,23),2),1:24),mixed.df)
# group.df <- cosinor(angle="time",x=2:20,code="ID",data=mixed.df)
# round(group.df,2)  #compare these values to the sp,sn,and n values done separately

Simulate the C(ues) T(endency) A(ction) model of motivation

Description

Dynamic motivational models such as the Dynamics of Action (Atkinson and Birch, 1970, Revelle, 1986) may be reparameterized as a simple pair of differential (matrix) equations (Revelle, 1986, 2008). This function simulates the dynamic aspects of the CTA. The CTA model is discussed in detail in Revelle and Condon (2015).

Usage

cta (n=3,t=5000, cues = NULL, act=NULL, inhibit=NULL,expect = NULL, consume = NULL, 
tendency = NULL,tstrength=NULL, type="both", fast=2,compare=FALSE,learn=TRUE,reward=NULL) 
cta.15(n = 3, t = 5000, cues = NULL,stim=NULL, act = NULL, inhibit = NULL, consume = NULL, 
   ten = NULL,  type = "both", fast = 2)

Arguments

n

number of actions to simuate

t

length of time to simulate

cues

a vector of cue strengths

stim

A vector of the environmental stimuli

act

matrix of associations between cues and action tendencies

inhibit

inhibition matrix

consume

Consummation matrix

ten

Initial values of action tendencies

type

show actions, tendencies, both, or state diagrams

fast

display every fast time (skips

expect

A matrix of expectations

tendency

starting values of tendencies

tstrength

a vector of starting value of tendencies

compare

Allows a two x two graph to compare two plots

learn

Allow the system to learn (self reinforce) over time

reward

The strength of the reward for doing an action

Details

A very thorough discussion of the CTA model is available from Revelle (2008). An application of the model is discussed in Revelle and Condon (2015).

cta.15 is the version used to produce the figures and analysis in Revelle and Condon (2015). cta is the most recent version and includes a learning function developed in collaboration with Luke Smillie at the University of Melbourne.

The dynamics of action (Atkinson and Birch, 1970) was a model of how instigating forces elicited action tendencies which in turn elicited actions. The basic concept was that action tendencies had inertia. That is, a wish (action tendency) would persist until satisfied and would not change without an instigating force. The consummatory strength of doing an action was thought in turn to reduce the action tendency. Forces could either be instigating or inhibitory (leading to "negaction").

Perhaps the simplest example is the action tendency (T) to eat a pizza. The instigating forces (F) to eat the pizza include the smell and look of the pizza, and once eating it, the flavor and texture. However, if eating the pizza, there is also a consummatory force (C) which was thought to reflect both the strength (gusto) of eating the pizza as well as some constant consummatory value of the activity (c). If not eating the pizza, but in a pizza parlor, the smells and visual cues combine to increase the tendency to eat the pizza. Once eating it, however, the consummatory effect is no longer zero, and the change in action tendency will be a function of both the instigating forces and the consummatory forces. These will achieve a balance when instigating forces are equal to the consummatory forces. The asymptotic strength of eating the pizza reflects this balance and does not require a “set point" or “comparator".

To avoid the problems of instigating and consummatory lags and the need for a decision mechanism, it is possible to reparameterize the original DOA model in terms of action tendencies and actions (Revelle, 1986). Rather than specifying inertia for action tendencies and a choice rule of always expressing the dominant action tendency, it is useful to distinguish between action tendencies (t) and the actions (a) themselves and to have actions as well as tendencies having inertial properties. By separating tendencies from actions, and giving them both inertial properties, we avoid the necessity of a lag parameter, and by making the decision rule one of mutual inhibition, the process is perhaps easier to understand. In an environment which affords cues for action (c), cues enhance action tendencies (t) which in turn strengthen actions (a). This leads to two differential equations, one describing the growth and decay of action tendencies (t), the other of the actions themselves (a).

dt=ScCad{t} = {Sc} - { Ca}

and

da=EtIad{a} = {Et} - {Ia}

. (See Revelle and Condon (2015) for an extensive discussion of this model.)

cta simulates this model, with the addition of a learning parameter such that activities strengthen the connection between cues and tendencies. The learning part of the cta model is still under development. cta.15 represents the state of the cta model as described in the Revelle and Condon (2015) article.

Value

graphical output unless type="none"

cues

echo back the cue input

inhibition

echo back the inhibitory matrix

time

time spent in each activity

frequency

Frequency of each activity

tendencies

average tendency strengths

actions

average action strength

Author(s)

William Revelle

References

Atkinson, John W. and Birch, David (1970) The dynamics of action. John Wiley, New York, N.Y.

Revelle, William (1986) Motivation and efficiency of cognitive performance in Brown, Donald R. and Veroff, Joe (ed). Frontiers of Motivational Psychology: Essays in honor of J. W. Atkinson. Springer. (Available as a pdf at https://personality-project.org/revelle/publications/dynamicsofmotivation.pdf.)

Revelle, W. (2008) Cues, Tendencies and Actions. The Dynamics of Action revisted. https://personality-project.org/revelle/publications/cta.pdf

Revelle, W. and Condon, D. (2015) A model for personality at three levels. Journal of Research in Personality https://www.sciencedirect.com/science/article/pii/S0092656615000318

Examples

#not run 
#cta()   #default values, running over time 
#cta(type="state") #default values, in a state space  of tendency 1 versus tendency 2
#these next are examples without graphic output
#not run
#two introverts
#c2i <- c(.95,1.05)
#cta(n=2,t=10000,cues=c2i,type="none")
#two extraverts
#c2e <- c(3.95,4.05)
#cta(n=2,t=10000,cues=c2e,type="none")
#three introverts
#c3i <-  c(.95,1,1.05)
#cta(3,t=10000,cues=c3i,type="none")
#three extraverts
#c3i <- c(3.95,4, 4.05)
#cta(3,10000,c3i,type="none")
#mixed
#c3 <- c(1,2.5,4)
#cta(3,10000,c3,type="none")

Create a 'violin plot' or density plot of the distribution of a set of variables

Description

Among the many ways to describe a data set, one is a density plot for each value of a grouping variable and another is violin plot of multiple variables. A density plot shows the density for different groups to show effect sizes. A violin plot is similar to a box plot but shows the actual distribution. Median and 25th and 75th percentile lines are added to the display. If a grouping variable is specified, violinBy will draw violin plots for each variable and for each group. Data points may be drawn as well in what is known as a "raincloud plot".

Usage

violin(x,data=NULL, var=NULL, grp=NULL, grp.name=NULL, xlab=NULL, ylab=NULL,
	main="Density plot", vertical=TRUE, dots=FALSE, rain=FALSE, jitter=.05, alpha=1,
	errors=FALSE, eyes=TRUE, adjust=1, restrict=TRUE, xlim=NULL, add=FALSE, 
	col=NULL, pch=20, scale=NULL,...) 
 
violinBy(x, var=NULL, grp=NULL, data=NULL, grp.name=NULL, xlab=NULL, ylab=NULL,
	main="Density plot", vertical=TRUE, dots=FALSE, rain=FALSE, jitter=.05, alpha= 1,
	errors=FALSE, eyes=TRUE, adjust=1, restrict=TRUE, xlim=NULL, add=FALSE, 
	col=NULL, pch=20, scale=NULL,...) 
 
densityBy(x, var=NULL, grp=NULL,data=NULL, freq=FALSE, col=c("blue","red","black"), 
	alpha=.5, adjust=1, ylim=NULL, xlim=NULL, xlab="Variable", ylab="Density",
     main="Density Plot",legend=NULL)

Arguments

x

A matrix or data.frame (can be expressed in formula input)

var

The variable(s) to display

grp

The grouping variable(s)

data

The name of the data object if using formula input

grp.name

If the grouping variable is specified, then what names should be give to the group? Defaults to 1:ngrp

ylab

The y label

xlab

The x label

main

Figure title

vertical

If TRUE, plot the violins vertically, otherwise, horizontonally

dots

if TRUE, add a stripchart with the data points

rain

If TRUE, draw a half violin with rain drops

jitter

If doing a stripchart, then jitter the points this much

errors

If TRUE, add error bars or cats eyes to the violins

eyes

if TRUE and errors=TRUE, then draw cats eyes

alpha

A degree of transparency (0=transparent ... 1 not transparent)

adjust

Allows smoothing of density histograms when plotting variables like height

freq

if TRUE, then plot frequencies (n * density)

restrict

Restrict the density to the observed max and min of the data

xlim

if not specified, will be .5 beyond the number of variables

ylim

If not specified, determined by the data

add

Allows overplotting

col

Allows for specification of colours. The default for 2 groups is blue and red, for more group levels, rainbows.

pch

The plot character for the mean is by default a small filled circle. To not show the mean, use pch=NA

scale

If NULL, scale the widths by the square root of sample size, otherwise scale by the value supplied.

legend

If not NULL, draw a legend at c(topleft,topright,top,left,right)

...

Other graphic parameters

Details

Describe the data using a violin plot. Change alpha to modify the shading. The grp variable may be used to draw separate violin plots for each of multiple groups.

For relatively smallish data sets (< 500-1000), it is informative to also show the actual data points. This done with the dots=TRUE option. The jitter value is arbitrarily set to .05, but making it larger (say .1 or .2) will display more points.

Perhaps even prettier, is draw "raincloud" plots (half violins with rain drops)

Value

The density (y axis) by value (x axis) of the data (for densityBy) or a violin plot for each variable (perhaps broken down by groups)

Note

Formula input added July 12, 2020

Author(s)

William Revelle

See Also

describe, describeBy and statsBy for descriptive statistics and error.bars, error.bars.by and bi.bars, histBy and scatterHist for other graphic displays

Examples

violin(psychTools::bfi[4:8])
violin(SATV + SATQ ~ gender, data=sat.act, grp.name =cs(MV,FV,MQ,FQ)) #formula input
violinBy(psychTools::bfi,var=4:7,grp ="gender",grp.name=c("M","F"))
#rain does not work for multiple DVS, just up to 2 IVs 
violinBy(SATV  ~ education,data =sat.act, rain=TRUE, pch=".", vertical=FALSE)  #rain 

densityBy(SATV  ~ gender,data =sat.act,legend=1)  #formula input

Basic descriptive statistics useful for psychometrics

Description

There are many summary statistics available in R; this function provides the ones most useful for scale construction and item analysis in classic psychometrics. Range is most useful for the first pass in a data set, to check for coding errors. Parallelizes if multiple cores are available.

Usage

describe(x, na.rm = TRUE, interp=FALSE,skew = TRUE, ranges = TRUE,trim=.1,
              type=3,check=TRUE,fast=NULL,quant=NULL,IQR=FALSE,omit=FALSE,data=NULL,
              size=50)
describeData(x,head=4,tail=4)
describeFast(x)

Arguments

x

A data frame or matrix

na.rm

The default is to delete missing data. na.rm=FALSE will delete the case.

interp

Should the median be standard or interpolated

skew

Should the skew and kurtosis be calculated?

ranges

Should the range be calculated?

trim

trim=.1 – trim means by dropping the top and bottom trim fraction

type

Which estimate of skew and kurtosis should be used? (See details.)

check

Should we check for non-numeric variables? Slower but helpful.

fast

if TRUE, will do n, means, sds, min, max, ranges for an improvement in speed. If NULL, will switch to fast mode for large (ncol * nrow > 10^7) problems, otherwise defaults to fast = FALSE

quant

if not NULL, will find the specified quantiles (e.g. quant=c(.25,.75) will find the 25th and 75th percentiles)

IQR

If TRUE, show the interquartile range

omit

Do not convert non-numerical variables to numeric, omit them instead

head

show the first 1:head cases for each variable in describeData

tail

Show the last nobs-tail cases for each variable in describeData

data

Allows formula input for specific grouping variables

size

For very big problems (in terms of nvar) set size to some reasonable value.

Details

In basic data analysis it is vital to get basic descriptive statistics. Procedures such as summary and Hmisc::describe do so. The describe function in the psych package is meant to produce the most frequently requested stats in psychometric and psychology studies, and to produce them in an easy to read data.frame. If a grouping variable is called for in formula mode, it will also call describeBy to the processing. The results from describe can be used in graphics functions (e.g., error.crosses).

The range statistics (min, max, range) are most useful for data checking to detect coding errors, and should be found in early analyses of the data.

Although describe will work on data frames as well as matrices, it is important to realize that for data frames, descriptive statistics will be reported only for those variables where this makes sense (i.e., not for alphanumeric data).

If the check option is TRUE, variables that are categorical or logical are converted to numeric and then described. These variables are marked with an * in the row name. This is somewhat slower. Note that in the case of categories or factors, the numerical ordering is not necessarily the one expected. For instance, if education is coded "high school", "some college" , "finished college", then the default coding will lead to these as values of 2, 3, 1. Thus, statistics for those variables marked with * should be interpreted cautiously (if at all).

In a typical study, one might read the data in from the clipboard (read.clipboard), show the splom plot of the correlations (pairs.panels), and then describe the data.

na.rm=FALSE is equivalent to describe(na.omit(x))

When finding the skew and the kurtosis, there are three different options available. These match the choices available in skewness and kurtosis found in the e1071 package (see Joanes and Gill (1998) for the advantages of each one).

If we define mr=[(Xmx)r]/nm_r = [\sum(X- mx)^r]/n then

Type 1 finds skewness and kurtosis by g1=m3/(m2)3/2g_1 = m_3/(m_2)^{3/2} and g2=m4/(m2)23g_2 = m_4/(m_2)^2 -3.

Type 2 is G1=g1n(n1)/(n2)G1 = g1 * \sqrt{n *(n-1)}/(n-2) and G2=(n1)[(n+1)g2+6]/((n2)(n3))G2 = (n-1)*[(n+1)g2 +6]/((n-2)(n-3)).

Type 3 is b1=[(n1)/n]3/2m3/m23/2b1 = [(n-1)/n]^{3/2} m_3/m_2^{3/2} and b2=[(n1)/n]3/2m4/m22)b2 = [(n-1)/n]^{3/2} m_4/m_2^2).

The additional helper function describeData just scans the data array and reports on whether the data are all numerical, logical/factorial, or categorical. This is a useful check to run if trying to get descriptive statistics on very large data sets where to improve the speed, the check option is FALSE.

An even faster overview of the data is describeFast which reports the number of total cases, number of complete cases, number of numeric variables and the number which are factors.

The fast=TRUE option will lead to a speed up of about 50% for larger problems by not finding all of the statistics (see NOTE)

To describe the data for different groups, see describeBy or specify the grouping variable(s) in formula mode (see the examples).

Value

A data.frame of the relevant statistics:
item name
item number
number of valid cases
mean
standard deviation
trimmed mean (with trim defaulting to .1)
median (standard or interpolated
mad: median absolute deviation (from the median).
minimum
maximum
skew
kurtosis
standard error

Note

For very large data sets that are data.frames, describe can be rather slow. Converting the data to a matrix first is recommended. However, if the data are of different types, (factors or logical), this is not possible. If the data set includes columns of character data, it is also not possible. Thus, a quick pass with describeData is recommended. Even faster is a quick pass with describeFast which just counts number of observations per variable and reports the type of data (numerical, factor, logical).

For the greatest speed, at the cost of losing information, do not ask for ranges or for skew and turn off check. This is done automatically if the fast option is TRUE or for large data sets.

Note that by default, fast=NULL. But if the number of cases x number of variables exceeds (ncol * nrow > 10^7), fast will be set to TRUE. This will provide just n, mean, sd, min, max, range, and standard errors. To get all of the statistics (but at a cost of greater time) set fast=FALSE.

The problem seems to be a memory limitation in that the time taken is an accelerating function of nvars * nobs. Thus, for a largish problem (72,000 cases with 1680 variables) which might take 330 seconds, doing it as two sets of 840 variable cuts the time down to 80 seconds.

The object returned is a data frame with the normal precision of R. However, to control the number of digits displayed, you can set digits in a print command, rather than losing precision at the descriptive stats level. See the last two examples. One just sets the number of digits, one gives uses signif to make 'prettier' output where all numbers are displayed to the same number of digits.

The MAD (median absolute deviation from the median) is calculated using the mad function from the stats package in Core-R. Note that by default, the MAD is adjusted by a scaling factor (1.4826) that will give the expectation of the MAD to be the same as the standard deviation for normal data.

An interesting problem with describe is that a function with the same name is in the Hmisc package. HMisc is loaded by qqgraph which in turn is loaded by SemPlot. So even if not directly loading HMisc, if you load SemPlot after loading psych, describe will not work, but the reverse order for loading should work.

Author(s)

https://personality-project.org/revelle.html

Maintainer: William Revelle [email protected]

References

Joanes, D.N. and Gill, C.A (1998). Comparing measures of sample skewness and kurtosis. The Statistician, 47, 183-189.

See Also

describeBy, skew, kurtosi interp.median, read.clipboard. Then, for graphic output, see error.crosses, pairs.panels, error.bars, error.bars.by and densityBy, or violinBy

Examples

data(sat.act)
describe(sat.act)
describe(sat.act ~ gender) #formula mode option calls describeBy for the entire data frame
describe(SATV + SATQ ~ gender, data=sat.act) #formula mode specifies just two variables

describe(sat.act,skew=FALSE)
describe(sat.act,IQR=TRUE) #show the interquartile Range
describe(sat.act,quant=c(.1,.25,.5,.75,.90) ) #find the 10th, 25th, 50th, 
                   #75th and 90th percentiles
                   
                   
 
describeData(sat.act) #the fast version just  gives counts and head and tail

print(describeFast(sat.act),short=FALSE)  #even faster is just counts  (just less information)  

#now show how to adjust the displayed number of digits
 des <- describe(sat.act)  #find the descriptive statistics.  Keep the original accuracy
 des  #show the normal output, which is rounded to 2 decimals
 print(des,digits=3)  #show the output, but round to 3 (trailing) digits
 print(des, signif=3) #round all numbers to the 3 significant digits

Basic summary statistics by group

Description

Report basic summary statistics by a grouping variable. Useful if the grouping variable is some experimental variable and data are to be aggregated for plotting. Partly a wrapper for by and describe

Usage

describeBy(x, group=NULL,mat=FALSE,type=3,digits=15,data,...)
describe.by(x, group=NULL,mat=FALSE,type=3,...)  # deprecated

Arguments

x

a data.frame or matrix. See note for statsBy.

group

a grouping variable or a list of grouping variables. (may be ignored if calling using the formula mode.)

mat

provide a matrix output rather than a list

type

Which type of skew and kurtosis should be found

digits

When giving matrix output, how many digits should be reported?

data

Needed if using formula input

...

parameters to be passed to describe

Details

To get descriptive statistics for several different grouping variables, make sure that group is a list. In the case of matrix output with multiple grouping variables, the grouping variable values are added to the output.

As of July, 2020, the grouping variable(s) may be specified in formula mode (see the examples).

The type parameter specifies which version of skew and kurtosis should be found. See describe for more details.

An alternative function (statsBy) returns a list of means, n, and standard deviations for each group. This is particularly useful if finding weighted correlations of group means using cor.wt. More importantly, it does a proper within and between group decomposition of the correlation.

cohen.d will work for two groups. It converts the data into mean differences and pools the within group standard deviations. Returns cohen.d statistic as well as the multivariate generalization (Mahalanobis D).

Value

A data.frame of the relevant statistics broken down by group:
item name
item number
number of valid cases
mean
standard deviation
median
mad: median absolute deviation (from the median)
minimum
maximum
skew
standard error

Author(s)

William Revelle

See Also

describe, statsBy, densityBy and violinBy, cohen.d, cohen.d.by, and cohen.d.ci as well as error.bars and error.bars.by for other graphical displays.

Examples

data(sat.act)
describeBy(sat.act,sat.act$gender) #just one grouping variable
describeBy(sat.act ~ gender)   #describe the entire set	formula input
describeBy(SATV + SATQ ~ gender,data =sat.act)  #specify the data set if using formula
#describeBy(sat.act,list(sat.act$gender,sat.act$education))  #two grouping variables
describeBy(sat.act ~ gender +  education) #two grouping variables
des.mat <- describeBy(age ~ education,mat=TRUE,data = sat.act) #matrix (data.frame) output 
des.mat <- describeBy(age ~ education + gender, data=sat.act,
               mat=TRUE,digits=2)  #matrix output  rounded to 2 decimals

Helper functions for drawing path model diagrams

Description

Path models are used to describe structural equation models or cluster analytic output. These functions provide the primitives for drawing path models. Used as a substitute for some of the functionality of Rgraphviz.

Usage

diagram(fit,...)
dia.rect(x, y = NULL, labels = NULL,  cex = 1,  xlim = c(0, 1), ylim = c(0, 1),
    draw=TRUE, ...)
dia.ellipse(x, y = NULL, labels = NULL, cex=1,e.size=.05, xlim=c(0,1), 
       ylim=c(0,1),draw=TRUE,  ...) 
dia.triangle(x, y = NULL, labels =NULL,  cex = 1, xlim=c(0,1),ylim=c(0,1),...)
dia.ellipse1(x,y,e.size=.05,xlim=c(0,1),ylim=c(0,1),draw=TRUE,...)
dia.shape(x, y = NULL, labels = NULL, cex = 1, 
         e.size=.05, xlim=c(0,1), ylim=c(0,1), shape=1, ...)
dia.arrow(from,to,labels=NULL,scale=1,cex=1,adj=2,both=FALSE,pos=NULL,l.cex,
        gap.size,draw=TRUE,col="black",lty="solid",...)
dia.curve(from,to,labels=NULL,scale=1,...)
dia.curved.arrow(from,to,labels=NULL,scale=1,both=FALSE,dir=NULL,draw=TRUE,...)
dia.self(location,labels=NULL,scale=.8,side=2,draw=TRUE,...)
dia.cone(x=0, y=-2, theta=45, arrow=TRUE,curves=TRUE,add=FALSE,labels=NULL,
      xlim = c(-1, 1), ylim=c(-1,1),... ) 
multi.self(self.list,...)
multi.arrow(arrows.list,...)
multi.curved.arrow(curved.list,l.cex=NULL,...)
multi.rect(rect.list,...)

Arguments

fit

The results from a factor analysis fa, components analysis principal, omega reliability analysis, omega, cluster analysis iclust, topdown (bassAckward) bassAckward or confirmatory factor analysis, cfa, or structural equation model,sem, using the lavaan package.

x

x coordinate of a rectangle or ellipse

y

y coordinate of a rectangle or ellipse

e.size

The size of the ellipse (scaled by the number of variables

labels

Text to insert in rectangle, ellipse, or arrow

cex

adjust the text size

col

line color (normal meaning for plot figures)

lty

line type

l.cex

Adjust the text size in arrows, defaults to cex which in turn defaults to 1

gap.size

Tweak the gap in an arrow to be allow the label to be in a gap

adj

Where to put the label along the arrows (values are then divided by 4)

both

Should the arrows have arrow heads on both ends?

scale

modifies size of rectangle and ellipse as well as the curvature of curves. (For curvature, positive numbers are concave down and to the left

from

arrows and curves go from

to

arrows and curves go to

location

where is the rectangle?

shape

Which shape to draw

xlim

default ranges

ylim

default ranges

draw

Draw the text box

side

Which side of boxes should errors appear

theta

Angle in degrees of vectors

arrow

draw arrows for edges in dia.cone

add

if TRUE, plot on previous plot

curves

if TRUE, draw curves between arrows in dia.cone

pos

The position of the text in . Follows the text positions of 1, 2, 3, 4 or NULL

dir

Should the direction of the curve be calculated dynamically, or set as "up" or "left"

...

Most graphic parameters may be passed here

self.list

list saved from dia.self

arrows.list

lst saved from dia.arrow

curved.list

list saved from dia.curved.arrow

rect.list

list saved from dia.rect

Details

The diagram function calls fa.diagram, omega.diagram, ICLUST.diagram, lavaan.diagram or bassAckward.diagram depending upon the class of the fit input. See those functions for particular parameter values.

The remaining functions are the graphic primitives used by fa.diagram, structure.diagram, omega.diagram, ICLUST.diagram and het.diagram

They create rectangles, ellipses or triangles surrounding text, connect them to straight or curved arrows, and can draw an arrow from and to the same rectangle.

To speed up the plotting, dia.rect and dia.arrow can suppress the actual drawing and return the locations and values to plot. These values can then be directly called by text or rect with matrix input. This leads to an impressive increase in speed when doing many variables.

The functions multi.rect, multi.self, multi.arrow and multi.curved.arrow will take the saved output from the appropriate primitives and then draw them all at once.

Each shape (ellipse, rectangle or triangle) has a left, right, top and bottom and center coordinate that may be used to connect the arrows.

Curves are double-headed arrows. By default they go from one location to another and curve either left or right (if going up or down) or up or down (going left to right). The direction of the curve may be set by dir="up" for left right curvature.

The helper functions were developed to get around the infelicities associated with trying to install Rgraphviz and graphviz.

These functions form the core of fa.diagram,het.diagram.

Better documentation will be added as these functions get improved. Currently the helper functions are just a work around for Rgraphviz.

dia.cone draws a cone with (optionally) arrows as sides and centers to show the problem of factor indeterminacy.

Value

Graphic output

Author(s)

William Revelle

See Also

The diagram functions that use the dia functions: fa.diagram, structure.diagram, omega.diagram, and ICLUST.diagram.

Examples

#first, show the primitives
xlim=c(-2,10)
ylim=c(0,10)
plot(NA,xlim=xlim,ylim=ylim,main="Demonstration of diagram functions",axes=FALSE,xlab="",ylab="")
ul <- dia.rect(1,9,labels="upper left",xlim=xlim,ylim=ylim)
ml <- dia.rect(1,6,"middle left",xlim=xlim,ylim=ylim)
ll <- dia.rect(1,3,labels="lower left",xlim=xlim,ylim=ylim)
bl <- dia.rect(1,1,"bottom left",xlim=xlim,ylim=ylim)
lr <- dia.ellipse(7,3,"lower right",xlim=xlim,ylim=ylim,e.size=.07)
ur <- dia.ellipse(7,9,"upper right",xlim=xlim,ylim=ylim,e.size=.07)
mr <- dia.ellipse(7,6,"middle right",xlim=xlim,ylim=ylim,e.size=.07)
lm <- dia.triangle(4,1,"Lower Middle",xlim=xlim,ylim=ylim)
br <- dia.rect(9,1,"bottom right",xlim=xlim,ylim=ylim) 
dia.curve(from=ul$left,to=bl$left,"double headed",scale=-1)

dia.arrow(from=lr,to=ul,labels="right to left")
dia.arrow(from=ul,to=ur,labels="left to right")
dia.curved.arrow(from=lr,to=ll,labels ="right to left")
dia.curved.arrow(to=ur,from=ul,labels ="left to right")
dia.curve(ll$top,ul$bottom,"right")  #for rectangles, specify where to point 

dia.curve(ll$top,ul$bottom,"left",scale=-1)  #for rectangles, specify where to point 
dia.curve(mr,ur,"up")  #but for ellipses, you may just point to it.
dia.curve(mr,lr,"down")
dia.curve(mr,ur,"up")
dia.curved.arrow(mr,ur,"up")  #but for ellipses, you may just point to it.
dia.curved.arrow(mr,lr,"down")  #but for ellipses, you may just point to it.

dia.curved.arrow(ur$right,mr$right,"3")
dia.curve(ml,mr,"across")
dia.curve(ur$right,lr$right,"top down",scale =2)
dia.curved.arrow(br$top,lr$right,"up")
dia.curved.arrow(bl,br,"left to right")
dia.curved.arrow(br,bl,"right to left",scale=-1)
dia.arrow(bl,ll$bottom)
dia.curved.arrow(ml,ll$right)
dia.curved.arrow(mr,lr$top)

#now, put them together in a factor analysis diagram
v9 <- sim.hierarchical()
f3 <- fa(v9,3,rotate="cluster")
fa.diagram(f3,error=TRUE,side=3)

Draw a correlation ellipse and two normal curves to demonstrate tetrachoric correlation

Description

A graphic of a correlation ellipse divided into 4 regions based upon x and y cutpoints on two normal distributions. This is also an example of using the layout function. Draw a bivariate density plot to show how tetrachorics work.

Usage

draw.tetra(r, t1, t2,shade=TRUE)
draw.cor(r=.5,expand=10,theta=30,phi=30,N=101,nbcol=30,box=TRUE,
main="Bivariate density  rho = ",cuts=NULL,all=TRUE,ellipses=TRUE,ze=.15)

Arguments

r

the underlying Pearson correlation defines the shape of the ellipse

t1

X is cut at tau

t2

Y is cut at Tau

shade

shade the diagram (default is TRUE)

expand

The relative height of the z axis

theta

The angle to rotate the x-y plane

phi

The angle above the plane to view the graph

N

The grid resolution

nbcol

The color resolution

box

Draw the axes

main

The main title

cuts

Should the graphic show cuts (e.g., cuts=c(0,0))

all

Show all four parts of the tetrachoric

ellipses

Draw a correlation ellipse

ze

height of the ellipse if requested

Details

A graphic demonstration of the tetrachoric correlation. Used for teaching purposes. The default values are for a correlation of .5 with cuts at 1 and 1. Any other values are possible. The code is also a demonstration of how to use the layout function for complex graphics using base graphics.

Author(s)

William Revelle

See Also

tetrachoric to find tetrachoric correlations, irt.fa and fa.poly to use them in factor analyses, scatter.hist to show correlations and histograms.

Examples

#if(require(mvtnorm)) {
#draw.tetra(.5,1,1)
#draw.tetra(.8,2,1)} else {print("draw.tetra requires the mvtnorm package")
#draw.cor(.5,cuts=c(0,0))}

draw.tetra(.5,1,1)
draw.tetra(.8,2,1)
draw.cor(.5,cuts=c(0,0))

Create dummy coded variables

Description

Given a variable x with n distinct values, create n new dummy coded variables coded 0/1 for presence (1) or absence (0) of each variable. A typical application would be to create dummy coded college majors from a vector of college majors. Can also combine categories by group. By default, NA values of x are returned as NA (added 10/20/17)

Usage

dummy.code(x,group=NULL,na.rm=TRUE,top=NULL,min=NULL)

Arguments

x

A vector to be transformed into dummy codes

group

A vector of categories to be coded as 1, all others coded as 0.

na.rm

If TRUE, return NA for all codes with NA in x

top

If specified, then just dummy code the top values, and make the rest NA

min

If specified, then dummy code all values >= min

Details

When coding demographic information, it is typical to create one variable with multiple categorical values (e.g., ethnicity, college major, occupation). dummy.code will convert these categories into n distinct dummy coded variables.

If there are many possible values (e.g., country in the SAPA data set) then specifying top will assign dummy codes to just a subset of the data.

If using dummy coded variables as predictors, remember to use n-1 variables.

If group is specified, then all values of x that are in group are given the value of 1, otherwise, 0. (Useful for combining a range of science majors into STEM or not. The example forms a dummy code of any smoking at all.)

Value

A matrix of dummy coded variables

Author(s)

William Revelle

Examples

new <- dummy.code(sat.act$education)
new.sat <- data.frame(new,sat.act)
round(cor(new.sat,use="pairwise"),2)
#dum.smoke <- dummy.code(spi$smoke,group=2:9)
#table(dum.smoke,spi$smoke)
#dum.age <- dummy.code(round(spi$age/5)*5,top=5)  #the most frequent five year blocks

8 cognitive variables used by Dwyer for an example.

Description

Dwyer (1937) introduced a technique for factor extension and used 8 cognitive variables from Thurstone. This is the example data set used in his paper.

Usage

data(Dwyer)

Format

The format is: num [1:8, 1:8] 1 0.58 -0.28 0.01 0.36 0.38 0.61 0.15 0.58 1 ... - attr(*, "dimnames")=List of 2 ..$ : chr [1:8] "V1" "V2" "V3" "V4" ... ..$ : chr [1:8] "V1" "V2" "V3" "V4" ...

Source

Data matrix retyped from the original publication.

References

Dwyer, Paul S. (1937), The determination of the factor loadings of a given test from the known factor loadings of other tests. Psychometrika, 3, 173-178

Examples

data(Dwyer)
Ro <- Dwyer[1:7,1:7]
Roe <- Dwyer[1:7,8]
fo <- fa(Ro,2,rotate="none")
fa.extension(Roe,fo)

Convert eigen vectors and eigen values to the more normal (for psychologists) component loadings

Description

The default procedures for principal component returns values not immediately equivalent to the loadings from a factor analysis. eigen.loadings translates them into the more typical metric of eigen vectors multiplied by the squareroot of the eigenvalues. This lets us find pseudo factor loadings if we have used princomp or eigen.
If we use principal to do our principal components analysis, then we do not need this routine.

Usage

eigen.loadings(x)

Arguments

x

the output from eigen or a list of class princomp derived from princomp

Value

A matrix of Principal Component loadings more typical for what is expected in psychometrics. That is, they are scaled by the square root of the eigenvalues.

Note

Useful for SAPA analyses

Author(s)

[email protected]
https://personality-project.org/revelle.html

Examples

x <- eigen(Harman74.cor$cov)
x$vectors[1:8,1:4]  #as they appear from eigen
y <- princomp(covmat=Harman74.cor$cov) 
y$loadings[1:8,1:4] #as they appear from princomp
eigen.loadings(x)[1:8,1:4] # rescaled by the eigen values
z <- pca(Harman74.cor$cov,4,rotate="none")
z$loadings[1:8,1:4]  #as they appear in pca

Plot data and 1 and 2 sigma correlation ellipses

Description

For teaching correlation, it is useful to draw ellipses around the mean to reflect the correlation. This variation of the ellipse function from John Fox's car package does so. Input may be either two vectors or a matrix or data.frame. In the latter cases, if the number of variables >2, then the ellipses are done in the pairs.panels function. Ellipses may be added to existing plots. The minkowski function is included as a generalized ellipse.

Usage

ellipses(x, y = NULL, add = FALSE, smooth=TRUE, lm=FALSE,data=TRUE, n = 2,
 span=2/3, iter=3, col = "red", xlab =NULL,ylab= NULL,size=c(1,2), ...)
minkowski(r=2,add=FALSE,main=NULL,xl=1,yl=1)

Arguments

x

a vector,matrix, or data.frame

y

Optional second vector

add

Should a new plot be created, or should it be added to?

smooth

smooth = TRUE -> draw a loess fit

lm

lm=TRUE -> draw the linear fit

data

data=TRUE implies draw the data points

n

Should 1 or 2 ellipses be drawn

span

averaging window parameter for the lowess fit

iter

iteration parameter for lowess

col

color of ellipses (default is red

xlab

label for the x axis

ylab

label for the y axis

size

The size of ellipses in sd units (defaults to 1 and 2)

...

Other parameters for plotting

r

r=1 draws a city block, r=2 is a Euclidean circle, r > 2 tends towards a square

main

title to use when drawing Minkowski circles

xl

stretch the x axis

yl

stretch the y axis

Details

Ellipse dimensions are calculated from the correlation between the x and y variables and are scaled as sqrt(1+r) and sqrt(1-r). They are then scaled as size[1] and size[2] standard deviation units. To scale for 95 and 99 percent confidence use c(1.64,2.32)

Value

A single plot (for 2 vectors or data frames with fewer than 3 variables. Otherwise a call is made to pairs.panels.

Note

Adapted from John Fox's ellipse and data.ellipse functions.

Author(s)

William Revelle

References

Galton, Francis (1888), Co-relations and their measurement. Proceedings of the Royal Society. London Series, 45, 135-145.

See Also

pairs.panels

Examples

data(psychTools::galton)
galton <- psychTools::galton
ellipses(galton,lm=TRUE)

ellipses(galton$parent,galton$child,xlab="Mid Parent Height",
                  ylab="Child Height") #input are two vectors
data(sat.act)
ellipses(sat.act)  #shows the pairs.panels ellipses
minkowski(2,main="Minkowski circles")
minkowski(1,TRUE)
minkowski(4,TRUE)

Plot means and confidence intervals

Description

One of the many functions in R to plot means and confidence intervals. Can be done using barplots if desired. Can also be combined with such functions as boxplot or violin to summarize distributions. Means and standard errors are calculated from the raw data using describe. Alternatively, plots of means +/- one standard deviation may be drawn.

Usage

error.bars(x,stats=NULL,data=NULL,group=NULL, ylab = "Dependent Variable",
	xlab="Independent Variable", main=NULL,eyes=TRUE, ylim = NULL, xlim=NULL,
	alpha=.05, sd=FALSE, labels = NULL, pos = NULL,  arrow.len = 0.05,
	arrow.col="black",add = FALSE,bars=FALSE, within=FALSE, 
	col=c("black","blue","red"), density=-10,...)

error.bars.tab(t,way="columns",raw=FALSE,col=c('blue','red'),...)

Arguments

x

A data frame or matrix of raw data OR, a formula of the form DV ~ IV. If formula input is specified, error.bars.by is called.

t

A table of frequencies

stats

Alternatively, a data.frame of descriptive stats from (e.g., describe). if specified, the means, sd, n and perhaps of se of a data set to be plotted

data

If using formula input, specify the object where the data may found

group

If not null, then do error.bars.by syntax

ylab

y label

xlab

x label

main

title for figure

ylim

if specified, the limits for the plot, otherwise based upon the data

xlim

if specified, the x limits for the plot, otherwise c(.5,nvar + .5)

eyes

should 'cats eyes' plots be drawn

alpha

alpha level of confidence interval – defaults to 95% confidence interval

sd

if TRUE, draw one standard deviation instead of standard errors at the alpha level

labels

X axis label

pos

where to place text: below, left, above, right

arrow.len

How long should the top of the error bars be?

arrow.col

What color should the error bars be?

add

add=FALSE, new plot, add=TRUE, just points and error bars

bars

bars=TRUE will draw a bar graph if you really want to do that

within

should the error variance of a variable be corrected by 1-SMC?

col

color(s) of the catseyes. Defaults to blue.

density

If negative, solid colors, if positive, how many lines to draw

way

Percentages are based upon the row totals (default) column totals, or grand total of the data Table

raw

If raw is FALSE, display the graphs in terms of probability, raw TRUE displays the data in terms of raw counts

...

other parameters to pass to the plot function, e.g., typ="b" to draw lines, lty="dashed" to draw dashed lines

Details

Drawing the mean +/- a confidence interval is a frequently used function when reporting experimental results. By default, the confidence interval is 1.96 standard errors of the t-distribution.

If within=TRUE, the error bars are corrected for the correlation with the other variables by reducing the variance by a factor of (1-smc). This allows for comparisons between variables.

The error bars are normally calculated from the data using the describe function. If, alternatively, a matrix of statistics is provided with column headings of values, means, and se, then those values will be used for the plot (using the stats option). If n is included in the matrix of statistics, then the distribution is drawn for a t distribution for n-1 df. If n is omitted (NULL) or is NA, then the distribution will be a normal distribution.

If sd is TRUE, then the error bars will represent one standard deviation from the mean rather than be a function of alpha and the standard errors.

See the last two examples for the case of plotting data with statistics from another function.

Alternatively, error.bars.tab will take tabulated data and convert to either row, column or overall percentages, and then plot these as percentages with the equivalent standard error (based upon sqrt(pq/N)).

In August, 2018, the functionality of error.bars and error.bars.by were combined so that if groups are specified, then the error bars are done by group. Furthermore, if the x variable is a formula of the form DV ~ IV, then error.bars.by is called to do the plotting.

Value

Graphic output showing the means + x

These confidence regions are based upon normal theory and do not take into account any skew in the variables. More accurate confidence intervals could be found by resampling.

The error.bars.tab function will return (invisibly) the cell means and standard errors.

Author(s)

William Revelle

See Also

error.crosses for two way error bars, error.bars.by for error bars for different groups as well as error.dots.

The error.bars.by is useful for showing the results of one or two way ANOVAs in that it will display means and CIs for one or more DVs for one or two IVs.

In addition, as pointed out by Jim Lemon on the R-help news group, error bars or confidence intervals may be drawn using

function package
bar.err (agricolae)
plotCI (gplots)
xYplot (Hmisc)
dispersion (plotrix)
plotCI (plotrix)

For advice why not to draw bar graphs with error bars, see the page at biostat.mc.vanderbilt.edu/wiki/Main/DynamitePlots.

Examples

set.seed(42)
x <- matrix(rnorm(1000),ncol=20)
boxplot(x,notch=TRUE,main="Notched boxplot with error bars")
error.bars(x,add=TRUE)
abline(h=0)

#show 50% confidence regions and color each variable separately
error.bars(attitude,alpha=.5,
   main="50 percent confidence limits",col=rainbow(ncol(attitude)) )
   
error.bars(attitude,bar=TRUE)  #show the use of bar graphs


#combine with a strip chart and boxplot
stripchart(attitude,vertical=TRUE,method="jitter",jitter=.1,pch=19,
           main="Stripchart with 95 percent confidence limits")
boxplot(attitude,add=TRUE)
error.bars(attitude,add=TRUE,arrow.len=.2)

#use statistics from somewhere else
#by specifying n, we are using the t distribution for confidences
#The first example allows the variables to be spaced along the x axis
my.stats <- data.frame(values=c(1,2,8),mean=c(10,12,18),se=c(2,3,5),n=c(5,10,20))
 error.bars(stats=my.stats,type="b",main="data with confidence intervals")
#don't connect the groups
my.stats <- data.frame(values=c(1,2,8),mean=c(10,12,18),se=c(2,3,5),n=c(5,10,20))
      error.bars(stats=my.stats,main="data with confidence intervals")
#by not specifying value, the groups are equally spaced
my.stats <- data.frame(mean=c(10,12,18),se=c(2,3,5),n=c(5,10,20))
rownames(my.stats) <- c("First", "Second","Third")
error.bars(stats=my.stats,xlab="Condition",ylab="Score")


#Consider the case where we get stats from describe
temp <- describe(attitude)
error.bars(stats=temp)

#show these do not differ from the other way by overlaying the two
error.bars(attitude,add=TRUE,col="red")

#n is omitted
#the error distribution is a normal distribution
my.stats <- data.frame(mean=c(2,4,8),se=c(2,1,2))
rownames(my.stats) <- c("First", "Second","Third")
error.bars(stats=my.stats,xlab="Condition",ylab="Score")
#n is specified
#compare this with small n which shows larger confidence regions
my.stats <- data.frame(mean=c(2,4,8),se=c(2,1,2),n=c(10,10,3))
rownames(my.stats) <- c("First", "Second","Third")
error.bars(stats=my.stats,xlab="Condition",ylab="Score")


#example of arrest rates (as percentage of condition)
arrest <- data.frame(Control=c(14,21),Treated =c(3,23))
rownames(arrest) <- c("Arrested","Not Arrested")
error.bars.tab(arrest,ylab="Probability of Arrest",xlab="Control vs Treatment",
main="Probability of Arrest varies by treatment")


#Show the raw  rates 
error.bars.tab(arrest,raw=TRUE,ylab="Number Arrested",xlab="Control vs Treatment",
main="Count of Arrest varies by treatment")

#If a grouping variable is specified, the function calls error.bars.by
#Use error.bars.by to have more control over the output.
#Show  how to use grouping variables
error.bars(SATV + SATQ ~ gender, data=sat.act) #one grouping variable, formula input
error.bars(SATV + SATQ ~ education + gender,data=sat.act)#two  grouping variables

Plot means and confidence intervals for multiple groups

Description

One of the many functions in R to plot means and confidence intervals. Meant mainly for demonstration purposes for showing the probabilty of replication from multiple samples. Can also be combined with such functions as boxplot to summarize distributions. Means and standard errors for each group are calculated using describeBy.

Usage

error.bars.by(x,group,data=NULL, by.var=FALSE,x.cat=TRUE,ylab =NULL,xlab=NULL, main=NULL, 
	ylim= NULL, xlim=NULL, 
	eyes=TRUE, alpha=.05,sd=FALSE,labels=NULL,v.labels=NULL,v2.labels=NULL, 
	add.labels=NULL,pos=NULL, arrow.len=.05, min.size=1,add=FALSE, 
	bars=FALSE,within=FALSE,
	colors=c("black","blue","red"), lty,lines=TRUE, 
	legend=0,pch=16,density=-10,stats=NULL,...)

Arguments

x

A data frame or matrix

group

A grouping variable

data

If using formula input, the data file must be specified

by.var

A different line for each group (default) or each variable

x.cat

Is the grouping variable categorical (TRUE) or continuous (FALSE

ylab

y label

xlab

x label

main

title for figure

ylim

if specified, the y limits for the plot, otherwise based upon the data

xlim

if specified, the x limits for the plot, otherwise based upon the data

eyes

Should 'cats eyes' be drawn'

alpha

alpha level of confidence interval. Default is 1- alpha =95% confidence interval

sd

sd=TRUE will plot Standard Deviations instead of standard errors

labels

X axis label

v.labels

For a bar plot legend, these are the variable labels, for a line plot, the labels of the grouping variable.

v2.labels

the names for the 2nd grouping variable, if there is one

add.labels

if !NULL, then add the v2.labels to the left/right of the lines (add.labels="right")

pos

where to place text: below, left, above, right

arrow.len

How long should the top of the error bars be?

min.size

Draw error bars for groups > min.size

add

add=FALSE, new plot, add=TRUE, just points and error bars

bars

Draw a barplot with error bars rather than a simple plot of the means

within

Should the s.e. be corrected by the correlation with the other variables?

colors

groups will be plotted in different colors (mod n.groups). See the note for how to make them transparent.

lty

line type may be specified in the case of not plotting by variables

lines

By default, when plotting different groups, connect the groups with a line of type = lty. If lines is FALSE, then do not connect the groups

legend

Where should the legend be drawn: 0 (do not draw it), 1= lower right corner, 2 = bottom, 3 ... 8 continue clockwise, 9 is the center

pch

The first plot symbol to use. Subsequent groups are pch + group

density

How many lines/inch should fill the cats eyes. If missing, non-transparent colors are used. If negative, transparent colors are used. May be a vector for different values.

stats

if specified, the means, sd, n and perhaps of se of a data set to be plotted

...

other parameters to pass to the plot function e.g., lty="dashed" to draw dashed lines

Details

Drawing the mean +/- a confidence interval is a frequently used function when reporting experimental results. By default, the confidence interval is 1.96 standard errors (adjusted for the t-distribution).

Improved/modified in August, 2018 to allow formula input (see examples) as well as to more properly handle multiple groups.

Following a request for better labeling of the grouping variables, the v.lab option is implemented for line graphs as well as bar graphs. Note that if using multiple grouping variables, the labels are for the variable with the most levels (which should be the first one.)

This function was originally just a wrapper for error.bars but has been written to allow groups to be organized either as the x axis or as separate lines.

If desired, a barplot with error bars can be shown. Many find this type of plot to be uninformative (e.g., https://biostat.mc.vanderbilt.edu/DynamitePlots ) and recommend the more standard dot plot.

Note in particular, if choosing to draw barplots, the starting value is 0.0 and setting the ylim parameter can lead to some awkward results if 0 is not included in the ylim range. Did you really mean to draw a bar plot in this case?

For up to three groups, the colors are by default "black", "blue" and "red". For more than 3 groups, they are by default rainbow colors with an alpha factor (transparency) of .5.

To make colors semitransparent, set the density to a negative number. See the last example.

Value

Graphic output showing the means + x% confidence intervals for each group. For ci=1.96, and normal data, this will be the 95% confidence region. For ci=1, the 68% confidence region.

These confidence regions are based upon normal theory and do not take into account any skew in the variables. More accurate confidence intervals could be found by resampling.

The results of describeBy are reported invisibly.

See Also

See Also as error.crosses, histBy, scatterHist, error.bars and error.dots

Examples

data(sat.act)
#The generic plot of variables by group
error.bars.by( SATV + SATQ ~ gender,data=sat.act)  #formula input 
error.bars.by( SATV + SATQ ~ gender,data=sat.act,v.lab=cs(male,female)) #labels 
error.bars.by(SATV + SATQ ~ education + gender, data =sat.act) #see below
error.bars.by(sat.act[1:4],sat.act$gender,legend=7) #specification of variables
error.bars.by(sat.act[1:4],sat.act$gender,legend=7,labels=cs(male,female))

#a bar plot
error.bars.by(sat.act[5:6],sat.act$gender,bars=TRUE,labels=c("male","female"),
    main="SAT V and SAT Q by gender",ylim=c(0,800),colors=c("red","blue"),
    legend=5,v.labels=c("SATV","SATQ"))  #draw a barplot
#a bar plot of SAT by age -- not recommended, see the next plot
error.bars.by(SATV + SATQ ~ education,data=sat.act,bars=TRUE,xlab="Education",
   main="95 percent confidence limits of Sat V and Sat Q", ylim=c(0,800),
   v.labels=c("SATV","SATQ"),colors=c("red","blue") )
#a better graph uses points not bars
#use formulat input
  #plot SAT V and SAT Q by education
error.bars.by(SATV + SATQ ~ education,data=sat.act,TRUE, xlab="Education",
    legend=5,labels=colnames(sat.act[5:6]),ylim=c(525,700),
     main="self reported SAT scores by education",
     v.lab =c("HS","in coll", "< 16", "BA/BS", "in Grad", "Grad/Prof"))
#make the cats eyes semi-transparent by specifying a negative density

error.bars.by(SATV + SATQ ~ education,data=sat.act, xlab="Education",
    legend=5,labels=c("SATV","SATQ"),ylim=c(525,700),
     main="self reported SAT scores by education",density=-10,
     v.lab =c("HS","in coll", "< 16", "BA/BS", "in Grad", "Grad/Prof"))

#use labels to specify the 2nd grouping variable, v.lab to specify the first
error.bars.by(SATV  ~ education  + gender,data=sat.act, xlab="Education",
    legend=5,labels=cs(male,female),ylim=c(525,700),
     main="self reported SAT scores by education",density=-10,
     v.lab =c("HS","in coll", "< 16", "BA/BS", "in Grad", "Grad/Prof"),
     colors=c("red","blue"))


#now for a more complicated examples using 25 big 5 items scored into 5 scales
#and showing age trends by decade 
#this shows how to convert many levels of a grouping variable (age) into more manageable levels.
data(bfi)   #The Big 5 data
#first create the keys 
 keys.list <- list(Agree=c(-1,2:5),Conscientious=c(6:8,-9,-10),
        Extraversion=c(-11,-12,13:15),Neuroticism=c(16:20),Openness = c(21,-22,23,24,-25))
 keys <- make.keys(psychTools::bfi,keys.list)
 #then create the scores for those older than 10 and less than 80
 bfis <- subset(psychTools::bfi,((psychTools::bfi$age > 10) & (psychTools::bfi$age < 80)))

 scores <- scoreItems(keys,bfis,min=1,max=6) #set the right limits for item reversals
 #now draw the results by age
#specify the particular colors to use
 error.bars.by(scores$scores,round(bfis$age/10)*10,by.var=TRUE,
      main="BFI age trends",legend=3,labels=colnames(scores$scores),
        xlab="Age",ylab="Mean item score",
        colors=cs(green,yellow,black,red,blue),
        v.labels =cs(10-14,15-24,25-34,35-44,45-54,55-64,65-74))
#show transparency 
 error.bars.by(scores$scores,round(bfis$age/10)*10,by.var=TRUE,
      main="BFI age trends",legend=3,labels=colnames(scores$scores),
        xlab="Age",ylab="Mean item score", density=-10, 
        colors=cs(green,yellow,black,red,blue),
        v.labels =cs(10-14,15-24,25-34,35-44,45-54,55-64,65-74))

Plot x and y error bars

Description

Given two vectors of data (X and Y), plot the means and show standard errors in both X and Y directions.

Usage

error.crosses(x,y,labels=NULL,main=NULL,xlim=NULL,ylim= NULL, 
xlab=NULL,ylab=NULL,pos=NULL,offset=1,arrow.len=.2,alpha=.05,sd=FALSE,add=FALSE,
colors=NULL,col.arrows=NULL,col.text=NULL,...)

Arguments

x

A vector of data or summary statistics (from Describe)

y

A second vector of data or summary statistics (also from Describe)

labels

the names of each pair – defaults to rownames of x

main

The title for the graph

xlim

xlim values if desired– defaults to min and max mean(x) +/- 2 se

ylim

ylim values if desired – defaults to min and max mean(y) +/- 2 se

xlab

label for x axis – grouping variable 1

ylab

label for y axis – grouping variable 2

pos

Labels are located where with respect to the mean?

offset

Labels are then offset from this location

arrow.len

Arrow length

alpha

alpha level of error bars

sd

if sd is TRUE, then draw means +/- 1 sd)

add

if TRUE, overlay the values with a prior plot

colors

What color(s) should be used for the plot character? Defaults to black

col.arrows

What color(s) should be used for the arrows – defaults to colors

col.text

What color(s) should be used for the text – defaults to colors

...

Other parameters for plot

Details

For an example of two way error bars describing the effects of mood manipulations upon positive and negative affect, see https://personality-project.org/revelle/publications/happy-sad-appendix/FIG.A-6.pdf

The second example shows how error crosses can be done for multiple variables where the grouping variable is found dynamically. The errorCircles example shows how to do this in one step.

Author(s)

William Revelle
[email protected]

See Also

To draw error bars for single variables error.bars, or by groups error.bars.by, or to find descriptive statistics describe or descriptive statistics by a grouping variable describeBy and statsBy.

A much improved version is now called errorCircles.

Examples

#just draw one pair of variables
desc <- describe(attitude)
x <- desc[1,]
y <- desc[2,]
error.crosses(x,y,xlab=rownames(x),ylab=rownames(y))   

#now for a bit more complicated plotting 
data(psychTools::bfi)
desc <- describeBy(psychTools::bfi[1:25],psychTools::bfi$gender) #select a high and low group
error.crosses(desc$'1',desc$'2',ylab="female scores",
   xlab="male scores",main="BFI scores by gender")
 abline(a=0,b=1)
 
#do it from summary statistics  (using standard errors) 
g1.stats <- data.frame(n=c(10,20,30),mean=c(10,12,18),se=c(2,3,5))
g2.stats <- data.frame(n=c(15,20,25),mean=c(6,14,15),se =c(1,2,3))
error.crosses(g1.stats,g2.stats)

#Or, if you prefer to draw +/- 1 sd. instead of 95% confidence
g1.stats <- data.frame(n=c(10,20,30),mean=c(10,12,18),sd=c(2,3,5))
g2.stats <- data.frame(n=c(15,20,25),mean=c(6,14,15),sd =c(1,2,3))
error.crosses(g1.stats,g2.stats,sd=TRUE)

#and seem even fancy plotting: This is taken from a study of mood
#four films were given (sad, horror, neutral, happy)
#with a pre and post test
data(psychTools::affect)
colors <- c("black","red","green","blue")
films <- c("Sad","Horror","Neutral","Happy")
affect.mat <- describeBy(psychTools::affect[10:17],psychTools::affect$Film,mat=TRUE)
 error.crosses(affect.mat[c(1:4,17:20),],affect.mat[c(5:8,21:24),],
    labels=films[affect.mat$group1],xlab="Energetic Arousal",
     ylab="Tense Arousal",colors = 
     colors[affect.mat$group1],pch=16,cex=2)

Show a dot.chart with error bars for different groups or variables

Description

Yet one more of the graphical ways of showing data with error bars for different groups. A dot.chart with error bars for different groups or variables is found using from describe, describeBy, statsBy, corCi, corr.test or data from bestScales.

Usage

error.dots(x=NULL, var = NULL, se = NULL, group = NULL, sd = FALSE, effect=NULL,
stats=NULL, head = 12, tail = 12, sort = TRUE, decreasing = TRUE, main = NULL,
 alpha = 0.05, eyes = FALSE,items=FALSE, min.n = NULL, max.labels = 40, labels = NULL,
 label.width=NULL, select=NULL,
 groups = NULL, gdata = NULL, cex =  par("cex"),  pt.cex = cex, pch = 21, gpch = 21, 
 bg = par("bg"), fg=par("fg"), color = par("fg"), gcolor = par("fg"), 
 lcolor = "gray", xlab = NULL, ylab = NULL, xlim = NULL,add=FALSE,order=NULL, ...)

Arguments

x

A data frame or matrix of raw data, or the resulting object from describe, describeBy, statsBy, bestScales, corCi, corr.test, or cohen.d

var

The variable to show (particularly if doing describeBy or StatsBy plots).

se

Source of a standard error

group

A grouping variable, if desired. Will group the data on group for one variable (var)

sd

if FALSE, confidence intervals in terms of standard errors, otherwise draw one standard deviation

effect

Should the data be compared to a specified group (with mean set to 0) in effect size units?

stats

A matrix of means and se to use instead of finding them from the data

head

The number of largest values to report

tail

The number of smallest values to report

sort

Sort the groups/variables by value

decreasing

Should they be sorted in increasing or decreasing order (from top to bottom)

main

The caption for the figure

alpha

p value for confidence intervals

eyes

Draw catseyes for error limits

items

If showing results from best scales, show the items for a specified dv

min.n

If using describeBy or statsBy, what should be the minimum sample size to draw

max.labels

Length of labels (truncate after this value)

labels

Specify the labels versus find them from the row names

label.width

Truncate after labels.width

select

Scale the plot for all the variables, but just show the select variables

groups

ignored: to be added eventually

gdata

ignored

cex

The standard meaning of cex for graphics

pt.cex

ignored

pch

Plot character of the mean

gpch

ignored

bg

background color (of the dots showing the means)

fg

foreground color (of the line segments)

color

Color of the text labels

gcolor

ignored

lcolor

ignored until groups are implemented

xlab

Label the x axis, if NULL, the variable name is used

ylab

If NULL, then the group rownames are used

xlim

If NULL, then calculated to show nice values

add

If TRUE, will add the plot to a previous plot (e.g., from dotchart)

order

if sort=TRUE, if order is NULL, sort on values, otherwise, if order is returned from a previous figure, use that order.

...

And any other graphic parameters we have forgotten

Details

Adapted from the dot.chart function to include error bars and to use the output of describe, describeBy, statsBy, fa, bestScales or cohen.d. To speed up multiple plots, the function can work from the output of a previous run. Thus describeBy will be done and the results can be show for multiple variables.

If using the add=TRUE option to add an error.dots plot to a dotplot, note that the order of variables in dot plots goes from last to first (highest y value is actually the last value in a vector.) Also note that the xlim parameter should be set to make sure the plots line up correctly.

Value

Returns (invisibily) either a describeBy or describe object as well as the order if sorted

Author(s)

William Revelle

References

Used in particular for showing https://sapa-project.org output.

See Also

describe, describeBy, or statsBy as well as error.bars, error.bars.by, statsBy, bestScales or cohen.d

Examples

temp <- error.dots(psychTools::bfi[1:25],sort=TRUE, 
xlab="Mean score for the item, sorted by difficulty")
error.dots(psychTools::bfi[1:25],sort=TRUE, order=temp$order,
 add=TRUE, eyes=TRUE) #over plot with eyes

error.dots(psychTools::ability,eyes=TRUE, xlab="Mean score for the item")

cd <- cohen.d(psychTools::bfi[1:26],"gender")
temp <- error.dots(cd, select=c(1:15,21:25),head=12,tail=13,
      main="Cohen d and confidence intervals of BFI by gender")
error.dots(cd,select=c(16:20),head=13,tail=12,col="blue",add=TRUE,fg="red" ,main="")
abline(v=0)
#now show cis for correlations
R <- corCi(attitude,plot=FALSE)
error.dots(R, sort=FALSE)
#the asymmetric case
R <- corr.test(attitude[,1:2],attitude[,3:7])
error.dots(R, sort=FALSE)

Two way plots of means, error bars, and sample sizes

Description

Given a matrix or data frame, data, find statistics based upon a grouping variable and then plot x and y means with error bars for each value of the grouping variable. If the data are paired (e.g. by gender), then plot means and error bars for the two groups on all variables.

Usage

errorCircles(x, y, data, ydata = NULL, group=NULL, paired = FALSE, labels = NULL, 
main = NULL, xlim = NULL, ylim = NULL, xlab = NULL, ylab = NULL,add=FALSE, pos = NULL,
 offset = 1, arrow.len = 0.2, alpha = 0.05, sd = FALSE, bars = TRUE, circles = TRUE,
 colors=NULL,col.arrows=NULL,col.text=NULL,circle.size=1, ...)

Arguments

x

The x variable (by name or number) to plot

y

The y variable (name or number) to plot

data

The matrix or data.frame to use for the x data

ydata

If plotting data from two data.frames, then the y variable of the ydata frame will be used.

group

If specified, then statsBy is called first to find the statistics by group

paired

If TRUE, plot all x and y variables for the two values of the grouping variable.

labels

Variable names

main

Main title for plot

xlim

xlim values if desired– defaults to min and max mean(x) +/- 2 se

ylim

ylim values if desired – defaults to min and max mean(y) +/- 2 se

xlab

label for x axis – grouping variable 1

ylab

label for y axis – grouping variable 2

add

If TRUE, add to the prior plot

pos

Labels are located where with respect to the mean?

offset

Labels are then offset from this location

arrow.len

Arrow length

alpha

alpha level of error bars

sd

if sd is TRUE, then draw means +/- 1 sd)

bars

Should error.bars be drawn for both x and y

circles

Should circles representing the relative sample sizes be drawn?

colors

Plot the points using colors – default is black

col.text

What color for the text labels (defaults to colors)

col.arrows

What color should the arrows and circles be? Defaults to colors

circle.size

A scaling parameter for the error.circles. Defaults to 1, but can be adjusted downwards to make them less instrusive.

...

Other parameters for plot

Details

When visualizing the effect of an experimental manipulation or the relationship of multiple groups, it is convenient to plot their means as well as their confidence regions in a two dimensional space.

The diameter of the enclosing circle (ellipse) scales as 1/sqrt(N) * the maximum standard error of all variables. That is to say, the area of the ellipse reflects sample size.

Value

If the group variable is specified, then the statistics from statsBy are (invisibly) returned.

Note

Basically this is a combination (and improvement) of statsBy with error.crosses. Can also serve some of the functionality of error.bars.by (see the last example).

Author(s)

William Revelle

See Also

statsBy, describeBy, error.crosses

Examples

#BFI scores for males and females
errorCircles(1:25,1:25,data=psychTools::bfi,group="gender",paired=TRUE,ylab="female scores",
      xlab="male scores",main="BFI scores by gender")
 abline(a=0,b=1)
#drop the circles since all samples are the same sizes
errorCircles(1:25,1:25,data=psychTools::bfi,group="gender",paired=TRUE,circles=FALSE,
     ylab="female scores",xlab="male scores",main="BFI scores by gender")
 abline(a=0,b=1)
 
 data(psychTools::affect)
colors <- c("black","red","white","blue")
films <- c("Sad","Horror","Neutral","Happy")
affect.stats <- errorCircles("EA2","TA2",data=psychTools::affect[-c(1,20)],
group="Film",labels=films,
      xlab="Energetic Arousal",ylab="Tense Arousal",ylim=c(10,22),xlim=c(8,20), 
      pch=16,cex=2,colors=colors, main ="EA and TA pre and post affective movies")
#now, use the stats from the prior run 
errorCircles("EA1","TA1",data=affect.stats,labels=films,pch=16,cex=2,colors=colors,add=TRUE)

#show sample size with the size of the circles
errorCircles("SATV","SATQ",sat.act,group="education")

#Can also provide error.bars.by functionality
 errorCircles(2,5,group=2,data=sat.act,circles=FALSE,pch=16,colors="blue",
      ylim= c(200,800),main="SATV by education",labels="")
 #just do the breakdown and then show the points
# errorCircles(3,5,group=3,data=sat.act,circles=FALSE,pch=16,colors="blue",
#         ylim= c(200,800),main="SATV by age",labels="",bars=FALSE)

Perform and Exploratory Structural Equation Model (ESEM) by using factor extension techniques

Description

Structural Equation Modeling (SEM) is a powerful tool for confirming multivariate structures and is well done by the lavaan, sem, or OpenMx packages. Because they are confirmatory, SEM models test specific models. Exploratory Structural Equation Modeling (ESEM), on the other hand, takes a more exploratory approach. By using factor extension, it is possible to extend the factors of one set of variables (X) into the variable space of another set (Y). Using this technique, it is then possible to estimate the correlations between the two sets of latent variables, much the way normal SEM would do. Based upon exploratory factor analysis (EFA) this approach provides a quick and easy approach to do exploratory structural equation modeling.

Usage

esem(r, varsX, varsY, nfX = 1, nfY = 1, n.obs = NULL, fm = "minres", 
      rotate = "oblimin", rotateY="oblimin", plot = TRUE, cor = "cor",
       use = "pairwise",weight=NULL, ...)
esemDiagram(esem=NULL,labels=NULL,cut=.3,errors=FALSE,simple=TRUE,
	regression=FALSE,lr=TRUE, digits=1,e.size=.1,adj=2,
     main="Exploratory Structural Model", ...)
esem.diagram(esem=NULL,labels=NULL,cut=.3,errors=FALSE,simple=TRUE,
	regression=FALSE,lr=TRUE, digits=1,e.size=.1,adj=2,
     main="Exploratory Structural Model", ...) #deprecated
cancorDiagram(fit, cut=.1,digits=2, nfX = NULL, nfY = NULL,simple=FALSE,e.size=.1,
     main="Canonical Correlation",...)    
interbattery(r, varsX, varsY, nfX = 1, nfY = 1, n.obs = NULL,cor = "cor", 
       use = "pairwise",weight=NULL)

Arguments

r

A correlation matrix or a raw data matrix suitable for factor analysis

varsX

The variables defining set X

varsY

The variables defining set Y

nfX

The number of factors to extract for the X variables

nfY

The number of factors to extract for the Y variables

n.obs

Number of observations (needed for eBIC and chi square), can be ignored.

fm

The factor method to use, e.g., "minres", "mle" etc. (see fa for details)

rotate

Which rotation to use. (see fa for details)

rotateY

Which rotatation to use for the Y variables.

plot

If TRUE, draw the esemDiagram

cor

What options for to use for correlations (see fa for details)

use

"pairwise" for pairwise complete data, for other options see cor

weight

Weights to apply to cases when finding wt.cov

...

other parameters to pass to fa or to esemDiagram functions.

esem

The object returned from esem and passed to esemDiagram

fit

The object returned by lmCor with canonical variates

labels

Variable labels

cut

Loadings with abs(loading) > cut will be shown

simple

Only the biggest loading per item is shown

errors

include error estimates (as arrows)

e.size

size of ellipses (adjusted by the number of variables)

digits

Round coefficient to digits

adj

loadings are adjusted by factor number mod adj to decrease likelihood of overlap

main

Graphic title, defaults to "Exploratory Structural Model"

lr

draw the graphic left to right (TRUE) or top to bottom (FALSE)

regression

Not yet implemented

Details

Factor analysis as implemented in fa attempts to summarize the covariance (correlational) structure of a set of variables with a small set of latent variables or “factors". This solution may be ‘extended’ into a larger space with more variables without changing the original solution (see fa.extension. Similarly, the factors of a second set of variables (the Y set) may be extended into the original (X ) set. Doing so allows two independent measurement models, a measurement model for X and a measurement model for Y. These two sets of latent variables may then be correlated for an Exploratory Structural Equation Model. (This is exploratory because it is based upon exploratory factor analysis (EFA) rather than a confirmatory factor model (CFA) using more traditional Structural Equation Modeling packages such as sem, lavaan, or Mx.)

Although the output seems very similar to that of a normal EFA using fa, it is actually two independent factor analyses (of the X and the Y sets) that are then mutually extended into each other. That is, the loadings and structure matrices from sets X and Y are merely combined, and the correlations between the two sets of factors are found.

Interbattery factor analysis was developed by Tucker (1958) as a way of comparing the factors in common to two batteries of tests. (Currently under development and not yet complete). Using some straight forward linear algebra It is easy to find the factors of the intercorrelations between the two sets of variables. This does not require estimating communalities and is highly related to the procedures of canonical correlation.

The difference between the esem and the interbattery approach is that the first factors the X set and then relates those factors to factors of the Y set. Interbattery factor analysis, on the other hand, tries to find one set of factors that links both sets but is still distinct from factoring both sets together.

Value

communality

The amount of variance in each of the X and Y variables accounted for by the total model.

sumsq

The amount of variance accounted for by each factor – independent of the other factors.

dof

Degrees of freedom of the model.

esem.dof

Alternatively consider the df1 for the X model and df2 for the Y model. esem.dof = df1 + df2

null.dof

Degrees of freedom of the null model (the correlation matrix)

ENull

chi square of the null model

chi

chi square of the model. This is found by examining the size of the residuals compared to their standard error.

rms

The root mean square of the residuals.

nh

Harmonic sample size if using min.chi for factor extraction.

EPVAL

Probability of the Emprical Chi Square given the hypothesis of an identity matrix.

crms

Adjusted root mean square residual

EBIC

When normal theory fails (e.g., in the case of non-positive definite matrices), it useful to examine the empirically derived EBIC based upon the empirical χ2\chi^2 - 2 df.

ESABIC

Sample size adjusted empirical BIC

fit

sum of squared residuals versus sum of squared original values

fit.off

fit applied to the off diagonal elements

sd

standard deviation of the residuals

factors

Number of factors extracted

complexity

Item complexity

n.obs

Number of total observations

loadings

The factor pattern matrix for the combined X and Y factors

Structure

The factor structure matrix for the combined X and Y factors

loadsX

Just the X set of loadings (pattern) without the extension variables.

loadsY

Just the Y set of loadings (pattern) without the extension variables.

PhiX

The correlations of the X factors

PhiY

the correlations of the Y factors

Phi

the correlations of the X and Y factors within the selves and across sets.

fm

The factor method used

fx

The complete factor analysis output for the X set

fy

The complete factor analysis output for the Y set

residual

The residual correlation matrix (R - model). May be examined by a call to residual().

Call

Echo back the original call to the function.

model

model code for SEM and for lavaan to do subsequent confirmatory modeling

Note

Developed September, 2016, revised December, 2018 to produce code for lavaan and sem from the esem and esemDiagram functions. Suggestions or comments are most welcome.

This is clearly an exploratory approach and the degrees of freedom for the model are unclear.

In December 2023, I added the canCorDiagram function to help understand canonical correlations.

Author(s)

William Revelle

References

Revelle, William. (in prep) An introduction to psychometric theory with applications in R. Springer. Working draft available at https://personality-project.org/r/book/

Tucker, Ledyard (1958) An inter-battery method of factor analysis, Psychometrika, 23, 111-136.

See Also

principal for principal components analysis (PCA). PCA will give very similar solutions to factor analysis when there are many variables. The differences become more salient as the number variables decrease. The PCA and FA models are actually very different and should not be confused. One is a model of the observed variables, the other is a model of latent variables.

lmCor which does a canonical correlation between the X and Y sets of variables.

irt.fa for Item Response Theory analyses using factor analysis, using the two parameter IRT equivalent of loadings and difficulties.

VSS will produce the Very Simple Structure (VSS) and MAP criteria for the number of factors, nfactors to compare many different factor criteria.

ICLUST will do a hierarchical cluster analysis alternative to factor analysis or principal components analysis.

predict.psych to find predicted scores based upon new data, fa.extension to extend the factor solution to new variables, omega for hierarchical factor analysis with one general factor. fa.multi for hierarchical factor analysis with an arbitrary number of higher order factors.

faRegression to do multiple regression from factor analysis solutions.

fa.sort will sort the factor loadings into echelon form. fa.organize will reorganize the factor pattern matrix into any arbitrary order of factors and items.

KMO and cortest.bartlett for various tests that some people like.

factor2cluster will prepare unit weighted scoring keys of the factors that can be used with scoreItems.

fa.lookup will print the factor analysis loadings matrix along with the item “content" taken from a dictionary of items. This is useful when examining the meaning of the factors.

anova.psych allows for testing the difference between two (presumably nested) factor models .

Examples

#make up a sem like problem using sim.structure
fx <-matrix(c( .9,.8,.6,rep(0,4),.6,.8,-.7),ncol=2)  
fy <- matrix(c(.6,.5,.4),ncol=1)
rownames(fx) <- c("V","Q","A","nach","Anx")
rownames(fy)<- c("gpa","Pre","MA")
Phi <-matrix( c(1,0,.7,.0,1,.7,.7,.7,1),ncol=3)
gre.gpa <- sim.structural(fx,Phi,fy)
print(gre.gpa)

#now esem it:
example <- esem(gre.gpa$model,varsX=1:5,varsY=6:8,nfX=2,nfY=1,n.obs=1000,plot=FALSE)
example
esemDiagram(example,simple=FALSE)
#compare this to the canonical solution
mod <- lmCor(y=6:8, x =1:5, data=gre.gpa$model, plot=FALSE)
cancorDiagram(mod)  #does not work because of imaginary roots
#compare two alternative solutions to the first 2 factors of the neo.
#solution 1 is the normal 2 factor solution.
#solution 2 is an esem with 1 factor for the first 6 variables, and 1 for the second 6.
f2 <- fa(psychTools::neo[1:12,1:12],2)
es2 <- esem(psychTools::neo,1:6,7:12,1,1)
summary(f2)
summary(es2)
fa.congruence(f2,es2)

interbattery(Thurstone.9,1:4,5:9,2,2)  #compare to the solution of Tucker.  We are not there yet.

Exploratory Factor analysis using MinRes (minimum residual) as well as EFA by Principal Axis, Weighted Least Squares or Maximum Likelihood

Description

Among the many ways to do latent variable exploratory factor analysis (EFA), one of the better is to use Ordinary Least Squares (OLS) to find the minimum residual (minres) solution. This produces solutions very similar to maximum likelihood even for badly behaved matrices. A variation on minres is to do weighted least squares (WLS). Perhaps the most conventional technique is principal axes (PAF). An eigen value decomposition of a correlation matrix is done and then the communalities for each variable are estimated by the first n factors. These communalities are entered onto the diagonal and the procedure is repeated until the sum(diag(r)) does not vary. Yet another estimate procedure is maximum likelihood. For well behaved matrices, maximum likelihood factor analysis (either in the fa or in the factanal function) is probably preferred. Bootstrapped confidence intervals of the loadings and interfactor correlations are found by fa with n.iter > 1.

Usage

fa(r, nfactors=1, n.obs = NA, n.iter=1, rotate="oblimin", scores="regression", 
	residuals=FALSE, SMC=TRUE, covar=FALSE, missing=FALSE,	impute="none",
	min.err = 0.001,  max.iter = 50, symmetric=TRUE, warnings=TRUE, fm="minres",
 	alpha=.1, p=.05, oblique.scores=FALSE, np.obs=NULL, use="pairwise", cor="cor",
 	correct=.5, weight=NULL, n.rotations=1, hyper=.15, smooth=TRUE,...)

fac(r,nfactors=1,n.obs = NA, rotate="oblimin", scores="tenBerge", residuals=FALSE,
 	SMC=TRUE, covar=FALSE, missing=FALSE, impute="median", min.err = 0.001, 
	max.iter=50, symmetric=TRUE, warnings=TRUE, fm="minres", alpha=.1,
	oblique.scores=FALSE, np.obs=NULL, use="pairwise", cor="cor", correct=.5, 
	weight=NULL, n.rotations=1, hyper=.15, smooth=TRUE,...)

fa.pooled(datasets,nfactors=1,rotate="oblimin", scores="regression", 
	residuals=FALSE, SMC=TRUE, covar=FALSE, missing=FALSE,impute="median", min.err = 
	.001, max.iter=50,symmetric=TRUE, warnings=TRUE, fm="minres", alpha=.1, 
	p =.05, oblique.scores=FALSE,np.obs=NULL, use="pairwise", cor="cor",
	correct=.5,	weight=NULL,...) 

fa.sapa(r,nfactors=1, n.obs = NA,n.iter=1,rotate="oblimin",scores="regression", 
	residuals=FALSE, SMC=TRUE, covar=FALSE, missing=FALSE, impute="median", 
	min.err = .001, max.iter=50,symmetric=TRUE,warnings=TRUE, fm="minres", alpha=.1, 
	p =.05, oblique.scores=FALSE, np.obs=NULL, use="pairwise", cor="cor", correct=.5, 
	weight=NULL, frac=.1,...)

Arguments

r

A correlation or covariance matrix or a raw data matrix. If raw data, the correlation matrix will be found using pairwise deletion. If covariances are supplied, they will be converted to correlations unless the covar option is TRUE.

nfactors

Number of factors to extract, default is 1

n.obs

Number of observations used to find the correlation matrix if using a correlation matrix. Used for finding the goodness of fit statistics. Must be specified if using a correlaton matrix and finding confidence intervals.

np.obs

The pairwise number of observations. Used if using a correlation matrix and asking for a minchi solution.

rotate

"none", "varimax", "quartimax", "bentlerT", "equamax", "varimin", "geominT" and "bifactor" are orthogonal rotations. "Promax", "promax", "oblimin", "simplimax", "bentlerQ, "geominQ" and "biquartimin" and "cluster" are possible oblique transformations of the solution. The default is to do a oblimin transformation, although versions prior to 2009 defaulted to varimax. SPSS seems to do a Kaiser normalization before doing Promax, this is done here by the call to "promax" which does the normalization before calling Promax in GPArotation.

n.rotations

Number of random starts for the rotations.

n.iter

Number of bootstrap interations to do in fa or fa.poly

residuals

Should the residual matrix be shown

scores

the default="regression" finds factor scores using regression. Alternatives for estimating factor scores include simple regression ("Thurstone"), correlaton preserving ("tenBerge") as well as "Anderson" and "Bartlett" using the appropriate algorithms ( factor.scores). Although scores="tenBerge" is probably preferred for most solutions, it will lead to problems with some improper correlation matrices.

SMC

Use squared multiple correlations (SMC=TRUE) or use 1 as initial communality estimate. Try using 1 if imaginary eigen values are reported. If SMC is a vector of length the number of variables, then these values are used as starting values in the case of fm='pa'.

covar

if covar is TRUE, factor the covariance matrix, otherwise factor the correlation matrix

missing

if scores are TRUE, and missing=TRUE, then impute missing values using either the median or the mean. Specifying impute="none" will not impute data points and thus will have some missing data in the factor scores.

impute

"median" or "mean" values are used to replace missing values

min.err

Iterate until the change in communalities is less than min.err

max.iter

Maximum number of iterations for convergence

symmetric

symmetric=TRUE forces symmetry by just looking at the lower off diagonal values

hyper

Count number of (absolute) loadings less than hyper.

warnings

warnings=TRUE => warn if number of factors is too many

fm

Factoring method fm="minres" will do a minimum residual as will fm="uls". Both of these use a first derivative. fm="ols" differs very slightly from "minres" in that it minimizes the entire residual matrix using an OLS procedure but uses the empirical first derivative. This will be slower. fm="wls" will do a weighted least squares (WLS) solution, fm="gls" does a generalized weighted least squares (GLS), fm="pa" will do the principal factor solution, fm="ml" will do a maximum likelihood factor analysis. fm="minchi" will minimize the sample size weighted chi square when treating pairwise correlations with different number of subjects per pair. fm ="minrank" will do a minimum rank factor analysis. "old.min" will do minimal residual the way it was done prior to April, 2017 (see discussion below). fm="alpha" will do alpha factor analysis as described in Kaiser and Coffey (1965)

alpha

alpha level for the confidence intervals for RMSEA

p

if doing iterations to find confidence intervals, what probability values should be found for the confidence intervals

oblique.scores

When factor scores are found, should they be based on the structure matrix (default) or the pattern matrix (oblique.scores=TRUE). Now it is always false. If you want oblique factor scores, use tenBerge.

weight

If not NULL, a vector of length n.obs that contains weights for each observation. The NULL case is equivalent to all cases being weighted 1.

use

How to treat missing data, use="pairwise" is the default". See cor for other options.

cor

How to find the correlations: "cor" is Pearson", "cov" is covariance, "tet" is tetrachoric, "poly" is polychoric, "mixed" uses mixed cor for a mixture of tetrachorics, polychorics, Pearsons, biserials, and polyserials, Yuleb is Yulebonett, Yuleq and YuleY are the obvious Yule coefficients as appropriate

correct

When doing tetrachoric, polycoric, or mixed cor, how should we treat empty cells. (See the discussion in the help for tetrachoric.)

frac

The fraction of data to sample n.iter times if showing stability across sample sizes

datasets

A list of replication data sets to analyse in fa.pooled. All need to have the same number of variables.

smooth

By default, improper correlations matrices are smoothed. This is not necessary for factor extraction using fm="pa" or fm="minres", but is needed to give some of the goodness of fit tests. (see notes)

...

additional parameters, specifically, keys may be passed if using the target rotation, or delta if using geominQ, or whether to normalize if using Varimax

Details

Factor analysis is an attempt to approximate a correlation or covariance matrix with one of lesser rank. The basic model is that nRnnFkkFn+U2_nR_n \approx _{n}F_{kk}F_n'+ U^2 where k is much less than n. There are many ways to do factor analysis, and maximum likelihood procedures are probably the most commonly preferred (see factanal ). The existence of uniquenesses is what distinguishes factor analysis from principal components analysis (e.g., principal). If variables are thought to represent a “true" or latent part then factor analysis provides an estimate of the correlations with the latent factor(s) representing the data. If variables are thought to be measured without error, then principal components provides the most parsimonious description of the data.

Factor loadings will be smaller than component loadings for the later reflect unique error in each variable. The off diagonal residuals for a factor solution will be superior (smaller) than that of a component model. Factor loadings can be thought of as the asymptotic component loadings as the number of variables loading on each factor increases.

The fa function will do factor analyses using one of six different algorithms: minimum residual (minres, aka ols, uls), principal axes, alpha factoring, weighted least squares, minimum rank, or maximum likelihood.

Principal axes factor analysis has a long history in exploratory analysis and is a straightforward procedure. Successive eigen value decompositions are done on a correlation matrix with the diagonal replaced with diag (FF') until (diag(FF))\sum(diag(FF')) does not change (very much). The current limit of max.iter =50 seems to work for most problems, but the Holzinger-Harmon 24 variable problem needs about 203 iterations to converge for a 5 factor solution.

Not all factor programs that do principal axes do iterative solutions. The example from the SAS manual (Chapter 33) is such a case. To achieve that solution, it is necessary to specify that the max.iter = 1. Comparing that solution to an iterated one (the default) shows that iterations improve the solution.

In addition, fm="mle" produces even better solutions for this example. Both the RMSEA and the root mean square of the residuals are smaller than the fm="pa" solution.

However, simulations of multiple problem sets suggest that fm="pa" tends to produce slightly smaller residuals while having slightly larger RMSEAs than does fm="minres" or fm="mle". That is, the sum of squared residuals for fm="pa" seem to be slightly smaller than those found using fm="minres" but the RMSEAs are slightly worse when using fm="pa". That is to say, the "true" minimal residual is probably found by fm="pa".

Following extensive correspondence with Hao Wu and Mikko Ronkko, in April, 2017 the derivative of the minres and uls) fitting was modified. This leads to slightly smaller residuals (appropriately enough for a method claiming to minimize them) than the prior procedure. For consistency with prior analyses, "old.min" was added to give these slightly larger residuals. The differences between old.min and the newer "minres" and "ols" solutions are at the third to fourth decimal, but none the less, are worth noting. For comparison purposes, the fm="ols" uses empirical first derivatives, while uls and minres use equation based first derivatives. The results seem to be identical, but the minres and uls solutions require fewer iterations for larger problems and are faster. Thanks to Hao Wu for some very thoughtful help.

Although usually these various algorithms produce equivalent results, there are several data sets included that show large differences between the methods. Schutz produces Heywood and super Heywood cases, blant leads to very different solutions. In particular, the minres solution produces smaller residuals than does the mle solution, and the factor.congruence coefficients show very different solutions.

A very strong argument against using MLE is found in the chapter by MacCallum, Brown and Cai (2007) who show that OLS approaches produce equivalent solutions most of the time, and better solutions some of the time. This particularly in the case of models with some unmodeled small factors. (See sim.minor to generate such data.)

Principal axes may be used in cases when maximum likelihood solutions fail to converge, although fm="minres" will also do that and tends to produce better (smaller RMSEA) solutions.

The fm="minchi" option is a variation on the "minres" (ols) solution and minimizes the sample size weighted residuals rather than just the residuals. This was developed to handle the problem of data that Massively Missing Completely at Random (MMCAR) which a condition that happens in the SAPA project.

A traditional problem in factor analysis was to find the best estimate of the original communalities in order to speed up convergence. Using the Squared Multiple Correlation (SMC) for each variable will underestimate the original communalities, using 1s will over estimate. By default, the SMC estimate is used. Note that in the case of non-invertible matrices, the pseudo-inverse is found so smcs are still estimated. In either case, iterative techniques will tend to converge on a stable solution. If, however, a solution fails to be achieved, it is useful to try again using ones (SMC =FALSE). Alternatively, a vector of starting values for the communalities may be specified by the SMC option.

The iterated principal axes algorithm does not attempt to find the best (as defined by a maximum likelihood criterion) solution, but rather one that converges rapidly using successive eigen value decompositions. The maximum likelihood criterion of fit and the associated chi square value are reported, and will be (slightly) worse than that found using maximum likelihood procedures.

The minimum residual (minres) solution is an unweighted least squares solution that takes a slightly different approach. It uses the optim function and adjusts the diagonal elements of the correlation matrix to mimimize the squared residual when the factor model is the eigen value decomposition of the reduced matrix. MINRES and PA will both work when ML will not, for they can be used when the matrix is singular. Although before the change in the derivative, the MINRES solution was slightly more similar to the ML solution than is the PA solution. With the change in the derivative of the minres fit, the minres, pa and uls solutions are practically identical. To a great extent, the minres and wls solutions follow ideas in the factanal function with the change in the derivative.

The weighted least squares (wls) solution weights the residual matrix by 1/ diagonal of the inverse of the correlation matrix. This has the effect of weighting items with low communalities more than those with high communalities.

The generalized least squares (gls) solution weights the residual matrix by the inverse of the correlation matrix. This has the effect of weighting those variables with low communalities even more than those with high communalities.

The maximum likelihood solution takes yet another approach and finds those communality values that minimize the chi square goodness of fit test. The fm="ml" option provides a maximum likelihood solution following the procedures used in factanal but does not provide all the extra features of that function. It does, however, produce more expansive output.

The minimum rank factor model (MRFA) roughly follows ideas by Shapiro and Ten Berge (2002) and Ten Berge and Kiers (1991). It makes use of the glb.algebraic procedure contributed by Andreas Moltner. MRFA attempts to extract factors such that the residual matrix is still positive semi-definite. This version is still being tested and feedback is most welcome.

Alpha factor analysis finds solutions based upon a correlation matrix corrected for communalities and then rescales these to the original correlation matrix. This procedure is described by Kaiser and Coffey, 1965.

Test cases comparing the output to SPSS suggest that the PA algorithm matches what SPSS calls uls, and that the wls solutions are equivalent in their fits. The wls and gls solutions have slightly larger eigen values, but slightly worse fits of the off diagonal residuals than do the minres or maximum likelihood solutions. Comparing the results to the examples in Harman (1976), the PA solution with no iterations matches what Harman calls Principal Axes (as does SAS), while the iterated PA solution matches his minres solution. The minres solution found in psych tends to have slightly smaller off diagonal residuals (as it should) than does the iterated PA solution.

Although for items, it is typical to find factor scores by scoring the salient items (using, e.g., scoreItems) factor scores can be estimated by regression as well as several other means. There are multiple approaches that are possible (see Grice, 2001) and one taken here was developed by tenBerge et al. (see factor.scores). The alternative, which will match factanal is to find the scores using regression – Thurstone's least squares regression where the weights are found by W=R1SW = R^{-1}S where R is the correlation matrix of the variables ans S is the structure matrix. Then, factor scores are just Fs=XWFs = X W.

In the oblique case, the factor loadings are referred to as Pattern coefficients and are related to the Structure coefficients by S=PΦS = P \Phi and thus P=SΦ1P = S \Phi^{-1}. When estimating factor scores, fa and factanal differ in that fa finds the factors from the Structure matrix while factanal seems to do it from the Pattern matrix. Thus, although in the orthogonal case, fa and factanal agree perfectly in their factor score estimates, they do not agree in the case of oblique factors. Setting oblique.scores = TRUE will produce factor score estimate that match those of factanal.

It is sometimes useful to extend the factor solution to variables that were not factored. This may be done using fa.extension. Factor extension is typically done in the case where some variables were not appropriate to factor, but factor loadings on the original factors are still desired. Factor extension is a very powerful procedure in that it allows one to find the factor-criterion correlations without using factor scores.

For dichotomous items or polytomous items, it is recommended to analyze the tetrachoric or polychoric correlations rather than the Pearson correlations. This may be done by specifying cor="poly" or cor="tet" or cor="mixed" if the data have a mixture of dichotomous, polytomous, and continous variables.

Analysis of dichotomous or polytomous data may also be done by using irt.fa or simply setting the cor="poly" option. In the first case, the factor analysis results are reported in Item Response Theory (IRT) terms, although the original factor solution is returned in the results. In the later case, a typical factor loadings matrix is returned, but the tetrachoric/polychoric correlation matrix and item statistics are saved for reanalysis by irt.fa. (See also the mixed.cor function to find correlations from a mixture of continuous, dichotomous, and polytomous items.)

Of the various rotation/transformation options, varimax, Varimax, quartimax, bentlerT, geominT, and bifactor do orthogonal rotations. Promax transforms obliquely with a target matix equal to the varimax solution. oblimin, quartimin, simplimax, bentlerQ, geominQ and biquartimin are oblique transformations. Most of these are just calls to the GPArotation package. The “cluster” option does a targeted rotation to a structure defined by the cluster representation of a varimax solution. With the optional "keys" parameter, the "target" option will rotate to a target supplied as a keys matrix. (See target.rot.)

oblimin is implemented in GPArotation by a call to quartimin with delta=0. This leads to confusion when people refer to quartimin solutions.

It is important to note a dirty little secret about factor rotation. That is the problem of local minima. Multiple restarts of the rotation algorithms are strongly encouraged (see Nguyen and Waller, N. G. 2021).

Two additional target rotation options are available through calls to GPArotation. These are the targetQ (oblique) and targetT (orthogonal) target rotations of Michael Browne. See target.rot for more documentation.

The "bifactor" rotation implements the Jennrich and Bentler (2011) bifactor rotation by calling the GPForth function in the GPArotation package and using two functions adapted from the MatLab code of Jennrich and Bentler. This seems to have a problem with local minima and multiple starting values should be used.

There are two varimax rotation functions. One, Varimax, in the GPArotation package does not by default apply Kaiser normalization. The other, varimax, in the stats package, does. It appears that the two rotation functions produce slightly different results even when normalization is set. For consistency with the other rotation functions, Varimax is probably preferred.

The rotation matrix (rot.mat) is returned from all of these options. This is the inverse of the Th (theta?) object returned by the GPArotation package. The correlations of the factors may be found by Φ=θθ\Phi = \theta' \theta

There are two ways to handle dichotomous or polytomous responses: fa with the cor="poly" option which will return the tetrachoric or polychoric correlation matrix, as well as the normal factor analysis output, and irt.fa which returns a two parameter irt analysis as well as the normal fa output.

When factor analyzing items with dichotomous or polytomous responses, the irt.fa function provides an Item Response Theory representation of the factor output. The factor analysis results are available, however, as an object in the irt.fa output.

fa.poly is deprecated, for its functioning is matched by setting cor="poly". It will produce normal factor analysis output but also will save the polychoric matrix (rho) and items difficulties (tau) for subsequent irt analyses. fa.poly will, by default, find factor scores if the data are available. The correlations are found using either tetrachoric or polychoric and then this matrix is factored. Weights from the factors are then applied to the original data to estimate factor scores.

The function fa will repeat the analysis n.iter times on a bootstrapped sample of the data (if they exist) or of a simulated data set based upon the observed correlation matrix. The mean estimate and standard deviation of the estimate are returned and will print the original factor analysis as well as the alpha level confidence intervals for the estimated coefficients. The bootstrapped solutions are rotated towards the original solution using target.rot. The factor loadings are z-transformed, averaged and then back transformed. This leads to an error in the case of Heywood cases. The probably better alternative is to just find the mean bootstrapped value and find the confidence intervals based upon the observed range of values. The default is to have n.iter =1 and thus not do bootstrapping.

If using polytomous or dichotomous items, it is perhaps more useful to find the Item Response Theory parameters equivalent to the factor loadings reported in fa.poly by using the irt.fa function.

For those who like SPSS type output, the measure of factoring adequacy known as the Kaiser-Meyer-Olkin KMO test may be found from the correlation matrix or data matrix using the KMO function. Similarly, the Bartlett's test of Sphericity may be found using the cortest.bartlett function.

For those who want to have an object of the variances accounted for, this is returned invisibly by the print function. (e.g., p <- print(fa(ability))$Vaccounted ). This is now returned by the fa function as well (e.g. p <- fa(ability)$Vaccounted ). Just as communalities may be found by the diagonal of Pattern %*% t(Structure) so can the variance accounted for be found by diagonal ( t(Structure) %*% Pattern. Note that referred to as SS loadings.

The output from the print.psych.fa function displays the factor loadings (from the pattern matrix, the h2 (communalities) the u2 (the uniquenesses), com (the complexity of the factor loadings for that variable (see below). In the case of an orthogonal solution, h2 is merely the row sum of the squared factor loadings. But for an oblique solution, it is the row sum of the orthogonal factor loadings (remember, that rotations or transformations do not change the communality).

In response to a request from Asghar Minaei who wanted to combine several imputation data sets from mice, fa.pooled was added. The separate data sets are combined into a list (datasets) which are then factored separately. Each solution is rotated towards the first set in the list. The results are then reported in terms of pooled loadings and confidence intervals based upon the replications.

fa.sapa simulates the process of doing SAPA (Synthetic Aperture Personality Assessment). It will do iterative solutions for successive random samples of fractions (frac) of the data set. This allows us to find the stability of solutions for various sample sizes and various sample rates. Need to specify the number of iterations (n.iter) as well as the percent of data sampled (frac).

Value

values

Eigen values of the common factor solution

e.values

Eigen values of the original matrix

communality

Communality estimates for each item. These are merely the sum of squared factor loadings for that item.

communalities

If using minrank factor analysis, these are the communalities reflecting the total amount of common variance. They will exceed the communality (above) which is the model estimated common variance.

rotation

which rotation was requested?

n.obs

number of observations specified or found

loadings

An item by factor (pattern) loading matrix of class “loadings" Suitable for use in other programs (e.g., GPA rotation or factor2cluster. To show these by sorted order, use print.psych with sort=TRUE

complexity

Hoffman's index of complexity for each item. This is just (Σai2)2Σai4\frac{(\Sigma a_i^2)^2}{\Sigma a_i^4} where a_i is the factor loading on the ith factor. From Hofmann (1978), MBR. See also Pettersson and Turkheimer (2010).

Structure

An item by factor structure matrix of class “loadings". This is just the loadings (pattern) matrix times the factor intercorrelation matrix.

fit

How well does the factor model reproduce the correlation matrix. This is just Σrij2Σrij2Σrij2\frac{\Sigma r_{ij}^2 - \Sigma r^{*2}_{ij} }{\Sigma r_{ij}^2} (See VSS, ICLUST, and principal for this fit statistic.

fit.off

how well are the off diagonal elements reproduced?

dof

Degrees of Freedom for this model. This is the number of observed correlations minus the number of independent parameters. Let n=Number of items, nf = number of factors then
dof=n(n1)/2nnf+nf(nf1)/2dof = n * (n-1)/2 - n * nf + nf*(nf-1)/2

objective

Value of the function that is minimized by a maximum likelihood procedures. This is reported for comparison purposes and as a way to estimate chi square goodness of fit. The objective function is
f=log(trace((FF+U2)1R)log((FF+U2)1R)n.itemsf = log(trace ((FF'+U2)^{-1} R) - log(|(FF'+U2)^{-1} R|) - n.items. When using MLE, this function is minimized. When using OLS (minres), although we are not minimizing this function directly, we can still calculate it in order to compare the solution to a MLE fit.

STATISTIC

If the number of observations is specified or found, this is a chi square based upon the objective function, f (see above). Using the formula from factanal(which seems to be Bartlett's test) :
χ2=(n.obs1(2p+5)/6(2factors)/3))f\chi^2 = (n.obs - 1 - (2 * p + 5)/6 - (2 * factors)/3)) * f

PVAL

If n.obs > 0, then what is the probability of observing a chisquare this large or larger?

Phi

If oblique rotations (e.g,m using oblimin from the GPArotation package or promax) are requested, what is the interfactor correlation?

communality.iterations

The history of the communality estimates (For principal axis only.) Probably only useful for teaching what happens in the process of iterative fitting.

residual

The matrix of residual correlations after the factor model is applied. To display it conveniently, use the residuals command.

chi

When normal theory fails (e.g., in the case of non-positive definite matrices), it useful to examine the empirically derived χ2\chi^2 based upon the sum of the squared residuals * N. This will differ slightly from the MLE estimate which is based upon the fitting function rather than the actual residuals.

rms

This is the sum of the squared (off diagonal residuals) divided by the degrees of freedom. Comparable to an RMSEA which, because it is based upon χ2\chi^2, requires the number of observations to be specified. The rms is an empirical value while the RMSEA is based upon normal theory and the non-central χ2\chi^2 distribution. That is to say, if the residuals are particularly non-normal, the rms value and the associated χ2\chi^2 and RMSEA can differ substantially.

crms

rms adjusted for degrees of freedom

RMSEA

The Root Mean Square Error of Approximation is based upon the non-central χ2\chi^2 distribution and the χ2\chi^2 estimate found from the MLE fitting function. With normal theory data, this is fine. But when the residuals are not distributed according to a noncentral χ2\chi^2, this can give very strange values. (And thus the confidence intervals can not be calculated.) The RMSEA is a conventional index of goodness (badness) of fit but it is also useful to examine the actual rms values.

TLI

The Tucker Lewis Index of factoring reliability which is also known as the non-normed fit index.

BIC

Based upon χ2\chi^2 with the assumption of normal theory and using the χ2\chi^2 found using the objective function defined above. This is just χ22df\chi^2 - 2 df

eBIC

When normal theory fails (e.g., in the case of non-positive definite matrices), it useful to examine the empirically derived eBIC based upon the empirical χ2\chi^2 - 2 df.

R2

The multiple R square between the factors and factor score estimates, if they were to be found. (From Grice, 2001). Derived from R2 is is the minimum correlation between any two factor estimates = 2R2-1.

r.scores

The correlations of the factor score estimates using the specified model, if they were to be found. Comparing these correlations with that of the scores themselves will show, if an alternative estimate of factor scores is used (e.g., the tenBerge method), the problem of factor indeterminacy. For these correlations will not necessarily be the same.

weights

The beta weights to find the factor score estimates. These are also used by the predict.psych function to find predicted factor scores for new cases. These weights will depend upon the scoring method requested.

scores

The factor scores as requested. Note that these scores reflect the choice of the way scores should be estimated (see scores in the input). That is, simple regression ("Thurstone"), correlaton preserving ("tenBerge") as well as "Anderson" and "Bartlett" using the appropriate algorithms (see factor.scores). The correlation between factor score estimates (r.scores) is based upon using the regression/Thurstone approach. The actual correlation between scores will reflect the rotation algorithm chosen and may be found by correlating those scores. Although the scores are found by multiplying the standarized data by the weights matrix, this will not result in standard scores if using regression.

valid

The validity coffiecient of course coded (unit weighted) factor score estimates (From Grice, 2001)

score.cor

The correlation matrix of coarse coded (unit weighted) factor score estimates, if they were to be found, based upon the structure matrix rather than the weights or loadings matrix.

rot.mat

The rotation matrix as returned from GPArotation.

Note

Thanks to Erich Studerus for some very helpful suggestions about various rotation and factor scoring algorithms, and to Gumundur Arnkelsson for suggestions about factor scores for singular matrices.

The fac function is the original fa function which is now called by fa repeatedly to get confidence intervals.

SPSS will sometimes use a Kaiser normalization before rotating. This will lead to different solutions than reported here. To get the Kaiser normalized loadings, use kaiser.

The communality for a variable is the amount of variance accounted for by all of the factors. That is to say, for orthogonal factors, it is the sum of the squared factor loadings (rowwise). The communality is insensitive to rotation. However, if an oblique solution is found, then the communality is not the sum of squared pattern coefficients. In both cases (oblique or orthogonal) the communality is the diagonal of the reproduced correlation matrix where nRn=nPkkΦkkPn_nR_n = _{n}P_{k k}\Phi_{k k}P_n' where P is the pattern matrix and Φ\Phi is the factor intercorrelation matrix. This is the same, of course to multiplying the pattern by the structure: R=PSR = P S' where the Structure matrix is S=ΦPS = \Phi P. Similarly, the eigen values are the diagonal of the product kΦkkPnnPk_k\Phi_{kk}P'_{nn}P_{k}.

A frequently asked question is why are the factor names of the rotated solution not in ascending order? That is, for example, if factoring the 25 items of the bfi, the factor names are MR2 MR3 MR5 MR1 MR4, rather than the seemingly more logical "MR1" "MR2" "MR3" "MR4" "MR5". This is for pedagogical reasons, in that factors as extracted are orthogonal and are in order of amount of variance accounted for. But when rotated (orthogonally) or transformed (obliquely) the simple structure solution does not preserve that order. The factors are still ordered according to variance accounted for, but because rotation changes how much variance each factor explains, the order may not the be the same as the original order. The factor names are, of course, arbitrary, and are kept with the original names to show the effect of rotation/transformation. To give them names associated with their ordinal position, simply paste("F", 1:nf, sep="") where nf is the number of factors. See the last example.

The print function for the fa output will return (invisibly) an object (Vaccounted) that matches the printed output for the variance accounted for by each factor, as well as the cumulative variance, and the percentage of variance accounted for by each factor.

Correction to documentation: as of September, 2014, the oblique.scores option is correctly explained. (It had been backwards.) The default (oblique.scores=FALSE) finds scores based upon the Structure matrix, while oblique.scores=TRUE finds them based upon the pattern matrix. The latter case matches factanal. This error was detected by Mark Seeto.

If the raw data are factored, factors scores are estimated. By default this will be done using 'regression' but alternatives are available. Although the scores are found by multiplying the standardized data by the weights, if using regression, the resulting factor scores will not necessarily have unit variance.

In the case of missing data for some items, the factor scores will return NA for any case where any item is missing. Specifying missing=TRUE will return factor scores for all cases using the imputation option chosen.

The minimum residual solution is done by finding those communalities that will minimize the off diagonal residual. The uls solution finds those communalities that minimize the total residuals.

The minres solution has been modified (April, 2107) following suggestions by Hao Wu. Although the fitting function was the minimal residual, the first derivative of the fitting function was incorrect. This has now been modified so that the results match those of SPSS and CEFA. The prior solutions are still available using fm="old.min".

Alpha factoring was added in August, 2017 to add to the numerous alternative models of factoring.

A few more lines of output were added in August 2017 to show the measures of factor adequacy for different rotations. This had been available in the results from factor.scores but now is added to the fa output.

For a comparison of the fa to factor analysis using SPSS, see Grieder and Steiner (2021).

Some correlation matrices that arise from using pairwise deletion or from tetrachoric or polychoric matrices will not be proper. That is, they will not be positive semi-definite (all eigen values >= 0). The cor.smooth function will adjust correlation matrices (smooth them) by making all negative eigen values slightly greater than 0, rescaling the other eigen values to sum to the number of variables, and then recreating the correlation matrix. See cor.smooth for an example of this problem using the burt data set.

One reason for this problem when using tetrachorics or polychorics seems to be the adjustment for continuity. Setting correct=0 turns this off and seems to produce more proper matrices.

Although both fm ="pa" or fm="minres" will work with these “improper" matrices, the goodness of fit tests do not. Thus, by default, calls to fa.stats will pass the smooth parameter as TRUE. This may be prevented by forcing smooth=FALSE. Npt smoothing does not affect the pa or minres solution, but will affect the goodness of fit statistics slightly.

Author(s)

William Revelle

References

Gorsuch, Richard, (1983) Factor Analysis. Lawrence Erlebaum Associates.

Grice, James W. (2001), Computing and evaluating factor scores. Psychological Methods, 6, 430-450

Harman, Harry and Jones, Wayne (1966) Factor analysis by minimizing residuals (minres), Psychometrika, 31, 3, 351-368.

Hofmann, R. J. ( 1978 ) . Complexity and simplicity as objective indices descriptive of factor solutions. Multivariate Behavioral Research, 13, 247-250.

Grieder, Silvia and Steiner, Markus. D. (2021) Algorithmic jingle jungle: A comparison of implementations of principal axis factoring and promax rotation in R and SPSS. Behavior Research Methods. Doi: 10.3758/s13428-021-01581

Kaiser, Henry F. and Caffrey, John. Alpha factor analysis, Psychometrika, (30) 1-14.

Nguyen, H. V. & Waller, N. G. (in press). Local minima and factor rotations in exploratory factor analysis. Psychological Methods.

Pettersson E, Turkheimer E. (2010) Item selection, evaluation, and simple structure in personality data. Journal of research in personality, 44(4), 407-420.

Revelle, William. (in prep) An introduction to psychometric theory with applications in R. Springer. Working draft available at https://personality-project.org/r/book/

Shapiro, A. and ten Berge, Jos M. F, (2002) Statistical inference of minimum rank factor analysis. Psychometika, (67) 79-84.

ten Berge, Jos M. F. and Kiers, Henk A. L. (1991). A numerical approach to the approximate and the exact minimum rank of a covariance matrix. Psychometrika, (56) 309-315.

See Also

principal for principal components analysis (PCA). PCA will give very similar solutions to factor analysis when there are many variables. The differences become more salient as the number variables decrease. The PCA and FA models are actually very different and should not be confused. One is a model of the observed variables, the other is a model of latent variables. Although some commercial packages (e.g., SPSS and SAS) refer to both as factor models, they are not. It is incorrect to report doing a factor analysis using principal components.

irt.fa for Item Response Theory analyses using factor analysis, using the two parameter IRT equivalent of loadings and difficulties.

fa.random applies a random intercepts model by removing the mean score for each subject prior to factoring.

VSS will produce the Very Simple Structure (VSS) and MAP criteria for the number of factors, nfactors to compare many different factor criteria.

ICLUST will do a hierarchical cluster analysis alternative to factor analysis or principal components analysis.

factor.scores to find factor scores with alternative options. predict.psych to find predicted scores based upon new data, fa.extension to extend the factor solution to new variables, omega for hierarchical factor analysis with one general factor. fa.multi for hierarchical factor analysis with an arbitrary number of 2nd order factors.

fa.sort will sort the factor loadings into echelon form. fa.organize will reorganize the factor pattern matrix into any arbitrary order of factors and items.

KMO and cortest.bartlett for various tests that some people like.

factor2cluster will prepare unit weighted scoring keys of the factors that can be used with scoreItems.

fa.lookup will print the factor analysis loadings matrix along with the item “content" taken from a dictionary of items. This is useful when examining the meaning of the factors.

anova.psych allows for testing the difference between two (presumably nested) factor models .

bassAckward will perform repeated factorings and organize them in a top-down structure suggested by Goldberg (2006) and Waller (2007).

Examples

#using the Harman 24 mental tests, compare a principal factor with a principal components solution
pc <- principal(Harman74.cor$cov,4,rotate="varimax")   #principal components
pa <- fa(Harman74.cor$cov,4,fm="pa" ,rotate="varimax")  #principal axis 
uls <- fa(Harman74.cor$cov,4,rotate="varimax")          #unweighted least squares is minres
wls <- fa(Harman74.cor$cov,4,fm="wls")       #weighted least squares

#to show the loadings sorted by absolute value
print(uls,sort=TRUE)

#then compare with a maximum likelihood solution using factanal
mle <- factanal(covmat=Harman74.cor$cov,factors=4)
factor.congruence(list(mle,pa,pc,uls,wls))

#note that the order of factors and the sign of some of factors may differ 

#finally, compare the unrotated factor, ml, uls, and  wls solutions
wls <- fa(Harman74.cor$cov,4,rotate="none",fm="wls")
pa <- fa(Harman74.cor$cov,4,rotate="none",fm="pa")
minres <-  factanal(factors=4,covmat=Harman74.cor$cov,rotation="none")
mle <- fa(Harman74.cor$cov,4,rotate="none",fm="mle")
uls <- fa(Harman74.cor$cov,4,rotate="none",fm="uls")
factor.congruence(list(minres,mle,pa,wls,uls))
#in particular, note the similarity of the mle and min res solutions
#note that the order of factors and the sign of some of factors may differ 



#an example of where the ML and PA and MR models differ is found in Thurstone.33.
#compare the first two factors with the 3 factor solution 
Thurstone.33 <- as.matrix(Thurstone.33)
mle2 <- fa(Thurstone.33,2,rotate="none",fm="mle")
mle3 <- fa(Thurstone.33,3 ,rotate="none",fm="mle")
pa2 <- fa(Thurstone.33,2,rotate="none",fm="pa")
pa3 <- fa(Thurstone.33,3,rotate="none",fm="pa")
mr2 <- fa(Thurstone.33,2,rotate="none")
mr3 <- fa(Thurstone.33,3,rotate="none")
factor.congruence(list(mle2,mr2,pa2,mle3,pa3,mr3))

#f5 <- fa(psychTools::bfi[1:25],5)
#f5  #names are not in ascending numerical order (see note)
#colnames(f5$loadings) <- paste("F",1:5,sep="")
#f5

#Get the variance accounted for object from the print function
p <- print(mr3)
round(p$Vaccounted,2)

#pool data and fa the pooled result (not run)
#datasets.list <- list(bfi[1:500,1:25],bfi[501:1000,1:25],
#   bfi[1001:1500,1:25],bfi[1501:2000,1:25],bfi[2001:2500,1:25])  #five different data sets
#temp <- fa.pooled(datasets.list,5)    #do 5 factor analyses, pool the results

Graph factor loading matrices

Description

Factor analysis or principal components analysis results are typically interpreted in terms of the major loadings on each factor. These structures may be represented as a table of loadings or graphically, where all loadings with an absolute value > some cut point are represented as an edge (path). fa.diagram uses the various diagram functions to draw the diagram. fa.graph generates dot code for external plotting. fa.rgraph uses the Rgraphviz package (if available) to draw the graph. het.diagram will draw "heterarchy" diagrams of factor/scale solutions at different levels. All of these as well as some additional functions may be called by diagram which is a wrapper to the specific functions.

Usage

fa.diagram(fa.results,Phi=NULL,fe.results=NULL,sort=TRUE,labels=NULL,cut=.3,
     simple=TRUE, errors=FALSE,g=FALSE,digits=1,e.size=.05,rsize=.15,side=2,
    main,cex=NULL,l.cex=NULL, marg=c(.5,.5,1,.5),adj=1,ic=FALSE, ...) 
het.diagram(r,levels,cut=.3,digits=2,both=TRUE, 
        main="Heterarchy diagram",l.cex,gap.size,...) 
extension.diagram(fa.results,Phi=NULL,fe.results=NULL,sort=TRUE,labels=NULL,cut=.3,
    f.cut, e.cut=.1,simple=TRUE,e.simple=FALSE,errors=FALSE,g=FALSE,
    digits=1,e.size=.05,rsize=.15,side=2,main,cex=NULL, e.cex=NULL,
    marg=c(.5,.5,1,.5),adj=1,ic=FALSE, ...)

fa.graph(fa.results,out.file=NULL,labels=NULL,cut=.3,simple=TRUE,
   size=c(8,6), node.font=c("Helvetica", 14),
    edge.font=c("Helvetica", 10), rank.direction=c("RL","TB","LR","BT"),
     digits=1,main="Factor Analysis", ...)
fa.rgraph(fa.results,out.file=NULL,labels=NULL,cut=.3,simple=TRUE,
   size=c(8,6), node.font=c("Helvetica", 14),
    edge.font=c("Helvetica", 10), rank.direction=c("RL","TB","LR","BT"),
     digits=1,main="Factor Analysis",graphviz=TRUE, ...)

Arguments

fa.results

The output of factor analysis, principal components analysis, or ICLUST analysis. May also be a factor loading matrix from anywhere.

Phi

Normally not specified (it is is found in the FA, pc, or ICLUST, solution), this may be given if the input is a loadings matrix.

fe.results

the results of a factor extension analysis (if any)

out.file

If it exists, a dot representation of the graph will be stored here (fa.graph)

labels

Variable labels

cut

Loadings with abs(loading) > cut will be shown

f.cut

factor correlations > f.cut will be shown

e.cut

extension loadings with abs(loading) > e.cut will be shown

simple

Only the biggest loading per item is shown

e.simple

Only the biggest loading per extension item is shown

g

Does the factor matrix reflect a g (first) factor. If so, then draw this to the left of the variables, with the remaining factors to the right of the variables. It is useful to turn off the simple parameter in this case.

ic

If drawing a cluster analysis result, should we treat it as latent variable model (ic=FALSE) or as an observed variable model (ic=TRUE)

r

A correlation matrix for the het.diagram function

levels

A list of the elements in each level

both

Should arrows have double heads (in het.diagram)

size

graph size

sort

sort the factor loadings before showing the diagram

errors

include error estimates (as arrows)

e.size

size of ellipses

rsize

size of rectangles

side

on which side should error arrows go?

cex

modify font size

e.cex

Modify the font size for the Dependent variables, defaults to cex

l.cex

modify the font size in arrows, defaults to cex

gap.size

The gap in the arrow for the label. Can be adjusted to compensate for variations in cex or l.cex

marg

sets the margins to be wider than normal, returns them to the normal size upon exit

adj

how many different positions (1-3) should be used for the numeric labels. Useful if they overlap each other.

node.font

what font should be used for nodes in fa.graph

edge.font

what font should be used for edges in fa.graph

rank.direction

parameter passed to Rgraphviz– which way to draw the graph

digits

Number of digits to show as an edgelable

main

Graphic title, defaults to "factor analyis" or "factor analysis and extension"

graphviz

Should we try to use Rgraphviz for output?

...

other parameters

Details

Path diagram representations have become standard in confirmatory factor analysis, but are not yet common in exploratory factor analysis. Representing factor structures graphically helps some people understand the structure.

By default the arrows come from the latent variables to the observed variables. This is, of course, the factor model. However, if the class of the object to be drawn is 'principal', then reverse the direction of the arrows, and the 'latent' variables are no longer latent, and are shown as boxes. For cluster models, the default is to treat them as factors, but if ic =TRUE, then we treat it as a components model.

fa.diagram does not use Rgraphviz and is the preferred function. fa.graph generates dot code to be used by an external graphics program. It does not have all the bells and whistles of fa.diagram, but these may be done in the external editor.

Hierarchical (bifactor) models may be drawn by specifying the g parameter as TRUE. This allows for an graphical displays of various factor transformations with a bifactor structure (e.g., bifactor and biquartimin. See omega for an alternative way to find these structures.

The het.diagram function will show the case of a hetarchical structure at multiple levels. It can also be used to show the patterns of correlations between sets of scales (e.g., EPI, NEO, BFI). The example is for showing the relationship between 3 sets of 4 variables from the Thurstone data set. The parameters l.cex and gap.size are used to adjust the font size of the labels and the gap in the lines.

extension.diagram will draw a fa.extend result with slightly more control than using fa.diagram or the more generic diagram function.

In fa.rgraph although a nice graph is drawn for the orthogonal factor case, the oblique factor drawing is acceptable, but is better if cleaned up outside of R or done using fa.diagram.

The normal input is taken from the output of either fa or ICLUST. This latter case displays the ICLUST results in terms of the cluster loadings, not in terms of the cluster structure. Actually an interesting option.

It is also possible to just give a factor loading matrix as input. In this case, supplying a Phi matrix of factor correlations is also possible.

It is possible, using fa.graph, to export dot code for an omega solution. fa.graph should be applied to the schmid$sl object with labels specified as the rownames of schmid$sl. The results will need editing to make fully compatible with dot language plotting.

To specify the model for a structural equation confirmatory analysis of the results, use structure.diagram instead.

Value

fa.diagram: A path diagram is drawn without using Rgraphviz. This is probably the more useful function.

fa.rgraph: A graph is drawn using rgraphviz. If an output file is specified, the graph instructions are also saved in the dot language.

fa.graph: the graph instructions are saved in the dot language.

Note

fa.rgraph requires Rgraphviz. Because there are occasional difficulties installing Rgraphviz from Bioconductor in that some libraries are misplaced and need to be relinked, it is probably better to use fa.diagram or fa.graph.

Author(s)

William Revelle

See Also

diagram, omega.graph, ICLUST.graph, bassAckward.diagram. structure.diagram will convert the factor diagram to sem modeling code.

Examples

test.simple <- fa(item.sim(16),2,rotate="oblimin")
#if(require(Rgraphviz)) {fa.graph(test.simple) } 
fa.diagram(test.simple)
f3 <- fa(Thurstone,3,rotate="cluster")
fa.diagram(f3