Title: | Tools for Descriptive Statistics |
---|---|
Description: | A collection of miscellaneous basic statistic functions and convenience wrappers for efficiently describing data. The author's intention was to create a toolbox, which facilitates the (notoriously time consuming) first descriptive tasks in data analysis, consisting of calculating descriptive statistics, drawing graphical summaries and reporting the results. The package contains furthermore functions to produce documents using MS Word (or PowerPoint) and functions to import data from Excel. Many of the included functions can be found scattered in other packages and other sources written partly by Titans of R. The reason for collecting them here, was primarily to have them consolidated in ONE instead of dozens of packages (which themselves might depend on other packages which are not needed at all), and to provide a common and consistent interface as far as function and arguments naming, NA handling, recycling rules etc. are concerned. Google style guides were used as naming rules (in absence of convincing alternatives). The 'BigCamelCase' style was consequently applied to functions borrowed from contributed R packages as well. |
Authors: | Andri Signorell [aut, cre] , Ken Aho [ctb], Andreas Alfons [ctb], Nanina Anderegg [ctb], Tomas Aragon [ctb], Chandima Arachchige [ctb], Antti Arppe [ctb], Adrian Baddeley [ctb], Kamil Barton [ctb], Ben Bolker [ctb], Hans W. Borchers [ctb], Frederico Caeiro [ctb], Stephane Champely [ctb], Daniel Chessel [ctb], Leanne Chhay [ctb], Nicholas Cooper [ctb], Clint Cummins [ctb], Michael Dewey [ctb], Harold C. Doran [ctb], Stephane Dray [ctb], Charles Dupont [ctb], Dirk Eddelbuettel [ctb], Claus Ekstrom [ctb], Martin Elff [ctb], Jeff Enos [ctb], Richard W. Farebrother [ctb], John Fox [ctb], Romain Francois [ctb], Michael Friendly [ctb], Tal Galili [ctb], Matthias Gamer [ctb], Joseph L. Gastwirth [ctb], Vilmantas Gegzna [ctb], Yulia R. Gel [ctb], Sereina Graber [ctb], Juergen Gross [ctb], Gabor Grothendieck [ctb], Frank E. Harrell Jr [ctb], Richard Heiberger [ctb], Michael Hoehle [ctb], Christian W. Hoffmann [ctb], Soeren Hojsgaard [ctb], Torsten Hothorn [ctb], Markus Huerzeler [ctb], Wallace W. Hui [ctb], Pete Hurd [ctb], Rob J. Hyndman [ctb], Christopher Jackson [ctb], Matthias Kohl [ctb], Mikko Korpela [ctb], Max Kuhn [ctb], Detlew Labes [ctb], Friederich Leisch [ctb], Jim Lemon [ctb], Dong Li [ctb], Martin Maechler [ctb], Arni Magnusson [ctb], Ben Mainwaring [ctb], Daniel Malter [ctb], George Marsaglia [ctb], John Marsaglia [ctb], Alina Matei [ctb], David Meyer [ctb], Weiwen Miao [ctb], Giovanni Millo [ctb], Yongyi Min [ctb], David Mitchell [ctb], Cyril Flurin Moser [ctb], Franziska Mueller [ctb], Markus Naepflin [ctb], Danielle Navarro [ctb], Henric Nilsson [ctb], Klaus Nordhausen [ctb], Derek Ogle [ctb], Hong Ooi [ctb], Nick Parsons [ctb], Sandrine Pavoine [ctb], Tony Plate [ctb], Luke Prendergast [ctb], Roland Rapold [ctb], William Revelle [ctb], Tyler Rinker [ctb], Brian D. Ripley [ctb], Caroline Rodriguez [ctb], Nathan Russell [ctb], Nick Sabbe [ctb], Ralph Scherer [ctb], Venkatraman E. Seshan [ctb], Michael Smithson [ctb], Greg Snow [ctb], Karline Soetaert [ctb], Werner A. Stahel [ctb], Alec Stephenson [ctb], Mark Stevenson [ctb], Ralf Stubner [ctb], Matthias Templ [ctb], Duncan Temple Lang [ctb], Terry Therneau [ctb], Yves Tille [ctb], Luis Torgo [ctb], Adrian Trapletti [ctb], Joshua Ulrich [ctb], Kevin Ushey [ctb], Jeremy VanDerWal [ctb], Bill Venables [ctb], John Verzani [ctb], Pablo J. Villacorta Iglesias [ctb], Gregory R. Warnes [ctb], Stefan Wellek [ctb], Hadley Wickham [ctb], Rand R. Wilcox [ctb], Peter Wolf [ctb], Daniel Wollschlaeger [ctb], Joseph Wood [ctb], Ying Wu [ctb], Thomas Yee [ctb], Achim Zeileis [ctb] |
Maintainer: | Andri Signorell <[email protected]> |
License: | GPL (>= 2) |
Version: | 0.99.58 |
Built: | 2024-12-09 07:02:51 UTC |
Source: | CRAN |
DescTools is an extensive collection of miscellaneous basic statistics functions and comfort wrappers not available in the R basic system for efficient description of data.
The author's intention was to create a toolbox, which facilitates the (notoriously time consuming) first descriptive tasks in data analysis, consisting of calculating descriptive statistics, drawing graphical summaries and reporting the results. Special attention was paid to the integration of various approaches to the calculation of confidence intervals. For most basic statistics functions, variants are included that allow the use of weights. The package contains furthermore functions to produce documents using MS Word (or PowerPoint) and functions to import data from Excel.
A considerable part of the included functions can be found scattered in other packages and other sources written partly by Titans of R. The reason for collecting them here, was primarily to have them consolidated in ONE instead of dozens of packages (which themselves might depend on other packages which are not needed at all), and to provide a common and consistent interface as far as function and arguments naming, NA
handling, recycling rules etc. are concerned. Google style guides were used as naming rules (in absence of convincing alternatives). The 'CamelStyle' was consequently applied to functions borrowed from contributed R packages as well.
Feedback, feature requests, bugreports and other suggestions are welcome! Please report problems to Stack Overflow using tag [desctools] or directly to the maintainer.
A grouped list of the functions:
Operators, calculus, transformations: | |
%()% | Between operators determine if a value lies within a range [a,b] |
%)(% | Outside operators: %)(%, %](%, %)[%, %][% |
%nin% | "not in" operator |
%overlaps% | Do two collections have common elements? |
%like%, %like any% | Simple operator to search for a specified pattern |
%^% | Powers of matrices |
Interval | The number of days of the overlapping part |
of two date periods | |
AUC | Area under the curve |
Primes | Find all primes less than n |
Factorize | Prime factorization of integers |
Divisors | All divisors of an integer |
GCD | Greatest common divisor |
LCM | Least common multiple |
Permn | Determine all possible permutations of a set |
Fibonacci | Generates single Fibonacci numbers or a Fibonacci sequence |
DigitSum | Digit sum of a number |
Frac | Return the fractional part of a numeric value |
Ndec | Count decimal places of a number |
MaxDigits | Maximum used digits for a vector of numbers |
Prec | Precision of a number |
BoxCox, BoxCoxInv | Box Cox transformation and its inverse transformation |
BoxCoxLambda | Return the optimal lambda for a BoxCox transformation |
LogSt, LogStInv | Calculate started logarithmic transformation and it's inverse |
Logit, LogitInv | Generalized logit and inverse logit function |
LinScale | Simple linear scaling of a vector x |
Winsorize | Data cleaning by winsorization |
Trim | Trim data by omitting outlying observations |
CutQ | Cut a numeric variable into quartiles or other quantiles |
Recode | Recode a factor with altered levels |
Rename | Change name(s) of a named object |
Sort | Sort extension for matrices and data.frames |
SortMixed, OrderMixed | Mixed sort order |
Rank | Calculate ranks including dense type for ties |
PercentRank | Calculate the percent rank |
RoundTo | Round to a multiple |
Large, Small | Returns the kth largest, resp. smallest values |
HighLow | Combines Large and Small . |
Rev | Reverses the order of rows and/or columns of a matrix or a data.frame |
Untable | Recreates original list based on a n-dimensional frequency table |
CollapseTable | Collapse some rows/columns in a table. |
Dummy | Generate dummy codes for a factor |
FisherZ, FisherZInv | Fisher's z-transformation and its inverse |
Midx | Calculate sequentially the midpoints of the elements of a vector |
Unwhich | Inverse function to which , create a logical vector/matrix from indices |
Vigenere | Implements a Vigenere cypher, both encryption and decryption |
BinTree, PlotBinTree | Create and plot a binary tree structure with a given length |
Information and manipulation functions: | |
AllDuplicated | Find all values involved in ties |
Closest | Return the value in a vector being closest to a given one |
Coalesce | Return the first value in a vector not being NA |
ZeroIfNA, NAIfZero | Replace NAs by 0, resp. vice versa |
Impute | Replace NAs by the median or another value |
LOCF | Imputation of datapoints following the "last observation |
carried forward" rule | |
CombN | Returns the number of subsets out of a list of elements |
CombSet | Generates all possible subsets out of a list of elements |
CombPairs | Generates all pairs out of one or two sets of elements |
SampleTwins | Create sample using stratifying groups |
RndPairs | Create pairs of correlated random numbers |
RndWord | Produce random combinations of characters |
IsNumeric | Check a vector for being numeric, zero Or a whole number |
IsWhole | Is x a whole number? |
IsDichotomous | Check if x contains exactly 2 values |
IsOdd | Is x even or odd? |
IsPrime | Is x a prime number? |
IsZero | Is numeric(x) == 0, say x < machine.eps? |
IsEuclid | Check if a distance matrix is euclidean |
Label, Unit | Get or set the label , resp. unit , attribute of an object |
Abind | Bind matrices to n-dimensional arrays |
Append | Append elements to several classes of objects |
VecRot, VecShift | Shift the elements of a vector in a circular mode to the right |
or to the left by n characters. | |
Clockwise | Transform angles from counter clock into clockwise mode |
split.formula | A formula interface for the base function split |
reorder.factor | Reorder the levels of a factor |
ToLong, ToWide | Simple reshaping of a vector |
SetNames | Set the names, rownames or columnnames in an object and return it |
Some | Return some randomly chosen elements of an object |
SplitAt | Split a vector into several pieces at given positions |
SplitToCol | Splits the columns of a data frame using a split character |
SplitPath | Split a path string in drive, path, filename |
Str | Compactly display the structure of any R object |
TextToTable | Converts a string to a table |
String functions: | |
StrCountW | Count the words in a string |
StrTrim | Delete white spaces from a string |
StrTrunc | Truncate string on a given length and add ellipses if it really |
was truncated | |
StrLeft, StrRight | Returns the left/right part or the a string. |
StrAlign | Align strings to the left/right/center or to a given character |
StrAbbr | Abbreviates a string |
StrCap | Capitalize the first letter of a string |
StrPad | Fill a string with defined characters to fit a given length |
StrRev | Reverse a string |
StrChop | Split a string by a fixed number of characters. |
StrExtract | Extract a part of a string, defined as regular expression. |
StrVal | Extract numeric values from a string |
StrIsNumeric | Check whether a string does only contain numeric data |
StrPos | Find position of first occurrence of a string in another one |
StrDist | Compute Levenshtein or Hamming distance between strings |
FixToTable | Create table out of a running text, by using columns of spaces as delimiter |
Conversion functions: | |
AscToChar, CharToAsc | Converts ASCII codes to characters and vice versa |
DecToBin, BinToDec | Converts numbers from binmode to decimal and vice versa |
DecToHex, HexToDec | Converts numbers from hexmode to decimal and vice versa |
DecToOct, OctToDec | Converts numbers from octmode to decimal and vice versa |
DegToRad, RadToDeg | Convert degrees to radians and vice versa |
CartToPol, PolToCart | Transform cartesian to polar coordinates and vice versa |
CartToSph, SphToCart | Transform cartesian to spherical coordinates and vice versa |
RomanToInt | Convert roman numerals to integers |
RgbToLong, LongToRgb | Convert a rgb color to a long number and vice versa |
ColToGray, ColToGrey | Convert colors to gcrey/grayscale |
ColToHex, HexToCol | Convert a color into hex string |
HexToRgb | Convert a hexnumber to an RGB-color |
ColToHsv | R color to HSV conversion |
ColToRgb, RgbToCol | Color to RGB conversion and back |
ConvUnit | Return the most common unit conversions |
Colors: | |
SetAlpha | Add transperancy (alpha channel) to a color. |
ColorLegend | Add a color legend to a plot |
FindColor | Get color on a defined color range |
MixColor | Get the mix of two colors |
TextContrastColor | Choose textcolor depending on background color |
Pal | Some custom color palettes |
Plots (low level): | |
Canvas | Canvas for geometric plotting |
Mar | Set margins more comfortably. |
Asp | Return aspect ratio of the current plot |
LineToUser | Convert line coordinates to user coordinates |
lines.loess | Add a loess smoother and its CIs to an existing plot |
lines.lm | Add the prediction of linear model and its CIs to a plot |
lines.smooth.spline | Add the prediction of a smooth.spline and its CIs to a plot |
BubbleLegend | Add a legend for bubbles to a bubble plot |
TitleRect | Add a main title to a plot surrounded by a rectangular box |
BarText | Add the value labels to a barplot |
ErrBars | Add horizontal or vertical error bars to an existing plot |
DrawArc, DrawRegPolygon | Draw elliptic, circular arc(s) or regular polygon(s) |
DrawCircle, DrawEllipse | Draw a circle, a circle annulus or a sector or an annulus |
DrawBezier | Draw a Bezier curve |
DrawBand | Draw confidence band |
BoxedText | Add text surrounded by a box to a plot |
Rotate | Rotate a geometric structure |
SpreadOut | Spread out a vector of numbers so that there is a minimum |
interval between any two elements. This can be used | |
to place textlabels in a plot so that they do not overlap. | |
IdentifyA | Helps identifying all the points in a specific area. |
identify.formula | Formula interface for identify . |
PtInPoly | Identify all the points within a polygon. |
ConnLines | Calculate and insert connecting lines in a barplot |
AxisBreak | Place a break mark on an axis |
Shade | Produce a shaded curve |
Stamp | Stamp the current plot with Date/Time/Directory or any other expression |
Plots (high level): | |
PlotACF, PlotGACF | Create a combined plot of a time series including its |
autocorrelation and partial autocorrelation | |
PlotMonth | Plot seasonal effects of a univariate time series |
PlotArea | Create an area plot |
PlotBag | Create a two-dimensional boxplot |
PlotBagPairs | Produce pairwise 2-dimensional boxplots (bagplot) |
PlotBubble | Draw a bubble plot |
PlotCandlestick | Plot candlestick chart |
PlotCirc | Create a circular plot |
PlotCorr | Plot a correlation matrix |
PlotDot | Plot a dotchart with confidence intervals |
PlotFaces | Produce a plot of Chernoff faces |
PlotFdist | Frequency distribution plot, combination of histogram, |
boxplot and ecdf.plot | |
PlotMarDens | Scatterplot with marginal densities |
PlotMultiDens | Plot multiple density curves |
PlotPolar | Plot values on a circular grid |
PlotFun | Plot mathematical expression or a function |
PolarGrid | Plot a grid in polar coordinates |
PlotPyramid | Pyramid plot (back-back histogram) |
PlotTreemap | Plot of a treemap. |
PlotVenn | Plot a Venn diagram |
PlotViolin | Plot violins instead of boxplots |
PlotQQ | QQ-plot for an optional distribution |
PlotWeb | Create a web plot |
PlotTernary | Create a triangle or ternary plot |
PlotMiss | Plot missing values |
PlotECDF | Plot empirical cumulative distribution function |
PlotLinesA | Plot the columns of one matrix against the columns of another |
PlotLog | Create a plot with logarithmic axis and log grid |
PlotMosaic | Plots a mosaic describing a contingency table in array form |
Distributions: | |
_Benf | Benford distribution, including qBenf, dBenf, rBenf |
_ExtrVal | Extreme value distribution (dExtrVal) |
_Frechet | Frechet distribution (dFrechet) |
_GenExtrVal | Generalized Extreme Value Distribution (dGenExtrVal) |
_GenPareto | Generalized Pareto Distribution (dGenPareto) |
_Gompertz | Gompertz distribution (dGompertz) |
_Gumbel | Gumbel distribution (dGumbel) |
_NegWeibull | Negative Weibull distribution (dNegWeibull) |
_Order | Distributions of Order Statistics (dOrder) |
_RevGumbel | Reverse Gumbel distribution (dRevGumbel), |
_RevGumbelExp | Expontial reverse Gumbel distribution (quantile only) |
_RevWeibull | Reverse Weibull distribution (dRevWeibull) |
Statistics: | |
Freq | Univariate frequency table |
PercTable | Bivariate percentage table |
Margins | (Extended) margin tables of a table |
ExpFreq | Expected frequencies of a n-dimensional table |
Mode | Mode, the most frequent value (including frequency) |
Gmean, Gsd | Geometric mean and geometric standard deviation |
Hmean | Harmonic Mean |
Median | Extended median function supporting weights and ordered factors |
HuberM, TukeyBiweight | Huber M-estimator of location and Tukey's biweight robust mean |
HodgesLehmann | the Hodges-Lehmann estimator |
HoeffD | Hoeffding's D statistic |
MeanSE | Standard error of mean |
MeanCI, MedianCI | Confidence interval for the mean and median |
MeanDiffCI | Confidence interval for the difference of two means |
MoveAvg | Moving average |
MeanAD | Mean absolute deviation |
VarCI | Confidence interval for the variance |
CoefVar | Coefficient of variation and its confidence interval |
RobScale | Robust data standardization |
Range | (Robust) range |
BinomCI, MultinomCI | Confidence intervals for binomial and multinomial proportions |
BinomDiffCI | Calculate confidence interval for a risk difference |
BinomRatioCI | Calculate confidence interval for the ratio of binomial proportions. |
PoissonCI | Confidence interval for a Poisson lambda |
Skew, Kurt | Skewness and kurtosis |
YuleQ, YuleY | Yule's Q and Yule's Y |
TschuprowT | Tschuprow's T |
Phi, ContCoef, CramerV | Phi, Pearson's Contingency Coefficient and Cramer's V |
GoodmanKruskalGamma | Goodman Kruskal's gamma |
KendallTauA | Kendall's tau-a |
KendallTauB | Kendall's tau-b |
StuartTauC | Stuart's tau-c |
SomersDelta | Somers' delta |
Lambda | Goodman Kruskal's lambda |
GoodmanKruskalTau | Goodman Kruskal's tau |
UncertCoef | Uncertainty coefficient |
Entropy, MutInf | Shannon's entropy, mutual information |
DivCoef, DivCoefMax | Rao's diversity coefficient ("quadratic entropy") |
TheilU | Theil's U1 and U2 coefficient |
Assocs | Combines the association measures above. |
OddsRatio, RelRisk | Odds ratio and relative risk |
ORToRelRisk | Transform odds ratio to relative risk |
CohenKappa, KappaM | Cohen's Kappa, weighted Kappa and Kappa for |
more than 2 raters | |
CronbachAlpha | Cronbach's alpha |
ICC | Intraclass correlations |
KrippAlpha | Return Kripp's alpha coefficient |
KendallW | Compute the Kendall coefficient of concordance |
Lc | Calculate and plot Lorenz curve |
Gini, Atkinson | Gini- and Atkinson coefficient |
Herfindahl, Rosenbluth | Herfindahl- and Rosenbluth coefficient |
GiniSimpson | Compute Gini-Simpson Coefficient |
CorCI | Confidence interval for Pearson's correlation coefficient |
CorPart | Find the correlations for a set x of variables with set y removed |
CorPolychor | Polychoric correlation coefficient |
SpearmanRho | Spearman rank correlation and its confidence intervals |
ConDisPairs | Return concordant and discordant pairs of two vectors |
FindCorr | Determine highly correlated variables |
CohenD | Cohen's Effect Size |
EtaSq | Effect size calculations for ANOVAs |
Contrasts | Generate pairwise contrasts for using in a post-hoc test |
Strata | Stratified sampling with equal/unequal probabilities |
Outlier | Outliers following Tukey's boxplot definition |
LOF | Local outlier factor |
BrierScore | Brier score, assessing the quality of predictions of binary events |
Cstat | C statistic, equivalent to the area under the ROC curve) |
CCC | Lin's concordance correlation coef for agreement on a continuous measure |
MAE | Mean absolute error |
MAPE, SMAPE | Mean absolute and symmetric mean absolute percentage error |
MSE, RMSE | Mean squared error and root mean squared error |
NMAE, NMSE | Normalized mean absolute and mean squared error |
Conf | Confusion matrix, a cross-tabulation of observed and predicted classes |
with associated statistics | |
Sens, Spec | Sensitivity and specificity |
PseudoR2 | Variants of pseudo R squared statistics: McFadden, Aldrich-Nelson, |
Nagelkerke, CoxSnell, Effron, McKelvey-Zavoina, Tjur | |
Mean, SD, Var, IQRw | Variants of base statistics, allowing to define weights: Mean, |
Quantile, MAD, Cor | standard deviation, variance, quantile, mad, correlation |
VIF, StdCoef | Variance inflation factors and standardised coefficents for linear models |
Tests: | |
SignTest | Signtest to test whether two groups are equally sized |
ZTest | Z--test for known population variance |
TTestA | Student's t-test based on sample statistics |
JonckheereTerpstraTest | Jonckheere-Terpstra trend test for medians |
PageTest | Page test for ordered alternatives |
CochranQTest | Cochran's Q-test to find differences in matched sets |
of three or more frequencies or proportions. | |
VarTest | ChiSquare test for one variance and F test for two variances |
SiegelTukeyTest | Siegel-Tukey test for equality in variability |
SiegelTukeyRank | Calculate Siegel-Tukey's ranks (auxiliary function) |
LeveneTest | Levene's test for homogeneity of variance |
MosesTest | Moses Test of extreme reactions |
RunsTest | Runs test for detecting non-randomness |
DurbinWatsonTest | Durbin-Watson test for autocorrelation |
BartelsRankTest | Bartels rank test for randomness |
JarqueBeraTest | Jarque-Bera Test for normality |
AndersonDarlingTest | Anderson-Darling test for normality |
CramerVonMisesTest | Cramer-von Mises test for normality |
LillieTest | Lilliefors (Kolmogorov-Smirnov) test for normality |
PearsonTest | Pearson chi-square test for normality |
ShapiroFranciaTest | Shapiro-Francia test for normality |
MHChisqTest | Mantel-Haenszel Chisquare test |
StuartMaxwellTest | Stuart-Maxwell marginal homogeneity test |
LehmacherTest | Lehmacher marginal homogeneity test |
CochranArmitageTest | Cochran-Armitage test for trend in binomial proportions |
BreslowDayTest, WoolfTest | Test for homogeneity on 2x2xk tables over strata |
PostHocTest | Post hoc tests by Scheffe, LSD, Tukey for a aov-object |
ScheffeTest | Multiple comparisons Scheffe test |
DunnTest | Dunn's test of multiple comparisons |
DunnettTest | Dunnett's test of multiple comparisons |
ConoverTest | Conover's test of multiple comparisons (following a kruskal test) |
NemenyiTest | Nemenyi's test of multiple comparisons |
HotellingsT2Test | Hotelling's T2 test for the one and two sample case |
YuenTTest | Yuen's robust t-Test with trimmed means and winsorized variances |
BarnardTest | Barnard's test for 2x2 tables |
BreuschGodfreyTest | Breusch-Godfrey test for higher-order serial correlation. |
GTest | Chi-squared contingency table test and goodness-of-fit test |
HosmerLemeshowTest | Hosmer-Lemeshow goodness of fit tests |
VonNeumannTest | Von Neumann's successive difference test |
Date functions: | |
day.name, day.abb | Defined names of the days |
AddMonths | Add a number of months to a given date |
IsDate | Check whether x is a date object |
IsWeekend | Check whether x falls on a weekend |
IsLeapYear | Check whether x is a leap year |
LastDayOfMonth | Return the last day of the month of the date x |
DiffDays360 | Calculate the difference of two dates using the 360-days system |
Date | Create a date from numeric representation of year, month, day |
Day, Month, Year | Extract part of a date |
Hour, Minute, Second | Extract part of time |
Week, Weekday | Returns ISO week and weekday of a date |
Quarter | Quarter of a date |
Timezone | Timezone of a POSIXct/POSIXlt date |
YearDay, YearMonth | The day in the year of a date |
Now, Today | Get current date or date-time |
HmsToSec, SecToHms | Convert h:m:s times to seconds and vice versa |
Overlap | Determine if and how extensively two date ranges overlap |
Zodiac | The zodiac sign of a date :-) |
Finance functions: | |
OPR | One period returns (simple and log returns) |
NPV | Net present value |
NPVFixBond | Net present value for fix bonds |
IRR | Internal rate of return |
YTM | Return yield to maturity for a bond |
SLN, DB, SYD | Several methods of depreciation of an asset |
GUI-Helpers: | |
PasswordDlg | Display a dialog containing an edit field, showing only ***. |
Reporting, InOut: | |
CatTable | Print a table with the option to have controlled linebreaks |
Format, Fmt | Easy format for numbers and dates |
Desc | Produce a rich description of an object |
Abstract | Display compact overview of the structure of a data frame |
TMod | Create comparison table for (general) linear models |
TOne | Create "Table One"" describing baseline characteristics |
GetNewWrd, GetNewXL, GetNewPP | Create a new Word, Excel or PowerPoint Instance |
GetCurrWrd, GetCurrXL, GetCurrPP | Get a handle to a running Word, Excel or PowerPoint instance |
WrdKill, XLKill | Ends a (possibly hidden) Word/Excel process |
IsValidHwnd | Check if the handle to a MS Office application is valid or outdated |
WrdCaption | Insert a title in Word |
WrdFont | Get and set the font for the current selection in Word |
WrdParagraphFormat | Get and set the paragraph format |
WrdTable | Create a table in Word |
WrdCellRange | Select a cell range of a table in Word |
WrdMergeCells | Merge cells of a table in Word |
WrdFormatCells | Format selected cells of a table in word |
WrdTableBorders | Set or edit table border style of a table in Word |
ToWrd, ToXL | Mord flexible wrapper to send diverse objects to Word, resp. Excel |
WrdPlot | Insert the active plot to Word |
WrdInsertBookmark | Insert a new bookmark in a Word document |
WrdDeleteBookmark | Delete an existing bookmark in a Word document |
WrdGoto | Place cursor to a specific bookmark, or another text position. |
WrdUpdateBookmark | Update the text of a bookmark's range |
WrdSaveAs | Saves documents in Word |
WrdStyle | Get and set the style of a paragraph in Word |
XLDateToPOSIXct | Convert XL-Date format to POSIXct format |
XLGetRange | Get the values of one or several cell range(s) in Excel |
XLGetWorkbook | Get the values of all sheets of an Excel workbook |
XLView | Use Excel as viewer for a data.frame |
PpPlot | Insert active plot to PowerPoint |
PpAddSlide | Adds a slide to a PowerPoint presentation |
PpText | Adds a textbox with text to a PP-presentation |
ParseSASDatalines | Parse a SAS "datalines" statement to read data |
Tools: | |
PairApply | Helper for calculating functions pairwise |
LsFct, LsObj | List the functions (or the data, all objects) of a package |
FctArgs | Retrieve the arguments of a functions |
InDots | Check if an argument is contained in ... argument and return it's value |
ParseFormula | Parse a formula and return the splitted parts of if |
Recycle | Recycle a list of elements to the maximal found dimension |
Keywords | Get the keywords of a man page |
SysInfo | Get some more information about system and environment |
DescToolsOptions | Get the DescTools specific options |
PDFManual | Get the pdf-manual of any package on CRAN and open it |
Data: | |
d.pizza | Synthetic dataset created for testing the description |
d.whisky | of Scotch Single Malts |
Reference Data: | |
d.units, d.prefix | Unit conversion factors and metric prefixes |
d.periodic | Periodic table of elements |
d.countries | ISO 3166-1 country codes |
roulette, cards, tarot | Datasets for probabilistic simulation |
This package is still under development. Although the code seems meanwhile quite stable, until release of version 1.0 (which is expected in hmm: near future?) you should be aware that everything in the package might be subject to change. Backward compatibility is not yet guaranteed. Functions may be deleted or renamed and new syntax may be inconsistent with earlier versions. By release of version 1.0 the "deprecated-defunct process" will be installed.
To make use of MS-Office features you must have Office in one of its variants installed, as well as the package RDCOMClient. This package uses the COM interface to control the Office applications. There is no direct equivalent to COM interface for Mac or Linux, hence the use of these functions is restricted to Windows systems. All Wrd*
, XL*
and Pp*
functions require this basis to run.
RDCOMClient can be installed with:
install.packages("RDCOMClient", repos="http://www.omegahat.net/R")
The omegahat repository does not benefit from the same update service as CRAN. So you may be forced to install a package compiled with an earlier version, which usually is no problem. For R 4.2 you can use:
url <- "http://www.omegahat.net/R/bin/windows/contrib/4.2/RDCOMClient_0.96-1.zip" install.packages(url, repos=NULL, type="binary")
RDCOMClient does not exist for Mac or Linux, sorry.
Andri Signorell
Helsana Versicherungen AG, Health Sciences, Zurich
HWZ University of Applied Sciences in Business Administration Zurich.
R is a community project. This can also be seen in this package, which contains R source code and/or documentation previously published elsewhere by (in alphabetical order):
Ken Aho, Andreas Alfons, Nanina Anderegg, Tomas Aragon, Antti Arppe, Adrian Baddeley, Kamil Barton, Ben Bolker, Hans W. Borchers, Frederico Caeiro, Stephane Champely, Daniel Chessel, Leanne Chhay, Clint Cummins, Michael Dewey, Harold C. Doran, Stephane Dray, Charles Dupont, Dirk Eddelbuettel, Jeff Enos, Claus Ekstrom, Martin Elff, Kamil Erguler, Richard W. Farebrother, John Fox, Romain Francois, Michael Friendly, Tal Galili, Matthias Gamer, Joseph L. Gastwirth, Yulia R. Gel, Juergen Gross, Gabor Grothendieck, Frank E. Harrell Jr, Richard Heiberger, Michael Hoehle, Christian W. Hoffmann, Soeren Hojsgaard, Torsten Hothorn, Markus Huerzeler, Wallace W. Hui, Pete Hurd, Rob J. Hyndman, Pablo J. Villacorta Iglesias, Christopher Jackson, Matthias Kohl, Mikko Korpela, Max Kuhn, Detlew Labes, Duncan Temple Lang, Friederich Leisch, Jim Lemon, Dong Li, Martin Maechler, Arni Magnusson, Daniel Malter, George Marsaglia, John Marsaglia, Alina Matei, David Meyer, Weiwen Miao, Giovanni Millo, Yongyi Min, David Mitchell, Franziska Mueller, Markus Naepflin, Danielle Navarro, Henric Nilsson, Klaus Nordhausen, Derek Ogle, Hong Ooi, Nick Parsons, Sandrine Pavoine, Tony Plate, Roland Rapold, William Revelle, Tyler Rinker, Brian D. Ripley, Caroline Rodriguez, Nathan Russell, Nick Sabbe, Venkatraman E. Seshan, Greg Snow, Michael Smithson, Karline Soetaert, Werner A. Stahel, Alec Stephenson, Mark Stevenson, Matthias Templ, Terry Therneau, Yves Tille, Adrian Trapletti, Joshua Ulrich, Kevin Ushey, Jeremy VanDerWal, Bill Venables, John Verzani, Gregory R. Warnes, Stefan Wellek, Hadley Wickham, Rand R. Wilcox, Peter Wolf, Daniel Wollschlaeger, Thomas Yee, Achim Zeileis
Special thanks go to Beat Bruengger, Mathias Frueh, Daniel Wollschlaeger, Vilmantas Gegzna for their valuable contributions and testing.
The good things come from all these guys, any problems are likely due to my tweaking.
Thank you all!
Maintainer: Andri Signorell <[email protected]>
# ****************************************************** # There are no examples defined here. But see the demos: # # demo(describe) # demo(plots)) # # ******************************************************
# ****************************************************** # There are no examples defined here. But see the demos: # # demo(describe) # demo(plots)) # # ******************************************************
%c% is just a short operator implementation for paste(x, y, separator="").
x %c% y
x %c% y
x |
first string |
y |
second string, which will be pasted behind the first one. |
R-Core does not consider it a good idea to use + as an operator not being commutative. So we use c here.
See the discussion:
https://stat.ethz.ch/pipermail/r-devel/2006-August/039013.html
and https://stackoverflow.com/questions/1319698/why-doesnt-operate-on-characters-in-r?lq=1
Still the paste syntax seems sometimes clumsy in daily life and so %c% might spare some keys.
returns the concatenation as string.
Andri Signorell <[email protected]>
"foo" %c% "bar" # works with numerics as well 345 %c% 457
"foo" %c% "bar" # works with numerics as well 345 %c% 457
The like operator is a simple wrapper for grep(..., value=TRUE)
, whose complexity is hard to crack for R-newbies.
x %like% pattern x %like any% pattern
x %like% pattern x %like any% pattern
x |
a vector, typically of character or factor type |
pattern |
simple character string to be matched in the given character vector. |
Follows the logic of simple SQL or basic commands.
a vector (numeric, character, factor), matching the mode of x
Andri Signorell <[email protected]>
match
, pmatch
, grep
, %[]%
, %overlaps%
# find names ending on "or" names(d.pizza) %like% "%or" # find names starting with "d" names(d.pizza) %like% "d%" # ... containing er? names(d.pizza) %like% "%er%" # and combined, search for a name containing "un", ending on "or" # or beginning with "F" levels(d.pizza$driver) %like any% c("%un%", "%or", "F%") # the positions on the vector match(names(d.pizza) %like% "%er%", names(d.pizza))
# find names ending on "or" names(d.pizza) %like% "%or" # find names starting with "d" names(d.pizza) %like% "d%" # ... containing er? names(d.pizza) %like% "%er%" # and combined, search for a name containing "un", ending on "or" # or beginning with "F" levels(d.pizza$driver) %like any% c("%un%", "%or", "F%") # the positions on the vector match(names(d.pizza) %like% "%er%", names(d.pizza))
%nin%
is a binary operator, which returns a logical vector indicating
if there is a match or not for its left operand. A true vector element
indicates no match in left operand, false indicates a match.
x %nin% table
x %nin% table
x |
a vector (numeric, character, factor) |
table |
a vector (numeric, character, factor), matching the mode of |
vector of logical values with length equal to length of x
.
Frank E Harrell Jr <[email protected]>
c('a','b','c') %nin% c('a','b')
c('a','b','c') %nin% c('a','b')
%overlaps% determines if two date ranges overlap at all and returns a logical value. Interval returns the number of days of the overlapping part of the two date periods. Inspired by the eponymous SQL-functions.
x %overlaps% y Overlap(x, y) Interval(x, y)
x %overlaps% y Overlap(x, y) Interval(x, y)
x |
range 1, vector of 2 numeric values or matrix with 2 columns, the first defining the left point the second the right point of the range. |
y |
range 2, vector of 2 numeric values or matrix with 2 columns, the first defining the left point the second the right point of the range. |
%overlaps%
returns TRUE
or FALSE
depending on if the two ranges overlap. The function Overlap
returns the range of the overlapping region as numeric value. This will be 0, if the ranges do not overlap. Interval
returns the width of the empty space between 2 ranges. Again this will be 0 if the ranges overlap.
To handle overlapping ranges there are 4 cases to consider:
range a: |--------------| range b: |-----| range c: |--------| range d: |-----| 1 2 3 4 5 6 7 8
Ranges a and b overlap, the function Overlap
will return the absolute value of the overlapping region (which will be 3 - 2 = 1 in this case). The result will be the same for Overlap(a, b)
and Overlap(b, a)
.Interval
will have a direction. Ranges b and c do not overlap, Overlap
will return 0, %overlaps%
FALSE. Interval
will return 2 for the case Interval(a, b)
and -2 for Interval(b, a)
.
This functions can be of value, if one has to decide, whether confidence intervals overlap or not.
returns a logical vector (match or not for each element of x).
Interval and Overlap return a numeric vector.
Andri Signorell <[email protected]>
similar operators: Between
, %like%
for calculating the overlapping time: difftime
as.Date(c("2012-01-03", "2012-02-03")) %overlaps% as.Date(c("2012-03-01", "2012-03-03")) as.Date(c("2012-01-03", "2012-02-03")) %overlaps% as.Date(c("2012-01-15", "2012-01-21")) Interval(as.Date(c("2012-01-03", "2012-02-03")), as.Date(c("2012-03-01", "2012-03-03"))) # both ranges are recyled if necessary as.Date("2012-01-03") %overlaps% as.Date(c("2012-03-01", "2012-03-03")) # works with numerics as well c(1, 18) %overlaps% c(10, 45)
as.Date(c("2012-01-03", "2012-02-03")) %overlaps% as.Date(c("2012-03-01", "2012-03-03")) as.Date(c("2012-01-03", "2012-02-03")) %overlaps% as.Date(c("2012-01-15", "2012-01-21")) Interval(as.Date(c("2012-01-03", "2012-02-03")), as.Date(c("2012-03-01", "2012-03-03"))) # both ranges are recyled if necessary as.Date("2012-01-03") %overlaps% as.Date(c("2012-03-01", "2012-03-03")) # works with numerics as well c(1, 18) %overlaps% c(10, 45)
Return the xy.coordinates for the literal positions "bottomright", etc. as used to place legends.
ABCCoords(x = "topleft", region = "figure", cex = NULL, linset = 0, ...)
ABCCoords(x = "topleft", region = "figure", cex = NULL, linset = 0, ...)
x |
one out of |
region |
one out of |
cex |
the character extension for the text. |
linset |
line inset in lines of text. |
... |
the dots are passed to the |
The same logic as for the legend can be useful for placing texts, too. This function returns the coordinates for the text, which can be used in the specific text functions.
nothing returned
Andri Signorell <[email protected]>
plot(x = rnorm(10), type="n", xlab="", ylab="") # note that plot.new() has to be called before we can grab the geometry ABCCoords("bottomleft") lapply(c("bottomleft", "left"), ABCCoords) plot(x = rnorm(10), type="n", xlab="", ylab="") text(x=(xy <- ABCCoords("bottomleft", region = "plot"))$xy, labels = "My Maybe Long Text", adj = xy$adj, xpd=NA) text(x=(xy <- ABCCoords("topleft", region = "figure"))$xy, labels = "My Maybe Long Text", adj = xy$adj, xpd=NA) plot(x = rnorm(10), type="n", xlab="", ylab="") sapply(c("topleft", "top", "topright", "left", "center", "right", "bottomleft", "bottom", "bottomright"), function(x) text(x=(xy <- ABCCoords(x, region = "plot", linset=1))$xy, labels = "MyMarginText", adj = xy$adj, xpd=NA) ) plot(x = rnorm(100), type="n", xlab="", ylab="", panel.first={Bg(c("red", "lightyellow")) grid()}) xy <- ABCCoords("topleft", region = "plot") par(xpd=NA) BoxedText(x=xy$xy$x, y=xy$xy$y, xpad = 1, ypad = 1, labels = "My Maybe Long Text", adj = xy$adj, col=SetAlpha("green", 0.8))
plot(x = rnorm(10), type="n", xlab="", ylab="") # note that plot.new() has to be called before we can grab the geometry ABCCoords("bottomleft") lapply(c("bottomleft", "left"), ABCCoords) plot(x = rnorm(10), type="n", xlab="", ylab="") text(x=(xy <- ABCCoords("bottomleft", region = "plot"))$xy, labels = "My Maybe Long Text", adj = xy$adj, xpd=NA) text(x=(xy <- ABCCoords("topleft", region = "figure"))$xy, labels = "My Maybe Long Text", adj = xy$adj, xpd=NA) plot(x = rnorm(10), type="n", xlab="", ylab="") sapply(c("topleft", "top", "topright", "left", "center", "right", "bottomleft", "bottom", "bottomright"), function(x) text(x=(xy <- ABCCoords(x, region = "plot", linset=1))$xy, labels = "MyMarginText", adj = xy$adj, xpd=NA) ) plot(x = rnorm(100), type="n", xlab="", ylab="", panel.first={Bg(c("red", "lightyellow")) grid()}) xy <- ABCCoords("topleft", region = "plot") par(xpd=NA) BoxedText(x=xy$xy$x, y=xy$xy$y, xpad = 1, ypad = 1, labels = "My Maybe Long Text", adj = xy$adj, col=SetAlpha("green", 0.8))
Base R functions cbind
and rbind
bind columns and rows, but there's no built-in function for binding higher dimensional datastructures like matrices. Abind
takes a sequence of
vectors, matrices, or arrays and produces a single array of
the same or higher dimension.
Abind(..., along = N, rev.along = NULL, new.names = NULL, force.array = TRUE, make.names = FALSE, use.first.dimnames = FALSE, hier.names = FALSE, use.dnns = FALSE)
Abind(..., along = N, rev.along = NULL, new.names = NULL, force.array = TRUE, make.names = FALSE, use.first.dimnames = FALSE, hier.names = FALSE, use.dnns = FALSE)
... |
Any number of vectors, matrices, arrays, or data frames.
The dimensions of all the arrays must match, except on one dimension
(specified by Alternatively, there can be one (and only one) list argument supplied, whose components are the objects to be bound together. Names of the list components are treated in the same way as argument names. |
along |
The dimension along which to bind the arrays.
The default is the last dimension, i.e., the maximum length of the dim
attribute of the supplied arrays. |
rev.along |
Alternate way to specify the dimension along which to bind the arrays:
|
new.names |
If new.names is a list, it is the first choice for the
dimnames attribute of the result. It should have the same
structure as a dimnames attribute. If the names for a
particular dimension are If |
force.array |
If |
make.names |
If |
use.first.dimnames |
When dimension names are present on more than one
argument, should dimension names for the result be take from
the first available (the default is to take them from the
last available, which is the same behavior as
|
hier.names |
If |
use.dnns |
(default |
The dimensions of the supplied vectors or arrays do not need
to be identical, e.g., arguments can be a mixture of vectors
and matrices. Abind
coerces arguments by the addition
of one dimension in order to make them consistent with other
arguments and along=
. The extra dimension is
added in the place specified by along=
.
The default action of Abind is to concatenate on the last
dimension, rather than increase the number of dimensions.
For example, the result of calling Abind with vectors is a
longer vector (see first example below). This differs from
the action of rbind
and cbind which is to return a matrix when
called with vectors. Abind can be made to behave like cbind
on vectors by specifying along=2
, and like rbind by
specifying along=0
.
The dimnames of the returned object are pieced together
from the dimnames of the arguments, and the names of the
arguments. Names for each dimension are searched for in the
following order: new.names, argument name, dimnames (or
names) attribute of last argument, dimnames (or names)
attribute of second last argument, etc. (Supplying the
argument use.first.dimnames=TRUE
changes this to
cause Abind
to use dimnames or names from the
first argument first. The default behavior is the same as
for rbind
and cbind
: use dimnames
from later arguments.) If some names are supplied for the
along dimension (either as argument names or dimnames in
arguments), names are constructed for anonymous arguments
unless maken.ames=FALSE
.
An array with a dim attribute calculated as follows.
Let rMin=min(sapply(list(...), function(x) length(dim(x))))
and rMax=max(sapply(list(...), function(x) length(dim(x))))
(where
the length of the dimensions of a vector are taken to be 1). Then rMax
should be
equal to or one greater than rMin
.
If along
refers to an existing dimension, then the length of
the dim attribute of the result is rMax
. If along
does
not refer to an existing dimension, then rMax
should equal
rMin
and the length of the dim attribute of the result will be rMax+1
.
rbind
or cbind
are
called to compute the result if (a)
force.array=FALSE
; and (b) the result will be a
two-dimensional object.
It would be nice to make Abind()
an S3 generic, but S3 generics
cannot dispatch off anonymous arguments.
The ability of Abind()
to accept a single list argument removes
much of the need for constructs like do.call("Abind",
list.of.arrays)
. Instead, just do Abind(list.of.arrays)
. The
direct construct is preferred because do.call()
construct can
sometimes consume more memory during evaluation.
Tony Plate <[email protected]> and Richard Heiberger
# Five different ways of binding together two matrices x <- matrix(1:12, 3, 4) y <- x + 100 dim(Abind(x, y, along=0)) # binds on new dimension before first dim(Abind(x, y, along=1)) # binds on first dimension dim(Abind(x, y, along=1.5)) dim(Abind(x, y, along=2)) dim(Abind(x, y, along=3)) dim(Abind(x, y, rev.along=1)) # binds on last dimension dim(Abind(x, y, rev.along=0)) # binds on new dimension after last # Unlike cbind or rbind in that the default is to bind # along the last dimension of the inputs, which for vectors # means the result is a vector (because a vector is # treated as an array with length(dim(x))==1). Abind(x=1:4, y=5:8) # Like cbind Abind(x=1:4, y=5:8, along=2) Abind(x=1:4, matrix(5:20, nrow=4), along=2) Abind(1:4, matrix(5:20, nrow=4), along=2) # Like rbind Abind(x=1:4, matrix(5:20, nrow=4), along=1) Abind(1:4, matrix(5:20, nrow=4), along=1) # Create a 3-d array out of two matrices Abind(x=matrix(1:16, nrow=4), y=matrix(17:32, nrow=4), along=3) # Use of hier.names Abind(x=cbind(a=1:3, b=4:6), y=cbind(a=7:9, b=10:12), hier.names=TRUE) # Use a list argument Abind(list(x=x, y=x), along=3) # Use lapply(..., get) to get the objects an <- c('x', 'y') names(an) <- an Abind(lapply(an, get), along=3)
# Five different ways of binding together two matrices x <- matrix(1:12, 3, 4) y <- x + 100 dim(Abind(x, y, along=0)) # binds on new dimension before first dim(Abind(x, y, along=1)) # binds on first dimension dim(Abind(x, y, along=1.5)) dim(Abind(x, y, along=2)) dim(Abind(x, y, along=3)) dim(Abind(x, y, rev.along=1)) # binds on last dimension dim(Abind(x, y, rev.along=0)) # binds on new dimension after last # Unlike cbind or rbind in that the default is to bind # along the last dimension of the inputs, which for vectors # means the result is a vector (because a vector is # treated as an array with length(dim(x))==1). Abind(x=1:4, y=5:8) # Like cbind Abind(x=1:4, y=5:8, along=2) Abind(x=1:4, matrix(5:20, nrow=4), along=2) Abind(1:4, matrix(5:20, nrow=4), along=2) # Like rbind Abind(x=1:4, matrix(5:20, nrow=4), along=1) Abind(1:4, matrix(5:20, nrow=4), along=1) # Create a 3-d array out of two matrices Abind(x=matrix(1:16, nrow=4), y=matrix(17:32, nrow=4), along=3) # Use of hier.names Abind(x=cbind(a=1:3, b=4:6), y=cbind(a=7:9, b=10:12), hier.names=TRUE) # Use a list argument Abind(list(x=x, y=x), along=3) # Use lapply(..., get) to get the objects an <- c('x', 'y') names(an) <- an Abind(lapply(an, get), along=3)
Compactly display the content and structure of a data.frame
, including
variable labels. str()
is optimized for lists and its output is
relatively technical, when it comes to e.g. attributes. summary()
on
the other side already calculates some basic statistics.
Abstract( x, sep = ", ", zero.form = ".", maxlevels = 5, trunc = TRUE, list.len = 999 ) ## S3 method for class 'abstract' print(x, sep = NULL, width = NULL, trunc = NULL, print.gap = 2, ...)
Abstract( x, sep = ", ", zero.form = ".", maxlevels = 5, trunc = TRUE, list.len = 999 ) ## S3 method for class 'abstract' print(x, sep = NULL, width = NULL, trunc = NULL, print.gap = 2, ...)
x |
a |
sep |
the separator for concatenating the levels of a factor |
zero.form |
a symbol to be used, when a variable has zero NAs. |
maxlevels |
(integer, |
trunc |
logical, defining if level names exceeding the column with
should be truncated. Default is |
list.len |
numeric; maximum number of list elements to display. |
width |
Console width. If |
print.gap |
(integer) Number of spaces between columns. |
... |
Further arguments to |
The levels of a factor and describing variable labels (as created by
Label()
) will be wrapped within the columns.
The first 4 columns are printed with the needed fix width, the last 2
(Levels and Labels) are wrapped within the column. The width is calculated
depending on the width of the screen as given by getOption("width")
.
ToWord
has an interface for the class abstract
.
an object of class abstract
, essentially a character matrix
with 5 or 6 columns containing:
a column number (Nr
),
the name of the column (ColName
),
the column class (Class
),
the number of NAs (NAs
),
the levels if the variable is a factor (Levels
),
(if there are any) descriptive labels for the column (Labels
).
Andri Signorell [email protected]
utils::str()
, base::summary()
, ColumnWrap()
, Desc()
Other Statistical summary functions:
Desc()
d.mydata <- d.pizza # let's use some labels Label(d.mydata) <- "Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam." Label(d.mydata$temperature) <- "Amet, consetetur sadipscing elitr, sed diam nonumy " Abstract(d.mydata)
d.mydata <- d.pizza # let's use some labels Label(d.mydata) <- "Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam." Label(d.mydata$temperature) <- "Amet, consetetur sadipscing elitr, sed diam nonumy " Abstract(d.mydata)
Clueless adding numbers of months to a date will in some cases lead to invalid dates, think of e.g. 2012-01-30 + 1 month.
AddMonths ensures that the result is always a valid date, e.g.
as.Date("2013-01-31") + 1 month
will be "2013-02-28"
. If number n
is negative, the months will be subtracted.
AddMonths(x, n, ...)
AddMonths(x, n, ...)
x |
a Date object (or something which can be coerced by |
n |
the number of months to be added. If n is negative the months will be subtracted. |
... |
the dots are passed to |
All parameters will be recyled if necessary.
a vector of class Date
with the same dimension as x
, containing the transformed dates.
Andri Signorell <[email protected]>, based on code by Roland Rapold and Antonio
Thanks to Antonio: https://stackoverflow.com/questions/14169620/add-a-month-to-a-date
as.ym
; Date functions: Year
, Month
, etc.
# characters will be coerced to Date AddMonths("2013-01-31", 1) # negative n AddMonths(as.Date("2013-03-31"), -1) # Arguments will be recycled # (with warning if the longer is not a multiple of length of shorter) AddMonths(c("2013-01-31", "2013-03-31", "2013-10-31", "2013-12-31"), c(1,-1)) x <- as.POSIXct(c("2015-01-31", "2015-08-31")) n <- c(1, 3) AddMonths(x, n) # mind the origin if x supplied as numeric ... x <- as.numeric(as.Date(x)) AddMonths(x, n, origin=as.Date("1970-01-01"))
# characters will be coerced to Date AddMonths("2013-01-31", 1) # negative n AddMonths(as.Date("2013-03-31"), -1) # Arguments will be recycled # (with warning if the longer is not a multiple of length of shorter) AddMonths(c("2013-01-31", "2013-03-31", "2013-10-31", "2013-12-31"), c(1,-1)) x <- as.POSIXct(c("2015-01-31", "2015-08-31")) n <- c(1, 3) AddMonths(x, n) # mind the origin if x supplied as numeric ... x <- as.numeric(as.Date(x)) AddMonths(x, n, origin=as.Date("1970-01-01"))
Computes raw simple and extended percentage agreement among raters.
Agree(x, tolerance = 0, na.rm = FALSE)
Agree(x, tolerance = 0, na.rm = FALSE)
x |
a data.frame or a |
tolerance |
number of successive rating categories that should be regarded as rater agreement (see details). |
na.rm |
logical, indicating whether |
Using extended percentage agreement (tolerance != 0
) is only possible for numerical values. If tolerance equals 1, for example, raters differing by one scale degree are interpreted as agreeing.
numeric value of coefficient of interrater reliability
The number of finally (potentially after omitting missing values) used subjects and raters are returned as attributes:
subjects |
the number of subjects examined. |
raters |
the number of raters. |
Matthias Gamer <[email protected]>,
some editorial amendments Andri Signorell <[email protected]>
categ <- c("V", "N", "P") lvls <- factor(categ, levels=categ) rtr1 <- rep(lvls, c(60, 30, 10)) rtr2 <- rep(rep(lvls, nlevels(lvls)), c(53,5,2, 11,14,5, 1,6,3)) rtr3 <- rep(rep(lvls, nlevels(lvls)), c(48,8,3, 15,10,7, 3,4,2)) Agree(cbind(rtr1, rtr2)) # Simple percentage Agreement Agree(data.frame(rtr1, rtr2)) # can be a data.frame Agree(cbind(rtr1, rtr2, rtr3)) # Simple percentage Agreement Agree(cbind(rtr1, rtr2), 1) # Extended percentage Agreement
categ <- c("V", "N", "P") lvls <- factor(categ, levels=categ) rtr1 <- rep(lvls, c(60, 30, 10)) rtr2 <- rep(rep(lvls, nlevels(lvls)), c(53,5,2, 11,14,5, 1,6,3)) rtr3 <- rep(rep(lvls, nlevels(lvls)), c(48,8,3, 15,10,7, 3,4,2)) Agree(cbind(rtr1, rtr2)) # Simple percentage Agreement Agree(data.frame(rtr1, rtr2)) # can be a data.frame Agree(cbind(rtr1, rtr2, rtr3)) # Simple percentage Agreement Agree(cbind(rtr1, rtr2), 1) # Extended percentage Agreement
The function duplicated
returns a logical vector indicating which elements x are duplicates, but will not include the very first appearance of subsequently duplicated elements. AllDuplicated
returns an index vector of ALL the values in x
which are involved in ties.
So !AllDuplicated
can be used to determine all elements of x, which
appear exactly once (thus with frequency 1).
AllDuplicated(x)
AllDuplicated(x)
x |
vector of any type. |
logical vector of the same dimension as x.
Andri Signorell <[email protected]>
unique
returns a unique list of all values in xduplicated
returns an index vector flagging all elements, which appeared more than once (leaving out the first appearance!)union
(A, B) returns a list with the unique values from A and Bintersect
returns all elements which appear in A and in Bsetdiff
(A, B) returns all elements appearing in A but not in B setequal
(A, B) returns TRUE
if A contains exactly the same elements as Bsplit
(A, A) returns a list with all the tied values in A (see examples)
x <- c(1:10, 4:6) AllDuplicated(x) # compare to: duplicated(x) x[!AllDuplicated(x)] # union, intersect and friends... A <- c(sort(sample(1:20, 9)),NA) B <- c(sort(sample(3:23, 7)),NA) # all elements from A and B (no duplicates) union(A, B) # all elements appearing in A and in B intersect(A, B) # elements in A, but not in B setdiff(A, B) # elements in B, but not in A setdiff(B, A) # Does A contain the same elements as B? setequal(A, B) # Find ties in a vector x x <- sample(letters[1:10], 20, replace=TRUE) ties <- split(x, x) # count tied groups sum(sapply(ties, length) > 1) # length of tied groups (x <- sapply(ties, length))[x>1] # by means of table tab <- table(x) tab[tab>1] # count elements involved in ties sum(tab>1) # count tied groups sum(tab[tab>1])
x <- c(1:10, 4:6) AllDuplicated(x) # compare to: duplicated(x) x[!AllDuplicated(x)] # union, intersect and friends... A <- c(sort(sample(1:20, 9)),NA) B <- c(sort(sample(3:23, 7)),NA) # all elements from A and B (no duplicates) union(A, B) # all elements appearing in A and in B intersect(A, B) # elements in A, but not in B setdiff(A, B) # elements in B, but not in A setdiff(B, A) # Does A contain the same elements as B? setequal(A, B) # Find ties in a vector x x <- sample(letters[1:10], 20, replace=TRUE) ties <- split(x, x) # count tied groups sum(sapply(ties, length) > 1) # length of tied groups (x <- sapply(ties, length))[x>1] # by means of table tab <- table(x) tab[tab>1] # count elements involved in ties sum(tab>1) # count tied groups sum(tab[tab>1])
The function identical()
is the safe and reliable way to test two objects for being exactly equal. But it is restricted to the comparison of two objects. AllIdentical()
allows the input of multiple objects and returns TRUE
in the case that all of them are exactly equal, FALSE
in every other case.
AllIdentical(...)
AllIdentical(...)
... |
any |
The function checks the first object against all others, so if the first object is identical to the second and to the third, then also the second and the third are identical. (If A=B and A=C then is B=C)
Andri Signorell <[email protected]>
A <- LETTERS[1:5] B <- LETTERS[1:5] C <- LETTERS[1:5] D <- LETTERS[1:5] E <- factor(LETTERS[1:5]) # all ok AllIdentical(A, B, C, D) # at least one odd man AllIdentical(A, B, C, D, E)
A <- LETTERS[1:5] B <- LETTERS[1:5] C <- LETTERS[1:5] D <- LETTERS[1:5] E <- factor(LETTERS[1:5]) # all ok AllIdentical(A, B, C, D) # at least one odd man AllIdentical(A, B, C, D, E)
Performs the Anderson-Darling test of goodness-of-fit to a specified continuous univariate probability distribution.
AndersonDarlingTest(x, null = "punif", ..., nullname)
AndersonDarlingTest(x, null = "punif", ..., nullname)
x |
numeric vector of data values. |
null |
a function, or a character string giving the name of a function, to compute the cumulative distribution function for the null distribution. |
... |
additional arguments for the cumulative distribution function. |
nullname |
optional character string describing the null distribution. |
This command performs the Anderson-Darling test
of goodness-of-fit to the distribution specified by the argument
null
. It is assumed that the values in x
are
independent and identically distributed random values, with some
cumulative distribution function .
The null hypothesis is that
is the function
specified by the argument
null
, while the alternative
hypothesis is that is some other function.
The procedures currently implemented are for the case of a SIMPLE null hypothesis, that is, where all the parameters of the distribution are known. Note that other packages such as 'normtest' support the test of a COMPOSITE null hypothesis where some or all of the parameters are unknown leading to different results concerning the test statistic and the p-value. Thus in 'normtest' you can test whether the data come from a normal distribution with some mean and variance (which will be estimated from the same data).
The discrepancies can be large if you don't have a lot of data (say less than 1000 observations).
An object of class "htest"
representing the result of
the hypothesis test.
Original C code by George Marsaglia and John Marsaglia. R interface by Adrian Baddeley.
Anderson, T.W. and Darling, D.A. (1952) Asymptotic theory of certain 'goodness-of-fit' criteria based on stochastic processes. Annals of Mathematical Statistics 23, 193–212.
Anderson, T.W. and Darling, D.A. (1954) A test of goodness of fit. Journal of the American Statistical Association 49, 765–769.
Marsaglia, G. and Marsaglia, J. (2004) Evaluating the Anderson-Darling Distribution. Journal of Statistical Software 9 (2), 1–5. February 2004. https://www.jstatsoft.org/v09/i02
shapiro.test
and all other tests for normality.
x <- rnorm(10, mean=2, sd=1) AndersonDarlingTest(x, "pnorm", mean=2, sd=1)
x <- rnorm(10, mean=2, sd=1) AndersonDarlingTest(x, "pnorm", mean=2, sd=1)
Append elements to a number of various objects as vectors, matrices, data.frames and lists. In a matrix either rows or columns can be inserted at any position. In data frames any vectors can be inserted. values
will be recycled to the necessary length.
Append(x, values, after = NULL, ...) ## S3 method for class 'matrix' Append(x, values, after = NULL, rows = FALSE, names = NULL, ...) ## S3 method for class 'data.frame' Append(x, values, after = NULL, rows = FALSE, names = NULL, ...) ## Default S3 method: Append(x, values, after = NULL, ...)
Append(x, values, after = NULL, ...) ## S3 method for class 'matrix' Append(x, values, after = NULL, rows = FALSE, names = NULL, ...) ## S3 method for class 'data.frame' Append(x, values, after = NULL, rows = FALSE, names = NULL, ...) ## Default S3 method: Append(x, values, after = NULL, ...)
x |
object for the elements to be inserted |
values |
the elements to be inserted |
after |
a subscript, after which the values are to be appended. If it's missing the values will be appended after the last element (or column/row). |
rows |
logical, defining if vector should be added as row or as column. Default is column ( |
names |
the dimension names for the inserted elements(s) |
... |
further arguments (not used here) |
The vector x
will be recycled to a length of the next multiple of the number of rows (or columns) of the matrix m
and will be inserted such that the first inserted row (column) has the index i
. If the dimnames are given, they will be used no matter if the matrix m has already dimnames defined or not.
An object containing the values in x with the elements of values appended after the specified element of x.
Andri Signorell <[email protected]>
Append(1:5, 0:1, after = 3) # the same as append # Insert columns and rows x <- matrix(runif(25), 5) Append(x, values=1:10, after=2, names = c("X","Y")) Append(x, values=1:10, after=2) Append(x, values=1:10, after=2, names = c("X","Y")) Append(x, values=1:10, after=2) # append to a data.frame d.frm <- data.frame("id" = c(1,2,3), "code" = c("AAA", "BBB", "CCC"), "val" = c(111, 222, 333)) z <- c(10, 20, 30) Append(d.frm, z, after=2, names="ZZZ")
Append(1:5, 0:1, after = 3) # the same as append # Insert columns and rows x <- matrix(runif(25), 5) Append(x, values=1:10, after=2, names = c("X","Y")) Append(x, values=1:10, after=2) Append(x, values=1:10, after=2, names = c("X","Y")) Append(x, values=1:10, after=2) # append to a data.frame d.frm <- data.frame("id" = c(1,2,3), "code" = c("AAA", "BBB", "CCC"), "val" = c(111, 222, 333)) z <- c(10, 20, 30) Append(d.frm, z, after=2, names="ZZZ")
Append rownames to a data.frame as first column.
AppendRowNames(x, names = "rownames", after = 0, remove_rownames = TRUE)
AppendRowNames(x, names = "rownames", after = 0, remove_rownames = TRUE)
x |
a data.frame |
names |
the name of the new inserted column containing the rownames. |
after |
a subscript, after which the values are to be appended. If missing the rownames will be inserted as first column. |
remove_rownames |
logical defining if the existing rownames should be removed.
Default is |
the object x with appended rownames
Andri Signorell <[email protected]>
(dd <- data.frame(x=1:5, y=6:10, z=LETTERS[1:5], row.names = letters[1:5])) AppendRowNames(dd)
(dd <- data.frame(x=1:5, y=6:10, z=LETTERS[1:5], row.names = letters[1:5])) AppendRowNames(dd)
This function converts an xtabs
object to a matrix
.
## S3 method for class 'xtabs' as.matrix(x, ...)
## S3 method for class 'xtabs' as.matrix(x, ...)
x |
an object of class |
... |
additional arguments to be passed to or from methods. |
An xtabs
object is indeed already a matrix, but won't be converted to a pure matrix by as.matrix.default
function, as its class definition will remain unchanged. Some functions expecting a pure matrix may fail, when fed with a xtabs
object.
as.matrix.xtabs
will drop the classes and the call attribute.
Note that unclass
would as well discard the classes xtabs
and table
, but retain the "call"
attribute.
Andri Signorell <[email protected]>
tab <- xtabs( ~ driver + operator, data=d.pizza) str(tab) class(tab) str(as.matrix(tab)) class(as.matrix(tab))
tab <- xtabs( ~ driver + operator, data=d.pizza) str(tab) class(tab) str(as.matrix(tab)) class(as.matrix(tab))
The representation of year and month information in YYYYYMM format as an integer is often handy and a useful and efficient data structure. Adding a number of months to such a date is not quite catchy, however, since the date structure is to be retained. For example, 201201 - 2 [months] is expected to result in 201111 instead of 201199. AddMonthsYM does this job.
as.ym(x) ## S3 method for class 'ym' as.Date(x, d = 1, ...) ## S3 method for class 'ym' AddMonths(x, n, ...)
as.ym(x) ## S3 method for class 'ym' as.Date(x, d = 1, ...) ## S3 method for class 'ym' AddMonths(x, n, ...)
x |
a vector of integers, representing the dates in the format YYYYMM, to which a number of months has to be added. YYYY must lie in the range of 1000-3000, MM in 1-12. |
d |
the day to be used for converting a yearmonth to a date. Default is 1. |
n |
the number of months to be added. If n is negative the months will be subtracted. |
... |
further arguments (not used here). |
All parameters will be recyled if necessary. The therefore used function mapply
will display a warning, if the longer argument is not a multiple of the length of the shorter one.
a vector of class integer
with the same dimension as x, containing the transformed dates.
Andri Signorell <[email protected]>, originally based on code by Roland Rapold
AddMonths
; Date functions, like Year
, Month
, etc.
Month(as.ym(202408)) Year(as.ym(202408)) Year(as.Date("2024-12-05")) Year(as.ym(202412)) Month(as.Date("2024-12-05"), fmt = "mm") Month(as.ym(202412), fmt="mm") AddMonths(201511, 5) AddMonths(c(201511, 201302), c(5, 15)) AddMonths(c(201511, 201302), c(5, -4))
Month(as.ym(202408)) Year(as.ym(202408)) Year(as.Date("2024-12-05")) Year(as.ym(202412)) Month(as.Date("2024-12-05"), fmt = "mm") Month(as.ym(202412), fmt="mm") AddMonths(201511, 5) AddMonths(c(201511, 201302), c(5, 15)) AddMonths(c(201511, 201302), c(5, -4))
AscToChar returns a character for each ASCII code (integer) supplied.
CharToAsc returns integer codes in 0:255
for each (one byte) character in all strings in x
.
AscToChar(i) CharToAsc(x)
AscToChar(i) CharToAsc(x)
i |
numeric (integer) vector of values in |
x |
vector of strings. |
Only codes in 1:127
make up the ASCII encoding which should be
identical for all R versions, whereas the ‘upper’ half
is often determined from the ISO-8859-1 (aka “ISO-Latin 1)”
encoding, but may well differ, depending on the locale setting, see
also Sys.setlocale
.
Note that 0
is no longer allowed since, R does not allow
\0
aka nul
characters in a string anymore.
AscToChar
returns a vector of the same length as i.
CharToAsc
returns a list of numeric vectors of character length of each string in x.
unknown guy out there, help text partly taken from M. Maechler's sfsmisc.
(x <- CharToAsc("Silvia")) # will be pasted together AscToChar(x) # use strsplit if the single characters are needed strsplit(AscToChar(x), split=NULL) # this would be an alternative, but the latter would be of class raw DecToHex(CharToAsc("Silvia")) charToRaw("Silvia")
(x <- CharToAsc("Silvia")) # will be pasted together AscToChar(x) # use strsplit if the single characters are needed strsplit(AscToChar(x), split=NULL) # this would be an alternative, but the latter would be of class raw DecToHex(CharToAsc("Silvia")) charToRaw("Silvia")
Returns the aspect ratio of the current plot in user coordinates.
Asp()
Asp()
The aspect ratio of the plot is calculated as
w <- par("pin")[1] / diff(par("usr")[1:2]) h <- par("pin")[2] / diff(par("usr")[3:4]) asp <- w/h
Andri Signorell <[email protected]>
Asp()
Asp()
Calculate Cramer's V, Pearson's contingency coefficient and phi,
Yule's Q and Y and Tschuprow's T of x
, if x
is a table. If both, x
and y
are given, then the according table will be built first.
Phi(x, y = NULL, ...) ContCoef(x, y = NULL, correct = FALSE, ...) CramerV(x, y = NULL, conf.level = NA, method = c("ncchisq", "ncchisqadj", "fisher", "fisheradj"), correct = FALSE, ...) YuleQ(x, y = NULL, ...) YuleY(x, y = NULL, ...) TschuprowT(x, y = NULL, correct = FALSE, ...)
Phi(x, y = NULL, ...) ContCoef(x, y = NULL, correct = FALSE, ...) CramerV(x, y = NULL, conf.level = NA, method = c("ncchisq", "ncchisqadj", "fisher", "fisheradj"), correct = FALSE, ...) YuleQ(x, y = NULL, ...) YuleY(x, y = NULL, ...) TschuprowT(x, y = NULL, correct = FALSE, ...)
x |
can be a numeric vector, a matrix or a table. |
y |
NULL (default) or a vector with compatible dimensions to |
conf.level |
confidence level of the interval. This is only implemented for Cramer's V. If set to |
method |
string defining the method to calculate confidence intervals for Cramer's V. One out of |
correct |
logical. Applying to |
... |
further arguments are passed to the function |
For x either a matrix or two vectors x
and y
are expected. In latter case table(x, y, ...)
is calculated.
The function handles NAs
the same way the table
function does, so tables are by default calculated with NAs
omitted.
A provided matrix is interpreted as a contingency table, which seems in the case of frequency data the natural interpretation
(this is e.g. also what chisq.test
expects).
Use the function PairApply
(pairwise apply) if the measure should be calculated pairwise for all columns.
This allows matrices of association measures to be calculated the same way cor
does. NAs
are by default omitted pairwise,
which corresponds to the pairwise.complete
option of cor
.
Use complete.cases
, if only the complete cases of a data.frame
are to be used. (see examples)
The maximum value for Phi is . The contingency coefficient goes from 0 to
. For the corrected contingency coefficient and for Cramer's V the range is 0 to 1.
A Cramer's V in the range of [0, 0.3] is considered as weak, [0.3,0.7] as medium and > 0.7 as strong.
The minimum value for all is 0 under statistical independence.
a single numeric value if no confidence intervals are requested,
and otherwise a numeric vector with 3 elements for the estimate, the lower and the upper confidence interval
Andri Signorell <[email protected]>,
Michael Smithson <[email protected]> (confidence intervals for Cramer V)
Yule, G. Uday (1912) On the methods of measuring association between two attributes. Journal of the Royal Statistical Society, LXXV, 579-652
Tschuprow, A. A. (1939) Principles of the Mathematical Theory of Correlation, translated by M. Kantorowitsch. W. Hodge & Co.
Cramer, H. (1946) Mathematical Methods of Statistics. Princeton University Press
Agresti, Alan (1996) Introduction to categorical data analysis. NY: John Wiley and Sons
Sakoda, J.M. (1977) Measures of Association for Multivariate Contingency Tables, Proceedings of the Social Statistics Section of the American Statistical Association (Part III), 777-780.
Smithson, M.J. (2003) Confidence Intervals, Quantitative Applications in the Social Sciences Series, No. 140. Thousand Oaks, CA: Sage. pp. 39-41
Bergsma, W. (2013) A bias-correction for Cramer's V and Tschuprow's T Journal of the Korean Statistical Society 42(3) DOI: 10.1016/j.jkss.2012.10.002
table
, PlotCorr
, PairApply
, Assocs
tab <- table(d.pizza$driver, d.pizza$wine_delivered) Phi(tab) ContCoef(tab) CramerV(tab) TschuprowT(tab) # just x and y CramerV(d.pizza$driver, d.pizza$wine_delivered) # data.frame PairApply(d.pizza[,c("driver","operator","area")], CramerV, symmetric = TRUE) # useNA is passed to table PairApply(d.pizza[,c("driver","operator","area")], CramerV, useNA="ifany", symmetric = TRUE) d.frm <- d.pizza[,c("driver","operator","area")] PairApply(d.frm[complete.cases(d.frm),], CramerV, symmetric = TRUE) m <- as.table(matrix(c(2,4,1,7), nrow=2)) YuleQ(m) YuleY(m) # Bootstrap confidence intervals for Cramer's V # http://support.sas.com/documentation/cdl/en/statugfreq/63124/PDF/default/statugfreq.pdf, p. 1821 tab <- as.table(rbind( c(26,26,23,18, 9), c( 6, 7, 9,14,23))) d.frm <- Untable(tab) n <- 1000 idx <- matrix(sample(nrow(d.frm), size=nrow(d.frm) * n, replace=TRUE), ncol=n, byrow=FALSE) v <- apply(idx, 2, function(x) CramerV(d.frm[x,1], d.frm[x,2])) quantile(v, probs=c(0.025,0.975)) # compare this to the analytical ones CramerV(tab, conf.level=0.95)
tab <- table(d.pizza$driver, d.pizza$wine_delivered) Phi(tab) ContCoef(tab) CramerV(tab) TschuprowT(tab) # just x and y CramerV(d.pizza$driver, d.pizza$wine_delivered) # data.frame PairApply(d.pizza[,c("driver","operator","area")], CramerV, symmetric = TRUE) # useNA is passed to table PairApply(d.pizza[,c("driver","operator","area")], CramerV, useNA="ifany", symmetric = TRUE) d.frm <- d.pizza[,c("driver","operator","area")] PairApply(d.frm[complete.cases(d.frm),], CramerV, symmetric = TRUE) m <- as.table(matrix(c(2,4,1,7), nrow=2)) YuleQ(m) YuleY(m) # Bootstrap confidence intervals for Cramer's V # http://support.sas.com/documentation/cdl/en/statugfreq/63124/PDF/default/statugfreq.pdf, p. 1821 tab <- as.table(rbind( c(26,26,23,18, 9), c( 6, 7, 9,14,23))) d.frm <- Untable(tab) n <- 1000 idx <- matrix(sample(nrow(d.frm), size=nrow(d.frm) * n, replace=TRUE), ncol=n, byrow=FALSE) v <- apply(idx, 2, function(x) CramerV(d.frm[x,1], d.frm[x,2])) quantile(v, probs=c(0.025,0.975)) # compare this to the analytical ones CramerV(tab, conf.level=0.95)
Collects a number of association measures for nominal and ordinal data.
Assocs(x, conf.level = 0.95, verbose = NULL) ## S3 method for class 'Assocs' print(x, digits = 4, ...)
Assocs(x, conf.level = 0.95, verbose = NULL) ## S3 method for class 'Assocs' print(x, digits = 4, ...)
x |
a 2 dimensional contingency table or a matrix. |
conf.level |
confidence level of the interval. If set to |
verbose |
integer out of |
digits |
integer which determines the number of digits used in formatting the measures of association. |
... |
further arguments to be passed to or from methods. |
This function wraps the association measures phi, contingency coefficient, Cramer's V, Goodman Kruskal's Gamma, Kendall's Tau-b, Stuart's Tau-c, Somers' Delta, Pearson and Spearman correlation, Guttman's Lambda, Theil's Uncertainty Coefficient and the mutual information.
numeric matrix, dimension [1:17, 1:3]
the first column contains the estimate, the second the lower confidence interval, the third the upper one.
Andri Signorell <[email protected]>
Phi
, ContCoef
, CramerV
, GoodmanKruskalGamma
, KendallTauB
, StuartTauC
,
SomersDelta
, SpearmanRho
, Lambda
, UncertCoef
, MutInf
options(scipen=8) # Example taken from: SAS/STAT(R) 9.2 User's Guide, Second Edition, The FREQ Procedure # http://support.sas.com/documentation/cdl/en/statugfreq/63124/PDF/default/statugfreq.pdf # Hair-Eye-Color pp. 1816 tob <- as.table(matrix(c( 69, 28, 68, 51, 6, 69, 38, 55, 37, 0, 90, 47, 94, 94, 16 ), nrow=3, byrow=TRUE, dimnames=list(eye=c("blue","green","brown"), hair=c("fair","red","medium","dark","black")) )) Desc(tob) Assocs(tob) # Example taken from: http://www.math.wpi.edu/saspdf/stat/chap28.pdf # pp. 1349 pain <- as.table(matrix(c( 26, 6, 26, 7, 23, 9, 18, 14, 9, 23 ), ncol=2, byrow=TRUE)) Desc(pain) Assocs(pain)
options(scipen=8) # Example taken from: SAS/STAT(R) 9.2 User's Guide, Second Edition, The FREQ Procedure # http://support.sas.com/documentation/cdl/en/statugfreq/63124/PDF/default/statugfreq.pdf # Hair-Eye-Color pp. 1816 tob <- as.table(matrix(c( 69, 28, 68, 51, 6, 69, 38, 55, 37, 0, 90, 47, 94, 94, 16 ), nrow=3, byrow=TRUE, dimnames=list(eye=c("blue","green","brown"), hair=c("fair","red","medium","dark","black")) )) Desc(tob) Assocs(tob) # Example taken from: http://www.math.wpi.edu/saspdf/stat/chap28.pdf # pp. 1349 pain <- as.table(matrix(c( 26, 6, 26, 7, 23, 9, 18, 14, 9, 23 ), ncol=2, byrow=TRUE)) Desc(pain) Assocs(pain)
The Atkinson index is an inequality measure and is useful in determining which end of the distribution contributed most to the observed inequality.
Atkinson(x, n = rep(1, length(x)), parameter = 0.5, na.rm = FALSE)
Atkinson(x, n = rep(1, length(x)), parameter = 0.5, na.rm = FALSE)
x |
a vector containing at least non-negative elements. |
n |
a vector of frequencies, must be same length as x. |
parameter |
parameter of the inequality measure (if set to |
na.rm |
logical. Should missing values be removed? Defaults to FALSE. |
the value of the Akinson Index.
This function was previously published as ineq()
in the ineq package and has been
integrated here without logical changes, but with some extensions for NA
-handling and the use of weights.
Achim Zeileis <[email protected]>
Cowell, F. A. (2000) Measurement of Inequality in Atkinson, A. B. / Bourguignon, F. (Eds): Handbook of Income Distribution. Amsterdam.
Cowell, F. A. (1995) Measuring Inequality Harvester Wheatshef: Prentice Hall.
Marshall, Olkin (1979) Inequalities: Theory of Majorization and Its Applications. New York: Academic Press.
Atkinson, A. B. (1970): On the Measurment of Inequality, Journal of Economic Theory, Vol. 2(3), pp. 244-263.
See Herfindahl
, Rosenbluth
for concentration measures and
ineq()
in the package ineq for additional inequality measures
# generate vector (of incomes) x <- c(541, 1463, 2445, 3438, 4437, 5401, 6392, 8304, 11904, 22261) # compute Atkinson coefficient with parameter=1 Atkinson(x, parameter=1)
# generate vector (of incomes) x <- c(541, 1463, 2445, 3438, 4437, 5401, 6392, 8304, 11904, 22261) # compute Atkinson coefficient with parameter=1 Atkinson(x, parameter=1)
Calculate the area under the curve with a naive algorithm and with a more elaborated spline approach. The curve must be given by vectors of xy-coordinates.
AUC(x, y, from = min(x, na.rm = TRUE), to = max(x, na.rm = TRUE), method = c("trapezoid", "step", "spline", "linear"), absolutearea = FALSE, subdivisions = 100, na.rm = FALSE, ...)
AUC(x, y, from = min(x, na.rm = TRUE), to = max(x, na.rm = TRUE), method = c("trapezoid", "step", "spline", "linear"), absolutearea = FALSE, subdivisions = 100, na.rm = FALSE, ...)
x , y
|
the xy-points of the curve |
method |
The type of interpolation. Can be |
from |
The value from where to start calculating the area under the curve. Defaults to the smallest x value. |
to |
The value from where to end the calculation of the area under the curve. Defaults to the greatest x value. |
absolutearea |
A logical value that determines if negative
areas should be added to the total area under the curve. By
default the auc function subtracts areas that have negative y
values. Set |
subdivisions |
an integer telling how many subdivisions should be used for integrate (for non-linear approximations). Ignored if |
na.rm |
logical, indicating whether |
... |
additional arguments passed on to approx (for linear approximations). In particular rule can be set to determine how values outside the range of x is handled. |
If method is set to "trapezoid"
then the curve is formed by connecting all points by a direct line (composite trapezoid rule). If "step"
is chosen then a stepwise connection of two points is used.
For linear interpolation the AUC()
function computes the area under the curve
using the composite trapezoid rule. For area under a spline interpolation,
AUC()
uses the splinefun function in combination with the integrate to
calculate a numerical integral.
The AUC()
function can handle unsorted time
values (by sorting x), missing observations, ties for the x values (by ignoring duplicates), and integrating over
part of the area or even outside the area.
Numeric value of the area under the curve.
Andri Signorell <[email protected]>, spline part by Claus Ekstrom <[email protected]>
AUC(x=c(1,3), y=c(1,1)) AUC(x=c(1,2,3), y=c(1,2,4), method="trapezoid") AUC(x=c(1,2,3), y=c(1,2,4), method="step") plot(x=c(1,2,2.5), y=c(1,2,4), type="l", col="blue", ylim=c(0,4)) lines(x=c(1,2,2.5), y=c(1,2,4), type="s", col="red") x <- seq(0, pi, length.out=200) AUC(x=x, y=sin(x)) AUC(x=x, y=sin(x), method="spline")
AUC(x=c(1,3), y=c(1,1)) AUC(x=c(1,2,3), y=c(1,2,4), method="trapezoid") AUC(x=c(1,2,3), y=c(1,2,4), method="step") plot(x=c(1,2,2.5), y=c(1,2,4), type="l", col="blue", ylim=c(0,4)) lines(x=c(1,2,2.5), y=c(1,2,4), type="s", col="red") x <- seq(0, pi, length.out=200) AUC(x=x, y=sin(x)) AUC(x=x, y=sin(x), method="spline")
Places a break mark on an axis on an existing plot.
AxisBreak(axis = 1, breakpos = NULL, pos = NA, bgcol = "white", breakcol = "black", style = "slash", brw = 0.02)
AxisBreak(axis = 1, breakpos = NULL, pos = NA, bgcol = "white", breakcol = "black", style = "slash", brw = 0.02)
axis |
which axis to break. |
breakpos |
where to place the break in user units. |
pos |
position of the axis (see axis). |
bgcol |
the color of the plot background. |
breakcol |
the color of the "break" marker. |
style |
Either ‘gap’, ‘slash’ or ‘zigzag’. |
brw |
break width relative to plot width. |
The ‘pos’ argument is not needed unless the user has specified a different position from the default for the axis to be broken.
There is some controversy about the propriety of using discontinuous
coordinates for plotting, and thus axis breaks. Discontinuous coordinates
allow widely separated groups of values or outliers to appear without
devoting too much of the plot to empty space.
The major objection seems
to be that the reader will be misled by assuming continuous coordinates.
The ‘gap’ style that clearly separates the two sections of the plot
is probably best for avoiding this.
Jim Lemon and Ben Bolker
plot(3:10, main="Axis break test") # put a break at the default axis and position AxisBreak() AxisBreak(2, 2.9, style="zigzag")
plot(3:10, main="Axis break test") # put a break at the default axis and position AxisBreak() AxisBreak(2, 2.9, style="zigzag")
Compute pretty tickmark locations, the same way as R does internally. By default, gives the at values which axis.POSIXct(side, x) would use.
axTicks.POSIXct(side, x, at, format, labels = TRUE, ...) axTicks.Date(side = 1, x, ...)
axTicks.POSIXct(side, x, at, format, labels = TRUE, ...) axTicks.Date(side = 1, x, ...)
side |
See axis. |
x , at
|
A date-time or date object. |
format |
See strptime. |
labels |
Either a logical value specifying whether annotations are to be made at the tickmarks, or a vector of character strings to be placed at the tickpoints. |
... |
Further arguments to be passed from or to other methods. |
axTicks has no implementation for POSIXct axis. This function fills the gap.
numeric vector of coordinate values at which axis tickmarks can be drawn.
Andri Signorell <[email protected]> simply copying R-Core code
with(beaver1, { time <- strptime(paste(1990, day, time %/% 100, time %% 100), "%Y %j %H %M") plot(time, temp, type = "l") # axis at 4-hour intervals. # now label every hour on the time axis plot(time, temp, type = "l", xaxt = "n") r <- as.POSIXct(round(range(time), "hours")) axis.POSIXct(1, at = seq(r[1], r[2], by = "hour"), format = "%H") # place the grid abline(v=axTicks.POSIXct(1, at = seq(r[1], r[2], by = "hour"), format = "%H"), col="grey", lty="dotted") })
with(beaver1, { time <- strptime(paste(1990, day, time %/% 100, time %% 100), "%Y %j %H %M") plot(time, temp, type = "l") # axis at 4-hour intervals. # now label every hour on the time axis plot(time, temp, type = "l", xaxt = "n") r <- as.POSIXct(round(range(time), "hours")) axis.POSIXct(1, at = seq(r[1], r[2], by = "hour"), format = "%H") # place the grid abline(v=axTicks.POSIXct(1, at = seq(r[1], r[2], by = "hour"), format = "%H"), col="grey", lty="dotted") })
Barnard's unconditional test for superiority applied to contingency tables
using Score or Wald statistics for the difference between two binomial proportions.
BarnardTest(x, y = NULL, alternative = c("two.sided", "less", "greater"), method = c("csm", "csm approximate", "z-pooled", "z-unpooled", "boschloo", "santner and snell"), fixed = 1, useStoredCSM = FALSE, ...)
BarnardTest(x, y = NULL, alternative = c("two.sided", "less", "greater"), method = c("csm", "csm approximate", "z-pooled", "z-unpooled", "boschloo", "santner and snell"), fixed = 1, useStoredCSM = FALSE, ...)
x |
a numeric vector or a two-dimensional contingency table in matrix form. |
y |
a factor object; ignored if |
alternative |
a character string specifying the alternative
hypothesis, must be one of |
method |
Indicates the method for finding the more extreme tables: must be either |
fixed |
indicates which margins are fixed. |
useStoredCSM |
logical, use a stored ordering matrix for the CSM test to greatly reduce the computation time (default is |
... |
the dots are passed on to the |
There are two fundamentally different exact tests for comparing the equality of two binomial probabilities - Fisher's exact test (Fisher, 1925), and Barnard's exact test (Barnard, 1945). Fisher's exact test (Fisher, 1925) is the more popular of the two. In fact, Fisher was bitterly critical of Barnard's proposal for esoteric reasons that we will not go into here. For 2 x 2 tables, Barnard's test is more powerful than Fisher's, as Barnard noted in his 1945 paper, much to Fisher's chagrin. Anyway, perhaps due to its computational difficulty the Barnard's is not widely used. (Mehta et.al., 2003)
Unconditional exact tests can be performed for binomial or multinomial models. The binomial model assumes the row or column margins (but not both) are known in advance, while the multinomial model assumes only the total sample size is known beforehand. For the binomial model, the user needs to specify which margin is fixed (default is rows). Conditional tests (e.g., Fisher's exact test) have both row and column margins fixed, but this is a very uncommon design. (See Calhoun (2019) for more details.)
If x
is a matrix, it is taken as a two-dimensional contingency
table, and hence its entries should be nonnegative integers.
Otherwise, both x
and y
must be vectors of the same
length. Incomplete cases are removed, the vectors are coerced into
factor objects, and the contingency table is computed from these.
For a 2x2 contingency table, such as , the normalized difference in proportions between the two categories, given in each column, can be written with pooled variance (Score statistic) as
where ,
,
,
and
. Alternatively, with unpooled variance (Wald statistic), the difference in proportions can we written as
The probability of observing is
where is the unknown nuisance parameter.
Barnard's test considers all tables with category sizes and
for a given
. The p-value is the sum of probabilities of the tables having a score in the rejection region, e.g. having significantly large difference in proportions for a two-sided test. The p-value of the test is the maximum p-value calculated over all
between 0 and 1.
If useStoredCSM
is set to TRUE
a companion data package called ExactData must be installed from GitHub.
The author states: "The CSM test is computationally intensive due to iteratively maximizing the p-value calculation to order the tables. The CSM ordering matrix has been stored for all possible sample sizes less than or equal to 100 (i.e., max(n1,n2)<=100). Thus, using the useStoredCSM = TRUE can greatly improve computation time. However, the stored ordering matrix was computed with npNumbers=100 and it is possible that the ordering matrix was not optimal for larger npNumbers. Increasing npNumbers and setting useStoredCSM = FALSE ensures the p-value is correctly calculated at the expense of significantly greater computation time. The stored ordering matrix is not used in the calculation of confidence intervals or non-inferiority tests, so CSM can still be very computationally intensive."
A list with class "htest"
containing the following components:
p.value |
the p-value of the test. |
estimate |
an estimate of the nuisance parameter where the p-value is maximized. |
alternative |
a character string describing the alternative hypothesis. |
method |
the character string
|
data.name |
a character string giving the names of the data. |
statistic.table |
The contingency tables considered in the analysis represented by |
nuisance.matrix |
Nuisance parameters, |
Peter Calhoun <[email protected]>, Andri Signorell <[email protected]> (interface)
Barnard, G.A. (1945) A new test for 2x2 tables. Nature, 156:177.
Barnard, G.A. (1947) Significance tests for 2x2 tables. Biometrika, 34:123-138.
Suissa, S. and Shuster, J. J. (1985), Exact Unconditional Sample Sizes for the 2x2 Binomial Trial, Journal of the Royal Statistical Society, Ser. A, 148, 317-327.
Cardillo G. (2009) MyBarnard: a very compact routine for Barnard's exact test on 2x2 matrix. https://ch.mathworks.com/matlabcentral/fileexchange/25760-mybarnard
Galili T. (2010) https://www.r-statistics.com/2010/02/barnards-exact-test-a-powerful-alternative-for-fishers-exact-test-implemented-in-r/
Lin C.Y., Yang M.C. (2009) Improved p-value tests for comparing two independent binomial proportions. Communications in Statistics-Simulation and Computation, 38(1):78-91.
Trujillo-Ortiz, A., R. Hernandez-Walls, A. Castro-Perez, L. Rodriguez-Cardozo N.A. Ramos-Delgado and R. Garcia-Sanchez. (2004). Barnardextest:Barnard's Exact Probability Test. A MATLAB file. [WWW document]. https://www.mathworks.com/
Mehta, C.R., Senchaudhuri, P. (2003) Conditional versus unconditional exact tests for comparing two binomials. https://www.researchgate.net/publication/242179503_Conditional_versus_Unconditional_Exact_Tests_for_Comparing_Two_Binomials
Calhoun, P. (2019) Exact: Unconditional Exact Test. R package version
2.0.
https://CRAN.R-project.org/package=Exact
tab <- as.table(matrix(c(8, 14, 1, 3), nrow=2, dimnames=list(treat=c("I","II"), out=c("I","II")))) BarnardTest(tab) # Plotting the search for the nuisance parameter for a one-sided test bt <- BarnardTest(tab) # Plotting the tables included in the p-value ttab <- as.table(matrix(c(40, 14, 10, 30), nrow=2, dimnames=list(treat=c("I","II"), out=c("I","II")))) bt <- BarnardTest(ttab) bts <- bt$statistic.table # Mehta et. al (2003) tab <- as.table(matrix(c(7, 12, 8, 3), nrow=2, dimnames=list(treat=c("vaccine","placebo"), infection=c("yes","no")))) BarnardTest(tab, alternative="less")
tab <- as.table(matrix(c(8, 14, 1, 3), nrow=2, dimnames=list(treat=c("I","II"), out=c("I","II")))) BarnardTest(tab) # Plotting the search for the nuisance parameter for a one-sided test bt <- BarnardTest(tab) # Plotting the tables included in the p-value ttab <- as.table(matrix(c(40, 14, 10, 30), nrow=2, dimnames=list(treat=c("I","II"), out=c("I","II")))) bt <- BarnardTest(ttab) bts <- bt$statistic.table # Mehta et. al (2003) tab <- as.table(matrix(c(7, 12, 8, 3), nrow=2, dimnames=list(treat=c("vaccine","placebo"), infection=c("yes","no")))) BarnardTest(tab, alternative="less")
Performs the Bartels rank test of randomness, which tests if a sample is sampled randomly from an underlying population. Data must at least be measured on an ordinal scale.
BartelsRankTest(x, alternative = c("two.sided", "trend", "oscillation"), method = c("normal", "beta", "auto"))
BartelsRankTest(x, alternative = c("two.sided", "trend", "oscillation"), method = c("normal", "beta", "auto"))
x |
a numeric vector containing the observations |
alternative |
a character string specifying the alternative hypothesis, must be one of " |
method |
a character string specifying the method used to compute the p-value. Must be one of |
The RVN test statistic is
where . It is known that
is asymptotically standard normal, where
.
By using the alternative "trend
" the null hypothesis of randomness is tested against a trend. By using the alternative "oscillation
" the null hypothesis of randomness is tested against a systematic oscillation.
Missing values are silently removed.
Bartels test is a rank version of von Neumann's test.
A list with class "htest" containing the components:
statistic |
the value of the normalized statistic test. |
parameter , n
|
the size of the data, after the remotion of consecutive duplicate values. |
p.value |
the p-value of the test. |
alternative |
a character string describing the alternative hypothesis. |
method |
a character string indicating the test performed. |
data.name |
a character string giving the name of the data. |
rvn |
the value of the RVN statistic (not show on screen). |
nm |
the value of the NM statistic, the numerator of RVN (not show on screen). |
mu |
the mean value of the RVN statistic (not show on screen). |
var |
the variance of the RVN statistic (not show on screen). |
Frederico Caeiro <[email protected]>
Bartels, R. (1982) The Rank Version of von Neumann's Ratio Test for Randomness, Journal of the American Statistical Association, 77 (377), 40-46.
Gibbons, J.D. and Chakraborti, S. (2003) Nonparametric Statistical Inference, 4th ed. (pp. 97-98). URL: http://books.google.pt/books?id=dPhtioXwI9cC&lpg=PA97&ots=ZGaQCmuEUq
von Neumann, J. (1941) Distribution of the ratio of the mean square successive difference to the variance. Annals of Mathematical Statistics 12, 367-395.
## Example 5.1 in Gibbons and Chakraborti (2003), p.98. ## Annual data on total number of tourists to the United States for 1970-1982. years <- 1970:1982 tourists <- c(12362, 12739, 13057, 13955, 14123, 15698, 17523, 18610, 19842, 20310, 22500, 23080, 21916) plot(years, tourists, pch=20) BartelsRankTest(tourists, alternative="trend", method="beta") # Bartels Ratio Test # # data: tourists # statistic = -3.6453, n = 13, p-value = 1.21e-08 # alternative hypothesis: trend ## Example in Bartels (1982). ## Changes in stock levels for 1968-1969 to 1977-1978 (in $A million), deflated by the ## Australian gross domestic product (GDP) price index (base 1966-1967). x <- c(528, 348, 264, -20, - 167, 575, 410, -4, 430, - 122) BartelsRankTest(x, method="beta")
## Example 5.1 in Gibbons and Chakraborti (2003), p.98. ## Annual data on total number of tourists to the United States for 1970-1982. years <- 1970:1982 tourists <- c(12362, 12739, 13057, 13955, 14123, 15698, 17523, 18610, 19842, 20310, 22500, 23080, 21916) plot(years, tourists, pch=20) BartelsRankTest(tourists, alternative="trend", method="beta") # Bartels Ratio Test # # data: tourists # statistic = -3.6453, n = 13, p-value = 1.21e-08 # alternative hypothesis: trend ## Example in Bartels (1982). ## Changes in stock levels for 1968-1969 to 1977-1978 (in $A million), deflated by the ## Australian gross domestic product (GDP) price index (base 1966-1967). x <- c(528, 348, 264, -20, - 167, 575, 410, -4, 430, - 122) BartelsRankTest(x, method="beta")
It can sometimes make sense to display data values directly on the bars in a barplot. There are a handful of obvious alternatives for placing the labels, either on top of the bars, right below the upper end, in the middle or at the bottom. Determining the required geometry - although not difficult - is cumbersome and the code is distractingly long within an analysis code. The present function offers a short way to solve the task. It can place text either in the middle of the stacked bars, on top or on the bottom of a barplot (side by side or stacked).
BarText(height, b, labels = height, beside = FALSE, horiz = FALSE, cex = par("cex"), adj = NULL, pos = c("topout", "topin", "mid", "bottomin", "bottomout"), offset = 0, ...)
BarText(height, b, labels = height, beside = FALSE, horiz = FALSE, cex = par("cex"), adj = NULL, pos = c("topout", "topin", "mid", "bottomin", "bottomout"), offset = 0, ...)
height |
either a vector or matrix of values describing the bars which make up the plot exactly as used for creating the barplot. |
b |
the returned mid points as returned by |
labels |
the labels to be placed on the bars. |
beside |
a logical value. If |
horiz |
a logical value. If |
cex |
numeric character expansion factor; multiplied by |
adj |
one or two values in [0, 1] which specify the x (and optionally y) adjustment of the labels. On most devices values outside that interval will also work. |
pos |
one of |
offset |
a vector indicating how much the bars should be shifted relative to the x axis. |
... |
the dots are passed to the |
The x coordinates of the labels can be found by using barplot()
result, if they are to be centered at the top of each bar. BarText()
calculates the rest.
Notice that when the labels are placed on top of the bars, they may be clipped. This can be avoided by setting xpd=TRUE
.
returns the geometry of the labels invisibly
Andri Signorell <[email protected]>
# simple vector x <- c(353, 44, 56, 34) b <- barplot(x) BarText(x, b, x) # more complicated b <- barplot(VADeaths, horiz = FALSE, col=DescTools::hblue, beside = TRUE) BarText(VADeaths, b=b, horiz = FALSE, beside = TRUE, cex=0.8) BarText(VADeaths, b=b, horiz = FALSE, beside = TRUE, cex=0.8, pos="bottomin", col="white", font=2) b <- barplot(VADeaths, horiz = TRUE, col=DescTools::hblue, beside = TRUE) BarText(VADeaths, b=b, horiz = TRUE, beside = TRUE, cex=0.8) b <- barplot(VADeaths) BarText(VADeaths, b=b) b <- barplot(VADeaths, horiz = TRUE) BarText(VADeaths, b=b, horiz = TRUE, col="red", cex=1.5) # position of the text old <- par(mfrow=c(3,2), xpd=NA) off <- c(10, 4, 1, 20, -15) for(pos in eval(formals(BarText)$pos)) { b <- barplot(x, offset=off, main=gettextf("Textposition pos = '%s'", pos), horiz=TRUE) abline(h=0) BarText(x, b, x, offset = off, pos=pos, cex=1.5, horiz=TRUE) } par(old)
# simple vector x <- c(353, 44, 56, 34) b <- barplot(x) BarText(x, b, x) # more complicated b <- barplot(VADeaths, horiz = FALSE, col=DescTools::hblue, beside = TRUE) BarText(VADeaths, b=b, horiz = FALSE, beside = TRUE, cex=0.8) BarText(VADeaths, b=b, horiz = FALSE, beside = TRUE, cex=0.8, pos="bottomin", col="white", font=2) b <- barplot(VADeaths, horiz = TRUE, col=DescTools::hblue, beside = TRUE) BarText(VADeaths, b=b, horiz = TRUE, beside = TRUE, cex=0.8) b <- barplot(VADeaths) BarText(VADeaths, b=b) b <- barplot(VADeaths, horiz = TRUE) BarText(VADeaths, b=b, horiz = TRUE, col="red", cex=1.5) # position of the text old <- par(mfrow=c(3,2), xpd=NA) off <- c(10, 4, 1, 20, -15) for(pos in eval(formals(BarText)$pos)) { b <- barplot(x, offset=off, main=gettextf("Textposition pos = '%s'", pos), horiz=TRUE) abline(h=0) BarText(x, b, x, offset = off, pos=pos, cex=1.5, horiz=TRUE) } par(old)
These functions convert numbers from one base to another. There are several solutions for this problem out there, but the naming is quite heterogeneous and so consistent function names might be helpful.
BinToDec(x) DecToBin(x) OctToDec(x) DecToOct(x) HexToDec(x) DecToHex(x)
BinToDec(x) DecToBin(x) OctToDec(x) DecToOct(x) HexToDec(x) DecToHex(x)
x |
a vector of numbers, resp. alphanumerical representation of numbers (hex), to be converted. |
BinToDec converts numbers from binary mode into decimal values. DecToBin does it the other way round.
Oct means octal system and hex hexadecimal.
A numeric or character vector of the same length as x containing the converted values.
Binary, octal and decimal values are numeric, hex-values are returned as class hexmode
.
Andri Signorell <[email protected]>
DecToBin(c(17, 25)) BinToDec(c(101, 11101)) DecToOct(c(17, 25)) OctToDec(c(11, 77)) DecToHex(c(17, 25)) HexToDec(c("FF", "AA", "ABC"))
DecToBin(c(17, 25)) BinToDec(c(101, 11101)) DecToOct(c(17, 25)) OctToDec(c(11, 77)) DecToHex(c(17, 25)) HexToDec(c("FF", "AA", "ABC"))
Density, distribution function, quantile function, and random generation for Benford's distribution.
dBenf(x, ndigits = 1, log = FALSE) pBenf(q, ndigits = 1, log.p = FALSE) qBenf(p, ndigits = 1) rBenf(n, ndigits = 1)
dBenf(x, ndigits = 1, log = FALSE) pBenf(q, ndigits = 1, log.p = FALSE) qBenf(p, ndigits = 1) rBenf(n, ndigits = 1)
x , q
|
Vector of quantiles.
See |
p |
vector of probabilities. |
n |
number of observations. A single positive integer.
Else if |
ndigits |
Number of leading digits, either 1 or 2. If 1 then the support of the distribution is {1,...,9}, else {10,...,99}. |
log , log.p
|
Logical.
If |
Benford's Law (aka the significant-digit law) is the
empirical observation that in many naturally occuring tables of
numerical data, the leading significant (nonzero) digit
is not uniformly distributed in .
Instead, the leading significant digit (
, say)
obeys the law
for .
This means
the probability the first significant digit is 1 is
approximately
, etc.
Benford's Law was apparently first discovered in 1881 by astronomer/mathematician S. Newcombe. It started by the observation that the pages of a book of logarithms were dirtiest at the beginning and progressively cleaner throughout. In 1938, a General Electric physicist called F. Benford rediscovered the law on this same observation. Over several years he collected data from different sources as different as atomic weights, baseball statistics, numerical data from Reader's Digest, and drainage areas of rivers.
Applications of Benford's Law has been as diverse as to the area of fraud detection in accounting and the design computers.
dBenf
gives the density,
pBenf
gives the distribution function, and
qBenf
gives the quantile function, and
rBenf
generates random deviates.
T. W. Yee
These functions were previously published as dbenf()
etc. in the VGAM package and have been
integrated here without logical changes.
Benford, F. (1938) The Law of Anomalous Numbers. Proceedings of the American Philosophical Society, 78, 551–572.
Newcomb, S. (1881) Note on the Frequency of Use of the Different Digits in Natural Numbers. American Journal of Mathematics, 4, 39–40.
dBenf(x <- c(0:10, NA, NaN, -Inf, Inf)) pBenf(x) ## Not run: xx <- 1:9 barplot(dBenf(xx), col = "lightblue", las = 1, xlab = "Leading digit", ylab = "Probability", names.arg = as.character(xx), main = paste("Benford's distribution", sep = "")) hist(rBenf(n = 1000), border = "blue", prob = TRUE, main = "1000 random variates from Benford's distribution", xlab = "Leading digit", sub="Red is the true probability", breaks = 0:9 + 0.5, ylim = c(0, 0.35), xlim = c(0, 10.0)) lines(xx, dBenf(xx), col = "red", type = "h") points(xx, dBenf(xx), col = "red") ## End(Not run)
dBenf(x <- c(0:10, NA, NaN, -Inf, Inf)) pBenf(x) ## Not run: xx <- 1:9 barplot(dBenf(xx), col = "lightblue", las = 1, xlab = "Leading digit", ylab = "Probability", names.arg = as.character(xx), main = paste("Benford's distribution", sep = "")) hist(rBenf(n = 1000), border = "blue", prob = TRUE, main = "1000 random variates from Benford's distribution", xlab = "Leading digit", sub="Red is the true probability", breaks = 0:9 + 0.5, ylim = c(0, 0.35), xlim = c(0, 10.0)) lines(xx, dBenf(xx), col = "red", type = "h") points(xx, dBenf(xx), col = "red") ## End(Not run)
The between and outside operators are used to check, whether a vector of given values x lie within a defined range (or outside respectively). The values can be numbers, text or dates. Ordered factors are supported.
x %()% rng x %(]% rng x %[)% rng x %[]% rng x %][% rng x %](% rng x %)[% rng x %)(% rng x %:% rng x %::% rng
x %()% rng x %(]% rng x %[)% rng x %[]% rng x %][% rng x %](% rng x %)[% rng x %)(% rng x %:% rng x %::% rng
x |
is a variable with at least ordinal scale, usually a numeric value, but can be an ordered factor or a text as well. Texts would be treated alphabetically. |
rng |
a vector of two values or a matrix with 2 columns, defining the minimum and maximum of the range for x. |
The "BETWEEN" operators basically combine two conditional statements into one and simplify the query process.
They are merely a wrapper for: x >= rng[1] & x <= rng[2]
,
where the round bracket (
means strictly greater (>) and the square bracket [
means greater or equal (>=). Numerical values of x will be handled by C-code, which is significantly faster than two comparisons in R (especially when x is huge).
.%][%
is the negation of %()%
, meaning all values lying outside the given range. Elements on the limits will return TRUE
.
Both arguments, x
and rng
, will be recycled to the highest dimension, which is either the length of the vector (x
) or the number of rows of the matrix (rng
).
See also the routines used to check, whether two ranges overlap (Overlap
, Interval
).
%:%
returns all the elements of a vector between the (first found) element rng[1]
and rng[2]
. If no match is found it returns NA
. If rng[2]
occurs before rng[1]
in the vector the elements will be returned in reverse order (which is the same behaviour as the :
operator).
%::%
does the same in greedy mood. It uses the first match for from
and the last match for to
.
A logical vector of the same length as x.
Andri Signorell <[email protected]> based on C-code by Kevin Ushey <[email protected]>
if
, ifelse
, Comparison
,
Overlap
, Interval
x <- 1:9 x %[]% c(3,5) # outside x <- 1:9 x %][% c(3,5) c(x,NA) %[]% c(3,5) x %(]% c(3,5) # no result when from > to: x %[]% c(5,3) x %(]% c(5,5) # no problem: ordered(x) %[]% c(3,5) # not meaningful: factor(x) %[]% c(3,5) # characters letters[letters %(]% c("d","h")] data(d.pizza) x <- levels(d.pizza$driver) x %[]% c("C","G") # select diamonds with a price between 2400 and 2510 data(d.diamonds) d.diamonds[d.diamonds$price %[]% c(2400,2510),] # use it with an ordered factor and select all diamonds with # symmetry between G (included) and X (excluded). mean(d.diamonds[d.diamonds$symmetry %[)% c("G","X"),"price"]) # use multiple ranges 2 %[]% cbind(1:4,2:5) # both arguments are recycled c(2,3) %[]% cbind(1:4,2:5) # between operator for vector positions set.seed(4) (x <- sample(LETTERS, size=10, replace=TRUE)) # [1] "X" "K" "S" "C" "G" "L" "S" "V" "U" "Z" # return all elements between "S" and "L" x %:% c("S","L") # [1] "S" "C" "G" "L" x %:% c("S","A") # [1] "S" "C" "G" "L" "S" "V" "U" "Z" x %:% c("A","S") # [1] "X" "K" "S" # reverted matches return the elements in reverse order x %:% c("G","X") # [1] "G" "C" "S" "K" "X" # no match results in NA x %:% c("Y","B") (x <- c("B", "A", "X", "K", "S", "K", "G", "L", "K", "V", "K", "Z")) # lazy x %:% c("A", "K") # greedy x %::% c("A", "K")
x <- 1:9 x %[]% c(3,5) # outside x <- 1:9 x %][% c(3,5) c(x,NA) %[]% c(3,5) x %(]% c(3,5) # no result when from > to: x %[]% c(5,3) x %(]% c(5,5) # no problem: ordered(x) %[]% c(3,5) # not meaningful: factor(x) %[]% c(3,5) # characters letters[letters %(]% c("d","h")] data(d.pizza) x <- levels(d.pizza$driver) x %[]% c("C","G") # select diamonds with a price between 2400 and 2510 data(d.diamonds) d.diamonds[d.diamonds$price %[]% c(2400,2510),] # use it with an ordered factor and select all diamonds with # symmetry between G (included) and X (excluded). mean(d.diamonds[d.diamonds$symmetry %[)% c("G","X"),"price"]) # use multiple ranges 2 %[]% cbind(1:4,2:5) # both arguments are recycled c(2,3) %[]% cbind(1:4,2:5) # between operator for vector positions set.seed(4) (x <- sample(LETTERS, size=10, replace=TRUE)) # [1] "X" "K" "S" "C" "G" "L" "S" "V" "U" "Z" # return all elements between "S" and "L" x %:% c("S","L") # [1] "S" "C" "G" "L" x %:% c("S","A") # [1] "S" "C" "G" "L" "S" "V" "U" "Z" x %:% c("A","S") # [1] "X" "K" "S" # reverted matches return the elements in reverse order x %:% c("G","X") # [1] "G" "C" "S" "K" "X" # no match results in NA x %:% c("Y","B") (x <- c("B", "A", "X", "K", "S", "K", "G", "L", "K", "V", "K", "Z")) # lazy x %:% c("A", "K") # greedy x %::% c("A", "K")
Paints the background of the plot, using either the figure region, the plot region or both. It can sometimes be cumbersome to elaborate the coordinates and base R does not provide a simple function for that.
Bg(col = "grey", region = c("plot", "figure"), border = NA)
Bg(col = "grey", region = c("plot", "figure"), border = NA)
col |
the color of the background, if two colors are provided, the first is used for the plot region and the second for the figure region. |
region |
either |
border |
color for rectangle border(s). Default is |
Andri Signorell <[email protected]>
# use two different colors for the figure region and the plot region plot(x = rnorm(100), col="blue", cex=1.2, pch=16, panel.first={Bg(c("red", "lightyellow")) grid()})
# use two different colors for the figure region and the plot region plot(x = rnorm(100), col="blue", cex=1.2, pch=16, panel.first={Bg(c("red", "lightyellow")) grid()})
Bhapkar (1966) tested marginal homogeneity by exploiting the asymptotic normality of marginal proportion, and so this test is also called Bhapkar's test. The idea of constructing test statistic is similar to the
one of generalized McNemar's test statistic used in StuartMaxwellTest
, and the major difference lies in the calculation of elements in
variance-covariance matrix.
BhapkarTest(x, y = NULL)
BhapkarTest(x, y = NULL)
x |
either a 2-way |
y |
a factor with the same levels as |
Although the Bhapkar and Stuart-Maxwell tests are asymptotically equivalent (Keefe, 1982). Generally, the Bhapkar (1966) test is a more powerful alternative to the Stuart-Maxwell test. With a large N, both will produce the same Chi-square value. As the Bhapkar test is more powerful, it is preferred.
Andri Signorell <[email protected]>
Bhapkar V.P. (1966) A note on the equivalence of two test criteria for hypotheses in categorical data. Journal of the American Statistical Association, 61: 228-235.
Ireland C.T., Ku H.H., and Kullback S. (1969) Symmetry and marginal homogeneity of an r x r contingency table. Journal of the American Statistical Association, 64: 1323-1341.
Keefe T.J. (1982) On the relationship between two tests for homogeneity of the marginal distributions in a two-way classification. Biometrika, 69: 683-684.
Sun X., Yang Z. (2008) Generalized McNemar's Test for Homogeneity of the Marginal Distributions. SAS Global Forum 2008: Statistics and Data Analysis, Paper 382-208.
StuartMaxwellTest
, mcnemar.test
, chisq.test
, MHChisqTest
,
BreslowDayTest
# Source: http://www.john-uebersax.com/stat/mcnemar.htm#stuart mc <- as.table(matrix(c(20,3,0,10,30,5,5,15,40), nrow=3)) BhapkarTest(mc)
# Source: http://www.john-uebersax.com/stat/mcnemar.htm#stuart mc <- as.table(matrix(c(20,3,0,10,30,5,5,15,40), nrow=3)) BhapkarTest(mc)
Compute confidence intervals for binomial proportions according to a number of the most common proposed methods.
BinomCI(x, n, conf.level = 0.95, sides = c("two.sided", "left", "right"), method = c("wilson", "wald", "waldcc", "agresti-coull", "jeffreys", "modified wilson", "wilsoncc","modified jeffreys", "clopper-pearson", "arcsine", "logit", "witting", "pratt", "midp", "lik", "blaker"), rand = 123, tol = 1e-05, std_est = TRUE)
BinomCI(x, n, conf.level = 0.95, sides = c("two.sided", "left", "right"), method = c("wilson", "wald", "waldcc", "agresti-coull", "jeffreys", "modified wilson", "wilsoncc","modified jeffreys", "clopper-pearson", "arcsine", "logit", "witting", "pratt", "midp", "lik", "blaker"), rand = 123, tol = 1e-05, std_est = TRUE)
x |
number of successes. |
n |
number of trials. |
conf.level |
confidence level, defaults to 0.95. |
sides |
a character string specifying the side of the confidence interval, must be one of |
method |
character string specifing which method to use; this can be one out of:
|
rand |
seed for random number generator; see details. |
tol |
tolerance for method |
std_est |
logical, specifying if the standard point estimator for the proportion value |
All arguments are being recycled.
The Wald interval is obtained by inverting the acceptance region of the Wald large-sample normal test.
The Wald with continuity correction interval is obtained by adding the term 1/(2*n) to the Wald interval.
The Wilson interval, which here is the default method, was introduced by Wilson (1927) and is the inversion of the CLT approximation to the family of equal tail tests of p = p0.
The Wilson interval is recommended by Agresti and Coull (1998) as well as by
Brown et al (2001). It is also returned as conf.int
from the function prop.test
with the correct
option set to FALSE
.
The Wilson cc interval is a modification of the Wilson interval adding a continuity correction term. This is returned as conf.int
from the function prop.test
with the correct
option set to TRUE
.
The modified Wilson interval is a modification of the Wilson interval for x close to 0 or n as proposed by Brown et al (2001).
The Agresti-Coull interval was proposed by Agresti and Coull (1998) and is a slight modification of the Wilson interval. The Agresti-Coull intervals are never shorter than the Wilson intervals; cf. Brown et al (2001). The internally used point estimator p-tilde is returned as attribute.
The Jeffreys interval is an implementation of the equal-tailed Jeffreys prior interval as given in Brown et al (2001).
The modified Jeffreys interval is a modification of the Jeffreys interval for
x == 0 | x == 1
and x == n-1 | x == n
as proposed by
Brown et al (2001).
The Clopper-Pearson interval is based on quantiles of corresponding beta distributions. This is sometimes also called exact interval.
The arcsine interval is based on the variance stabilizing distribution for the binomial distribution.
The logit interval is obtained by inverting the Wald type interval for the log odds.
The Witting interval (cf. Beispiel 2.106 in Witting (1985)) uses randomization to obtain uniformly optimal lower and upper confidence bounds (cf. Satz 2.105 in Witting (1985)) for binomial proportions.
The Pratt interval is obtained by extremely accurate normal approximation. (Pratt 1968)
The Mid-p approach is used to reduce the conservatism of the Clopper-Pearson, which is known to be very pronounced. The method midp accumulates the tail areas.
The lower bound is found as the solution to the equation
where denotes the probability mass function (pmf) and
the (cumulative) distribution function of the binomial
distribution with size
and proportion
evaluated at
.
The upper bound
is found as the solution to the equation
In case x=0 then the lower bound is zero and in case x=n then the upper bound is 1.
The Likelihood-based approach is said to be theoretically appealing. Confidence intervals are based on profiling the binomial deviance in the neighbourhood of the MLE.
For the Blaker method refer to Blaker (2000).
For more details we refer to Brown et al (2001) as well as Witting (1985).
Some approaches for the confidence intervals are capable of violating the [0, 1] boundaries and potentially yield negative results or values beyond 1. These would be truncated such as not to exceed the valid range of [0, 1].
So now, which interval should we use? The Wald interval often has inadequate coverage, particularly for small n and values of p close to 0 or 1. Conversely, the Clopper-Pearson Exact method is very conservative and tends to produce wider intervals than necessary. Brown et al. recommends the Wilson or Jeffreys methods for small n and Agresti-Coull, Wilson, or Jeffreys, for larger n as providing more reliable coverage than the alternatives.
For the methods "wilson"
, "wilsoncc"
, "modified wilson"
, "agresti-coull"
and "arcsine"
the internally used alternative point estimator for the proportion value can be returned (by setting std_est = FALSE
). The point estimate typically is slightly shifted towards 0.5 compared to the standard estimator. See the literature for the more details.
A vector with 3 elements for estimate, lower confidence intervall and upper for the upper one.
For more than one argument each, a 3-column matrix is returned.
The base of this function once was binomCI()
from the SLmisc package. In the meantime, the code has been updated on several occasions and it has undergone numerous extensions and bug fixes.
Matthias Kohl <[email protected]>, Rand R. Wilcox (Pratt's method), Michael Hoehle <[email protected]> (Mid-p), Ralph Scherer <[email protected]> (Blaker), Andri Signorell <[email protected]> (interface issues and all the rest)
Agresti A. and Coull B.A. (1998) Approximate is better than "exact" for interval estimation of binomial proportions. American Statistician, 52, pp. 119-126.
Brown L.D., Cai T.T. and Dasgupta A. (2001) Interval estimation for a binomial proportion Statistical Science, 16(2), pp. 101-133.
Witting H. (1985) Mathematische Statistik I. Stuttgart: Teubner.
Pratt J. W. (1968) A normal approximation for binomial, F, Beta, and other common, related tail probabilities Journal of the American Statistical Association, 63, 1457- 1483.
Wilcox, R. R. (2005) Introduction to robust estimation and hypothesis testing. Elsevier Academic Press
Newcombe, R. G. (1998) Two-sided confidence intervals for the single proportion: comparison of seven methods, Statistics in Medicine, 17:857-872 https://pubmed.ncbi.nlm.nih.gov/16206245/
Blaker, H. (2000) Confidence curves and improved exact confidence intervals for discrete distributions, Canadian Journal of Statistics 28 (4), 783-798
binom.test
, binconf
, MultinomCI
, BinomDiffCI
, BinomRatioCI
BinomCI(x=37, n=43, method=eval(formals(BinomCI)$method)) # return all methods prop.test(x=37, n=43, correct=FALSE) # same as method wilson prop.test(x=37, n=43, correct=TRUE) # same as method wilsoncc # the confidence interval computed by binom.test # corresponds to the Clopper-Pearson interval BinomCI(x=42, n=43, method="clopper-pearson") binom.test(x=42, n=43)$conf.int # all arguments are being recycled: BinomCI(x=c(42, 35, 23, 22), n=43, method="wilson") BinomCI(x=c(42, 35, 23, 22), n=c(50, 60, 70, 80), method="jeffreys") # example Table I in Newcombe (1998) meths <- c("wald", "waldcc", "wilson", "wilsoncc", "clopper-pearson","midp", "lik") round(cbind( BinomCI(81, 263, m=meths)[, -1], BinomCI(15, 148, m=meths)[, -1], BinomCI(0, 20, m=meths)[, -1], BinomCI(1, 29, m=meths)[, -1]), 4) # returning p.tilde for agresti-coull ci BinomCI(x=81, n=263, meth="agresti-coull", std_est = c(TRUE, FALSE))
BinomCI(x=37, n=43, method=eval(formals(BinomCI)$method)) # return all methods prop.test(x=37, n=43, correct=FALSE) # same as method wilson prop.test(x=37, n=43, correct=TRUE) # same as method wilsoncc # the confidence interval computed by binom.test # corresponds to the Clopper-Pearson interval BinomCI(x=42, n=43, method="clopper-pearson") binom.test(x=42, n=43)$conf.int # all arguments are being recycled: BinomCI(x=c(42, 35, 23, 22), n=43, method="wilson") BinomCI(x=c(42, 35, 23, 22), n=c(50, 60, 70, 80), method="jeffreys") # example Table I in Newcombe (1998) meths <- c("wald", "waldcc", "wilson", "wilsoncc", "clopper-pearson","midp", "lik") round(cbind( BinomCI(81, 263, m=meths)[, -1], BinomCI(15, 148, m=meths)[, -1], BinomCI(0, 20, m=meths)[, -1], BinomCI(1, 29, m=meths)[, -1]), 4) # returning p.tilde for agresti-coull ci BinomCI(x=81, n=263, meth="agresti-coull", std_est = c(TRUE, FALSE))
Returns the necessary sample size to achieve a given width of a binomial confidence interval, as calculated by BinomCI()
. The function uses uniroot()
to find a numeric solution.
BinomCIn(p = 0.5, width, interval = c(1, 100000), conf.level = 0.95, sides = "two.sided", method = "wilson")
BinomCIn(p = 0.5, width, interval = c(1, 100000), conf.level = 0.95, sides = "two.sided", method = "wilson")
p |
probability for success, defaults to |
width |
the width of the confidence interval |
interval |
a vector containing the end-points of the interval to be searched for the root. The defaults are set to |
conf.level |
confidence level, defaults to |
sides |
a character string specifying the side of the confidence interval, must be one of |
method |
character string specifing which method to use; this can be one out of:
|
The required sample sizes for a specific width of confidence interval depends on the proportion in the population. This value might be unknown right from the start when a study is planned. In such cases the sample size needed for a given level of accuracy can be estimated using the worst case percentage which is p=50%. When a better estimate is available you can you can use it to get a smaller interval.
a numeric value
Andri Signorell <[email protected]>
BinomCI()
BinomCIn(p=0.1, width=0.05, method="pratt")
BinomCIn(p=0.1, width=0.05, method="pratt")
Several confidence intervals for the difference between proportions are available, but they can produce markedly different results. Traditional approaches, such as the Wald interval do not perform well unless the sample size is large. Better intervals are available. These include the Agresti/Caffo method (2000), Newcombe Score method (1998) and more computing intensive ones as by Miettinen and Nurminen (1985) or Mee (1984). The latter ones are favoured by Newcombe (when forced to choose between a rock and a hard place).
BinomDiffCI(x1, n1, x2, n2, conf.level = 0.95, sides = c("two.sided","left","right"), method = c("ac", "wald", "waldcc", "score", "scorecc", "mn", "mee", "blj", "ha", "hal", "jp"))
BinomDiffCI(x1, n1, x2, n2, conf.level = 0.95, sides = c("two.sided","left","right"), method = c("ac", "wald", "waldcc", "score", "scorecc", "mn", "mee", "blj", "ha", "hal", "jp"))
x1 |
number of successes for the first group. |
n1 |
number of trials for the first group. |
x2 |
number of successes for the second group. |
n2 |
number of trials for the second group. |
conf.level |
confidence level, defaults to 0.95. |
sides |
a character string specifying the side of the confidence interval, must be one of |
method |
one of |
All arguments are being recycled.
We estimate the difference between proportions using the sample proportions:
The traditional Wald confidence interval for the difference of two proportions is based on the asymptotic normal distribution of
.
The Corrected Wald interval uses a continuity correction included in the test statistic. The continuity correction is subtracted from the numerator of the test statistic if the numerator is greater than zero; otherwise, the continuity correction is added to the numerator. The value of the continuity correction is (1/n1 + 1/n2)/2.
The Agresti-Caffo (code "ac"
) is equal to the Wald interval with the adjustment according to Agresti, Caffo (2000) for difference in proportions and independent samples. It adds 1 to x1 and x2 and adds 2 to n1 and n2 and performs surpringly well.
Newcombe (code "scorecc"
) proposed a confidence interval for the difference based on the Wilson score confidence interval for a single proportion. A variant uses a continuity correction for the Wilson interval (code "scorecc"
).
Miettinen and Nurminen showed that the restricted maximum likelihood estimates for p1 and p2 can
be obtained by solving a cubic equation and gave unique closed-form expressions for them. The Miettinen-Nurminen confidence interval is returned with code "mn"
.
The Mee (code "mee"
) interval proposed by Mee (1984) and Farrington-Manning (1990) is using the same maximum likelihood estimators as Miettinen-Nurminen but with another correcting factor.
The Brown, Li's Jeffreys (code "blj"
) interval was proposed by Brown, Li's Jeffreys (2005).
The Hauck-Anderson (code "ha"
) interval was proposed by Hauck-Anderson (1986).
The Haldane (code "hal"
) interval is described in Newcombe (1998) and so is
the Jeffreys-Perks (code "jp"
).
Some approaches for the confidence intervals can potentially yield negative results or values beyond [-1, 1]. These would be reset such as not to exceed the range of [-1, 1].
Which of the methods to use is currently still the subject of lively discussion and has not yet been conclusively clarified. See e.g. Fagerland (2011).
The general consensus is that the most widely taught method method="wald"
is inappropriate in many situations and should not be used. Recommendations seem to converge around the Miettinen-Nurminen based methods (method="mn"
).
A matrix with 3 columns containing the estimate, the lower and the upper confidence intervall.
Andri Signorell <[email protected]>
Agresti, A, Caffo, B (2000) Simple and effective confidence intervals for proportions and difference of proportions result from adding two successes and two failures. The American Statistician 54 (4), 280-288.
Beal, S L (1987) Asymptotic Confidence Intervals for the Difference Between Two Binomial Parameters for Use with Small Samples; Biometrics, 43, 941-950.
Brown L, Li X (2005) Confidence intervals for two sample binomial distribution, Journal of Statistical Planning and Inference, 130(1), 359-375.
Hauck WW, Anderson S. (1986) A comparison of large-sample confidence interval methods for the difference of two binomial probabilities The American Statistician 40(4): 318-322.
Farrington, C. P. and Manning, G. (1990) Test Statistics and Sample Size Formulae for Comparative Binomial Trials with Null Hypothesis of Non-zero Risk Difference or Non-unity Relative Risk Statistics in Medicine, 9, 1447-1454.
Mee RW (1984) Confidence bounds for the difference between two probabilities, Biometrics 40:1175-1176 .
Miettinen OS, Nurminen M. (1985) Comparative analysis of two rates. Statistics in Medicine 4, 213-226.
Newcombe, R G (1998). Interval Estimation for the Difference Between Independent Proportions: Comparison of Eleven Methods. Statistics in Medicine, 17, 873–890.
Fagerland M W, Lydersen S and Laake P (2011) Recommended confidence intervals for two independent binomial proportions, Statistical Methods in Medical Research 0(0) 1-31
BinomCI
, MultinomCI
, binom.test
,
prop.test
, BinomRatioCI
x1 <- 56; n1 <- 70; x2 <- 48; n2 <- 80 xci <- BinomDiffCI(x1, n1, x2, n2, method=c("wald", "waldcc", "ac", "score", "scorecc", "mn", "mee", "blj", "ha")) Format(xci[,-1], digits=4) x1 <- 9; n1 <- 10; x2 <- 3; n2 <- 10 yci <- BinomDiffCI(x1, n1, x2, n2, method=c("wald", "waldcc", "ac", "score", "scorecc", "mn", "mee", "blj", "ha")) Format(yci[, -1], digits=4) # https://www.lexjansen.com/wuss/2016/127_Final_Paper_PDF.pdf, page 9 SetNames(round( BinomDiffCI(56, 70, 48, 80, method=c("wald", "waldcc", "hal", "jp", "mee", "mn", "score", "scorecc", "ha", "ac", "blj"))[,-1], 4), rownames=c("1. Wald, no CC", "2. Wald, CC", "3. Haldane", "4. Jeffreys-Perks", "5. Mee", "6. Miettinen-Nurminen", "10. Score, no CC", "11. Score, CC", "12. Hauck-Andersen", "13. Agresti-Caffo", "16. Brown-Li"))
x1 <- 56; n1 <- 70; x2 <- 48; n2 <- 80 xci <- BinomDiffCI(x1, n1, x2, n2, method=c("wald", "waldcc", "ac", "score", "scorecc", "mn", "mee", "blj", "ha")) Format(xci[,-1], digits=4) x1 <- 9; n1 <- 10; x2 <- 3; n2 <- 10 yci <- BinomDiffCI(x1, n1, x2, n2, method=c("wald", "waldcc", "ac", "score", "scorecc", "mn", "mee", "blj", "ha")) Format(yci[, -1], digits=4) # https://www.lexjansen.com/wuss/2016/127_Final_Paper_PDF.pdf, page 9 SetNames(round( BinomDiffCI(56, 70, 48, 80, method=c("wald", "waldcc", "hal", "jp", "mee", "mn", "score", "scorecc", "ha", "ac", "blj"))[,-1], 4), rownames=c("1. Wald, no CC", "2. Wald, CC", "3. Haldane", "4. Jeffreys-Perks", "5. Mee", "6. Miettinen-Nurminen", "10. Score, no CC", "11. Score, CC", "12. Hauck-Andersen", "13. Agresti-Caffo", "16. Brown-Li"))
A number of methods have been develeloped for obtaining confidence intervals for the ratio of two binomial proportions. These include the Wald/Katz-log method (Katz et al. 1978), adjusted-log (Walter 1975, Pettigrew et al. 1986), Koopman asymptotic score (Koopman 1984), Inverse hyperbolic sine transformation (Newman 2001), the Bailey method (Bailey (1987), and the Noether (1957) procedure. Koopman results are found iteratively for most intervals using root finding.
BinomRatioCI(x1, n1, x2, n2, conf.level = 0.95, sides = c("two.sided", "left", "right"), method = c("katz.log", "adj.log", "bailey", "koopman", "noether", "sinh-1", "boot"), tol = .Machine$double.eps^0.25, R = 1000)
BinomRatioCI(x1, n1, x2, n2, conf.level = 0.95, sides = c("two.sided", "left", "right"), method = c("katz.log", "adj.log", "bailey", "koopman", "noether", "sinh-1", "boot"), tol = .Machine$double.eps^0.25, R = 1000)
x1 |
number of successes for the ratio numerator. |
n1 |
number of trials for the ratio numerator. |
x2 |
number of successes for the ratio denominator. |
n2 |
number of successes for the ratio denominator. |
conf.level |
confidence level, defaults to 0.95. |
sides |
a character string specifying the side of the confidence interval, must be one of |
method |
confidence interval method, one of |
tol |
The desired accuracy (convergence tolerance) for the iterative root finding procedure when finding Koopman intevals. The default is taken to be the smallest positive floating-point number of the workstation implementing the function, raised to the 0.25 power, and will normally be approximately 0.0001. |
R |
If method |
All arguments are being recycled.
Let and
be multinomial random variables with parameters
, and
, respectively; where
. This encompasses the binomial case in which
. We define the true selection ratio for the ith resource of r total resources to be:
where and
represent the proportional use and availability of the ith resource, respectively. Note that if
the selection ratio becomes relative risk. The maximum likelihood estimators for
and
are the sample proportions:
and
where and
are the observed counts for use and availability for the ith resource. The estimator for
is:
Method | Algorithm |
Katz-log | exp , |
where . |
|
Adjusted-log | exp , |
where , |
|
. |
|
Bailey | , |
where = 1 - , and = 1 - . |
|
Inv. hyperbolic sine | , |
Koopman | Find = , where |
, |
|
. |
|
Noether | , |
where .
|
Exception handling strategies are generally necessary in the cases = 0,
=
,
= 0, and
=
(see Aho and Bowyer, in review).
The bootstrap method currently employs percentile confidence intervals.
A matrix with 3 columns containing the estimate, the lower and the upper confidence intervall.
Ken Aho <[email protected]>, some tweaks Andri Signorell <[email protected]>
Agresti, A., Min, Y. (2001) On small-sample confidence intervals for parameters in discrete distributions. Biometrics 57: 963-97.
Aho, K., and Bowyer, T. (In review) Confidence intervals for ratios of multinomial proportions: implications for selection ratios. Methods in Ecology and Evolution.
Bailey, B.J.R. (1987) Confidence limits to the risk ratio. Biometrics 43(1): 201-205.
Katz, D., Baptista, J., Azen, S. P., and Pike, M. C. (1978) Obtaining confidence intervals for the risk ratio in cohort studies. Biometrics 34: 469-474
Koopman, P. A. R. (1984) Confidence intervals for the ratio of two binomial proportions. Biometrics 40:513-517.
Manly, B. F., McDonald, L. L., Thomas, D. L., McDonald, T. L. and Erickson, W.P. (2002) Resource Selection by Animals: Statistical Design and Analysis for Field Studies. 2nd edn. Kluwer, New York, NY
Newcombe, R. G. (2001) Logit confidence intervals and the inverse sinh transformation. The American Statistician 55: 200-202.
Pettigrew H. M., Gart, J. J., Thomas, D. G. (1986) The bias and higher cumulants of the logarithm of a binomial variate. Biometrika 73(2): 425-435.
Walter, S. D. (1975) The distribution of Levins measure of attributable risk. Biometrika 62(2): 371-374.
# From Koopman (1984) BinomRatioCI(x1 = 36, n1 = 40, x2 = 16, n2 = 80, method = "katz") BinomRatioCI(x1 = 36, n1 = 40, x2 = 16, n2 = 80, method = "koop")
# From Koopman (1984) BinomRatioCI(x1 = 36, n1 = 40, x2 = 16, n2 = 80, method = "katz") BinomRatioCI(x1 = 36, n1 = 40, x2 = 16, n2 = 80, method = "koop")
Create a binary tree of a given number of nodes n
. Can be used to organize a sorted numeric vector as a binary tree.
BinTree(n) PlotBinTree(x, main="Binary tree", horiz=FALSE, cex=1.0, col=1, ...)
BinTree(n) PlotBinTree(x, main="Binary tree", horiz=FALSE, cex=1.0, col=1, ...)
n |
integer, size of the tree |
x |
numeric vector to be organized as binary tree. |
main |
main title of the plot. |
horiz |
logical, should the plot be oriented horizontally or vertically. The latter is default. |
cex |
character extension factor for the labels. |
col |
color of the linesegments of the plot. |
... |
the dots are sent to |
If we index the nodes of the tree as 1 for the top, 2–3 for the next horizontal row, 4–7 for the next, ... then the parent-child traversal becomes particularly easy. The basic idea is that the rows of the tree start at indices 1, 2, 4, ....
BinTree(13) yields the vector
c(8, 4, 9, 2, 10, 5, 11, 1, 12, 6, 13, 3, 7)
meaning that the smallest element
will be in position 8 of the tree, the next smallest in position 4, etc.
an integer vector of length n
Terry Therneau <[email protected]>
Andri Signorell <[email protected]> (plot)
BinTree(12) x <- sort(sample(100, 24)) z <- PlotBinTree(x, cex=0.8) # Plot example - Titanic data, for once from a somwhat different perspective tab <- apply(Titanic, c(2,3,4), sum) cprob <- c(1, prop.table(apply(tab, 1, sum)) , as.vector(aperm(prop.table(apply(tab, c(1,2), sum), 1), c(2, 1))) , as.vector(aperm(prop.table(tab, c(1,2)), c(3,2,1))) ) PlotBinTree(round(cprob[BinTree(length(cprob))],2), horiz=TRUE, cex=0.8, main="Titanic") text(c("sex","age","survived"), y=0, x=c(1,2,3)+1)
BinTree(12) x <- sort(sample(100, 24)) z <- PlotBinTree(x, cex=0.8) # Plot example - Titanic data, for once from a somwhat different perspective tab <- apply(Titanic, c(2,3,4), sum) cprob <- c(1, prop.table(apply(tab, 1, sum)) , as.vector(aperm(prop.table(apply(tab, c(1,2), sum), 1), c(2, 1))) , as.vector(aperm(prop.table(tab, c(1,2)), c(3,2,1))) ) PlotBinTree(round(cprob[BinTree(length(cprob))],2), horiz=TRUE, cex=0.8, main="Titanic") text(c("sex","age","survived"), y=0, x=c(1,2,3)+1)
Convenience wrapper for calculating bootstrap confidence intervals for univariate and bivariate statistics.
BootCI(x, y = NULL, FUN, ..., bci.method = c("norm", "basic", "stud", "perc", "bca"), conf.level = 0.95, sides = c("two.sided", "left", "right"), R = 999)
BootCI(x, y = NULL, FUN, ..., bci.method = c("norm", "basic", "stud", "perc", "bca"), conf.level = 0.95, sides = c("two.sided", "left", "right"), R = 999)
x |
a (non-empty) numeric vector of data values. |
y |
NULL (default) or a vector with compatible dimensions to |
FUN |
the function to be used |
bci.method |
A vector of character strings representing the type of intervals required. The value should be any subset of the values |
conf.level |
confidence level of the interval. |
sides |
a character string specifying the side of the confidence interval, must be one of |
... |
further arguments are passed to the function |
R |
The number of bootstrap replicates. Usually this will be a single positive integer. For importance resampling,
some resamples may use one set of weights and others use a different set of weights. In this case |
a named numeric vector with 3 elements:
est |
the specific estimate, as calculated by |
lwr.ci |
lower bound of the confidence interval |
upr.ci |
upper bound of the confidence interval |
Andri Signorell <[email protected]>
set.seed(1984) BootCI(d.pizza$temperature, FUN=mean, na.rm=TRUE, bci.method="basic") BootCI(d.pizza$temperature, FUN=mean, trim=0.1, na.rm=TRUE, bci.method="basic") BootCI(d.pizza$temperature, FUN=Skew, na.rm=TRUE, bci.method="basic") BootCI(d.pizza$operator, d.pizza$area, FUN=CramerV) spearman <- function(x,y) cor(x, y, method="spearman", use="p") BootCI(d.pizza$temperature, d.pizza$delivery_min, FUN=spearman)
set.seed(1984) BootCI(d.pizza$temperature, FUN=mean, na.rm=TRUE, bci.method="basic") BootCI(d.pizza$temperature, FUN=mean, trim=0.1, na.rm=TRUE, bci.method="basic") BootCI(d.pizza$temperature, FUN=Skew, na.rm=TRUE, bci.method="basic") BootCI(d.pizza$operator, d.pizza$area, FUN=CramerV) spearman <- function(x,y) cor(x, y, method="spearman", use="p") BootCI(d.pizza$temperature, d.pizza$delivery_min, FUN=spearman)
BoxCox()
returns a transformation of the input variable using a Box-Cox transformation.BoxCoxInv()
reverses the transformation.
BoxCox(x, lambda) BoxCoxInv(x, lambda)
BoxCox(x, lambda) BoxCoxInv(x, lambda)
x |
a numeric vector |
lambda |
transformation parameter |
The Box-Cox transformation is given by
a numeric vector of the same length as x.
These two functions are borrowed from library(forecast)
.
Rob J Hyndman <[email protected]>
Box, G. E. P. and Cox, D. R. (1964) An analysis of transformations. JRSS B 26 211–246.
Use BoxCoxLambda
or boxcox
in library(MASS)
to find optimal lambda values.
# example by Greg Snow x <- rlnorm(500, 3, 2) par(mfrow=c(2,2)) qqnorm(x, main="Lognormal") qqnorm(BoxCox(x, 1/2), main="BoxCox(lambda=0.5)") qqnorm(BoxCox(x, 0), main="BoxCox(lambda=0)") PlotFdist(BoxCox(x, 0)) bx <- BoxCox(x, lambda = BoxCoxLambda(x) )
# example by Greg Snow x <- rlnorm(500, 3, 2) par(mfrow=c(2,2)) qqnorm(x, main="Lognormal") qqnorm(BoxCox(x, 1/2), main="BoxCox(lambda=0.5)") qqnorm(BoxCox(x, 0), main="BoxCox(lambda=0)") PlotFdist(BoxCox(x, 0)) bx <- BoxCox(x, lambda = BoxCoxLambda(x) )
An automatic selection of the Box Cox transformation parameter is estimated with two methods.
Guerrero's (1993) method yields a lambda which minimizes the coefficient of variation for subseries of x
.
For method "loglik"
, the value of lambda is chosen to maximize the profile log likelihood of a linear model fitted to x
.
For non-seasonal data, a linear time trend is fitted while for seasonal data, a linear time trend with seasonal dummy variables is used.
BoxCoxLambda(x, method = c("guerrero", "loglik"), lower = -1, upper = 2)
BoxCoxLambda(x, method = c("guerrero", "loglik"), lower = -1, upper = 2)
x |
a numeric vector or time series |
method |
method to be used in calculating lambda. Can be either "guerrero" (default) or "loglik". |
lower |
lower limit for possible lambda values, default is -1. |
upper |
upper limit for possible lambda values, default is 2. |
a number indicating the Box-Cox transformation parameter.
This function was previously published as BoxCox.lambda()
in the forecast package and has been integrated here without logical changes.
Leanne Chhay and Rob J Hyndman
Box, G. E. P. and Cox, D. R. (1964) An analysis of transformations. JRSS B 26 211–246.
Guerrero, V.M. (1993) Time-series analysis supported by power transformations. Journal of Forecasting, 12, 37–48.
lambda <- BoxCoxLambda(AirPassengers, lower=0)
lambda <- BoxCoxLambda(AirPassengers, lower=0)
BoxedText draws the strings given in the vector labels at the coordinates given by x and y, surrounded by a rectangle.
BoxedText(x, ...) ## Default S3 method: BoxedText(x, y = NULL, labels = seq_along(x), adj = NULL, pos = NULL, offset = 0.5, vfont = NULL, cex = 1, col = NULL, font = NULL, srt = 0, xpad = 0.2, ypad = 0.2, density = NULL, angle = 45, bg = NA, border = par("fg"), lty = par("lty"), lwd = par("lwd"), ...)
BoxedText(x, ...) ## Default S3 method: BoxedText(x, y = NULL, labels = seq_along(x), adj = NULL, pos = NULL, offset = 0.5, vfont = NULL, cex = 1, col = NULL, font = NULL, srt = 0, xpad = 0.2, ypad = 0.2, density = NULL, angle = 45, bg = NA, border = par("fg"), lty = par("lty"), lwd = par("lwd"), ...)
x , y
|
numeric vectors of coordinates where the text labels should be written. If the length of x and y differs, the shorter one is recycled. |
labels |
a character vector or expression specifying the text to be written. An attempt is made to coerce other language objects (names and calls) to expressions, and vectors and other classed objects to character vectors by as.character. If labels is longer than x and y, the coordinates are recycled to the length of labels. |
adj |
The value of adj determines the way in which text strings are justified. A value of 0 produces left-justified text, 0.5 (the default) centered text and 1 right-justified text. (Any value in [0, 1] is allowed, and on most devices values outside that interval will also work.) Note that the adj argument of text also allows adj = c(x, y) for different adjustment in x- and y- directions. |
pos |
a position specifier for the text. If specified this overrides any adj value given. Values of 1, 2, 3 and 4, respectively indicate positions below, to the left of, above and to the right of the specified coordinates. |
offset |
when pos is specified, this value gives the offset of the label from the specified coordinate in fractions of a character width. |
vfont |
|
cex |
numeric character expansion factor; multiplied by |
col , font
|
the color and (if vfont = NULL) font to be used, possibly vectors. These default to the values of the global graphical parameters in |
srt |
The string rotation in degrees. |
xpad , ypad
|
The proportion of the rectangles to the extent of the text within. |
density |
the density of shading lines, in lines per inch. The default value of |
angle |
angle (in degrees) of the shading lines. |
bg |
color(s) to fill or shade the rectangle(s) with. The default |
border |
color for rectangle border(s). The default is |
lty |
line type for borders and shading; defaults to |
lwd |
line width for borders and shading. Note that the use of |
... |
additional arguments are passed to the text function. |
Andri Signorell <[email protected]>
SpreadOut
, similar function in package plotrix boxed.labels
(lacking rotation option)
Canvas(xpd=TRUE) BoxedText(0, 0, adj=0, label="This is boxed text", srt=seq(0,360,20), xpad=.3, ypad=.3) points(0,0, pch=15)
Canvas(xpd=TRUE) BoxedText(0, 0, adj=0, label="This is boxed text", srt=seq(0,360,20), xpad=.3, ypad=.3) points(0,0, pch=15)
Calculates the Breslow-Day test of homogeneity for a
table, in order to investigate if
all
strata have the same OR.
If OR is not given, the Mantel-Haenszel estimate is used.
BreslowDayTest(x, OR = NA, correct = FALSE)
BreslowDayTest(x, OR = NA, correct = FALSE)
x |
a |
OR |
the odds ratio to be tested against. If left undefined (default) the Mantel-Haenszel estimate will be used. |
correct |
If TRUE, the Breslow-Day test with Tarone's adjustment is computed, which subtracts an adjustment factor to make the resulting statistic asymptotically chi-square. |
For the Breslow-Day test to be valid, the sample size should be relatively large in each stratum, and at least 80% of the expected cell counts should be greater than 5. Note that this is a stricter sample size requirement than the requirement for the Cochran-Mantel-Haenszel test for tables, in that each stratum sample size (not just the overall sample size) must be relatively large. Even when the Breslow-Day test is valid, it might not be very powerful against certain alternatives, as discussed in Breslow and Day (1980).
Alternatively, it might be better to cast the entire inference problem
into the setting of a logistic regression model. Here, the underlying
question of the Breslow-Day test can be answered by investigating whether an
interaction term with the strata variable is necessary (e.g. using a
likelihood ratio test using the anova
function).
Michael Hoehle <[email protected]>
Breslow, N. E., N. E. Day (1980) The Analysis of Case-Control Studies Statistical Methods in Cancer Research: Vol. 1. Lyon, France, IARC Scientific Publications.
Tarone, R.E. (1985) On heterogeneity tests based on efficient scores, Biometrika, 72, pp. 91-95.
Jones, M. P., O'Gorman, T. W., Lemka, J. H., and Woolson, R. F. (1989) A Monte Carlo Investigation of Homogeneity Tests of the Odds Ratio Under Various Sample Size Configurations Biometrics, 45, 171-181
Breslow, N. E. (1996) Statistics in Epidemiology: The Case-Control Study Journal of the American Statistical Association, 91, 14-26.
migraine <- xtabs(freq ~ ., cbind(expand.grid(treatment=c("active", "placebo"), response =c("better", "same"), gender =c("female", "male")), freq=c(16, 5, 11, 20, 12, 7, 16, 19)) ) # get rid of gender tab <- xtabs(Freq ~ treatment + response, migraine) Desc(tab) # only the women female <- migraine[,, 1] Desc(female) # .. and the men male <- migraine[,, 2] Desc(male) BreslowDayTest(migraine) BreslowDayTest(migraine, correct = TRUE) salary <- array( c(38, 12, 102, 141, 12, 9, 136, 383), dim=c(2, 2, 2), dimnames=list(exposure=c("exposed", "not"), disease =c("case", "control"), salary =c("<1000", ">=1000")) ) # common odds ratio = 4.028269 BreslowDayTest(salary, OR = 4.02)
migraine <- xtabs(freq ~ ., cbind(expand.grid(treatment=c("active", "placebo"), response =c("better", "same"), gender =c("female", "male")), freq=c(16, 5, 11, 20, 12, 7, 16, 19)) ) # get rid of gender tab <- xtabs(Freq ~ treatment + response, migraine) Desc(tab) # only the women female <- migraine[,, 1] Desc(female) # .. and the men male <- migraine[,, 2] Desc(male) BreslowDayTest(migraine) BreslowDayTest(migraine, correct = TRUE) salary <- array( c(38, 12, 102, 141, 12, 9, 136, 383), dim=c(2, 2, 2), dimnames=list(exposure=c("exposed", "not"), disease =c("case", "control"), salary =c("<1000", ">=1000")) ) # common odds ratio = 4.028269 BreslowDayTest(salary, OR = 4.02)
BreuschGodfreyTest
performs the Breusch-Godfrey test for higher-order
serial correlation.
BreuschGodfreyTest( formula, order = 1, order.by = NULL, type = c("Chisq", "F"), data = list(), fill = 0 )
BreuschGodfreyTest( formula, order = 1, order.by = NULL, type = c("Chisq", "F"), data = list(), fill = 0 )
formula |
a symbolic description for the model to be tested (or a
fitted |
order |
integer. maximal order of serial correlation to be tested. |
order.by |
Either a vector |
type |
the type of test statistic to be returned. Either |
data |
an optional data frame containing the variables in the model. By
default the variables are taken from the environment which
|
fill |
starting values for the lagged residuals in the auxiliary
regression. By default |
Under the test statistic is asymptotically Chi-squared with
degrees of freedom as given in
parameter
. If type
is set to
"F"
the function returns a finite sample version of the test
statistic, employing an distribution with degrees of freedom as
given in
parameter
.
By default, the starting values for the lagged residuals in the auxiliary
regression are chosen to be 0 (as in Godfrey 1978) but could also be set to
NA
to omit them.
BreuschGodfreyTest
also returns the coefficients and estimated
covariance matrix from the auxiliary regression that includes the lagged
residuals. Hence, CoefTest
(package: RegClassTools) can be used to
inspect the results. (Note, however, that standard theory does not always
apply to the standard errors and t-statistics in this regression.)
A list with class "BreuschGodfreyTest"
inheriting from
"htest"
containing the following components:
statistic |
the value of the test statistic. |
p.value |
the p-value of the test. |
parameter |
degrees of freedom. |
method |
a character string indicating what type of test was performed. |
data.name |
a character string giving the name(s) of the data. |
coefficients |
coefficient estimates from the auxiliary regression. |
vcov |
corresponding covariance matrix estimate. |
This function was previously published as bgtest
in the
lmtest package and has been integrated here without logical changes.
David Mitchell [email protected], Achim Zeileis
Johnston, J. (1984): Econometric Methods, Third Edition, McGraw Hill Inc.
Godfrey, L.G. (1978): 'Testing Against General Autoregressive and Moving Average Error Models when the Regressors Include Lagged Dependent Variables', Econometrica, 46, 1293-1302.
Breusch, T.S. (1979): 'Testing for Autocorrelation in Dynamic Linear Models', Australian Economic Papers, 17, 334-355.
## Generate a stationary and an AR(1) series x <- rep(c(1, -1), 50) y1 <- 1 + x + rnorm(100) ## Perform Breusch-Godfrey test for first-order serial correlation: BreuschGodfreyTest(y1 ~ x) ## or for fourth-order serial correlation BreuschGodfreyTest(y1 ~ x, order = 4) ## Compare with Durbin-Watson test results: DurbinWatsonTest(y1 ~ x) y2 <- stats::filter(y1, 0.5, method = "recursive") BreuschGodfreyTest(y2 ~ x)
## Generate a stationary and an AR(1) series x <- rep(c(1, -1), 50) y1 <- 1 + x + rnorm(100) ## Perform Breusch-Godfrey test for first-order serial correlation: BreuschGodfreyTest(y1 ~ x) ## or for fourth-order serial correlation BreuschGodfreyTest(y1 ~ x, order = 4) ## Compare with Durbin-Watson test results: DurbinWatsonTest(y1 ~ x) y2 <- stats::filter(y1, 0.5, method = "recursive") BreuschGodfreyTest(y2 ~ x)
Calculate Brier score for assessing the quality of the probabilistic predictions of binary events.
BrierScore(x, pred = NULL, scaled = FALSE, ...)
BrierScore(x, pred = NULL, scaled = FALSE, ...)
x |
either a model object if pred is not supplied or the response variable if it is. |
pred |
the predicted values |
scaled |
logical, defining if scaled or not. Default is FALSE. |
... |
further arguments to be passed to other functions. |
The Brier score is a proper score function that measures the accuracy of probabilistic predictions. It is applicable to tasks in which predictions must assign probabilities to a set of mutually exclusive discrete outcomes. The set of possible outcomes can be either binary or categorical in nature, and the probabilities assigned to this set of outcomes must sum to one (where each individual probability is in the range of 0 to 1).
It's calculated as
The lower the Brier score is for a set of predictions, the better the predictions are calibrated. Note that the Brier score, in its most common formulation, takes on a value between zero and one, since this is the largest possible difference between a predicted probability (which must be between zero and one) and the actual outcome (which can take on values of only 0 and 1). (In the original (1950) formulation of the Brier score, the range is double, from zero to two.)
a numeric value
Andri Signorell <[email protected]>
Brier, G. W. (1950) Verification of forecasts expressed in terms of probability. Monthly Weather Review, 78, 1-3.
r.glm <- glm(Survived ~ ., data=Untable(Titanic), family=binomial) BrierScore(r.glm)
r.glm <- glm(Survived ~ ., data=Untable(Titanic), family=binomial) BrierScore(r.glm)
Calculate bootstrap intervals for the Brier score, based on a glm
.
BrierScoreCI( object, conf.level = 0.95, sides = c("two.sided", "left", "right"), ... )
BrierScoreCI( object, conf.level = 0.95, sides = c("two.sided", "left", "right"), ... )
object |
the model object as returned by glm. |
conf.level |
confidence level of the interval. |
sides |
a character string specifying the side of the confidence
interval, must be one of |
... |
further arguments are passed to the |
a numeric vector with 3 elements:
mean |
mean |
lwr.ci |
lower bound of the confidence interval |
upr.ci |
upper bound of the confidence interval |
Andri Signorell [email protected]
utils::data(Pima.te, package = "MASS") r.logit <- glm(type ~ ., data=Pima.te, family="binomial") # calculate Brier score with confidence intervals BrierScore(r.logit) BrierScoreCI(r.logit, R=99) # use higher R in real life!
utils::data(Pima.te, package = "MASS") r.logit <- glm(type ~ ., data=Pima.te, family="binomial") # calculate Brier score with confidence intervals BrierScore(r.logit) BrierScoreCI(r.logit, R=99) # use higher R in real life!
Add a legend for bubbles to a bubble plot.
BubbleLegend(x, y = NULL, area, cols, labels = NULL, cols.lbl = "black", width = NULL, xjust = 0, yjust = 1, inset = 0, border = "black", frame = TRUE, adj = c(0.5, 0.5), cex = 1, cex.names = 1, bg = NULL, ...)
BubbleLegend(x, y = NULL, area, cols, labels = NULL, cols.lbl = "black", width = NULL, xjust = 0, yjust = 1, inset = 0, border = "black", frame = TRUE, adj = c(0.5, 0.5), cex = 1, cex.names = 1, bg = NULL, ...)
x |
the left x-coordinate to be used to position the legend. See 'Details'. |
y |
the top y-coordinate to be used to position the legend. See 'Details'. |
area |
the area(s) for the bubbles in bubble legend. |
cols |
the color appearing in the legend. |
labels |
a vector of labels to be placed at the right side of the legend. |
cols.lbl |
the textcolor for the labels of the bubbles. |
width |
the width of the legend. |
xjust |
how the legend is to be justified relative to the legend x location. A value of 0 means left justified, 0.5 means centered and 1 means right justified. |
yjust |
the same as |
inset |
inset distance(s) from the margins as a fraction of the plot region when legend is placed by keyword. |
border |
defines the bordor color of each rectangle. Default is none ( |
frame |
defines the bordor color of the frame around the whole legend. Default is none ( |
adj |
text alignment, horizontal and vertical. |
cex |
extension factor for the area, default 1.0. |
cex.names |
character extension for the labels, default 1.0. |
bg |
the background color for the bubble legend. |
... |
further arguments are passed to the function |
The labels are placed in the middle of the legend.
The location of the legend may be specified by setting x to a single keyword from the list "bottomright"
, "bottom"
, "bottomleft"
, "left"
, "topleft"
, "top"
, "topright"
, "right"
and "center"
. This places the legend on the inside of the plot frame at the given location. Partial argument matching is used. The optional inset argument specifies how far the legend is inset from the plot margins. If a single value is given, it is used for both margins; if two values are given, the first is used for x- distance, the second for y-distance. This is the same behaviour as it's implemented in legend
.
Andri Signorell <[email protected]>
PlotBubble(x=d.pizza$delivery_min, y=d.pizza$temperature, area=d.pizza$price, xlab="delivery time", ylab="temperature", col=SetAlpha(as.numeric(d.pizza$area)+2, .5), border="darkgrey", na.rm=TRUE, main="Price-Bubbles", panel.first=grid()) BubbleLegend("bottomleft", area=c(1500, 1000, 500), frame=TRUE, cols=SetAlpha("steelblue",0.5), bg="green", labels=c(1500, 1000, 500), cex=0.8, cols.lbl=c("yellow", "red","blue"))
PlotBubble(x=d.pizza$delivery_min, y=d.pizza$temperature, area=d.pizza$price, xlab="delivery time", ylab="temperature", col=SetAlpha(as.numeric(d.pizza$area)+2, .5), border="darkgrey", na.rm=TRUE, main="Price-Bubbles", panel.first=grid()) BubbleLegend("bottomleft", area=c(1500, 1000, 500), frame=TRUE, cols=SetAlpha("steelblue",0.5), bg="green", labels=c(1500, 1000, 500), cex=0.8, cols.lbl=c("yellow", "red","blue"))
This is just a wrapper for creating an empty plot with suitable defaults for plotting geometric shapes.
Canvas(xlim = NULL, ylim = xlim, main = NULL, xpd = par("xpd"), mar=c(5.1,5.1,5.1,5.1), asp = 1, bg = par("bg"), usrbg = "white", ...)
Canvas(xlim = NULL, ylim = xlim, main = NULL, xpd = par("xpd"), mar=c(5.1,5.1,5.1,5.1), asp = 1, bg = par("bg"), usrbg = "white", ...)
xlim , ylim
|
the xlims and ylims for the plot. Default is c(-1, 1). |
xpd |
expand drawing area, defaults to |
main |
the main title on top of the plot. |
mar |
set margins. Defaults to c(5.1,5.1,5.1,5.1). |
asp |
numeric, giving the aspect ratio y/x. (See |
bg |
the background color of the plot, defaults to par("bg"), which usually will be "white". |
usrbg |
the color of the user space of the plot, defaults to "white". |
... |
additional arguments are passed to the |
The plot is created with these settings:asp = 1, xaxt = "n", yaxt = "n", xlab = "", ylab = "", frame.plot = FALSE
.
a list of all the previous values of the parameters changed (returned invisibly)
Andri Signorell <[email protected]>
Canvas(7) text(0, 0, "Hello world!", cex=5)
Canvas(7) text(0, 0, "Hello world!", cex=5)
Transform cartesian into polar coordinates, resp. to spherical coordinates and vice versa.
CartToPol(x, y) PolToCart(r, theta) CartToSph(x, y, z, up = TRUE) SphToCart(r, theta, phi, up = TRUE)
CartToPol(x, y) PolToCart(r, theta) CartToSph(x, y, z, up = TRUE) SphToCart(r, theta, phi, up = TRUE)
x , y , z
|
vectors with the xy-coordianates to be transformed. |
r |
a vector with the radius of the points. |
theta |
a vector with the angle(s) of the points. |
phi |
a vector with the angle(s) of the points. |
up |
logical. If set to |
Angles are in radians, not degrees (i.e., a right angle is pi/2). Use DegToRad
to convert,
if you don't wanna do it by yourself.
All parameters are recycled if necessary.
PolToCart returns a list of x and y coordinates of the points.
CartToPol returns a list of r for the radius and theta for the angles of the given points.
Andri Signorell <[email protected]>, Christian W. Hoffmann <[email protected]>
CartToPol(x=1, y=1) CartToPol(x=c(1,2,3), y=c(1,1,1)) CartToPol(x=c(1,2,3), y=1) PolToCart(r=1, theta=pi/2) PolToCart(r=c(1,2,3), theta=pi/2) CartToSph(x=1, y=2, z=3) # r=3.741657, theta=0.930274, phi=1.107149
CartToPol(x=1, y=1) CartToPol(x=c(1,2,3), y=c(1,1,1)) CartToPol(x=c(1,2,3), y=1) PolToCart(r=1, theta=pi/2) PolToCart(r=c(1,2,3), theta=pi/2) CartToSph(x=1, y=2, z=3) # r=3.741657, theta=0.930274, phi=1.107149
CatTable helps printing a table, if is has to be broken into multiple rows. Rowlabels will be repeated after every new break.
CatTable(tab, wcol, nrepchars, width = getOption("width"))
CatTable(tab, wcol, nrepchars, width = getOption("width"))
tab |
the rows of a table to be printed, pasted together in one string with constant columnwidth. |
wcol |
integer, the width of the columns. All columns must have the same width. |
nrepchars |
integer, the number of characters to be repeated with every break. This is typically the maximum width of the rowlabels. |
width |
integer, the width of the whole table. Default is the width of the current command window
( |
Andri Signorell <[email protected]>
options(scipen=8) # used in bivariate description functions Desc(temperature ~ cut(delivery_min, breaks=40), data=d.pizza) txt <- c( paste(sample(letters, 500, replace=TRUE), collapse="") , paste(sample(letters, 500, replace=TRUE), collapse="") , paste(sample(letters, 500, replace=TRUE), collapse="") ) txt <- paste(c("aaa","bbb","ccc"), txt, sep="") CatTable(txt, nrepchars=3, wcol=5)
options(scipen=8) # used in bivariate description functions Desc(temperature ~ cut(delivery_min, breaks=40), data=d.pizza) txt <- c( paste(sample(letters, 500, replace=TRUE), collapse="") , paste(sample(letters, 500, replace=TRUE), collapse="") , paste(sample(letters, 500, replace=TRUE), collapse="") ) txt <- paste(c("aaa","bbb","ccc"), txt, sep="") CatTable(txt, nrepchars=3, wcol=5)
Calculates Lin's concordance correlation coefficient for agreement on a continuous measure.
CCC(x, y, ci = "z-transform", conf.level = 0.95, na.rm = FALSE)
CCC(x, y, ci = "z-transform", conf.level = 0.95, na.rm = FALSE)
x |
a vector, representing the first set of measurements. |
y |
a vector, representing the second set of measurements. |
ci |
a character string, indicating the method to be used. Options are |
conf.level |
magnitude of the returned confidence interval. Must be a single number between 0 and 1. |
na.rm |
logical, indicating whether |
Computes Lin's (1989, 2000) concordance correlation coefficient for agreement on a continuous measure obtained by two methods. The concordance correlation coefficient combines measures of both precision and accuracy to determine how far the observed data deviate from the line of perfect concordance (that is, the line at 45 degrees on a square scatter plot). Lin's coefficient increases in value as a function of the nearness of the data's reduced major axis to the line of perfect concordance (the accuracy of the data) and of the tightness of the data about its reduced major axis (the precision of the data).
Both x
and y
values need to be present for a measurement pair to be included in the analysis. If either or both values are missing (i.e. coded NA
) then the measurement pair is deleted before analysis.
A list containing the following:
rho.c |
the concordance correlation coefficient. |
s.shift |
the scale shift. |
l.shift |
the location shift. |
C.b |
a bias correction factor that measures how far the best-fit line deviates from a line at 45 degrees. No deviation from the 45 degree line occurs when C.b = 1. See Lin (1989, page 258). |
blalt |
a data frame with two columns: |
Mark Stevenson <[email protected]>
Bland J, Altman D (1986). Statistical methods for assessing agreement between two methods of clinical measurement. The Lancet 327: 307 - 310.
Bradley E, Blackwood L (1989). Comparing paired data: a simultaneous test for means and variances. American Statistician 43: 234 - 235.
Dunn G (2004). Statistical Evaluation of Measurement Errors: Design and Analysis of Reliability Studies. London: Arnold.
Hsu C (1940). On samples from a normal bivariate population. Annals of Mathematical Statistics 11: 410 - 426.
Krippendorff K (1970). Bivariate agreement coefficients for reliability of data. In: Borgatta E, Bohrnstedt G (eds) Sociological Methodology. San Francisco: Jossey-Bass, pp. 139 - 150.
Lin L (1989). A concordance correlation coefficient to evaluate reproducibility. Biometrics 45: 255 - 268.
Lin L (2000). A note on the concordance correlation coefficient. Biometrics 56: 324 - 325.
Pitman E (1939). A note on normal correlation. Biometrika 31: 9 - 12.
Reynolds M, Gregoire T (1991). Comment on Bradley and Blackwood. American Statistician 45: 163 - 164.
Snedecor G, Cochran W (1989). Statistical Methods. Ames: Iowa State University Press.
## Concordance correlation plot: set.seed(seed = 1234) method1 <- rnorm(n = 100, mean = 0, sd = 1) method2 <- method1 + runif(n = 100, min = 0, max = 1) ## Introduce some missing values: method1[50] <- NA method2[75] <- NA tmp.ccc <- CCC(method1, method2, ci = "z-transform", conf.level = 0.95) lab <- paste("CCC: ", round(tmp.ccc$rho.c[,1], digits = 2), " (95% CI ", round(tmp.ccc$rho.c[,2], digits = 2), " - ", round(tmp.ccc$rho.c[,3], digits = 2), ")", sep = "") z <- lm(method2 ~ method1) par(pty = "s") plot(method1, method2, xlim = c(0, 5), ylim = c(0,5), xlab = "Method 1", ylab = "Method 2", pch = 16) abline(a = 0, b = 1, lty = 2) abline(z, lty = 1) legend(x = "topleft", legend = c("Line of perfect concordance", "Reduced major axis"), lty = c(2,1), lwd = c(1,1), bty = "n") text(x = 1.55, y = 3.8, labels = lab) ## Bland and Altman plot (Figure 2 from Bland and Altman 1986): x <- c(494,395,516,434,476,557,413,442,650,433,417,656,267, 478,178,423,427) y <- c(512,430,520,428,500,600,364,380,658,445,432,626,260, 477,259,350,451) tmp.ccc <- CCC(x, y, ci = "z-transform", conf.level = 0.95) tmp.mean <- mean(tmp.ccc$blalt$delta) tmp.sd <- sqrt(var(tmp.ccc$blalt$delta)) plot(tmp.ccc$blalt$mean, tmp.ccc$blalt$delta, pch = 16, xlab = "Average PEFR by two meters (L/min)", ylab = "Difference in PEFR (L/min)", xlim = c(0,800), ylim = c(-140,140)) abline(h = tmp.mean, lty = 1, col = "gray") abline(h = tmp.mean - (2 * tmp.sd), lty = 2, col = "gray") abline(h = tmp.mean + (2 * tmp.sd), lty = 2, col = "gray") legend(x = "topleft", legend = c("Mean difference", "Mean difference +/ 2SD"), lty = c(1,2), bty = "n") legend(x = 0, y = 125, legend = c("Difference"), pch = 16, bty = "n")
## Concordance correlation plot: set.seed(seed = 1234) method1 <- rnorm(n = 100, mean = 0, sd = 1) method2 <- method1 + runif(n = 100, min = 0, max = 1) ## Introduce some missing values: method1[50] <- NA method2[75] <- NA tmp.ccc <- CCC(method1, method2, ci = "z-transform", conf.level = 0.95) lab <- paste("CCC: ", round(tmp.ccc$rho.c[,1], digits = 2), " (95% CI ", round(tmp.ccc$rho.c[,2], digits = 2), " - ", round(tmp.ccc$rho.c[,3], digits = 2), ")", sep = "") z <- lm(method2 ~ method1) par(pty = "s") plot(method1, method2, xlim = c(0, 5), ylim = c(0,5), xlab = "Method 1", ylab = "Method 2", pch = 16) abline(a = 0, b = 1, lty = 2) abline(z, lty = 1) legend(x = "topleft", legend = c("Line of perfect concordance", "Reduced major axis"), lty = c(2,1), lwd = c(1,1), bty = "n") text(x = 1.55, y = 3.8, labels = lab) ## Bland and Altman plot (Figure 2 from Bland and Altman 1986): x <- c(494,395,516,434,476,557,413,442,650,433,417,656,267, 478,178,423,427) y <- c(512,430,520,428,500,600,364,380,658,445,432,626,260, 477,259,350,451) tmp.ccc <- CCC(x, y, ci = "z-transform", conf.level = 0.95) tmp.mean <- mean(tmp.ccc$blalt$delta) tmp.sd <- sqrt(var(tmp.ccc$blalt$delta)) plot(tmp.ccc$blalt$mean, tmp.ccc$blalt$delta, pch = 16, xlab = "Average PEFR by two meters (L/min)", ylab = "Difference in PEFR (L/min)", xlim = c(0,800), ylim = c(-140,140)) abline(h = tmp.mean, lty = 1, col = "gray") abline(h = tmp.mean - (2 * tmp.sd), lty = 2, col = "gray") abline(h = tmp.mean + (2 * tmp.sd), lty = 2, col = "gray") legend(x = "topleft", legend = c("Mean difference", "Mean difference +/ 2SD"), lty = c(1,2), bty = "n") legend(x = 0, y = 125, legend = c("Difference"), pch = 16, bty = "n")
Transforms given angles in counter clock mode into clockwise angles.
Clockwise(x, start = 0)
Clockwise(x, start = 0)
x |
a vector of angles |
start |
the starting angle for the transformation. Defaults to 0. |
Sometimes there's need for angles being defined the other way round.
a data.frame with two columns, containing the start and end angles.
Andri Signorell <[email protected]>
Clockwise( c(0, pi/4, pi/2, pi))
Clockwise( c(0, pi/4, pi/2, pi))
Find the closest value(s) of a number in a vector x. Multiple values will be reported, if the differences are the same or if there are duplicates of the same value.
Closest(x, a, which = FALSE, na.rm = FALSE)
Closest(x, a, which = FALSE, na.rm = FALSE)
x |
the vector to be searched in |
a |
the reference value |
which |
a logical value defining if the index position or the value should be returned. By default will the value be returned. |
na.rm |
a logical value indicating whether |
the value or index in x which is closest to a
Andri Signorell <[email protected]>
# basic set.seed(8) x <- runif(10) * 10 Closest(x, 3.1) sort(x) y <- sample(10, size=10, replace=TRUE) # multiple observations of the same closest value Closest(y, a=6) # get the relevant positions Closest(y, a=6, which=TRUE) # two different values having the same distance Closest(c(2, 3, 4, 5), a=3.5) # vectorize "a" Closest(c(2, 3, 4, 5), a=c(3.1, 3.9)) # vectorize "which" Closest(c(2, 3, 4, 5), a=3.1, which=c(FALSE, TRUE)) # vectorize both Closest(c(2, 3, 4, 5), a=c(3.1, 3.9), which=c(FALSE, TRUE))
# basic set.seed(8) x <- runif(10) * 10 Closest(x, 3.1) sort(x) y <- sample(10, size=10, replace=TRUE) # multiple observations of the same closest value Closest(y, a=6) # get the relevant positions Closest(y, a=6, which=TRUE) # two different values having the same distance Closest(c(2, 3, 4, 5), a=3.5) # vectorize "a" Closest(c(2, 3, 4, 5), a=c(3.1, 3.9)) # vectorize "which" Closest(c(2, 3, 4, 5), a=3.1, which=c(FALSE, TRUE)) # vectorize both Closest(c(2, 3, 4, 5), a=c(3.1, 3.9), which=c(FALSE, TRUE))
Return the first element of a vector, not being NA.
Coalesce(..., method = c("is.na", "is.null", "is.finite"), flatten = TRUE)
Coalesce(..., method = c("is.na", "is.null", "is.finite"), flatten = TRUE)
... |
the elements to be evaluated. This can either be a single vector, several vectors of same length, a matrix, a data.frame or a list of vectors (of same length). See examples. |
method |
one out of |
flatten |
logical, defines whether lists are going to be flattened (default |
If several vectors are supplied, the evaluation will be elementwise, resp. rowwise if x is a data.frame or a matrix. The first element of the result
is the first non NA
element of the first elements of all the arguments, the second element of
the result is the one of the second elements of all the arguments and so on.
Shorter inputs (of non-zero length) are NOT recycled. The function will bark, if multiple vectors do not all have the same dimension.
The idea is borrowed from SQL. Might sometimes be useful when preparing data in R instead of in SQL.
return a single vector of the first non NA
element(s) of the given data structure.
Andri Signorell <[email protected]>
Coalesce(c(NA, NA, NA, 5, 3)) Coalesce(c(NA, NULL, "a")) Coalesce(NULL, 5, 3) d.frm <- data.frame(matrix(c( 1, 2, NA, 4, NA, NA, 3, 1, NaN, 2, 3, 1, NA, Inf, 1, 1), nrow=4, byrow=TRUE) ) Coalesce(d.frm) Coalesce(as.matrix(d.frm)) Coalesce(d.frm$X1, d.frm$X2, d.frm$X3, d.frm$X4) Coalesce(d.frm$X1, d.frm$X2, d.frm$X3, d.frm$X4, method="is.finite") Coalesce(list(d.frm[,1], d.frm[,2])) # returns the first finite element Coalesce(d.frm, method="is.finite") # with characters (take care, factors won't work!) # is.finite does not make sense here... d.frm <- data.frame(matrix(c( "a", "b", NA, "4", NA, NA, "g", "m", NA_character_,"hfdg", "rr", "m", NA, Inf, 1, 1), nrow=4, byrow=TRUE) , stringsAsFactors = FALSE) Coalesce(d.frm$X1, d.frm$X2, d.frm$X3, d.frm$X4) Coalesce(d.frm) Coalesce(as.list(d.frm))
Coalesce(c(NA, NA, NA, 5, 3)) Coalesce(c(NA, NULL, "a")) Coalesce(NULL, 5, 3) d.frm <- data.frame(matrix(c( 1, 2, NA, 4, NA, NA, 3, 1, NaN, 2, 3, 1, NA, Inf, 1, 1), nrow=4, byrow=TRUE) ) Coalesce(d.frm) Coalesce(as.matrix(d.frm)) Coalesce(d.frm$X1, d.frm$X2, d.frm$X3, d.frm$X4) Coalesce(d.frm$X1, d.frm$X2, d.frm$X3, d.frm$X4, method="is.finite") Coalesce(list(d.frm[,1], d.frm[,2])) # returns the first finite element Coalesce(d.frm, method="is.finite") # with characters (take care, factors won't work!) # is.finite does not make sense here... d.frm <- data.frame(matrix(c( "a", "b", NA, "4", NA, NA, "g", "m", NA_character_,"hfdg", "rr", "m", NA, Inf, 1, 1), nrow=4, byrow=TRUE) , stringsAsFactors = FALSE) Coalesce(d.frm$X1, d.frm$X2, d.frm$X3, d.frm$X4) Coalesce(d.frm) Coalesce(as.list(d.frm))
Perform a Cochran Armitage test for trend in binomial proportions across the levels of a single variable. This test is appropriate only when one variable has two levels and the other variable is ordinal. The two-level variable represents the response, and the other represents an explanatory variable with ordered levels. The null hypothesis is the hypothesis of no trend, which means that the binomial proportion is the same for all levels of the explanatory variable.
CochranArmitageTest(x, alternative = c("two.sided", "one.sided"))
CochranArmitageTest(x, alternative = c("two.sided", "one.sided"))
x |
a frequency table or a matrix. |
alternative |
a character string specifying the alternative hypothesis, must be one of |
A list of class htest
, containing the following components:
statistic |
the z-statistic of the test. |
parameter |
the dimension of the table. |
p.value |
the p-value for the test. |
alternative |
a character string describing the alternative hypothesis. |
method |
the character string “Cochran-Armitage test for trend”. |
data.name |
a character string giving the names of the data. |
Andri Signorell <[email protected]> strongly based on code from
Eric Lecoutre <[email protected]>
https://stat.ethz.ch/pipermail/r-help/2005-July/076371.html
Agresti, A. (2002) Categorical Data Analysis. John Wiley & Sons
https://documentation.sas.com/doc/en/pgmsascdc/9.4_3.5/procstat/procstat_freq_details76.htm
# http://www.lexjansen.com/pharmasug/2007/sp/sp05.pdf, pp. 4 dose <- matrix(c(10,9,10,7, 0,1,0,3), byrow=TRUE, nrow=2, dimnames=list(resp=0:1, dose=0:3)) Desc(dose) CochranArmitageTest(dose) CochranArmitageTest(dose, alternative="one.sided") # not exactly the same as in package coin: # independence_test(tumor ~ dose, data = lungtumor, teststat = "quad") lungtumor <- data.frame(dose = rep(c(0, 1, 2), c(40, 50, 48)), tumor = c(rep(c(0, 1), c(38, 2)), rep(c(0, 1), c(43, 7)), rep(c(0, 1), c(33, 15)))) tab <- table(lungtumor$dose, lungtumor$tumor) CochranArmitageTest(tab) # but similar to prop.trend.test(tab[,1], apply(tab,1, sum))
# http://www.lexjansen.com/pharmasug/2007/sp/sp05.pdf, pp. 4 dose <- matrix(c(10,9,10,7, 0,1,0,3), byrow=TRUE, nrow=2, dimnames=list(resp=0:1, dose=0:3)) Desc(dose) CochranArmitageTest(dose) CochranArmitageTest(dose, alternative="one.sided") # not exactly the same as in package coin: # independence_test(tumor ~ dose, data = lungtumor, teststat = "quad") lungtumor <- data.frame(dose = rep(c(0, 1, 2), c(40, 50, 48)), tumor = c(rep(c(0, 1), c(38, 2)), rep(c(0, 1), c(43, 7)), rep(c(0, 1), c(33, 15)))) tab <- table(lungtumor$dose, lungtumor$tumor) CochranArmitageTest(tab) # but similar to prop.trend.test(tab[,1], apply(tab,1, sum))
Perform the Cochran's Q test for unreplicated randomized block design experiments with a binary response variable and paired data.
CochranQTest(y, ...) ## Default S3 method: CochranQTest(y, groups, blocks, ...) ## S3 method for class 'formula' CochranQTest(formula, data, subset, na.action, ...)
CochranQTest(y, ...) ## Default S3 method: CochranQTest(y, groups, blocks, ...) ## S3 method for class 'formula' CochranQTest(formula, data, subset, na.action, ...)
y |
either a numeric vector of data values, or a data matrix. |
groups |
a vector giving the group for the corresponding elements of y if this is a vector; ignored if y is a matrix. If not a factor object, it is coerced to one. |
blocks |
a vector giving the block for the corresponding elements of y if this is a vector; ignored if y is a matrix. If not a factor object, it is coerced to one. |
formula |
a formula of the form |
data |
an optional matrix or data frame (or similar: see |
subset |
an optional vector specifying a subset of observations to be used. |
na.action |
a function which indicates what should happen when the data contain |
... |
further arguments to be passed to or from methods. |
CochranQTest()
can be used for analyzing unreplicated complete block designs (i.e., there is exactly one
binary observation in y for each combination of levels of groups and blocks) where the normality assumption may be violated.
The null hypothesis is that apart from an effect of blocks, the location parameter of y is the same in each of the groups.
If y is a matrix, groups and blocks are obtained from the column and row indices, respectively.
NA
's are not allowed in groups or blocks; if y contains NA
's, corresponding blocks are removed.
Note that Cochran's Q Test is analogue to the Friedman test with 0, 1 coded response. This is used here for a simple implementation.
A list with class htest
containing the following components:
statistic |
the value of Cochran's chi-squared statistic. |
parameter |
the degrees of freedom of the approximate chi-squared distribution of the test statistic. |
p.value |
the p-value of the test. |
method |
the character string "Cochran's Q-Test". |
data.name |
a character string giving the names of the data. |
Andri Signorell <[email protected]>
Cochran, W.G. (1950) The Comparison of Percentages in Matched Samples. Biometrika. 37 (3/4): 256-266. doi:10.1093/biomet/37.3-4.256. JSTOR 2332378.
# example in: # http://support.sas.com/documentation/cdl/en/statugfreq/63124/PDF/default/statugfreq.pdf # pp. S. 1824 # use expand.grid, xtabs and Untable to create the dataset d.frm <- Untable(xtabs(c(6,2,2,6,16,4,4,6) ~ ., expand.grid(rep(list(c("F","U")), times=3))), colnames = LETTERS[1:3]) # rearrange to long shape d.long <- reshape(d.frm, varying=1:3, times=names(d.frm)[c(1:3)], v.names="resp", direction="long") # after having done the hard work of data organisation, performing the test is a piece of cake.... CochranQTest(resp ~ time | id, data=d.long) # and let's perform a post hoc analysis using mcnemar's test z <- split(d.long, f=d.long$time) pairwise.table(function(i, j) { mcnemar.test(z[[i]]$resp, z[[j]]$resp, correct=FALSE)$p.value }, level.names = names(z), p.adjust.method = "fdr" )
# example in: # http://support.sas.com/documentation/cdl/en/statugfreq/63124/PDF/default/statugfreq.pdf # pp. S. 1824 # use expand.grid, xtabs and Untable to create the dataset d.frm <- Untable(xtabs(c(6,2,2,6,16,4,4,6) ~ ., expand.grid(rep(list(c("F","U")), times=3))), colnames = LETTERS[1:3]) # rearrange to long shape d.long <- reshape(d.frm, varying=1:3, times=names(d.frm)[c(1:3)], v.names="resp", direction="long") # after having done the hard work of data organisation, performing the test is a piece of cake.... CochranQTest(resp ~ time | id, data=d.long) # and let's perform a post hoc analysis using mcnemar's test z <- split(d.long, f=d.long$time) pairwise.table(function(i, j) { mcnemar.test(z[[i]]$resp, z[[j]]$resp, correct=FALSE)$p.value }, level.names = names(z), p.adjust.method = "fdr" )
Calculates the coefficient of variation and its confidence limits using various methods.
CoefVar(x, ...) ## S3 method for class 'lm' CoefVar(x, unbiased = FALSE, na.rm = FALSE, ...) ## S3 method for class 'aov' CoefVar(x, unbiased = FALSE, na.rm = FALSE, ...) ## Default S3 method: CoefVar(x, weights = NULL, unbiased = FALSE, na.rm = FALSE, ...) CoefVarCI(K, n, conf.level = 0.95, sides = c("two.sided", "left", "right"), method = c("nct","vangel","mckay","verrill","naive"))
CoefVar(x, ...) ## S3 method for class 'lm' CoefVar(x, unbiased = FALSE, na.rm = FALSE, ...) ## S3 method for class 'aov' CoefVar(x, unbiased = FALSE, na.rm = FALSE, ...) ## Default S3 method: CoefVar(x, weights = NULL, unbiased = FALSE, na.rm = FALSE, ...) CoefVarCI(K, n, conf.level = 0.95, sides = c("two.sided", "left", "right"), method = c("nct","vangel","mckay","verrill","naive"))
x |
a (non-empty) numeric vector of data values. |
weights |
a numerical vector of weights the same length as |
unbiased |
logical value determining, if a bias correction should be used (see. details). Default is FALSE. |
K |
the coefficient of variation as calculated by |
n |
the number of observations used for calculating the coefficient of variation. |
conf.level |
confidence level of the interval. Defaults to 0.95. |
sides |
a character string specifying the side of the confidence interval, must be one of |
method |
character string specifing the method to use for calculating the confidence intervals, can be one out of:
|
na.rm |
logical. Should missing values be removed? Defaults to FALSE. |
... |
further arguments (not used here). |
In order for the coefficient of variation to be an unbiased estimate of the true population value, the coefficient of variation is corrected as:
For determining
the confidence intervals
for the coefficient of variation a number of methods have been proposed. CoefVarCI()
currently supports five different methods.
The details for the methods are given in the specific references.
The "naive" method
is based on dividing the standard confidence limit for the standard deviation by the sample mean.
McKay's
approximation is asymptotically exact as n goes to infinity. McKay recommends this approximation only if the coefficient of variation is less than 0.33. Note that if the coefficient of variation is greater than 0.33, either the normality of the data is suspect or the probability of negative values in the data is non-neglible. In this case, McKay's approximation may not be valid. Also, it is generally recommended that the sample size should be at least 10 before using McKay's approximation.
Vangel's modified McKay method
is more accurate than the McKay in most cases, particilarly for small samples.. According to Vangel, the unmodified McKay is only more accurate when both the coefficient of variation and alpha are large. However, if the coefficient of variation is large, then this implies either that the data contains negative values or the data does not follow a normal distribution. In this case, neither the McKay or the modified McKay should be used.
In general, the Vangel's modified McKay method is recommended over the McKay method. It generally provides good approximations as long as the data is approximately normal and the coefficient of variation is less than 0.33. This is the default method.
See also: https://www.itl.nist.gov/div898/software/dataplot/refman1/auxillar/coefvacl.htm
nct
uses the noncentral t-distribution to calculate the confidence intervals. See Smithson (2003).
if no confidence intervals are requested:
the estimate as numeric value (without any name)
else a named numeric vector with 3 elements
est |
estimate |
lwr.ci |
lower confidence interval |
upr.ci |
upper confidence interval |
Andri Signorell <[email protected]>,
Michael Smithson <[email protected]> (noncentral-t)
McKay, A. T. (1932). Distribution of the coefficient of variation and the extended t distribution, Journal of the Royal Statistical Society, 95, 695–698.
Johnson, B. L., Welch, B. L. (1940). Applications of the non-central t-distribution. Biometrika, 31, 362–389.
Mark Vangel (1996) Confidence Intervals for a Normal Coefficient of Variation, American Statistician, Vol. 15, No. 1, pp. 21-26.
Kelley, K. (2007). Sample size planning for the coefcient of variation from the accuracy in parameter estimation approach. Behavior Research Methods, 39 (4), 755-766
Kelley, K. (2007). Constructing confidence intervals for standardized effect sizes: Theory, application, and implementation. Journal of Statistical Software, 20 (8), 1-24
Smithson, M.J. (2003) Confidence Intervals, Quantitative Applications in the Social Sciences Series, No. 140. Thousand Oaks, CA: Sage. pp. 39-41
Steve Verrill (2003) Confidence Bounds for Normal and Lognormal Distribution Coefficients of Variation, Research Paper 609, USDA Forest Products Laboratory, Madison, Wisconsin.
Verrill, S. and Johnson, R.A. (2007) Confidence Bounds and Hypothesis Tests for Normal Distribution Coefficients of Variation, Communications in Statistics Theory and Methods, Volume 36, No. 12, pp 2187-2206.
Mean
, SD
, (both supporting weights)
set.seed(15) x <- runif(100) CoefVar(x, conf.level=0.95) # est low.ci upr.ci # 0.5092566 0.4351644 0.6151409 # Coefficient of variation for a linear model r.lm <- lm(Fertility ~ ., swiss) CoefVar(r.lm) # the function is vectorized, so arguments are recyled... # https://www.itl.nist.gov/div898/software/dataplot/refman1/auxillar/coefvacl.htm CoefVarCI(K = 0.00246, n = 195, method="vangel", sides="two.sided", conf.level = c(.5,.8,.9,.95,.99,.999))
set.seed(15) x <- runif(100) CoefVar(x, conf.level=0.95) # est low.ci upr.ci # 0.5092566 0.4351644 0.6151409 # Coefficient of variation for a linear model r.lm <- lm(Fertility ~ ., swiss) CoefVar(r.lm) # the function is vectorized, so arguments are recyled... # https://www.itl.nist.gov/div898/software/dataplot/refman1/auxillar/coefvacl.htm CoefVarCI(K = 0.00246, n = 195, method="vangel", sides="two.sided", conf.level = c(.5,.8,.9,.95,.99,.999))
Computes the Cohen's d and Hedges'g effect size statistics.
CohenD(x, y = NULL, pooled = TRUE, correct = FALSE, conf.level = NA, na.rm = FALSE)
CohenD(x, y = NULL, pooled = TRUE, correct = FALSE, conf.level = NA, na.rm = FALSE)
x |
a (non-empty) numeric vector of data values. |
y |
a (non-empty) numeric vector of data values. |
pooled |
logical, indicating whether compute pooled standard deviation or the whole sample standard deviation. Default is TRUE. |
correct |
logical, indicating whether to apply the Hedges correction. (Default: FALSE) |
conf.level |
confidence level of the interval. Set this to NA, if no confidence intervals should be calculated. (This is the default) |
na.rm |
logical. Should missing values be removed? Defaults to FALSE. |
a numeric vector with 3 elements:
d |
the effect size d |
lwr.ci |
lower bound of the confidence interval |
upr.ci |
upper bound of the confidence interval |
Andri Signorell <[email protected]>, William Revelle <[email protected]> (CI)
Cohen, J. (1988) Statistical power analysis for the behavioral sciences (2nd ed.) Academic Press, New York.
Hedges, L. V. & Olkin, I. (1985) Statistical methods for meta-analysis Academic Press, Orlando, FL
Smithson, M.J. (2003) Confidence Intervals, Quantitative Applications in the Social Sciences Series, No. 140. Thousand Oaks, CA: Sage. pp. 39-41
x <- d.pizza$price[d.pizza$driver=="Carter"] y <- d.pizza$price[d.pizza$driver=="Miller"] CohenD(x, y, conf.level=0.95, na.rm=TRUE)
x <- d.pizza$price[d.pizza$driver=="Carter"] y <- d.pizza$price[d.pizza$driver=="Miller"] CohenD(x, y, conf.level=0.95, na.rm=TRUE)
Computes the agreement rates Cohen's kappa and weighted kappa and their confidence intervals.
CohenKappa(x, y = NULL, weights = c("Unweighted", "Equal-Spacing", "Fleiss-Cohen"), conf.level = NA, ...)
CohenKappa(x, y = NULL, weights = c("Unweighted", "Equal-Spacing", "Fleiss-Cohen"), conf.level = NA, ...)
x |
can either be a numeric vector or a confusion matrix. In the latter case x must be a square matrix. |
y |
NULL (default) or a vector with compatible dimensions to |
weights |
either one out of |
conf.level |
confidence level of the interval. If set to |
... |
further arguments are passed to the function |
Cohen's kappa is the diagonal sum of the (possibly weighted) relative frequencies, corrected for expected values and standardized by its maximum value.
The equal-spacing weights (see Cicchetti and Allison 1971) are defined by
r
being the number of columns/rows, and the Fleiss-Cohen weights by
The latter attaches greater importance to closer disagreements.
Data can be passed to the function either as matrix or data.frame in x
, or as two numeric vectors x
and y
. In the latter case table(x, y, ...)
is calculated. Thus NA
s are handled the same way as table
does. Note that tables are by default calculated without NAs. The specific argument useNA
can be passed via the ... argument.
The vector interface (x, y)
is only supported for the calculation of unweighted kappa. This is because we cannot ensure a safe construction of a confusion table for two factors with different levels, which is independent of the order of the levels in x
and y
. So weights might lead to inconsistent results. The function will raise an error in this case.
if no confidence intervals are requested:
the estimate as numeric value
else a named numeric vector with 3 elements
kappa |
estimate |
lwr.ci |
lower confidence interval |
upr.ci |
upper confidence interval |
David Meyer <[email protected]>, some changes and tweaks Andri Signorell <[email protected]>
Cohen, J. (1960) A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20, 37-46.
Everitt, B.S. (1968), Moments of statistics kappa and weighted kappa. The British Journal of Mathematical and Statistical Psychology, 21, 97-103.
Fleiss, J.L., Cohen, J., and Everitt, B.S. (1969), Large sample standard errors of kappa and weighted kappa. Psychological Bulletin, 72, 332-327.
Cicchetti, D.V., Allison, T. (1971) A New Procedure for Assessing Reliability of Scoring EEG Sleep Recordings American Journal of EEG Technology, 11, 101-109.
CronbachAlpha
, KappaM
, KrippAlpha
# from Bortz et. al (1990) Verteilungsfreie Methoden in der Biostatistik, Springer, pp. 459 m <- matrix(c(53, 5, 2, 11, 14, 5, 1, 6, 3), nrow=3, byrow=TRUE, dimnames = list(rater1 = c("V","N","P"), rater2 = c("V","N","P")) ) # confusion matrix interface CohenKappa(m, weight="Unweighted") # vector interface x <- Untable(m) CohenKappa(x$rater1, x$rater2, weight="Unweighted") # pairwise Kappa rating <- data.frame( rtr1 = c(4,2,2,5,2, 1,3,1,1,5, 1,1,2,1,2, 3,1,1,2,1, 5,2,2,1,1, 2,1,2,1,5), rtr2 = c(4,2,3,5,2, 1,3,1,1,5, 4,2,2,4,2, 3,1,1,2,3, 5,4,2,1,4, 2,1,2,3,5), rtr3 = c(4,2,3,5,2, 3,3,3,4,5, 4,4,2,4,4, 3,1,1,4,3, 5,4,4,4,4, 2,1,4,3,5), rtr4 = c(4,5,3,5,4, 3,3,3,4,5, 4,4,3,4,4, 3,4,1,4,5, 5,4,5,4,4, 2,1,4,3,5), rtr5 = c(4,5,3,5,4, 3,5,3,4,5, 4,4,3,4,4, 3,5,1,4,5, 5,4,5,4,4, 2,5,4,3,5), rtr6 = c(4,5,5,5,4, 3,5,4,4,5, 4,4,3,4,5, 5,5,2,4,5, 5,4,5,4,5, 4,5,4,3,5) ) PairApply(rating, FUN=CohenKappa, symmetric=TRUE) # Weighted Kappa cats <- c("<10%", "11-20%", "21-30%", "31-40%", "41-50%", ">50%") m <- matrix(c(5,8,1,2,4,2, 3,5,3,5,5,0, 1,2,6,11,2,1, 0,1,5,4,3,3, 0,0,1,2,5,2, 0,0,1,2,1,4), nrow=6, byrow=TRUE, dimnames = list(rater1 = cats, rater2 = cats) ) CohenKappa(m, weight="Equal-Spacing") # supply an explicit weight matrix ncol(m) (wm <- outer(1:ncol(m), 1:ncol(m), function(x, y) { 1 - ((abs(x-y)) / (ncol(m)-1)) } )) CohenKappa(m, weight=wm, conf.level=0.95) # however, Fleiss, Cohen and Everitt weight similarities fleiss <- matrix(c( 106, 10, 4, 22, 28, 10, 2, 12, 6 ), ncol=3, byrow=TRUE) #Fleiss weights the similarities weights <- matrix(c( 1.0000, 0.0000, 0.4444, 0.0000, 1.0000, 0.6666, 0.4444, 0.6666, 1.0000 ), ncol=3) CohenKappa(fleiss, weights)
# from Bortz et. al (1990) Verteilungsfreie Methoden in der Biostatistik, Springer, pp. 459 m <- matrix(c(53, 5, 2, 11, 14, 5, 1, 6, 3), nrow=3, byrow=TRUE, dimnames = list(rater1 = c("V","N","P"), rater2 = c("V","N","P")) ) # confusion matrix interface CohenKappa(m, weight="Unweighted") # vector interface x <- Untable(m) CohenKappa(x$rater1, x$rater2, weight="Unweighted") # pairwise Kappa rating <- data.frame( rtr1 = c(4,2,2,5,2, 1,3,1,1,5, 1,1,2,1,2, 3,1,1,2,1, 5,2,2,1,1, 2,1,2,1,5), rtr2 = c(4,2,3,5,2, 1,3,1,1,5, 4,2,2,4,2, 3,1,1,2,3, 5,4,2,1,4, 2,1,2,3,5), rtr3 = c(4,2,3,5,2, 3,3,3,4,5, 4,4,2,4,4, 3,1,1,4,3, 5,4,4,4,4, 2,1,4,3,5), rtr4 = c(4,5,3,5,4, 3,3,3,4,5, 4,4,3,4,4, 3,4,1,4,5, 5,4,5,4,4, 2,1,4,3,5), rtr5 = c(4,5,3,5,4, 3,5,3,4,5, 4,4,3,4,4, 3,5,1,4,5, 5,4,5,4,4, 2,5,4,3,5), rtr6 = c(4,5,5,5,4, 3,5,4,4,5, 4,4,3,4,5, 5,5,2,4,5, 5,4,5,4,5, 4,5,4,3,5) ) PairApply(rating, FUN=CohenKappa, symmetric=TRUE) # Weighted Kappa cats <- c("<10%", "11-20%", "21-30%", "31-40%", "41-50%", ">50%") m <- matrix(c(5,8,1,2,4,2, 3,5,3,5,5,0, 1,2,6,11,2,1, 0,1,5,4,3,3, 0,0,1,2,5,2, 0,0,1,2,1,4), nrow=6, byrow=TRUE, dimnames = list(rater1 = cats, rater2 = cats) ) CohenKappa(m, weight="Equal-Spacing") # supply an explicit weight matrix ncol(m) (wm <- outer(1:ncol(m), 1:ncol(m), function(x, y) { 1 - ((abs(x-y)) / (ncol(m)-1)) } )) CohenKappa(m, weight=wm, conf.level=0.95) # however, Fleiss, Cohen and Everitt weight similarities fleiss <- matrix(c( 106, 10, 4, 22, 28, 10, 2, 12, 6 ), ncol=3, byrow=TRUE) #Fleiss weights the similarities weights <- matrix(c( 1.0000, 0.0000, 0.4444, 0.0000, 1.0000, 0.6666, 0.4444, 0.6666, 1.0000 ), ncol=3) CohenKappa(fleiss, weights)
Collapse (or re-label) variables in a
a contingency table or ftable
object by re-assigning levels of the table variables.
CollapseTable(x, ...)
CollapseTable(x, ...)
x |
A |
... |
A collection of one or more assignments of factors of the table to a list of levels |
Each of the ...
arguments must be of the form
variable = levels
, where variable
is the name of one of the table
dimensions, and levels
is a character or numeric vector of length equal
to the corresponding dimension of the table. Missing argument names are allowed and will be interpreted in the order of the dimensions of the table.
A table
object (even if the input was an ftable), representing the original table with
one or more of its factors collapsed or rearranged into other levels.
Michael Friendly <[email protected]>, Andri Signorell <[email protected]>
margin.table
"collapses" a table in a different way, by
summing over table dimensions.
# create some sample data in table form sex <- c("Male", "Female") age <- letters[1:6] education <- c("low", 'med', 'high') data <- expand.grid(sex=sex, age=age, education=education) counts <- rpois(36, 100) data <- cbind(data, counts) t1 <- xtabs(counts ~ sex + age + education, data=data) Desc(t1) ## age a b c d e f ## sex education ## Male low 119 101 109 85 99 93 ## med 94 98 103 108 84 84 ## high 81 88 96 110 100 92 ## Female low 107 104 95 86 103 96 ## med 104 98 94 95 110 106 ## high 93 85 90 109 99 86 # collapse age to 3 levels t2 <- CollapseTable(t1, age=c("A", "A", "B", "B", "C", "C")) Desc(t2) ## age A B C ## sex education ## Male low 220 194 192 ## med 192 211 168 ## high 169 206 192 ## Female low 211 181 199 ## med 202 189 216 ## high 178 199 185 # collapse age to 3 levels and pool education: "low" and "med" to "low" t3 <- CollapseTable(t1, age=c("A", "A", "B", "B", "C", "C"), education=c("low", "low", "high")) Desc(t3) ## age A B C ## sex education ## Male low 412 405 360 ## high 169 206 192 ## Female low 413 370 415 ## high 178 199 185 # change labels for levels of education to 1:3 t4 <- CollapseTable(t1, education=1:3) Desc(t4) ## age a b c d e f ## sex education ## Male 1 119 101 109 85 99 93 ## 2 94 98 103 108 84 84 ## 3 81 88 96 110 100 92 ## Female 1 107 104 95 86 103 96 ## 2 104 98 94 95 110 106 ## 3 93 85 90 109 99 86
# create some sample data in table form sex <- c("Male", "Female") age <- letters[1:6] education <- c("low", 'med', 'high') data <- expand.grid(sex=sex, age=age, education=education) counts <- rpois(36, 100) data <- cbind(data, counts) t1 <- xtabs(counts ~ sex + age + education, data=data) Desc(t1) ## age a b c d e f ## sex education ## Male low 119 101 109 85 99 93 ## med 94 98 103 108 84 84 ## high 81 88 96 110 100 92 ## Female low 107 104 95 86 103 96 ## med 104 98 94 95 110 106 ## high 93 85 90 109 99 86 # collapse age to 3 levels t2 <- CollapseTable(t1, age=c("A", "A", "B", "B", "C", "C")) Desc(t2) ## age A B C ## sex education ## Male low 220 194 192 ## med 192 211 168 ## high 169 206 192 ## Female low 211 181 199 ## med 202 189 216 ## high 178 199 185 # collapse age to 3 levels and pool education: "low" and "med" to "low" t3 <- CollapseTable(t1, age=c("A", "A", "B", "B", "C", "C"), education=c("low", "low", "high")) Desc(t3) ## age A B C ## sex education ## Male low 412 405 360 ## high 169 206 192 ## Female low 413 370 415 ## high 178 199 185 # change labels for levels of education to 1:3 t4 <- CollapseTable(t1, education=1:3) Desc(t4) ## age a b c d e f ## sex education ## Male 1 119 101 109 85 99 93 ## 2 94 98 103 108 84 84 ## 3 81 88 96 110 100 92 ## Female 1 107 104 95 86 103 96 ## 2 104 98 94 95 110 106 ## 3 93 85 90 109 99 86
Add a color legend, an image of a sequence of colors, to a plot.
ColorLegend(x, y = NULL, cols = rev(heat.colors(100)), labels = NULL, width = NULL, height = NULL, horiz = FALSE, xjust = 0, yjust = 1, inset = 0, border = NA, frame = NA, cntrlbl = FALSE, adj = ifelse(horiz, c(0.5, 1), c(1, 0.5)), cex = 1, title = NULL, title.adj = 0.5, ...)
ColorLegend(x, y = NULL, cols = rev(heat.colors(100)), labels = NULL, width = NULL, height = NULL, horiz = FALSE, xjust = 0, yjust = 1, inset = 0, border = NA, frame = NA, cntrlbl = FALSE, adj = ifelse(horiz, c(0.5, 1), c(1, 0.5)), cex = 1, title = NULL, title.adj = 0.5, ...)
x |
the left x-coordinate to be used to position the colorlegend. See 'Details'. |
y |
the top y-coordinate to be used to position the colorlegend. See 'Details'. |
cols |
the color appearing in the colorlegend. |
labels |
a vector of labels to be placed at the right side of the colorlegend. |
width |
the width of the colorlegend. |
height |
the height of the colorlegend. |
horiz |
logical indicating if the colorlegend should be horizontal; default |
xjust |
how the colorlegend is to be justified relative to the colorlegend x location. A value of 0 means left justified, 0.5 means centered and 1 means right justified. |
yjust |
the same as |
inset |
inset distance(s) from the margins as a fraction of the plot region when colorlegend is placed by keyword. |
border |
defines the bordor color of each rectangle. Default is none ( |
frame |
defines the bordor color of the frame around the whole colorlegend. Default is none ( |
cntrlbl |
defines, whether the labels should be printed in the middle of the color blocks or
start at the edges of the colorlegend. Default is |
adj |
text alignment, horizontal and vertical. |
cex |
character extension for the labels, default 1.0. |
title |
a character string or length-one expression giving a title to be placed at the top of the legend. |
title.adj |
horizontal adjustment for title: see the help for |
... |
further arguments are passed to the function |
The labels are placed at the right side of the colorlegend and are reparted uniformly between y and y - height.
The location may also be specified by setting x to a single keyword from the list "bottomright"
, "bottom"
, "bottomleft"
, "left"
, "topleft"
, "top"
, "topright"
, "right"
and "center"
. This places the colorlegend on the inside of the plot frame at the given location. Partial argument matching is used. The optional inset argument specifies how far the colorlegend is inset from the plot margins. If a single value is given, it is used for both margins; if two values are given, the first is used for x- distance, the second for y-distance.
Andri Signorell <[email protected]>
legend
, FindColor
, BubbleLegend
plot(1:15,, xlim=c(0,10), type="n", xlab="", ylab="", main="Colorstrips") # A ColorLegend(x="right", inset=0.1, labels=c(1:10)) # B: Center the labels ColorLegend(x=1, y=9, height=6, col=colorRampPalette(c("blue", "white", "red"), space = "rgb")(5), labels=1:5, cntrlbl = TRUE) # C: Outer frame ColorLegend(x=3, y=9, height=6, col=colorRampPalette(c("blue", "white", "red"), space = "rgb")(5), labels=1:4, frame="grey") # D ColorLegend(x=5, y=9, height=6, col=colorRampPalette(c("blue", "white", "red"), space = "rgb")(10), labels=sprintf("%.1f",seq(0,1,0.1)), cex=0.8) # E: horizontal shape ColorLegend(x=1, y=2, width=6, height=0.2, col=rainbow(500), labels=1:5,horiz=TRUE) # F ColorLegend(x=1, y=14, width=6, height=0.5, col=colorRampPalette( c("black","blue","green","yellow","red"), space = "rgb")(100), horiz=TRUE) # G ColorLegend(x=1, y=12, width=6, height=1, col=colorRampPalette(c("black","blue", "green","yellow","red"), space = "rgb")(10), horiz=TRUE, border="black", title="From black to red", title.adj=0) text(x = c(8,0.5,2.5,4.5,0.5,0.5,0.5)+.2, y=c(14,9,9,9,2,14,12), LETTERS[1:7], cex=2)
plot(1:15,, xlim=c(0,10), type="n", xlab="", ylab="", main="Colorstrips") # A ColorLegend(x="right", inset=0.1, labels=c(1:10)) # B: Center the labels ColorLegend(x=1, y=9, height=6, col=colorRampPalette(c("blue", "white", "red"), space = "rgb")(5), labels=1:5, cntrlbl = TRUE) # C: Outer frame ColorLegend(x=3, y=9, height=6, col=colorRampPalette(c("blue", "white", "red"), space = "rgb")(5), labels=1:4, frame="grey") # D ColorLegend(x=5, y=9, height=6, col=colorRampPalette(c("blue", "white", "red"), space = "rgb")(10), labels=sprintf("%.1f",seq(0,1,0.1)), cex=0.8) # E: horizontal shape ColorLegend(x=1, y=2, width=6, height=0.2, col=rainbow(500), labels=1:5,horiz=TRUE) # F ColorLegend(x=1, y=14, width=6, height=0.5, col=colorRampPalette( c("black","blue","green","yellow","red"), space = "rgb")(100), horiz=TRUE) # G ColorLegend(x=1, y=12, width=6, height=1, col=colorRampPalette(c("black","blue", "green","yellow","red"), space = "rgb")(10), horiz=TRUE, border="black", title="From black to red", title.adj=0) text(x = c(8,0.5,2.5,4.5,0.5,0.5,0.5)+.2, y=c(14,9,9,9,2,14,12), LETTERS[1:7], cex=2)
Convert colors to grey/grayscale so that you can see how your plot will look after photocopying or printing to a non-color printer.
ColToGrey(col) ColToGray(col)
ColToGrey(col) ColToGray(col)
col |
vector of any of the three kind of R colors, i.e., either a color name (an element of colors()), a hexadecimal string of the form "#rrggbb" or "#rrggbbaa" (see rgb), or an integer i meaning palette()[i]. Non-string values are coerced to integer. |
Converts colors to greyscale using the formula grey = 0.3*red + 0.59*green + 0.11*blue. This allows you to see how your color plot will approximately look when printed on a non-color printer or photocopied.
A vector of colors (greys) corresponding to the input colors.
These function was previously published as Col2Grey()
in the TeachingDemos package and has been integrated here without logical changes.
Greg Snow <[email protected]>
grey
, ColToRgb
, dichromat package
par(mfcol=c(2,2)) tmp <- 1:3 names(tmp) <- c('red','green','blue') barplot(tmp, col=c('red','green','blue')) barplot(tmp, col=ColToGrey(c('red','green','blue'))) barplot(tmp, col=c('red','#008100','#3636ff')) barplot(tmp, col=ColToGrey(c('red','#008100','#3636ff')))
par(mfcol=c(2,2)) tmp <- 1:3 names(tmp) <- c('red','green','blue') barplot(tmp, col=c('red','green','blue')) barplot(tmp, col=ColToGrey(c('red','green','blue'))) barplot(tmp, col=c('red','#008100','#3636ff')) barplot(tmp, col=ColToGrey(c('red','#008100','#3636ff')))
Convert a color given by name, by its palette index or by rgb-values into a string of the form "#rrggbb" or "#rrggbbaa".
ColToHex(col, alpha = 1)
ColToHex(col, alpha = 1)
col |
vector of any of either a color name (an element of colors()), or an integer i meaning palette()[i]. Non-string values are coerced to integer. |
alpha |
the alpha value to be used. This can be any value from 0 (fully transparent) to 1 (opaque). Default is 1. |
Returns the colorvalue in #rrggbb" or #rrggbbaa" format. (character)
Andri Signorell <[email protected]>
ColToHex(c("lightblue", "salmon")) x <- ColToRgb("darkmagenta") x[2,] <- x[2,] + 155 RgbToCol(x) RgbToHex(c(255,0,0))
ColToHex(c("lightblue", "salmon")) x <- ColToRgb("darkmagenta") x[2,] <- x[2,] + 155 RgbToCol(x) RgbToHex(c(255,0,0))
ColToHsv transforms colors from R color into HSV space (hue/saturation/value).
ColToHsv(col, alpha = FALSE)
ColToHsv(col, alpha = FALSE)
col |
vector of any of the three kind of R colors, i.e., either a color name (an element of |
alpha |
logical value indicating whether alpha channel (opacity) values should be returned. |
Converts a color first into RGB an from there into HSV space by means of the functions rgb2hsv
and col2rgb
.
Value (brightness) gives the amount of light in the color. Hue describes the dominant wavelength. Saturation is the amount of Hue mixed into the color.
An HSV colorspace is relative to an RGB colorspace, which in R is sRGB, which has an implicit gamma correction.
A matrix with a column for each color. The three rows of the matrix indicate hue, saturation and value and are named "h", "s", and "v" accordingly.
Andri Signorell <[email protected]>
ColToHsv("peachpuff") ColToHsv(c(blu = "royalblue", reddish = "tomato")) # names kept ColToHsv(1:8)
ColToHsv("peachpuff") ColToHsv(c(blu = "royalblue", reddish = "tomato")) # names kept ColToHsv(1:8)
Determine the equivalent opaque RGB color for a given partially transparent RGB color against a background of any color.
ColToOpaque(col, alpha = NULL, bg = NULL)
ColToOpaque(col, alpha = NULL, bg = NULL)
col |
the color as hex value (use converters below if it's not available). |
alpha |
the alpha channel, if left to NULL the alpha channels of the colors are used |
bg |
the background color to be used to calculate against (default is "white") |
Reducing the opacity against a white background is a good way to find usable lighter and less saturated tints of a base color. For doing so, we sometimes need to get the equivalent opaque color for the transparent color.
An named vector with the hexcodes of the opaque colors.
Andri Signorell <[email protected]>
cols <- c(SetAlpha("limegreen", 0.4), ColToOpaque(ColToHex("limegreen"), 0.4), "limegreen") barplot(c(1, 1.2, 1.3), col=cols, panel.first=abline(h=0.4, lwd=10, col="grey35"))
cols <- c(SetAlpha("limegreen", 0.4), ColToOpaque(ColToHex("limegreen"), 0.4), "limegreen") barplot(c(1, 1.2, 1.3), col=cols, panel.first=abline(h=0.4, lwd=10, col="grey35"))
R color to RGB (red/green/blue) conversion.
ColToRgb(col, alpha = FALSE)
ColToRgb(col, alpha = FALSE)
col |
vector of any of the three kind of R colors, i.e., either a color name (an element of |
alpha |
logical value indicating whether alpha channel (opacity) values should be returned. |
This is merely a wrapper to col2rgb
, defined in order to follow this package's naming conventions.
A matrix with a column for each color. The three rows of the matrix indicate red, green and blue value and are named "red", "green", and "blue" accordingly. The matrix might have a 4th row if an alpha channel is requested.
Andri Signorell <[email protected]>
ColToRgb("peachpuff") ColToRgb(c(blu = "royalblue", reddish = "tomato")) # names kept ColToRgb(1:8)
ColToRgb("peachpuff") ColToRgb(c(blu = "royalblue", reddish = "tomato")) # names kept ColToRgb(1:8)
Wraps text in a character matrix so, that it's displayed over more than one line.
ColumnWrap(x, width = NULL)
ColumnWrap(x, width = NULL)
x |
the matrix with one row |
width |
integer, the width of the columns in characters |
A data.frame containing character columns with long texts is often wrapped
by columns. This can lead to a loss of overview. ColumnWrap()
wraps the
lines within the columns.
a character matrix
Andri Signorell [email protected]
Abstract(d.pizza)
Abstract(d.pizza)
Returns all combinations of 2 out of the elements in x or x and y (if defined). Combinations of the same elements will be dropped (no replacing).
CombPairs(x, y = NULL)
CombPairs(x, y = NULL)
x |
a vector of elements |
y |
a vector of elements, need not be same dimension as x.
If y is not |
If y = NULL
then all combination of 2 out of x are returned.
If y is defined then all combinations of x and y are calculated.
CombPairs
returns a data.frame with 2 columns X1 and X2.
Andri Signorell <[email protected]>
combn
, expand.grid
, outer
, lower.tri
CombPairs(letters[1:4]) CombPairs(x = letters[1:4], y = LETTERS[1:2]) # get all pairs of combinations between factors and numerics out of a data.frame CombPairs(which(sapply(d.pizza, IsNumeric)), which(sapply(d.pizza, is.factor)))
CombPairs(letters[1:4]) CombPairs(x = letters[1:4], y = LETTERS[1:2]) # get all pairs of combinations between factors and numerics out of a data.frame CombPairs(which(sapply(d.pizza, IsNumeric)), which(sapply(d.pizza, is.factor)))
Return either the columnnames or a logical vector indicating which columns are complete, i.e., have no missing values.
CompleteColumns(x, which = TRUE)
CompleteColumns(x, which = TRUE)
x |
a data.frame containing the data |
which |
logical, determining if the names of the variables should be returned or a if a logical vector indicating which columns are complete should be returned. |
A logical vector specifying which columns have no missing values across the entire sequence.
Andri Signorell <[email protected]>
is.na
, na.omit
, complete.cases
CompleteColumns(d.pizza) CompleteColumns(d.pizza, which=FALSE)
CompleteColumns(d.pizza) CompleteColumns(d.pizza, which=FALSE)
This function counts concordant and discordant pairs for two variables x, y with at least ordinal scale, aggregated in a 2way table. This is the base for many association measures like Goodman Kruskal's gamma, but also all tau measures.
ConDisPairs(x)
ConDisPairs(x)
x |
a 2-dimensional table. The column and the row order must be the logical one. |
The code is so far implemented in R (O(n^2)) and therefore slow for large sample sizes (>5000).
An O(n log(n)) implementation is available as (so far) undocumented function DescTools:::.DoCount(x, y, wts)
returning only concorant and discordant pairs (not including standard errors to be used for calculating confidence intervals).
a list with the number of concordant pairs, the number of discordant pairs and the matrix
Andri Signorell <[email protected]>
Agresti, A. (2002) Categorical Data Analysis. John Wiley & Sons, pp. 57-59.
Goodman, L. A., & Kruskal, W. H. (1954) Measures of association for cross classifications. Journal of the American Statistical Association, 49, 732-764.
Goodman, L. A., & Kruskal, W. H. (1963) Measures of association for cross classifications III: Approximate sampling theory. Journal of the American Statistical Association, 58, 310-364.
Association measures: KendallTauA
(tau-a), cor
(method="kendall") for tau-b, StuartTauC
(tau-c), SomersDelta
Lambda
, GoodmanKruskalTau
(tau), UncertCoef
, MutInf
tab <- as.table(rbind(c(26,26,23,18,9),c(6,7,9,14,23))) ConDisPairs(tab)
tab <- as.table(rbind(c(26,26,23,18,9),c(6,7,9,14,23))) ConDisPairs(tab)
Calculates a cross-tabulation of observed and predicted classes with associated statistics.
Conf(x, ...) ## S3 method for class 'table' Conf(x, pos = NULL, ...) ## S3 method for class 'matrix' Conf(x, pos = NULL, ...) ## Default S3 method: Conf(x, ref, pos = NULL, na.rm = TRUE, ...) ## S3 method for class 'rpart' Conf(x, ...) ## S3 method for class 'multinom' Conf(x, ...) ## S3 method for class 'glm' Conf(x, cutoff = 0.5, pos = NULL, ...) ## S3 method for class 'randomForest' Conf(x, ...) ## S3 method for class 'svm' Conf(x, ...) ## S3 method for class 'regr' Conf(x, ...) ## S3 method for class 'Conf' plot(x, main = "Confusion Matrix", ...) ## S3 method for class 'Conf' print(x, digits = max(3, getOption("digits") - 3), ...) Sens(x, ...) Spec(x, ...)
Conf(x, ...) ## S3 method for class 'table' Conf(x, pos = NULL, ...) ## S3 method for class 'matrix' Conf(x, pos = NULL, ...) ## Default S3 method: Conf(x, ref, pos = NULL, na.rm = TRUE, ...) ## S3 method for class 'rpart' Conf(x, ...) ## S3 method for class 'multinom' Conf(x, ...) ## S3 method for class 'glm' Conf(x, cutoff = 0.5, pos = NULL, ...) ## S3 method for class 'randomForest' Conf(x, ...) ## S3 method for class 'svm' Conf(x, ...) ## S3 method for class 'regr' Conf(x, ...) ## S3 method for class 'Conf' plot(x, main = "Confusion Matrix", ...) ## S3 method for class 'Conf' print(x, digits = max(3, getOption("digits") - 3), ...) Sens(x, ...) Spec(x, ...)
x |
a vector, normally a factor, of predicted classes or an object of following classes |
ref |
a vector, normally a factor, of classes to be used as the reference. This is ignored if |
pos |
a character string that defines the factor level corresponding to the "positive" results. Will be ignored for a |
cutoff |
used in logit models. The cutoff for changing classes. |
main |
overall title for the plot. |
digits |
controls the number of digits to print. |
na.rm |
a logical value indicating whether or not missing values should be removed. Defaults to |
... |
further arguments to be passed to or from methods. |
The functions require the factors to have the same levels.
For two class problems, the sensitivity, specificity, positive
predictive value and negative predictive value is calculated using the
positive
argument. Also, the prevalence of the "event" is computed from the
data (unless passed in as an argument), the detection rate (the rate of true events also
predicted to be events) and the detection prevalence (the prevalence of predicted events).
Suppose a table with notation
Reference | ||
Predicted | Event | No Event |
Event | A | B |
No Event | C | D |
The formulas used here are:
See the references for discusions of the first five formulas.
For more than two classes, these results are calculated comparing each factor level to the remaining levels (i.e. a "one versus all" approach).
The overall accuracy and unweighted Kappa statistic are calculated. A p-value from McNemar's test is also computed using mcnemar.test
(which can produce NA
values with sparse tables).
The overall accuracy rate is computed along with a 95 percent confidence interval for this rate (using BinomCI
) and a one-sided test to see if the accuracy is better than the "no information rate," which is taken to be the largest class percentage in the data.
The sensitivity is defined as the proportion of positive results out of the number of
samples which were actually positive. When there are no positive results, sensitivity is
not defined and a value of NA
is returned. Similarly, when there are no negative
results, specificity is not defined and a value of NA
is returned. Similar
statements are true for predictive values.
Confidence intervals for sensitivity, specificity etc. could be calculated as binomial confidence intervals (see BinomCI
). BinomCI(A, A+C)
yields the ci for sensitivity.
a list with elements
table |
the results of |
positive |
the positive result level |
overall |
a numeric vector with overall accuracy and Kappa statistic values |
byClass |
the sensitivity, specificity, positive predictive value, negative predictive value, prevalence, dection rate and detection prevalence for each class. For two class systems, this is calculated once using the |
Andri Signorell <[email protected]>
rewritten based on the ideas of confusionMatrix
by Max Kuhn <[email protected]>
Kuhn, M. (2008) Building predictive models in R using the caret package Journal of Statistical Software, (https://www.jstatsoft.org/v28/i05/).
Powers, David M W (2011) Evaluation: From Precision, Recall and F-Measure to ROC, Informedness, Markedness & Correlation (PDF). Journal of Machine Learning Technologies 2 (1): 37-63.
Collett D (1999) Modelling Binary Data. Chapman & Hall/CRC, Boca Raton Florida, pp. 24.
Matthews, B. W. (1975) Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochimica et Biophysica Acta (BBA) - Protein Structure 405 (2): 442-451. doi:10.1016/0005-2795(75)90109-9. PMID 1180967.
# let tab be a confusion table tab <- TextToTable(" lo hi lo 23 13 hi 10 18 ", dimnames=c("pred", "obs")) Conf(tab, pos="hi") pred <- Untable(tab)[,"pred"] obs <- Untable(tab)[,"obs"] Conf(x = pred, ref = obs) Conf(x = pred, ref = obs, pos="hi") Sens(tab) # Sensitivity Spec(tab) # Specificity tab <- TextToTable(" terrible poor marginal clear terrible 10 4 1 0 poor 5 10 12 2 marginal 2 4 12 5 clear 0 2 6 13 ", dimnames=c("pred", "obs")) Conf(tab)
# let tab be a confusion table tab <- TextToTable(" lo hi lo 23 13 hi 10 18 ", dimnames=c("pred", "obs")) Conf(tab, pos="hi") pred <- Untable(tab)[,"pred"] obs <- Untable(tab)[,"obs"] Conf(x = pred, ref = obs) Conf(x = pred, ref = obs, pos="hi") Sens(tab) # Sensitivity Spec(tab) # Specificity tab <- TextToTable(" terrible poor marginal clear terrible 10 4 1 0 poor 5 10 12 2 marginal 2 4 12 5 clear 0 2 6 13 ", dimnames=c("pred", "obs")) Conf(tab)
Add connection lines to a stacked barplot (beside = TRUE
is not supported). The function expects exactly the same arguments, that were
used to create the barplot.
ConnLines(..., col = 1, lwd = 1, lty = "solid", xalign = c("mar","mid"))
ConnLines(..., col = 1, lwd = 1, lty = "solid", xalign = c("mar","mid"))
... |
the arguments used to create the barplot. (The dots are sent directly to barplot). |
col |
the line color of the connection lines. Defaults to black. |
lwd |
the line width for the connection lines. Default is 1. |
lty |
the line type for the connection lines. Line types can either be specified as an integer
(0=blank, 1=solid (default), 2=dashed, 3=dotted, 4=dotdash, 5=longdash, 6=twodash) or as one of the
character strings |
xalign |
defines where the lines should be aligned to on the x-axis. Can be set either to the margins of the bars ("mar" which is the default) or to "mid". The latter will lead the connecting lines to the middle of the bars. |
Andri Signorell <[email protected]>
tab <- with( subset(d.pizza, driver %in% c("Carpenter","Miller","Farmer","Butcher")), table(factor(driver), Weekday(date, "dd", stringsAsFactor=TRUE)) ) tab barplot(tab, beside=FALSE, space=1.2) ConnLines(tab, beside=FALSE, space=1.2, lcol="grey50", lwd=1, lty=2) barplot(tab, beside=FALSE, space=1.2, horiz=TRUE) ConnLines(tab, beside=FALSE, space=1.2, horiz=TRUE, lcol="grey50", lwd=1, lty=2) cols <- Pal("Helsana")[1:4] b <- barplot(tab, beside=FALSE, horiz=FALSE, col=cols) ConnLines(tab, beside=FALSE, horiz=FALSE, lcol="grey50", lwd=1, lty=2) # set some labels txt <- tab txt[] <- gsub(pattern="^0", "", t(tab)) # do not print 0s text(x=b, y=t(apply(apply(rbind(0,tab), 2, Midx), 2, cumsum)), labels=txt, col=(matrix(rep(TextContrastColor(cols), each=ncol(tab)), nrow=nrow(tab), byrow=FALSE ))) # align to the middle of the bars barplot(tab, beside=FALSE, space=1.2) ConnLines(tab, beside=FALSE, space=1.2, lcol="grey50", lwd=1, lty=2, method="mid")
tab <- with( subset(d.pizza, driver %in% c("Carpenter","Miller","Farmer","Butcher")), table(factor(driver), Weekday(date, "dd", stringsAsFactor=TRUE)) ) tab barplot(tab, beside=FALSE, space=1.2) ConnLines(tab, beside=FALSE, space=1.2, lcol="grey50", lwd=1, lty=2) barplot(tab, beside=FALSE, space=1.2, horiz=TRUE) ConnLines(tab, beside=FALSE, space=1.2, horiz=TRUE, lcol="grey50", lwd=1, lty=2) cols <- Pal("Helsana")[1:4] b <- barplot(tab, beside=FALSE, horiz=FALSE, col=cols) ConnLines(tab, beside=FALSE, horiz=FALSE, lcol="grey50", lwd=1, lty=2) # set some labels txt <- tab txt[] <- gsub(pattern="^0", "", t(tab)) # do not print 0s text(x=b, y=t(apply(apply(rbind(0,tab), 2, Midx), 2, cumsum)), labels=txt, col=(matrix(rep(TextContrastColor(cols), each=ncol(tab)), nrow=nrow(tab), byrow=FALSE ))) # align to the middle of the bars barplot(tab, beside=FALSE, space=1.2) ConnLines(tab, beside=FALSE, space=1.2, lcol="grey50", lwd=1, lty=2, method="mid")
Perform Conover's test of multiple comparisons using rank sums as post hoc test following a significant kruskal.test
.
ConoverTest(x, ...) ## Default S3 method: ConoverTest(x, g, method = c("holm", "hochberg", "hommel", "bonferroni", "BH", "BY", "fdr", "none"), alternative = c("two.sided", "less", "greater"), out.list = TRUE, ...) ## S3 method for class 'formula' ConoverTest(formula, data, subset, na.action, ...)
ConoverTest(x, ...) ## Default S3 method: ConoverTest(x, g, method = c("holm", "hochberg", "hommel", "bonferroni", "BH", "BY", "fdr", "none"), alternative = c("two.sided", "less", "greater"), out.list = TRUE, ...) ## S3 method for class 'formula' ConoverTest(formula, data, subset, na.action, ...)
x |
a numeric vector of data values, or a list of numeric data vectors. |
g |
a vector or factor object giving the group for the
corresponding elements of |
method |
the method for adjusting p-values for multiple comparisons. The function is calling |
alternative |
a character string specifying the alternative hypothesis, must be one of |
out.list |
logical, indicating if the results should be printed in list mode or as a square matrix. Default is list (TRUE). |
formula |
a formula of the form |
data |
an optional matrix or data frame (or similar: see
|
subset |
an optional vector specifying a subset of observations to be used. |
na.action |
a function which indicates what should happen when
the data contain |
... |
further arguments to be passed to or from methods. |
ConoverTest
performs the post hoc pairwise multiple comparisons procedure appropriate to follow the rejection of a Kruskal-Wallis test.
Conover's test is more powerful than Dunn's post hoc multiple comparisons test (DunnTest
). The interpretation of stochastic dominance requires an assumption that the CDF of one group does not cross the CDF of the other.
ConoverTest makes m = k(k-1)/2 multiple pairwise comparisons based on the Conover-Iman t-test-statistic for the rank-sum differences:
with the (tie corrected) statistic of the Kruskal Wallis test
and the being
If x
is a list, its elements are taken as the samples to be
compared, and hence have to be numeric data vectors. In this case,
g
is ignored, and one can simply use ConoverTest(x)
to perform the test. If the samples are not yet contained in a
list, use ConoverTest(list(x, ...))
.
Otherwise, x
must be a numeric data vector, and g
must
be a vector or factor object of the same length as x
giving
the group for the corresponding elements of x
.
A list with class "DunnTest"
containing the following components:
res |
an array containing the mean rank differencens and the according p-values |
Andri Signorell <[email protected]>, the interface is based on R-Core code
Conover W. J., Iman R. L. (1979) On multiple-comparisons procedures, Tech. Rep. LA-7677-MS, Los Alamos Scientific Laboratory.
Conover, W. J. (1999) Practical Nonparametric Statistics Wiley, Hoboken, NJ. 3rd edition.
DunnTest
, NemenyiTest
, kruskal.test
, wilcox.test
, p.adjust
## Hollander & Wolfe (1973), 116. ## Mucociliary efficiency from the rate of removal of dust in normal ## subjects, subjects with obstructive airway disease, and subjects ## with asbestosis. x <- c(2.9, 3.0, 2.5, 2.6, 3.2) # normal subjects y <- c(3.8, 2.7, 4.0, 2.4) # with obstructive airway disease z <- c(2.8, 3.4, 3.7, 2.2, 2.0) # with asbestosis ConoverTest(list(x, y, z)) ## Equivalently, x <- c(x, y, z) g <- factor(rep(1:3, c(5, 4, 5)), labels = c("Normal subjects", "Subjects with obstructive airway disease", "Subjects with asbestosis")) # do the kruskal.test first kruskal.test(x, g) # ...and the pairwise test afterwards ConoverTest(x, g) ## Formula interface. boxplot(Ozone ~ Month, data = airquality) ConoverTest(Ozone ~ Month, data = airquality)
## Hollander & Wolfe (1973), 116. ## Mucociliary efficiency from the rate of removal of dust in normal ## subjects, subjects with obstructive airway disease, and subjects ## with asbestosis. x <- c(2.9, 3.0, 2.5, 2.6, 3.2) # normal subjects y <- c(3.8, 2.7, 4.0, 2.4) # with obstructive airway disease z <- c(2.8, 3.4, 3.7, 2.2, 2.0) # with asbestosis ConoverTest(list(x, y, z)) ## Equivalently, x <- c(x, y, z) g <- factor(rep(1:3, c(5, 4, 5)), labels = c("Normal subjects", "Subjects with obstructive airway disease", "Subjects with asbestosis")) # do the kruskal.test first kruskal.test(x, g) # ...and the pairwise test afterwards ConoverTest(x, g) ## Formula interface. boxplot(Ozone ~ Month, data = airquality) ConoverTest(Ozone ~ Month, data = airquality)
Generate all pairwise contrasts for using in a post-hoc test, e.g. ScheffeTest.
Contrasts(levs)
Contrasts(levs)
levs |
the levels to be used |
A matrix with all possible pairwise contrasts, that can be built with the given levels.
Andri Signorell <[email protected]>
Contrasts(LETTERS[1:5]) # B-A C-A D-A E-A C-B D-B E-B D-C E-C E-D # A -1 -1 -1 -1 0 0 0 0 0 0 # B 1 0 0 0 -1 -1 -1 0 0 0 # C 0 1 0 0 1 0 0 -1 -1 0 # D 0 0 1 0 0 1 0 1 0 -1 # E 0 0 0 1 0 0 1 0 1 1
Contrasts(LETTERS[1:5]) # B-A C-A D-A E-A C-B D-B E-B D-C E-C E-D # A -1 -1 -1 -1 0 0 0 0 0 0 # B 1 0 0 0 -1 -1 -1 0 0 0 # C 0 1 0 0 1 0 0 -1 -1 0 # D 0 0 1 0 0 1 0 1 0 -1 # E 0 0 0 1 0 0 1 0 1 1
Converts a numerical vector from one measurement system to another. Metric prefixes (as unit prefixes that precede a basic unit of measure to indicate a multiple or fraction of the unit) are respected.
CmToPts(x) PtsToCm(x) ConvUnit(x, from, to) data(d.units) data(d.prefix)
CmToPts(x) PtsToCm(x) ConvUnit(x, from, to) data(d.units) data(d.prefix)
x |
the numeric to be converted. |
from |
a character defining the original unit. |
to |
a character defining the target unit. |
The two functions CmToPts()
and PtsToCm()
convert centimeters to points and vice versa. 1 cm corresponds to 28.35 points.
The units as defined by the International System of Units (SI) (m, g, s, A, K, mol, cd, Hz, rad, sr, N, Pa, J, W, C, V, F, Ohm, S, Wb, T, H, lm, lx, Bq, Gy, Sv, kat, l) can be used to convert between different prefixes. The following non SI-units can be chosen for conversion between different systems. NA
will be returned if a conversion can't be found.
The function is using the conversion factors stored in the dataset d.units
.
Weight and mass | ||
Gram | g | metric |
Pound mass (avoirdupois) | lb | |
Ounce mass (avoirdupois) | oz | |
Metric ton | ton (or tn) | |
Distance | ||
Meter | m | metric |
Statute mile | mi | |
Nautical mile | nmi | |
Inch | in | |
Foot | ft | |
Yard | yd | |
Angstrom | AA | (accepted) metric |
Astronomical unit | au | |
Time | ||
Year | a | |
Day | d | |
Hour | h | |
Minute | min | |
Second | s | |
Pressure | ||
Pascal | Pa | |
Atmosphere | atm | |
mm of Mercury | mmHg | |
bar | bar | |
Pound-force per quare inch | psi | |
Energy | ||
Joule | J | metric |
IT calorie | cal | (accepted) metric |
Electron volt | eV (or ev) | |
Power | ||
Horsepower (mechanical) | hp | |
Horsepower (metric) | HP | |
Watt | W (or w) | metric |
Temperature | ||
Degree Celsius | C | |
Degree Fahrenheit | F | |
Kelvin | K | metric |
Liquid measure | ||
Fluid ounce | fl oz | |
Gallon | gal | |
Liter | l (or lt) | (accepted) metric |
Additional details can be found in the d.units
data.frame.
Andri Signorell <[email protected]>
ConvUnit(c(1.2, 5.4, 6.7), "in", "m") # from kilometers to pico meters ConvUnit(1, from="km", to="pm") # from miles to kilometers ConvUnit(1, from="mi", to="km") # nautical miles ConvUnit(1, from="nmi", to="km") # from kilo Kelvin to Fahrenheit ConvUnit(10, from="kK", to="F") # from metric to more quirky units ConvUnit(c(10, 1), from="hl", to="gal") ConvUnit(500, from="ml", to="fl oz") # conversion between non-SI units ConvUnit(1000, "yd", "mi") # ... must be the same as ConvUnit(ConvUnit(1000, "yd", "m"), "m", "mi")
ConvUnit(c(1.2, 5.4, 6.7), "in", "m") # from kilometers to pico meters ConvUnit(1, from="km", to="pm") # from miles to kilometers ConvUnit(1, from="mi", to="km") # nautical miles ConvUnit(1, from="nmi", to="km") # from kilo Kelvin to Fahrenheit ConvUnit(10, from="kK", to="F") # from metric to more quirky units ConvUnit(c(10, 1), from="hl", to="gal") ConvUnit(500, from="ml", to="fl oz") # conversion between non-SI units ConvUnit(1000, "yd", "mi") # ... must be the same as ConvUnit(ConvUnit(1000, "yd", "m"), "m", "mi")
Cov
and Cor
compute the covariance or correlation of x
and y
if these
are vectors. If x
and y
are matrices then the
covariances (or correlations) between the columns of x
and the
columns of y
are computed.
Cov(x, y = NULL, use = "everything", method = c("pearson", "kendall", "spearman")) Cor(x, y = NULL, use = "everything", method = c("pearson", "kendall", "spearman"))
Cov(x, y = NULL, use = "everything", method = c("pearson", "kendall", "spearman")) Cor(x, y = NULL, use = "everything", method = c("pearson", "kendall", "spearman"))
x |
a numeric vector, matrix or data frame. |
y |
|
use |
an optional character string giving a
method for computing covariances in the presence
of missing values. This must be (an abbreviation of) one of the strings
|
method |
a character string indicating which correlation
coefficient (or covariance) is to be computed. One of
|
For Cov
and Cor
one must either give a matrix or
data frame for x
or give both x
and y
.
The inputs must be numeric (as determined by is.numeric
:
logical values are also allowed for historical compatibility): the
"kendall"
and "spearman"
methods make sense for ordered
inputs but xtfrm
can be used to find a suitable prior
transformation to numbers.
If use
is "everything"
, NA
s will
propagate conceptually, i.e., a resulting value will be NA
whenever one of its contributing observations is NA
.
If use
is "all.obs"
, then the presence of missing
observations will produce an error. If use
is
"complete.obs"
then missing values are handled by casewise
deletion (and if there are no complete cases, that gives an error).
"na.or.complete"
is the same unless there are no complete
cases, that gives NA
.
Finally, if use
has the value "pairwise.complete.obs"
then the correlation or covariance between each pair of variables is
computed using all complete pairs of observations on those variables.
This can result in covariance or correlation matrices which are not positive
semi-definite, as well as NA
entries if there are no complete
pairs for that pair of variables. For Cov
and Var
,
"pairwise.complete.obs"
only works with the "pearson"
method.
Note that (the equivalent of) Var(double(0), use = *)
gives
NA
for use = "everything"
and "na.or.complete"
,
and gives an error in the other cases.
The denominator is used which gives an unbiased estimator
of the (co)variance for i.i.d. observations.
These functions return
NA
when there is only one
observation (whereas S-PLUS has been returning NaN
), and
fail if x
has length zero.
For Cor()
, if method
is "kendall"
or
"spearman"
, Kendall's or Spearman's
statistic is used to estimate a rank-based measure of
association. These are more robust and have been recommended if the
data do not necessarily come from a bivariate normal distribution.
For Cov()
, a non-Pearson method is unusual but available for
the sake of completeness. Note that "spearman"
basically
computes Cor(R(x), R(y))
(or Cov(., .)
) where R(u)
:= rank(u, na.last = "keep")
. In the case of missing values, the
ranks are calculated depending on the value of use
, either
based on complete observations, or based on pairwise completeness with
reranking for each pair.
Scaling a covariance matrix into a correlation one can be achieved in
many ways, mathematically most appealing by multiplication with a
diagonal matrix from left and right, or more efficiently by using
sweep(.., FUN = "/")
twice.
For r <- Cor(*, use = "all.obs")
, it is now guaranteed that
all(abs(r) <= 1)
.
Some people have noted that the code for Kendall's tau is slow for
very large datasets (many more than 1000 cases). It rarely makes
sense to do such a computation, but see function
cor.fk
in package pcaPP.
Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) The New S Language. Wadsworth & Brooks/Cole.
cor.test
for confidence intervals (and tests).
cov.wt
for weighted covariance computation.
Var
, SD
for variance and standard deviation (vectors).
## Two simple vectors Cor(1:10, 2:11) # == 1 ## Correlation Matrix of Multivariate sample: (Cl <- Cor(longley)) ## Graphical Correlation Matrix: symnum(Cl) # highly correlated ## Spearman's rho and Kendall's tau symnum(clS <- Cor(longley, method = "spearman")) symnum(clK <- Cor(longley, method = "kendall")) ## How much do they differ? i <- lower.tri(Cl) Cor(cbind(P = Cl[i], S = clS[i], K = clK[i])) ##--- Missing value treatment: C1 <- Cov(swiss) range(eigen(C1, only.values = TRUE)$values) # 6.19 1921 ## swM := "swiss" with 3 "missing"s : swM <- swiss colnames(swM) <- abbreviate(colnames(swiss), min=6) swM[1,2] <- swM[7,3] <- swM[25,5] <- NA # create 3 "missing" ## Consider all 5 "use" cases : (C. <- Cov(swM)) # use="everything" quite a few NA's in cov.matrix try(Cov(swM, use = "all")) # Error: missing obs... C2 <- Cov(swM, use = "complete") stopifnot(identical(C2, Cov(swM, use = "na.or.complete"))) range(eigen(C2, only.values = TRUE)$values) # 6.46 1930 C3 <- Cov(swM, use = "pairwise") range(eigen(C3, only.values = TRUE)$values) # 6.19 1938 ## Kendall's tau doesn't change much: symnum(Rc <- Cor(swM, method = "kendall", use = "complete")) symnum(Rp <- Cor(swM, method = "kendall", use = "pairwise")) symnum(R. <- Cor(swiss, method = "kendall")) ## "pairwise" is closer componentwise, summary(abs(c(1 - Rp/R.))) summary(abs(c(1 - Rc/R.))) ## but "complete" is closer in Eigen space: EV <- function(m) eigen(m, only.values=TRUE)$values summary(abs(1 - EV(Rp)/EV(R.)) / abs(1 - EV(Rc)/EV(R.)))
## Two simple vectors Cor(1:10, 2:11) # == 1 ## Correlation Matrix of Multivariate sample: (Cl <- Cor(longley)) ## Graphical Correlation Matrix: symnum(Cl) # highly correlated ## Spearman's rho and Kendall's tau symnum(clS <- Cor(longley, method = "spearman")) symnum(clK <- Cor(longley, method = "kendall")) ## How much do they differ? i <- lower.tri(Cl) Cor(cbind(P = Cl[i], S = clS[i], K = clK[i])) ##--- Missing value treatment: C1 <- Cov(swiss) range(eigen(C1, only.values = TRUE)$values) # 6.19 1921 ## swM := "swiss" with 3 "missing"s : swM <- swiss colnames(swM) <- abbreviate(colnames(swiss), min=6) swM[1,2] <- swM[7,3] <- swM[25,5] <- NA # create 3 "missing" ## Consider all 5 "use" cases : (C. <- Cov(swM)) # use="everything" quite a few NA's in cov.matrix try(Cov(swM, use = "all")) # Error: missing obs... C2 <- Cov(swM, use = "complete") stopifnot(identical(C2, Cov(swM, use = "na.or.complete"))) range(eigen(C2, only.values = TRUE)$values) # 6.46 1930 C3 <- Cov(swM, use = "pairwise") range(eigen(C3, only.values = TRUE)$values) # 6.19 1938 ## Kendall's tau doesn't change much: symnum(Rc <- Cor(swM, method = "kendall", use = "complete")) symnum(Rp <- Cor(swM, method = "kendall", use = "pairwise")) symnum(R. <- Cor(swiss, method = "kendall")) ## "pairwise" is closer componentwise, summary(abs(c(1 - Rp/R.))) summary(abs(c(1 - Rc/R.))) ## but "complete" is closer in Eigen space: EV <- function(m) eigen(m, only.values=TRUE)$values summary(abs(1 - EV(Rp)/EV(R.)) / abs(1 - EV(Rc)/EV(R.)))
A straightforward application of matrix algebra to remove the effect of the variables in the y set from the x set. Input may be either a data matrix or a correlation matrix. Variables in x and y are specified by location.
CorPart(m, x, y)
CorPart(m, x, y)
m |
a data or correlation matrix. |
x |
the variable numbers associated with the X set. |
y |
the variable numbers associated with the Y set. |
It is sometimes convenient to partial the effect of a number of variables (e.g., sex, age, education) out of the correlations of another set of variables. This could be done laboriously by finding the residuals of various multiple correlations, and then correlating these residuals. The matrix algebra alternative is to do it directly.
The matrix of partial correlations.
William Revelle
Revelle, W. An introduction to psychometric theory with applications in R Springer.
(working draft available at http://personality-project.org/r/book/
# example from Bortz, J. (1993) Statistik fuer Sozialwissenschaftler, Springer, pp. 413 abstr <- c(9,11,13,13,14,9,10,11,10,8,13,7,9,13,14) coord <- c(8,12,14,13,14,8,9,12,8,9,14,7,10,12,12) age <- c(6,8,9,9,10,7,8,9,8,7,10,6,10,10,9) # calculate the correlation of abstr and coord, after without the effect of the age CorPart(cbind(abstr, coord, age), 1:2, 3) # by correlation matrix m m <- cor(cbind(abstr, coord, age)) CorPart(m, 1:2, 3) # ... which would be the same as: lm1 <- lm(abstr ~ age) lm2 <- lm(coord ~ age) cor(resid(lm1), resid(lm2))
# example from Bortz, J. (1993) Statistik fuer Sozialwissenschaftler, Springer, pp. 413 abstr <- c(9,11,13,13,14,9,10,11,10,8,13,7,9,13,14) coord <- c(8,12,14,13,14,8,9,12,8,9,14,7,10,12,12) age <- c(6,8,9,9,10,7,8,9,8,7,10,6,10,10,9) # calculate the correlation of abstr and coord, after without the effect of the age CorPart(cbind(abstr, coord, age), 1:2, 3) # by correlation matrix m m <- cor(cbind(abstr, coord, age)) CorPart(m, 1:2, 3) # ... which would be the same as: lm1 <- lm(abstr ~ age) lm2 <- lm(coord ~ age) cor(resid(lm1), resid(lm2))
Computes the polychoric correlation (and its standard error) between two ordinal variables or from their contingency table, under the assumption that the ordinal variables dissect continuous latent variables that are bivariate normal. Either the maximum-likelihood estimator or a (possibly much) quicker “two-step” approximation is available. For the ML estimator, the estimates of the thresholds and the covariance matrix of the estimates are also available.
CorPolychor(x, y, ML = FALSE, control = list(), std.err = FALSE, maxcor=.9999) ## S3 method for class 'CorPolychor' print(x, digits = max(3, getOption("digits") - 3), ...)
CorPolychor(x, y, ML = FALSE, control = list(), std.err = FALSE, maxcor=.9999) ## S3 method for class 'CorPolychor' print(x, digits = max(3, getOption("digits") - 3), ...)
x |
a contingency table of counts or an ordered categorical variable; the latter can be numeric, logical, a factor, or an ordered factor, but if a factor, its levels should be in proper order. |
y |
if |
ML |
if |
control |
optional arguments to be passed to the |
std.err |
if |
maxcor |
maximum absolute correlation (to insure numerical stability). |
digits |
integer, determining the number of digits used to format the printed result |
... |
not used |
If std.err
is TRUE
,
returns an object of class "polycor"
with the following components:
type |
set to |
rho |
the CorPolychoric correlation. |
var |
the estimated variance of the correlation, or, for the ML estimate, the estimated covariance matrix of the correlation and thresholds. |
n |
the number of observations on which the correlation is based. |
chisq |
chi-square test for bivariate normality. |
df |
degrees of freedom for the test of bivariate normality. |
ML |
|
Othewise, returns the polychoric correlation.
This is a verbatim copy from polchor function in the package polycor.
John Fox [email protected]
Drasgow, F. (1986) CorPolychoric and polyserial correlations. Pp. 68–74 in S. Kotz and N. Johnson, eds., The Encyclopedia of Statistics, Volume 7. Wiley.
Olsson, U. (1979) Maximum likelihood estimation of the CorPolychoric correlation coefficient. Psychometrika 44, 443-460.
hetcor
, polyserial
, print.CorPolychor
, optim
set.seed(12345) z <- RndPairs(1000, 0.6) x <- z[,1] y <- z[,2] cor(x, y) # sample correlation x <- cut(x, c(-Inf, .75, Inf)) y <- cut(y, c(-Inf, -1, .5, 1.5, Inf)) CorPolychor(x, y) # 2-step estimate CorPolychor(x, y, ML=TRUE, std.err=TRUE) # ML estimate
set.seed(12345) z <- RndPairs(1000, 0.6) x <- z[,1] y <- z[,2] cor(x, y) # sample correlation x <- cut(x, c(-Inf, .75, Inf)) y <- cut(y, c(-Inf, -1, .5, 1.5, Inf)) CorPolychor(x, y) # 2-step estimate CorPolychor(x, y, ML=TRUE, std.err=TRUE) # ML estimate
Return for each variable of a data frame the number of missing values and the complete cases to be expected if this variable would be omitted.
CountCompCases(x) ## S3 method for class 'CountCompCases' print(x, digits=1, ...)
CountCompCases(x) ## S3 method for class 'CountCompCases' print(x, digits=1, ...)
x |
a data.frame containg the data. |
digits |
the number of digits to be used when printing the results. |
... |
the dots are not further used. |
A list with three elements. The first gives the number of rows, the second the number of complete cases for the whole data frame.
The third element tab
contains the data for the single variables.
Andri Signorell <[email protected]>
PlotMiss
, CompleteColumns
, complete.cases
, is.na
, na.omit
CountCompCases(d.pizza)
CountCompCases(d.pizza)
Returns the number of work days between two dates taking into account the provided holiday dates.
CountWorkDays(from, to, holiday = NULL, nonworkdays = c("Sat", "Sun"))
CountWorkDays(from, to, holiday = NULL, nonworkdays = c("Sat", "Sun"))
from |
the initial dates |
to |
the final dates |
holiday |
a vector with dates to be excluded. |
nonworkdays |
a character vector containing the abbreviations of the weekdays (as in |
The function is vectorised so that multiple initial and final dates can be supplied. The dates are recycled, if their number are different
an integer vector
Andri Signorell <[email protected]>
weekdays
, Date Functions
from <- as.Date("2019-01-01") + rep(0, 10) to <- as.Date("2020-01-15") + seq(0, 9) CountWorkDays(from, to) x <- seq(from[1], from[1]+11, "days") data.frame( date = x, day = Format(x, fmt="ddd")) CountWorkDays(from = min(x), to = max(x), holiday = c("2019-01-06", "2019-01-07"))
from <- as.Date("2019-01-01") + rep(0, 10) to <- as.Date("2020-01-15") + seq(0, 9) CountWorkDays(from, to) x <- seq(from[1], from[1]+11, "days") data.frame( date = x, day = Format(x, fmt="ddd")) CountWorkDays(from = min(x), to = max(x), holiday = c("2019-01-06", "2019-01-07"))
Wrapper for didactical datasets used in statistic courses.
CourseData(name, url = NULL, header = TRUE, sep = ";", ...)
CourseData(name, url = NULL, header = TRUE, sep = ";", ...)
name |
the name of the file, usually without extension. |
url |
a url where the data reside, should have the form |
header |
a logical value indicating whether the file contains the names of the variables as its first line. If missing, the value is determined from the file format: header is set to |
sep |
the field separator character. Values on each line of the file are separated by this character. Default is - unlike in |
... |
the dots are sent to |
A data.frame
containing a representation of the data in the file.
Andri Signorell <[email protected]>
## Not run: d.farm <- CourseData("farmer") ## End(Not run)
## Not run: d.farm <- CourseData("farmer") ## End(Not run)
Performs the Cramer-von Mises test for the composite hypothesis of normality, see e.g. Thode (2002, Sec. 5.1.3).
CramerVonMisesTest(x)
CramerVonMisesTest(x)
x |
a numeric vector of data values, the number of which must be greater than 7. Missing values are allowed. |
The Cramer-von Mises test is an EDF omnibus test for the composite hypothesis of normality. The test statistic is
where . Here,
is the cumulative distribution function
of the standard normal distribution, and
and
are mean and standard deviation of the data values.
The p-value is computed from the modified statistic
according to Table 4.9 in
Stephens (1986).
A list of class htest
, containing the following components:
statistic |
the value of the Cramer-von Mises statistic. |
p.value |
the p-value for the test. |
method |
the character string “Cramer-von Mises normality test”. |
data.name |
a character string giving the name(s) of the data. |
Juergen Gross <[email protected]>
Stephens, M.A. (1986) Tests based on EDF statistics In: D'Agostino, R.B. and Stephens, M.A., eds.: Goodness-of-Fit Techniques. Marcel Dekker, New York.
Thode Jr., H.C. (2002) Testing for Normality Marcel Dekker, New York.
shapiro.test
for performing the Shapiro-Wilk test for normality.
AndersonDarlingTest
, LillieTest
,
PearsonTest
, ShapiroFranciaTest
for performing further tests for normality.
qqnorm
for producing a normal quantile-quantile plot.
CramerVonMisesTest(rnorm(100, mean = 5, sd = 3)) CramerVonMisesTest(runif(100, min = 2, max = 4))
CramerVonMisesTest(rnorm(100, mean = 5, sd = 3)) CramerVonMisesTest(runif(100, min = 2, max = 4))
Cronbach's alpha is a measure of internal consistency and often used for validating psychometric tests. It determines the internal consistency or average correlation of items in a survey instrument to gauge its reliability. This reduces to Kuder-Richardson formula 20 (KR-20) when the columns of the data matrix are dichotomous.
CronbachAlpha(x, conf.level = NA, cond = FALSE, na.rm = FALSE)
CronbachAlpha(x, conf.level = NA, cond = FALSE, na.rm = FALSE)
x |
|
conf.level |
confidence level of the interval. If set to |
cond |
logical. If set to |
na.rm |
logical, indicating whether |
Either a numeric value or
a named vector of 3 columns if confidence levels are required (estimate, lower and upper ci) or
a list containing the following components, if the argument cond
is set to TRUE
:
unconditional |
Cronbach's Alpha, either the single value only or with confidence intervals |
condCronbachAlpha |
The alpha that would be realized if the item were excluded |
Andri Signorell <[email protected]>, based on code of Harold C. Doran
Cohen, J. (1960), A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20, 37-46.
set.seed(1234) tmp <- data.frame( item1=sample(c(0,1), 20, replace=TRUE), item2=sample(c(0,1), 20, replace=TRUE), item3=sample(c(0,1), 20, replace=TRUE), item4=sample(c(0,1), 20, replace=TRUE), item5=sample(c(0,1), 20, replace=TRUE) ) CronbachAlpha(tmp[,1:4], cond=FALSE, conf.level=0.95) CronbachAlpha(tmp[,1:4], cond=TRUE, conf.level=0.95) CronbachAlpha(tmp[,1:4], cond=FALSE) CronbachAlpha(tmp[,1:2], cond=TRUE, conf.level=0.95) ## Not run: # Calculate bootstrap confidence intervals for CronbachAlpha library(boot) cronbach.boot <- function(data,x) {CronbachAlpha(data[x,])[[3]]} res <- boot(datafile, cronbach.boot, 1000) quantile(res$t, c(0.025,0.975)) # two-sided bootstrapped confidence interval of Cronbach's alpha boot.ci(res, type="bca") # adjusted bootstrap percentile (BCa) confidence interval (better) ## End(Not run)
set.seed(1234) tmp <- data.frame( item1=sample(c(0,1), 20, replace=TRUE), item2=sample(c(0,1), 20, replace=TRUE), item3=sample(c(0,1), 20, replace=TRUE), item4=sample(c(0,1), 20, replace=TRUE), item5=sample(c(0,1), 20, replace=TRUE) ) CronbachAlpha(tmp[,1:4], cond=FALSE, conf.level=0.95) CronbachAlpha(tmp[,1:4], cond=TRUE, conf.level=0.95) CronbachAlpha(tmp[,1:4], cond=FALSE) CronbachAlpha(tmp[,1:2], cond=TRUE, conf.level=0.95) ## Not run: # Calculate bootstrap confidence intervals for CronbachAlpha library(boot) cronbach.boot <- function(data,x) {CronbachAlpha(data[x,])[[3]]} res <- boot(datafile, cronbach.boot, 1000) quantile(res$t, c(0.025,0.975)) # two-sided bootstrapped confidence interval of Cronbach's alpha boot.ci(res, type="bca") # adjusted bootstrap percentile (BCa) confidence interval (better) ## End(Not run)
Vector or cross product
Cross(x, y)
Cross(x, y)
x |
numeric vector or matrix |
y |
numeric vector or matrix |
Computes the cross (or: vector) product of vectors in 3 dimensions. In case of matrices it takes the first dimension of length 3 and computes the cross product between corresponding columns or rows.
The more general cross product of n-1
vectors in n-dimensional
space is realized as CrossN
.
3-dim. vector if x
and <
are vectors, a matrix of
3-dim. vectors if x
and y
are matrices themselves.
Hans W. Borchers <[email protected]>
Cross(c(1, 2, 3), c(4, 5, 6)) # -3 6 -3 # Triple product can be calculated as: va <- c(1, 2, 3) vb <- c(4, 3, 0) vc <- c(5, 1, 1) Dot(va, Cross(vb, vc))
Cross(c(1, 2, 3), c(4, 5, 6)) # -3 6 -3 # Triple product can be calculated as: va <- c(1, 2, 3) vb <- c(4, 3, 0) vc <- c(5, 1, 1) Dot(va, Cross(vb, vc))
Vector cross product of n-1
vectors in n-dimensional space
CrossN(A)
CrossN(A)
A |
matrix of size |
The rows of the matrix A
are taken as(n-1)
vectors in
n
-dimensional space. The cross product generates a vector in this
space that is orthogonal to all these rows in A
and its length is
the volume of the geometric hypercube spanned by the vectors.
a vector of length n
The ‘scalar triple product’ in can be defined as
spatproduct <- function(a, b, c) Dot(a, CrossN(b, c))
It represents the volume of the parallelepiped spanned by the three vectors.
Hans W. Borchers <[email protected]>
A <- matrix(c(1,0,0, 0,1,0), nrow=2, ncol=3, byrow=TRUE) CrossN(A) #=> 0 0 1 x <- c(1.0, 0.0, 0.0) y <- c(1.0, 0.5, 0.0) z <- c(0.0, 0.0, 1.0) identical(Dot(x, CrossN(rbind(y, z))), det(rbind(x, y, z)))
A <- matrix(c(1,0,0, 0,1,0), nrow=2, ncol=3, byrow=TRUE) CrossN(A) #=> 0 0 1 x <- c(1.0, 0.0, 0.0) y <- c(1.0, 0.5, 0.0) z <- c(0.0, 0.0, 1.0) identical(Dot(x, CrossN(rbind(y, z))), det(rbind(x, y, z)))
Calculate the C statistic, a measure of goodness of fit for binary outcomes in a logistic regression or any other classification model. The C statistic is equivalent to the area under the ROC-curve (Receiver Operating Characteristic).
Cstat(x, ...) ## S3 method for class 'glm' Cstat(x, ...) ## Default S3 method: Cstat(x, resp, ...)
Cstat(x, ...) ## S3 method for class 'glm' Cstat(x, ...) ## Default S3 method: Cstat(x, resp, ...)
x |
the logistic model for the glm interface or the predicted probabilities of the model for the default. |
resp |
the response variable (coded as c(0, 1)) |
... |
further arguments to be passed to other functions. |
Values for this measure range from 0.5 to 1.0, with higher values indicating better predictive models. A value of 0.5 indicates that the model is no better than chance at making a prediction of membership in a group and a value of 1.0 indicates that the model perfectly identifies those within a group and those not. Models are typically considered reasonable when the C-statistic is higher than 0.7 and strong when C exceeds 0.8.
Confidence intervals for this measure can be calculated by bootstrap.
numeric value
Andri Signorell <[email protected]>
Hosmer D.W., Lemeshow S. (2000) Applied Logistic Regression (2nd Edition). New York, NY: John Wiley & Sons
d.titanic = Untable(Titanic) r.glm <- glm(Survived ~ ., data=d.titanic, family=binomial) Cstat(r.glm) # default interface Cstat(x = predict(r.glm, method="response"), resp = model.response(model.frame(r.glm))) # calculating bootstrap confidence intervals FUN <- function(d.set, i) { r.glm <- glm(Survived ~ ., data=d.set[i,], family=binomial) Cstat(r.glm) } ## Not run: library(boot) boot.res <- boot(d.titanic, FUN, R=999) # the percentile confidence intervals boot.ci(boot.res, type="perc") ## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS ## Based on 999 bootstrap replicates ## ## CALL : ## boot.ci(boot.out = res, type = "perc") ## ## Intervals : ## Level Percentile ## 95% ( 0.7308, 0.7808 ) ## Calculations and Intervals on Original Scale ## End(Not run)
d.titanic = Untable(Titanic) r.glm <- glm(Survived ~ ., data=d.titanic, family=binomial) Cstat(r.glm) # default interface Cstat(x = predict(r.glm, method="response"), resp = model.response(model.frame(r.glm))) # calculating bootstrap confidence intervals FUN <- function(d.set, i) { r.glm <- glm(Survived ~ ., data=d.set[i,], family=binomial) Cstat(r.glm) } ## Not run: library(boot) boot.res <- boot(d.titanic, FUN, R=999) # the percentile confidence intervals boot.ci(boot.res, type="perc") ## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS ## Based on 999 bootstrap replicates ## ## CALL : ## boot.ci(boot.out = res, type = "perc") ## ## Intervals : ## Level Percentile ## 95% ( 0.7308, 0.7808 ) ## Calculations and Intervals on Original Scale ## End(Not run)
Calculate bootstrap intervals for the C statistic (Area under the curve AUC), based on a glm.
CstatCI( object, conf.level = 0.95, sides = c("two.sided", "left", "right"), ... )
CstatCI( object, conf.level = 0.95, sides = c("two.sided", "left", "right"), ... )
object |
the model object as returned by glm. |
conf.level |
confidence level of the interval. |
sides |
a character string specifying the side of the confidence
interval, must be one of |
... |
further arguments are passed to the |
a numeric vector with 3 elements:
mean |
mean |
lwr.ci |
lower bound of the confidence interval |
upr.ci |
upper bound of the confidence interval |
Andri Signorell [email protected]
utils::data(Pima.te, package = "MASS") r.logit <- glm(type ~ ., data=Pima.te, family="binomial") # calculate Brier score with confidence intervals Cstat(r.logit) CstatCI(r.logit, R=99) # use higher R in real life!
utils::data(Pima.te, package = "MASS") r.logit <- glm(type ~ ., data=Pima.te, family="binomial") # calculate Brier score with confidence intervals Cstat(r.logit) CstatCI(r.logit, R=99) # use higher R in real life!
Dividing the range of an age variable x
into intervals is a frequent task. The commonly used function cut
has unfavourable default values for this.
CutAge()
is a convenient wrapper for cutting age variables in groups
of e.g. 10 years with more suitable defaults.
CutAge(x, from = 0, to = 90, by = 10, right = FALSE, ordered_result = TRUE, ...)
CutAge(x, from = 0, to = 90, by = 10, right = FALSE, ordered_result = TRUE, ...)
x |
continuous variable. |
from , to
|
the starting and (maximal) end values of the sequence. |
by |
number: increment of the sequence. Default is 10, alternatives could be 5 or 20. |
right |
logical, indicating if the intervals should be closed on the right (and open on the left) or vice versa. Default is |
ordered_result |
logical: should the result be an ordered factor? Default is |
... |
the dots are passed on to the underlying function |
A factor is returned, unless labels = FALSE which results in an integer vector of level codes.
Values which fall outside the range of breaks are coded as NA
, as are NaN
and NA
values.
Andri Signorell <[email protected]>
Desc(CutAge(sample(100, 100)))
Desc(CutAge(sample(100, 100)))
Create a factor variable using the quantiles of a continous variable.
CutQ(x, breaks = quantile(x, seq(0, 1, by = 0.25), na.rm = TRUE), labels = NULL, na.rm = FALSE, ...)
CutQ(x, breaks = quantile(x, seq(0, 1, by = 0.25), na.rm = TRUE), labels = NULL, na.rm = FALSE, ...)
x |
continous variable. |
breaks |
the breaks for creating groups. By default the quartiles will be used, say |
labels |
labels for the levels of the resulting category. By default, labels are defined as |
na.rm |
Boolean indicating whether missing values should be removed when computing quantiles. Defaults to TRUE. |
... |
Optional arguments passed to |
This function uses quantile
to obtain the specified
quantiles of x
, then calls cut
to create a factor
variable using the intervals specified by these quantiles.
It properly handles cases where more than one quantile obtains the same value, as in the second example below. Note that in this case, there will be fewer generated factor levels than the specified number of quantile intervals.
Factor variable with one level for each quantile interval given by q
.
Gregory R. Warnes <[email protected]>, some slight modifications Andri Signorell <[email protected]>
# create example data x <- rnorm(1000) # cut into quartiles quartiles <- CutQ(x) table(quartiles) # cut into deciles deciles <- CutQ(x, breaks=10, labels=NULL) table(deciles) # show handling of 'tied' quantiles. x <- round(x) # discretize to create ties stem(x) # display the ties deciles <- CutQ(x, breaks=10) table(deciles) # note that there are only 5 groups (not 10) # due to duplicates
# create example data x <- rnorm(1000) # cut into quartiles quartiles <- CutQ(x) table(quartiles) # cut into deciles deciles <- CutQ(x, breaks=10, labels=NULL) table(deciles) # show handling of 'tied' quantiles. x <- round(x) # discretize to create ties stem(x) # display the ties deciles <- CutQ(x, breaks=10) table(deciles) # note that there are only 5 groups (not 10) # due to duplicates
Country codes published by the International Organization for Standardization (ISO) define codes for the names of countries, dependent territories, and special areas of geographical interest.
data("d.countries")
data("d.countries")
A data frame with 249 observations on the following 4 variables.
name
a character vector, the name of the country.
a2
a character vector, two-letter country codes (aka alpha-2) which are the most widely used of the three, and used most prominently for the Internet's country code top-level domains (with a few exceptions).
a3
a character vector, three-letter country codes (aka alpha-3) which allow a better visual association between the codes and the country names than the alpha-2 codes.
code
a numeric vector, three-digit country codes which are identical to those developed and maintained by the United Nations Statistics Division, with the advantage of script (writing system) independence, and hence useful for people or systems using non-Latin scripts.
region
the region of the country. One of "East Asia & Pacific" (35), "Europe & Central Asia" (52), "Latin America & Caribbean" (41), "Middle East & North Africa" (20), "North America" (3), "South Asia" (8), "Sub-Saharan Africa" (47)
pop2012
the population in 2012
gcpi2012
the gross national income (per capita) in dollars per country in 2012.
latitude
geographic coordinate that specifies the north–south position of a point on the Earth's surface. Latitude is an angle (defined below) which ranges from 0° at the Equator to 90° (North or South) at the poles.
longitude
geographic coordinate that specifies the east–west position of a point on the Earth's surface, or the surface of a celestial body
https://en.wikipedia.org/wiki/ISO_3166-1
https://datacatalog.worldbank.org/search/dataset/0037652
head(d.countries)
head(d.countries)
As I suppose, an artificial dataset
data(d.diamonds)
data(d.diamonds)
A data frame with 440 observations on the following 10 variables.
index
a numeric vector
carat
a numeric vector
colour
a factor with levels D
E
F
G
H
I
J
K
L
clarity
an ordered factor with levels I2
< I1
< SI3
< SI2
< SI1
< VS2
< VS1
< VVS2
< VVS1
cut
an ordered factor with levels F
< G
< V
< X
< I
certification
a factor with levels AGS
DOW
EGL
GIA
IGI
polish
an ordered factor with levels F
< G
< V
< X
< I
symmetry
an ordered factor with levels F
< G
< V
< X
< I
price
a numeric vector
wholesaler
a factor with levels A
B
C
P Poor F Fair G Good V Very good X Excellent I Ideal
somewhere from the net...
data(d.diamonds) str(d.diamonds)
data(d.diamonds) str(d.diamonds)
This data.frame contains the most important properties of the periodic table of the elements.
data(d.periodic)
data(d.periodic)
A data frame with 110 observations on the following 24 variables.
symbol
symbol of an element.
nr
atomic number of an atomic symbol.
name
name of an element.
group
group of an element. Possible results are: Alkali Earth, Alkali Met., Halogen, Metal, Noble Gas, Non-Metal, Rare Earth and Trans. Met.
weight
atomic weight of an element. The values are based upon carbon-12. () indicates the most stable or best known isotope.
meltpt
melting point of an element in [K].
boilpt
boiling point of an element in Kelvin [K].
dens
density of an element in [g/cm3] at 300K and 1 atm.
elconf
electron configuration of an element.
oxstat
oxidation states of an element. The most stable is indicated by a "!".
struct
crystal structure of an element. Possible results are: Cubic, Cubic body centered, Cubic face centered, Hexagonal, Monoclinic, Orthorhombic, Rhombohedral, Tetragonal
covrad
covalent radius of an element in Angstroem [A].
arad
atomic radius of an element in Angstroem.
avol
atomic volume of an element in [cm3/mol].
spheat
specific heat of an element in [J/(g K)].
eneg
electronegativity (Pauling's) of an element.
fusheat
heat of fusion of an element in [kJ/mol].
vapheat
heat of vaporization of an element in [kJ/mol].
elcond
electrical conductivity of an element in [1/(Ohm cm].
thermcond
thermal conductivity of an element in [W/(cm K)].
ionpot1
first ionization potential of an element in [V].
ionpot2
second ionization potential of an element in [V].
ionpot3
third ionization potential of an element in [V].
discyear
year of discovery of the element
https://en.wikipedia.org/wiki/Periodic_table
An artificial dataset inspired by a similar dataset pizza.sav in Arbeitsbuch zur deskriptiven und induktiven Statistik by Toutenburg et.al.
The dataset contains data of a pizza delivery service in London, delivering pizzas to three areas. Every record defines one order/delivery and the according properties. A pizza is supposed to taste good, if its temperature is high enough, say 45 Celsius. So it might be interesting for the pizza delivery service to minimize the delivery time.
The dataset is designed to be as evil as possible. As far as the description is concerned, it should pose the same difficulties that we have to deal with in everyday life. It contains the most used datatypes as numerics, factors, ordered factors, integers, logicals and a date. NAs are scattered everywhere partly systematically, partly randomly (except in the index).
data(d.pizza)
data(d.pizza)
A data frame with 1209 observations on the following 17 variables.
index
a numeric vector, indexing the records (no missings here).
date
Date, the delivery date
week
integer, the weeknumber
weekday
integer, the weekday
area
factor, the three London districts: Brent
, Camden
, Westminster
count
integer, the number of pizzas delivered
rabate
logical, TRUE
if a rabate has been given
price
numeric, the total price of delivered pizza(s)
operator
a factor with levels Allanah
Maria
Rhonda
driver
a factor with levels Carpenter
Carter
Taylor
Butcher
Hunter
Miller
Farmer
delivery_min
numeric, the delivery time in minutes (decimal)
temperature
numeric, the temperature of the pizza in degrees Celsius when delivered to the customer
wine_ordered
integer, 1 if wine was ordered, 0 if not
wine_delivered
integer, 1 if wine was delivered, 0 if not
wrongpizza
logical, TRUE
if a wrong pizza was delivered
quality
ordered factor with levels low
< medium
< high
, defining the quality of the pizza when delivered
The dataset contains NAs randomly scattered.
Toutenburg H, Schomaker M, Wissmann M, Heumann C (2009): Arbeitsbuch zur deskriptiven und induktiven Statistik Springer, Berlin Heidelberg
str(d.pizza) head(d.pizza) Desc(d.pizza)
str(d.pizza) head(d.pizza) Desc(d.pizza)
86 malt whiskies are scored between 0-4 for 12 different taste categories including sweetness, smoky, nutty etc. Additionally, coordinates of distilleries allow us to obtain pairwise distance information. Using a combination of these variables it is possible to look for correlations between particular attributes of taste and physical location, for example does a shared local resource have a significant effect on nearby whiskies.
By using correlation analysis it may be possible to provide whisky recommendations based upon an individual's particular preferences.
By computing the Pearson correlation coefficient and specifying a threshold value between 0 and 1, we can establish an adjacency matrix where each node is a malt whisky and an edge represents a level of similarity above the threshold.
data("d.whisky")
data("d.whisky")
A data frame with 86 observations on the following 16 variables.
distillery
a character Aberfeldy
, Aberlour
, AnCnoc
, Ardbeg
, ...
brand
a grouping factor to separate the better known distilleries (A
) from the lesser known ones (B
).
region
a factor with levels campbeltown
, highland
, islands
, islay
, lowland
, speyside
.
body
a numeric vector
sweetness
a numeric vector
smoky
a numeric vector
medicinal
a numeric vector
tobacco
a numeric vector
honey
a numeric vector
spicy
a numeric vector
winey
a numeric vector
nutty
a numeric vector
malty
a numeric vector
fruity
a numeric vector
floral
a numeric vector
postcode
a character AB30 1YE
, AB35 5TB
, ...
latitude
a numeric vector, coordinate pairs of distilleries.
longitude
a numeric vector, coordinate pairs of distilleries.
http://www.mathstat.strath.ac.uk/outreach/nessie/nessie_whisky.html
http://www.mathstat.strath.ac.uk/outreach/nessie/index.html
head(d.whisky) opar <- par(mfrow=c(3,3), cex.main=1.8) for(i in 1:9) PlotPolar(d.whisky[i, 4:15], rlim=4, type="l", col=DescTools::hecru, lwd=2, fill=SetAlpha(DescTools::hecru, 0.4), panel.first=PolarGrid( ntheta=ncol(d.whisky[i, 2:13]), nr = NA, col="grey", lty="dotted", las=1, cex=1.4, alabels=StrCap(colnames(d.whisky[i, 3:14])), lblradians=TRUE), main=d.whisky[i, "distillery"]) par(mfrow=c(3,3), cex.main=1.8, xpd=NA) id <- d.whisky$distillery %in% c("Ardbeg","Caol Ila","Cragganmore","Lagavulin","Laphroig", "Macallan","Mortlach","Talisker","Tobermory") PlotFaces(d.whisky[id, 4:15], nr=3, nc=3, col=hecru, scale=TRUE, fill=TRUE, labels=d.whisky$distillery[id]) par(opar)
head(d.whisky) opar <- par(mfrow=c(3,3), cex.main=1.8) for(i in 1:9) PlotPolar(d.whisky[i, 4:15], rlim=4, type="l", col=DescTools::hecru, lwd=2, fill=SetAlpha(DescTools::hecru, 0.4), panel.first=PolarGrid( ntheta=ncol(d.whisky[i, 2:13]), nr = NA, col="grey", lty="dotted", las=1, cex=1.4, alabels=StrCap(colnames(d.whisky[i, 3:14])), lblradians=TRUE), main=d.whisky[i, "distillery"]) par(mfrow=c(3,3), cex.main=1.8, xpd=NA) id <- d.whisky$distillery %in% c("Ardbeg","Caol Ila","Cragganmore","Lagavulin","Laphroig", "Macallan","Mortlach","Talisker","Tobermory") PlotFaces(d.whisky[id, 4:15], nr=3, nc=3, col=hecru, scale=TRUE, fill=TRUE, labels=d.whisky$distillery[id]) par(opar)
For performing elementary probability calculations in introductory statistic courses, we might want to simulate random games.
The dataset roulette
contains the standard sample space for one spin on a roulette wheel.
cards
contains the standard set of 52 playing cards in four colours (without Jokers). tarot
does the same with a classic tarot deck.
cards tarot roulette
cards tarot roulette
cards
is a data.frame
with three columns named card
, rank
and suit
tarot
is a data.frame
with four columns named card
, rank
, suit
and desc
roulette
is a data.frame
with seven columns named num
and col
, parity
, highlow
, dozens
, column
, pocketrange
head(cards) head(tarot) head(roulette) # drawing 5 cards sample(cards$card, 5) # drawing 5 cards with jokers sample(c(cards$card, rep("Joker", 3)), 5) # spin the wheel by using the DescTools::Sample() for sampling # rows from a data frame Sample(roulette, size=1) # simulate the evening in Las Vegas with 10 games Sample(roulette, 10, replace=TRUE)
head(cards) head(tarot) head(roulette) # drawing 5 cards sample(cards$card, 5) # drawing 5 cards with jokers sample(c(cards$card, rep("Joker", 3)), 5) # spin the wheel by using the DescTools::Sample() for sampling # rows from a data frame Sample(roulette, size=1) # simulate the evening in Las Vegas with 10 games Sample(roulette, 10, replace=TRUE)
Some more date functions for making daily life a bit easier. The first ones extract a specific part of a given date, others check some conditions.
Year(x) Quarter(x) Month(x, fmt = c("m", "mm", "mmm"), lang = DescToolsOptions("lang"), stringsAsFactors = TRUE) Week(x, method = c("iso", "us")) Day(x) Weekday(x, fmt = c("d", "dd", "ddd"), lang = DescToolsOptions("lang"), stringsAsFactors = TRUE) YearDay(x) YearMonth(x) Day(x) <- value IsWeekend(x) IsLeapYear(x) Hour(x) Minute(x) Second(x) Timezone(x) HmsToMinute(x) Now() Today() DiffDays360(start_d, end_d, method = c("eu", "us")) LastDayOfMonth(x) YearDays(x) MonthDays(x)
Year(x) Quarter(x) Month(x, fmt = c("m", "mm", "mmm"), lang = DescToolsOptions("lang"), stringsAsFactors = TRUE) Week(x, method = c("iso", "us")) Day(x) Weekday(x, fmt = c("d", "dd", "ddd"), lang = DescToolsOptions("lang"), stringsAsFactors = TRUE) YearDay(x) YearMonth(x) Day(x) <- value IsWeekend(x) IsLeapYear(x) Hour(x) Minute(x) Second(x) Timezone(x) HmsToMinute(x) Now() Today() DiffDays360(start_d, end_d, method = c("eu", "us")) LastDayOfMonth(x) YearDays(x) MonthDays(x)
x |
the date to be evaluated. |
fmt |
format string, defines how the month or the weekday are to be formatted. Defaults to |
value |
new value |
lang |
optional value setting the language for the months and daynames. Can be either |
stringsAsFactors |
logical. Defines if the result should be coerced to a factor, using the local definitions as levels. The result would be an ordered factor. Default is TRUE. |
start_d , end_d
|
the start, resp. end date for |
method |
one out of |
These functions are mainly convenience wrappers for the painful format()
and its strange codes...
Based on the requested time component, the output is as follows:Year
returns the year of the input date in yyyy format or a yearmonth yyyymm.Quarter
returns the quarter of the year (1 to 4) for the input date. Month
returns the month of the year (1 to 12) for the input date or for a yearmonth yyyymm. Week
returns the week of the year for the input date (0 to 53), as defined in ISO8601. Weekday
returns the week day of the input date. (1 - Monday, 2 - Tuesday, ... 7 - Sunday). (Names and abbreviations are either english or in the current locale!)YearDay
returns the day of the year numbering (1 to 366). Day
returns the day of the month (1 to 31). YearMonth
returns the yearmonth representation (yyyymm) of a date as long integer. Hour
, Minute
, Second
, Timezone
return the hour, minute, second or timezone from a POSIXlt object. HmsToMinute
converts the time parts of a POSIXlt object to minutes.Today
, Now
return the current date, resp. the current date and time.
IsWeekend
returns TRUE
, if the date x falls on a weekend. IsLeapYear
returns TRUE
, if the year of the date x is a leap year.
The day can not only be extracted, but as well be defined. See examples.
DiffDays360
calculates the difference between 2 dates using the 360-days convention.LastDayOfMonth
returns the last day of the month of the given date(s).
YearDays
returns the total number of days of the given date(s).
MonthDays
returns the numer of days of the month of the given date(s).
The language in Weekday and Moth can be set with an option as well. The functions will check for an existing option named "lang" and take this value if it exists. So simply set option(lang="engl") if the results should always be reported in English.
a vector of the same dimension as x, consisting of either numeric values or characters depending on the function used.
Andri Signorell <[email protected]>
strptime
, DateTimeClasses
, as.POSIXlt
x <- Today() # the same as Sys.Date() but maybe easier to remember.. Year(x) Quarter(x) Month(x) Month(x, fmt = "mm", lang="engl") Month(x, fmt = "mm", lang="local") Month(x, fmt = "mmm", lang="engl") Month(x, fmt = "mmm", lang="local") Week(x) Day(x) Day(x) <- 20 x Weekday(x) Weekday(x, fmt = "dd", lang="engl") Weekday(x, fmt = "dd", lang="local") Weekday(x, fmt = "ddd", lang="engl") Weekday(x, fmt = "ddd", lang="local") YearDay(x) IsWeekend(x) IsLeapYear(x) # let's generate a time sequence by weeks Month(seq(from=as.Date(Sys.Date()), to=Sys.Date()+150, by="weeks"), fmt="mm") LastDayOfMonth(as.Date(c("2014-10-12","2013-01-31","2011-12-05")))
x <- Today() # the same as Sys.Date() but maybe easier to remember.. Year(x) Quarter(x) Month(x) Month(x, fmt = "mm", lang="engl") Month(x, fmt = "mm", lang="local") Month(x, fmt = "mmm", lang="engl") Month(x, fmt = "mmm", lang="local") Week(x) Day(x) Day(x) <- 20 x Weekday(x) Weekday(x, fmt = "dd", lang="engl") Weekday(x, fmt = "dd", lang="local") Weekday(x, fmt = "ddd", lang="engl") Weekday(x, fmt = "ddd", lang="local") YearDay(x) IsWeekend(x) IsLeapYear(x) # let's generate a time sequence by weeks Month(seq(from=as.Date(Sys.Date()), to=Sys.Date()+150, by="weeks"), fmt="mm") LastDayOfMonth(as.Date(c("2014-10-12","2013-01-31","2011-12-05")))
There's a small number of built-in constants in R
. We have month.name
and month.abb
but nothing similar for weekdays. Here it is.
day.name day.abb
day.name day.abb
The following constants are available in DescTools:
day.name
: the English names for the day of the week (Monday, Tuesday, Wednesday, Thursday, Friday, Saturday, Sunday);
day.abb
: the three-letter abbreviations for the
English day names (Mon, Tue, Wed, Thu, Fri, Sat, Sun);
Convert degrees to radians (and back again).
DegToRad(deg) RadToDeg(rad)
DegToRad(deg) RadToDeg(rad)
deg |
a vector of angles in degrees. |
rad |
a vector of angles in radians. |
DegToRad returns a vector of the same length as deg
with the angles in radians.
RadToDeg returns a vector of the same length as rad
with the angles in degrees.
Andri Signorell <[email protected]>
DegToRad(c(90,180,270)) RadToDeg( c(0.5,1,2) * pi)
DegToRad(c(90,180,270)) RadToDeg( c(0.5,1,2) * pi)
Return the depreciation of an asset for a specified period using different methods. SLN
returns the straight-line depreciation
DB
uses the fixed-declining balance method
and SYD
returns the sum-of-years' digits depreciation.
SLN(cost, salvage, life) DB(cost, salvage, life, period = 1:life) SYD(cost, salvage, life, period = 1:life)
SLN(cost, salvage, life) DB(cost, salvage, life, period = 1:life) SYD(cost, salvage, life, period = 1:life)
cost |
initial cost of the asset. |
salvage |
value at the end of the depreciation (sometimes called the salvage value of the asset). |
life |
number of periods over which the asset is depreciated (sometimes called the useful life of the asset). |
period |
period for which you want to calculate the depreciation. Period must use the same units as life. |
val
Andri Signorell <[email protected]>
NPV()
# depreciation allowance for each year SLN(cost = 50000, salvage = 10000, life = 5) DB(cost = 50000, salvage = 10000, life = 5) 50000 - c(0, cumsum(SYD(cost = 50000, salvage = 10000, life = 5)))
# depreciation allowance for each year SLN(cost = 50000, salvage = 10000, life = 5) DB(cost = 50000, salvage = 10000, life = 5) 50000 - c(0, cumsum(SYD(cost = 50000, salvage = 10000, life = 5)))
Produce summaries of various types of variables. Calculate descriptive statistics for x and use Word as reporting tool for the numeric results and for descriptive plots. The appropriate statistics are chosen depending on the class of x. The general intention is to simplify the description process for lazy typers and return a quick, but rich summary.
Desc(x, ..., main = NULL, plotit = NULL, wrd = NULL) ## S3 method for class 'numeric' Desc( x, main = NULL, maxrows = NULL, plotit = NULL, sep = NULL, digits = NULL, ... ) ## S3 method for class 'integer' Desc( x, main = NULL, maxrows = NULL, plotit = NULL, sep = NULL, digits = NULL, ... ) ## S3 method for class 'factor' Desc( x, main = NULL, maxrows = NULL, ord = NULL, plotit = NULL, sep = NULL, digits = NULL, ... ) ## S3 method for class 'labelled' Desc( x, main = NULL, maxrows = NULL, ord = NULL, plotit = NULL, sep = NULL, digits = NULL, ... ) ## S3 method for class 'ordered' Desc( x, main = NULL, maxrows = NULL, ord = NULL, plotit = NULL, sep = NULL, digits = NULL, ... ) ## S3 method for class 'character' Desc( x, main = NULL, maxrows = NULL, ord = NULL, plotit = NULL, sep = NULL, digits = NULL, ... ) ## S3 method for class 'ts' Desc(x, main = NULL, plotit = NULL, sep = NULL, digits = NULL, ...) ## S3 method for class 'logical' Desc( x, main = NULL, ord = NULL, conf.level = 0.95, plotit = NULL, sep = NULL, digits = NULL, ... ) ## S3 method for class 'Date' Desc( x, main = NULL, dprobs = NULL, mprobs = NULL, plotit = NULL, sep = NULL, digits = NULL, ... ) ## S3 method for class 'table' Desc( x, main = NULL, conf.level = 0.95, verbose = 2, rfrq = "111", margins = c(1, 2), plotit = NULL, sep = NULL, digits = NULL, ... ) ## Default S3 method: Desc( x, main = NULL, maxrows = NULL, ord = NULL, conf.level = 0.95, verbose = 2, rfrq = "111", margins = c(1, 2), dprobs = NULL, mprobs = NULL, plotit = NULL, sep = NULL, digits = NULL, ... ) ## S3 method for class 'data.frame' Desc(x, main = NULL, plotit = NULL, enum = TRUE, sep = NULL, ...) ## S3 method for class 'list' Desc(x, main = NULL, plotit = NULL, enum = TRUE, sep = NULL, ...) ## S3 method for class 'formula' Desc( formula, data = parent.frame(), subset, main = NULL, plotit = NULL, digits = NULL, ... ) ## S3 method for class 'Desc' print( x, digits = NULL, plotit = NULL, nolabel = FALSE, sep = NULL, nomain = FALSE, ... ) ## S3 method for class 'Desc' plot(x, main = NULL, ...) ## S3 method for class 'palette' Desc(x, ...)
Desc(x, ..., main = NULL, plotit = NULL, wrd = NULL) ## S3 method for class 'numeric' Desc( x, main = NULL, maxrows = NULL, plotit = NULL, sep = NULL, digits = NULL, ... ) ## S3 method for class 'integer' Desc( x, main = NULL, maxrows = NULL, plotit = NULL, sep = NULL, digits = NULL, ... ) ## S3 method for class 'factor' Desc( x, main = NULL, maxrows = NULL, ord = NULL, plotit = NULL, sep = NULL, digits = NULL, ... ) ## S3 method for class 'labelled' Desc( x, main = NULL, maxrows = NULL, ord = NULL, plotit = NULL, sep = NULL, digits = NULL, ... ) ## S3 method for class 'ordered' Desc( x, main = NULL, maxrows = NULL, ord = NULL, plotit = NULL, sep = NULL, digits = NULL, ... ) ## S3 method for class 'character' Desc( x, main = NULL, maxrows = NULL, ord = NULL, plotit = NULL, sep = NULL, digits = NULL, ... ) ## S3 method for class 'ts' Desc(x, main = NULL, plotit = NULL, sep = NULL, digits = NULL, ...) ## S3 method for class 'logical' Desc( x, main = NULL, ord = NULL, conf.level = 0.95, plotit = NULL, sep = NULL, digits = NULL, ... ) ## S3 method for class 'Date' Desc( x, main = NULL, dprobs = NULL, mprobs = NULL, plotit = NULL, sep = NULL, digits = NULL, ... ) ## S3 method for class 'table' Desc( x, main = NULL, conf.level = 0.95, verbose = 2, rfrq = "111", margins = c(1, 2), plotit = NULL, sep = NULL, digits = NULL, ... ) ## Default S3 method: Desc( x, main = NULL, maxrows = NULL, ord = NULL, conf.level = 0.95, verbose = 2, rfrq = "111", margins = c(1, 2), dprobs = NULL, mprobs = NULL, plotit = NULL, sep = NULL, digits = NULL, ... ) ## S3 method for class 'data.frame' Desc(x, main = NULL, plotit = NULL, enum = TRUE, sep = NULL, ...) ## S3 method for class 'list' Desc(x, main = NULL, plotit = NULL, enum = TRUE, sep = NULL, ...) ## S3 method for class 'formula' Desc( formula, data = parent.frame(), subset, main = NULL, plotit = NULL, digits = NULL, ... ) ## S3 method for class 'Desc' print( x, digits = NULL, plotit = NULL, nolabel = FALSE, sep = NULL, nomain = FALSE, ... ) ## S3 method for class 'Desc' plot(x, main = NULL, ...) ## S3 method for class 'palette' Desc(x, ...)
x |
the object to be described. This can be a data.frame, a list, a table or a vector of the classes: numeric, integer, factor, ordered factor, logical. |
... |
further arguments to be passed to or from other methods. For the internal default method these can include:
|
main |
(character|
|
plotit |
logical. Should a plot be created? The plot type will be
chosen according to the classes of variables (roughly following a
numeric-numeric, numeric-categorical, categorical-categorical logic).
Default can be defined by option |
wrd |
the pointer to a running MS Word instance, as created by
|
maxrows |
numeric; defines the maximum number of rows in a frequency
table to be reported. For factors with many levels it is often not
interesting to see all of them. Default is set to 12 most frequent ones
(resp. the first ones if For a numeric argument x If Setting |
sep |
character. The separator for the title. By default a line of
|
digits |
integer. With how many digits should the relative frequencies be formatted? Default can be set by DescToolsOptions(digits=x). |
ord |
character out of |
conf.level |
confidence level of the interval. If set to |
dprobs , mprobs
|
a vector with the probabilities for the Chi-Square test
for days, resp. months, when describing a |
verbose |
integer out of |
rfrq |
a string with 3 characters, each of them being |
margins |
a vector, consisting out of 1 and/or 2. Defines the margin
sums to be included. Row margins are reported if margins is set to 1. Set it
to 2 for column margins and c(1,2) for both. |
enum |
logical, determining if in data.frames and lists a sequential number should be included in the main title. Default is TRUE. The reason for this option is, that if a Word report with enumerated headings is created, the numbers may be redundant or inconsistent. |
formula |
a formula of the form |
data |
an optional matrix or data frame containing the variables in the
formula |
subset |
an optional vector specifying a subset of observations to be used. |
nolabel |
logical, defining if labels (defined as attribute with the
name |
nomain |
logical, determines if the main title of the output is printed
or not, default is |
A 2-dimensional table will be described with it's relative frequencies, a
short summary containing the total cases, the dimensions of the table,
chi-square tests and some association measures as phi-coefficient,
contingency coefficient and Cramer's V.
Tables with higher dimensions will simply be printed as flat table,
with marginal sums for the first and for the last dimension.
Desc
is a generic function. It dispatches to one of the methods above
depending on the class of its first argument. Typing ?Desc
+ TAB at the
prompt should present a choice of links: the help pages for each of these
Desc
methods (at least if you're using RStudio, which anyway is
recommended). You don't need to use the full name of the method although you
may if you wish; i.e., Desc(x)
is idiomatic R but you can bypass method
dispatch by going direct if you wish: Desc.numeric(x)
.
This function produces a rich description of a factor, containing length,
number of NAs, number of levels and detailed frequencies of all levels. The
order of the frequency table can be chosen between descending/ascending
frequency, labels or levels. For ordered factors the order default is
"level"
. Character vectors are treated as unordered factors Desc.char
converts x to a factor an processes x as factor.
Desc.ordered does nothing more than changing the standard order for the
frequencies to it's intrinsic order, which means order "level"
instead of "desc"
in the factor case.
Description interface for dates. We do here what seems reasonable for describing dates. We start with a short summary about length, number of NAs and extreme values, before we describe the frequencies of the weekdays and months, rounded up by a chi-square test.
A 2-dimensional table will be described with it's relative frequencies, a
short summary containing the total cases, the dimensions of the table,
chi-square tests and some association measures as phi-coefficient,
contingency coefficient and Cramer's V.
Tables with higher dimensions will simply be printed as flat table,
with marginal sums for the first and for the last dimension.
Note that NA
s cannot be handled by this interface, as tables in general come
in "as.is", say basically as a matrix without any further information about
potentially previously cleared NAs.
Description of a dichotomous variable. This can either be a logical vector,
a factor with two levels or a numeric variable with only two unique values.
The confidence levels for the relative frequencies are calculated by
BinomCI()
, method "Wilson"
on a confidence level defined
by conf.level
. Dichotomous variables can easily be condensed in one
graphical representation. Desc for a set of flags (=dichotomous variables)
calculates the frequencies, a binomial confidence interval and produces a
kind of dotplot with error bars. Motivation for this function is, that
dichotomous variable in general do not contain intense information.
Therefore it makes sense to condense the description of sets of dichotomous
variables.
The formula interface accepts the formula operators +
, :
,
*
, I()
, 1
and evaluates any function. The left hand
side and right hand side of the formula are evaluated the same way. The
variable pairs are processed in dependency of their classes.
Word
This function is not thought of being directly run by the end user.
It will normally be called automatically, when a pointer to a Word instance
is passed to the function Desc()
.
However DescWrd
takes
some more specific arguments concerning the Word output (like font
or
fontsize
), which can make it necessary to call the function directly.
A list containing the following components:
length |
the length of the vector (n + NAs). |
n |
the valid entries (NAs are excluded) |
NAs |
number of NAs |
unique |
number of unique values. |
0s |
number of zeros |
mean |
arithmetic mean |
MeanSE |
standard error of the mean, as calculated by |
quant |
a table of quantiles, as calculated by quantile(x, probs = c(.05,.10,.25,.5,.75,.9,.95), na.rm = TRUE). |
sd |
standard deviation |
vcoef |
coefficient of variation: |
mad |
median absolute deviation ( |
IQR |
interquartile range |
skew |
skewness, as calculated by |
kurt |
kurtosis, as calculated by |
highlow |
the lowest and the highest values, reported with their frequencies in brackets, if > 1. |
frq |
a data.frame of absolute and relative frequencies given by
|
Andri Signorell [email protected]
Other Statistical summary functions:
Abstract()
opt <- DescToolsOptions() # implemented classes: Desc(d.pizza$wrongpizza) # logical Desc(d.pizza$driver) # factor Desc(d.pizza$quality) # ordered factor Desc(as.character(d.pizza$driver)) # character Desc(d.pizza$week) # integer Desc(d.pizza$delivery_min) # numeric Desc(d.pizza$date) # Date Desc(d.pizza) Desc(d.pizza$wrongpizza, main="The wrong pizza delivered", digits=5) Desc(table(d.pizza$area)) # 1-dim table Desc(table(d.pizza$area, d.pizza$operator)) # 2-dim table Desc(table(d.pizza$area, d.pizza$operator, d.pizza$driver)) # n-dim table # expressions Desc(log(d.pizza$temperature)) Desc(d.pizza$temperature > 45) # supported labels Label(d.pizza$temperature) <- "This is the temperature in degrees Celsius measured at the time when the pizza is delivered to the client." Desc(d.pizza$temperature) # try as well: Desc(d.pizza$temperature, wrd=GetNewWrd()) z <- Desc(d.pizza$temperature) print(z, digits=1, plotit=FALSE) # plot (additional arguments are passed on to the underlying plot function) plot(z, main="The pizza's temperature in Celsius", args.hist=list(breaks=50)) # formula interface for single variables Desc(~ uptake + Type, data = CO2, plotit = FALSE) # bivariate Desc(price ~ operator, data=d.pizza) # numeric ~ factor Desc(driver ~ operator, data=d.pizza) # factor ~ factor Desc(driver ~ area + operator, data=d.pizza) # factor ~ several factors Desc(driver + area ~ operator, data=d.pizza) # several factors ~ factor Desc(driver ~ week, data=d.pizza) # factor ~ integer Desc(driver ~ operator, data=d.pizza, rfrq="111") # alle rel. frequencies Desc(driver ~ operator, data=d.pizza, rfrq="000", verbose=3) # no rel. frequencies Desc(price ~ delivery_min, data=d.pizza) # numeric ~ numeric Desc(price + delivery_min ~ operator + driver + wrongpizza, data=d.pizza, digits=c(2,2,2,2,0,3,0,0) ) Desc(week ~ driver, data=d.pizza, digits=c(2,2,2,2,0,3,0,0)) # define digits Desc(delivery_min + weekday ~ driver, data=d.pizza) # without defining data-parameter Desc(d.pizza$delivery_min ~ d.pizza$driver) # with functions and interactions Desc(sqrt(price) ~ operator : factor(wrongpizza), data=d.pizza) Desc(log(price+1) ~ cut(delivery_min, breaks=seq(10,90,10)), data=d.pizza, digits=c(2,2,2,2,0,3,0,0)) # response versus all the rest Desc(driver ~ ., data=d.pizza[, c("temperature","wine_delivered","area","driver")]) # all the rest versus response Desc(. ~ driver, data=d.pizza[, c("temperature","wine_delivered","area","driver")]) # pairwise Descriptions p <- CombPairs(c("area","count","operator","driver","temperature","wrongpizza","quality"), ) for(i in 1:nrow(p)) print(Desc(formula(gettextf("%s ~ %s", p$X1[i], p$X2[i])), data=d.pizza)) # get more flexibility, create the table first tab <- as.table(apply(HairEyeColor, c(1,2), sum)) tab <- tab[,c("Brown","Hazel","Green","Blue")] # display only absolute values, row and columnwise percentages Desc(tab, row.vars=c(3, 1), rfrq="011", plotit=FALSE) # do the plot by hand, while setting the colours for the mosaics cols1 <- SetAlpha(c("sienna4", "burlywood", "chartreuse3", "slategray1"), 0.6) cols2 <- SetAlpha(c("moccasin", "salmon1", "wheat3", "gray32"), 0.8) plot(Desc(tab), col1=cols1, col2=cols2) # choose alternative flavours for graphing numeric ~ factor using pipe # (colors are recyled) Desc(temperature ~ driver, data = d.pizza) |> plot(type="dens", col=Pal("Tibco")) # use global format options for presentation Fmt(abs=as.fmt(digits=0, big.mark="")) Fmt(per=as.fmt(digits=2, fmt="%")) Desc(area ~ driver, d.pizza, plotit=FALSE) Fmt(abs=as.fmt(digits=0, big.mark="'")) Fmt(per=as.fmt(digits=3, ldigits=0)) Desc(area ~ driver, d.pizza, plotit=FALSE) # plot arguments can be fixed in detail z <- Desc(BoxCox(d.pizza$temperature, lambda = 1.5)) plot(z, mar=c(0, 2.1, 4.1, 2.1), args.rug=TRUE, args.hist=list(breaks=50), args.dens=list(from=0)) # The default description for count variables can be inappropriate, # the density curve does not represent the variable well. set.seed(1972) x <- rpois(n = 500, lambda = 5) Desc(x) # but setting maxrows to Inf gives a better plot Desc(x, maxrows = Inf) # Output into word document (Windows-specific example) ----------------------- # by simply setting wrd=GetNewWrd() ## Not run: # create a new word instance and insert title and contents wrd <- GetNewWrd(header=TRUE) # let's have a subset d.sub <- d.pizza[,c("driver", "date", "operator", "price", "wrongpizza")] # do just the univariate analysis Desc(d.sub, wrd=wrd) ## End(Not run) DescToolsOptions(opt)
opt <- DescToolsOptions() # implemented classes: Desc(d.pizza$wrongpizza) # logical Desc(d.pizza$driver) # factor Desc(d.pizza$quality) # ordered factor Desc(as.character(d.pizza$driver)) # character Desc(d.pizza$week) # integer Desc(d.pizza$delivery_min) # numeric Desc(d.pizza$date) # Date Desc(d.pizza) Desc(d.pizza$wrongpizza, main="The wrong pizza delivered", digits=5) Desc(table(d.pizza$area)) # 1-dim table Desc(table(d.pizza$area, d.pizza$operator)) # 2-dim table Desc(table(d.pizza$area, d.pizza$operator, d.pizza$driver)) # n-dim table # expressions Desc(log(d.pizza$temperature)) Desc(d.pizza$temperature > 45) # supported labels Label(d.pizza$temperature) <- "This is the temperature in degrees Celsius measured at the time when the pizza is delivered to the client." Desc(d.pizza$temperature) # try as well: Desc(d.pizza$temperature, wrd=GetNewWrd()) z <- Desc(d.pizza$temperature) print(z, digits=1, plotit=FALSE) # plot (additional arguments are passed on to the underlying plot function) plot(z, main="The pizza's temperature in Celsius", args.hist=list(breaks=50)) # formula interface for single variables Desc(~ uptake + Type, data = CO2, plotit = FALSE) # bivariate Desc(price ~ operator, data=d.pizza) # numeric ~ factor Desc(driver ~ operator, data=d.pizza) # factor ~ factor Desc(driver ~ area + operator, data=d.pizza) # factor ~ several factors Desc(driver + area ~ operator, data=d.pizza) # several factors ~ factor Desc(driver ~ week, data=d.pizza) # factor ~ integer Desc(driver ~ operator, data=d.pizza, rfrq="111") # alle rel. frequencies Desc(driver ~ operator, data=d.pizza, rfrq="000", verbose=3) # no rel. frequencies Desc(price ~ delivery_min, data=d.pizza) # numeric ~ numeric Desc(price + delivery_min ~ operator + driver + wrongpizza, data=d.pizza, digits=c(2,2,2,2,0,3,0,0) ) Desc(week ~ driver, data=d.pizza, digits=c(2,2,2,2,0,3,0,0)) # define digits Desc(delivery_min + weekday ~ driver, data=d.pizza) # without defining data-parameter Desc(d.pizza$delivery_min ~ d.pizza$driver) # with functions and interactions Desc(sqrt(price) ~ operator : factor(wrongpizza), data=d.pizza) Desc(log(price+1) ~ cut(delivery_min, breaks=seq(10,90,10)), data=d.pizza, digits=c(2,2,2,2,0,3,0,0)) # response versus all the rest Desc(driver ~ ., data=d.pizza[, c("temperature","wine_delivered","area","driver")]) # all the rest versus response Desc(. ~ driver, data=d.pizza[, c("temperature","wine_delivered","area","driver")]) # pairwise Descriptions p <- CombPairs(c("area","count","operator","driver","temperature","wrongpizza","quality"), ) for(i in 1:nrow(p)) print(Desc(formula(gettextf("%s ~ %s", p$X1[i], p$X2[i])), data=d.pizza)) # get more flexibility, create the table first tab <- as.table(apply(HairEyeColor, c(1,2), sum)) tab <- tab[,c("Brown","Hazel","Green","Blue")] # display only absolute values, row and columnwise percentages Desc(tab, row.vars=c(3, 1), rfrq="011", plotit=FALSE) # do the plot by hand, while setting the colours for the mosaics cols1 <- SetAlpha(c("sienna4", "burlywood", "chartreuse3", "slategray1"), 0.6) cols2 <- SetAlpha(c("moccasin", "salmon1", "wheat3", "gray32"), 0.8) plot(Desc(tab), col1=cols1, col2=cols2) # choose alternative flavours for graphing numeric ~ factor using pipe # (colors are recyled) Desc(temperature ~ driver, data = d.pizza) |> plot(type="dens", col=Pal("Tibco")) # use global format options for presentation Fmt(abs=as.fmt(digits=0, big.mark="")) Fmt(per=as.fmt(digits=2, fmt="%")) Desc(area ~ driver, d.pizza, plotit=FALSE) Fmt(abs=as.fmt(digits=0, big.mark="'")) Fmt(per=as.fmt(digits=3, ldigits=0)) Desc(area ~ driver, d.pizza, plotit=FALSE) # plot arguments can be fixed in detail z <- Desc(BoxCox(d.pizza$temperature, lambda = 1.5)) plot(z, mar=c(0, 2.1, 4.1, 2.1), args.rug=TRUE, args.hist=list(breaks=50), args.dens=list(from=0)) # The default description for count variables can be inappropriate, # the density curve does not represent the variable well. set.seed(1972) x <- rpois(n = 500, lambda = 5) Desc(x) # but setting maxrows to Inf gives a better plot Desc(x, maxrows = Inf) # Output into word document (Windows-specific example) ----------------------- # by simply setting wrd=GetNewWrd() ## Not run: # create a new word instance and insert title and contents wrd <- GetNewWrd(header=TRUE) # let's have a subset d.sub <- d.pizza[,c("driver", "date", "operator", "price", "wrongpizza")] # do just the univariate analysis Desc(d.sub, wrd=wrd) ## End(Not run) DescToolsOptions(opt)
Some aliases are defined either for having shorter names or for following the Google naming convention.
N()
N()
N()
is the same as as.numeric()
.D()
is the same as as.Date()
Andri Signorell <[email protected]>
head(N(d.pizza$driver))
head(N(d.pizza$driver))
Some more custom palettes.
Pal(pal, n = 100, alpha = 1) ## S3 method for class 'palette' plot(x, cex = 3, ...) hred horange hyellow hecru hblue hgreen
Pal(pal, n = 100, alpha = 1) ## S3 method for class 'palette' plot(x, cex = 3, ...) hred horange hyellow hecru hblue hgreen
pal |
name or number of the palette. One of
|
n |
integer, number of colors for the palette. |
alpha |
the alpha value to be added. This can be any value from 0 (fully transparent) to 1 (opaque). |
x |
a palette to be plotted. |
cex |
extension for the color squares. Defaults to 3. |
... |
further arguments passed to the function. |
hred, horange, hyellow, hecru, hblue and hgreen are constants, pointing to the according color from the palette Pal("Helsana").
a vector of colors
Andri Signorell <[email protected]>
Canvas(c(0,1)) ColorLegend(x=0, y=1, width=0.1, col=Pal(1, n=50)) ColorLegend(x=0.15, y=1, width=0.1, col=Pal(2, n=50)) ColorLegend(x=0.3, y=1, width=0.1, col=Pal(3, n=50)) ColorLegend(x=0.45, y=1, width=0.1, col=Pal(4, n=50)) ColorLegend(x=0.6, y=1, width=0.1, col=Pal(5, n=50)) ColorLegend(x=0.75, y=1, width=0.1, col=Pal(6, n=50)) ColorLegend(x=0.9, y=1, width=0.1, col=Pal(7)) ColorLegend(x=1.05, y=1, width=0.1, col=Pal(8)) text(1:8, y=1.05, x=seq(0,1.05,.15)+.05) title(main="DescTools palettes") par(mfrow=c(4,2), mar=c(1,1,2,1)) barplot(1:9, col=Pal("Tibco"), axes=FALSE, main="Palette 'Tibco'" ) barplot(1:7, col=Pal("Helsana"), axes=FALSE, main="Palette 'Helsana'" ) barplot(1:7, col=SetAlpha(Pal("Helsana")[c("ecru","hellgruen","hellblau")], 0.6), axes=FALSE, main="Palette 'Helsana' (Alpha)" ) barplot(1:10, col=Pal("RedToBlack", 10), axes=FALSE, main="Palette 'RedToBlack'" ) barplot(1:10, col=Pal("RedBlackGreen", 10), axes=FALSE, main="Palette 'RedGreenGreen'" ) barplot(1:10, col=Pal("SteeblueWhite", 10), axes=FALSE, main="Palette 'SteeblueWhite'" ) barplot(1:10, col=Pal("RedWhiteGreen", 10), axes=FALSE, main="Palette 'RedWhiteGreen'" )
Canvas(c(0,1)) ColorLegend(x=0, y=1, width=0.1, col=Pal(1, n=50)) ColorLegend(x=0.15, y=1, width=0.1, col=Pal(2, n=50)) ColorLegend(x=0.3, y=1, width=0.1, col=Pal(3, n=50)) ColorLegend(x=0.45, y=1, width=0.1, col=Pal(4, n=50)) ColorLegend(x=0.6, y=1, width=0.1, col=Pal(5, n=50)) ColorLegend(x=0.75, y=1, width=0.1, col=Pal(6, n=50)) ColorLegend(x=0.9, y=1, width=0.1, col=Pal(7)) ColorLegend(x=1.05, y=1, width=0.1, col=Pal(8)) text(1:8, y=1.05, x=seq(0,1.05,.15)+.05) title(main="DescTools palettes") par(mfrow=c(4,2), mar=c(1,1,2,1)) barplot(1:9, col=Pal("Tibco"), axes=FALSE, main="Palette 'Tibco'" ) barplot(1:7, col=Pal("Helsana"), axes=FALSE, main="Palette 'Helsana'" ) barplot(1:7, col=SetAlpha(Pal("Helsana")[c("ecru","hellgruen","hellblau")], 0.6), axes=FALSE, main="Palette 'Helsana' (Alpha)" ) barplot(1:10, col=Pal("RedToBlack", 10), axes=FALSE, main="Palette 'RedToBlack'" ) barplot(1:10, col=Pal("RedBlackGreen", 10), axes=FALSE, main="Palette 'RedGreenGreen'" ) barplot(1:10, col=Pal("SteeblueWhite", 10), axes=FALSE, main="Palette 'SteeblueWhite'" ) barplot(1:10, col=Pal("RedWhiteGreen", 10), axes=FALSE, main="Palette 'RedWhiteGreen'" )
Get and set a variety of options which affect the way in which DescTools functions display results.
DescToolsOptions(..., default = NULL, reset = FALSE)
DescToolsOptions(..., default = NULL, reset = FALSE)
... |
any options can be defined, using |
default |
if the specified option is not set in the options list, this value is returned. This facilitates retrieving an option and checking whether it is set and setting it separately if not. |
reset |
logical. If this is set to |
Invoking DescToolsOptions()
with no arguments returns a list with the current values of the options. Note that not all options listed below are set initially. To access the value of a single option, one can simply use DescToolsOptions("plotit")
.
To set a new value use the same rationale as with the R options: DescToolsOptions(plotit=FALSE)
Options used by DescTools
col
:a vector of colours, defined as names or as RGB-longs ("#RRGGBB"
). By now three colors are used in several plots as defaults. By default they're set to hblue
, hred
and horange
. Change the values by defining DescToolsOptions(col=c("pink", "blue", "yellow"))
. Any color definition can be used here.
digits
:the number of FIXED digits, used throughout the print functions.
fixedfont
:this font will be used by default, when Desc
writes
to a Word document. Must be defined as a font object, say enumerating name
, face
and size
of the font and setting the class font
, e.g. structure(list(name="Courier New", size=7), class="font")
.
fmt
:Three number format definitions are currently used in the Desc
routines. The format used for integer values is named "abs"
, for percentages "perc"
and for floating point numeric values "num"
.
The format definitions must be of class "fmt"
and may contain any argument used in the function Format
.
Use Fmt
to access and update formats (as they are organised in a nested list).
footnote
:a character vector, containing characters to be used as footnote signs.
Any character can be defined here. This is currently used by TOne
.
lang
:either "engl"
or "local"
, defining the language to be used for the names of weekdays and months when using Format
.
plotit
:logical, defining whether the Desc
-procedures should produce plots by default. This is usually a good thing, but it may clutter up your desktop, if you're not using RStudio. Therefore it can be turned off.
stamp
:text or expression to be placed in the right bottom corner of the DescTools
plots. This can be useful, if some author or date information should automatically be inserted by default. Any text can be set as option, but also dynamic expressions can be used. The default would use an expression as <username>/<date>, which will use the username from the system and the current date. See defaults below.
Calling DescToolsOptions(reset=TRUE)
will reset the options to these defaults:
options(DescTools = list( col = c(hblue="#8296C4", hred="#9A0941", horange="#F08100"), digits = 3, fixedfont = structure(list(name = "Consolas", size = 7), class = "font"), fmt = list(abs = structure(list(digits = 0, big.mark = "'"), name = "abs", label = "Number format for counts", default = TRUE, class = "fmt"), per = structure(list(digits = 1, fmt = "%"), name = "per", label = "Percentage number format", default = TRUE, class = "fmt"), num = structure(list(digits = 3, big.mark = "'"), name = "num", label = "Number format for floats", default = TRUE, class = "fmt") ), footnote = c("'", "\"", "\"\""), lang = "engl", plotit = TRUE, stamp = expression(gettextf("%s/%s", Sys.getenv("USERNAME"), Format(Today(), fmt = "yyyy-mm-dd"))) ))
This code can as well be copied and pasted to the users' RProfile
file, in order to have the options permanently available.
For a given vector of strings the current value set for option x
, or NULL
if the option is unset.
If called with no arguments, returns all option settings in a list. Otherwise, it changes the named settings and invisibly returns their previous values.
Andri Signorell <[email protected]>
DescToolsOptions("plotit") ## Not run: # Get all options, defaults are attributed as such DescToolsOptions() # get some options DescToolsOptions("plotit", "lang") # get some potentially undefined option, while taking a user default and # overriding system defaults DescToolsOptions("stamp", default="Condor, 2016") # get an undefined option, should return default DescToolsOptions("stampede", default="Condor, 2016") # set options, while getting the old values opt <- DescToolsOptions(plotit=789, lang="portugues") DescToolsOptions() # output the old values opt # just a single argument DescToolsOptions(digits=2) # reset the old values DescToolsOptions(opt) DescToolsOptions() # reset factory defaults DescToolsOptions(reset=TRUE) ## End(Not run)
DescToolsOptions("plotit") ## Not run: # Get all options, defaults are attributed as such DescToolsOptions() # get some options DescToolsOptions("plotit", "lang") # get some potentially undefined option, while taking a user default and # overriding system defaults DescToolsOptions("stamp", default="Condor, 2016") # get an undefined option, should return default DescToolsOptions("stampede", default="Condor, 2016") # set options, while getting the old values opt <- DescToolsOptions(plotit=789, lang="portugues") DescToolsOptions() # output the old values opt # just a single argument DescToolsOptions(digits=2) # reset the old values DescToolsOptions(opt) DescToolsOptions() # reset factory defaults DescToolsOptions(reset=TRUE) ## End(Not run)
Calculate digit sum of a number x.
DigitSum(x)
DigitSum(x)
x |
an integer number |
the digit sum
Andri Signorell <[email protected]> based on code by Julius benchmarked by Uwe
URL: https://stackoverflow.com/questions/18675285/digit-sum-function-in-r
DigitSum(c(124, 45, 268)) # [1] 7 9 16
DigitSum(c(124, 45, 268)) # [1] 7 9 16
Calculates Rao's diversity coefficient (also known as "Quadratic Entropy") within samples.
DivCoef(df, dis, scale)
DivCoef(df, dis, scale)
df |
a data frame with elements as rows, samples as columns, and abundance, presence-absence or frequencies as entries |
dis |
an object of class |
scale |
a logical value indicating whether or not the diversity coefficient should be scaled by its maximal value over all frequency distributions. |
Returns a data frame with samples as rows and the diversity coefficient within samples as columns
This function was previously published as divc()
in the ade4 package and has been integrated here without logical changes.
Sandrine Pavoine [email protected]
Rao, C.R. (1982) Diversity and dissimilarity coefficients: a unified approach. Theoretical Population Biology, 21, 24–43.
Gini, C. (1912) Variabilita e mutabilita. Universite di Cagliari III, Parte II.
Simpson, E.H. (1949) Measurement of diversity. Nature, 163, 688.
Champely, S. and Chessel, D. (2002) Measuring biological diversity using Euclidean metrics. Environmental and Ecological Statistics, 9, 167–177.
# data(ecomor) # dtaxo <- dist.taxo(ecomor$taxo) # DivCoef(ecomor$habitat, dtaxo) # data(humDNAm) # DivCoef(humDNAm$samples, sqrt(humDNAm$distances))
# data(ecomor) # dtaxo <- dist.taxo(ecomor$taxo) # DivCoef(ecomor$habitat, dtaxo) # data(humDNAm) # DivCoef(humDNAm$samples, sqrt(humDNAm$distances))
For a given dissimilarity matrix, this function calculates the maximal value of Rao's diversity coefficient over all frequency distribution. It uses an optimization technique based on Rosen's projection gradient algorithm and is verified using the Kuhn-Tucker conditions.
DivCoefMax(dis, epsilon, comment)
DivCoefMax(dis, epsilon, comment)
dis |
an object of class |
epsilon |
a tolerance threshold : a frequency is non null if it is higher than epsilon. |
comment |
a logical value indicating whether or not comments on the optimization technique should be printed. |
Returns a list
value |
the maximal value of Rao's diversity coefficient. |
vectors |
a data frame containing four frequency
distributions : sim is a simple distribution which is equal
to |
Stéphane Champely <[email protected]>
Sandrine Pavoine <[email protected]>
Rao, C.R. (1982) Diversity and dissimilarity coefficients: a unified approach. Theoretical Population Biology, 21, 24–43.
Gini, C. (1912) Variabilita e mutabilita. Universite di Cagliari III, Parte II.
Simpson, E.H. (1949) Measurement of diversity. Nature, 163, 688.
Champely, S. and Chessel, D. (2002) Measuring biological diversity using Euclidean metrics. Environmental and Ecological Statistics, 9, 167–177.
Pavoine, S., Ollier, S. and Pontier, D. (2005) Measuring diversity from dissimilarities with Rao's quadratic entropy: are any dissimilarities suitable? Theoretical Population Biology, 67, 231–239.
## Not run: par.safe <- par()$mar data(elec88) par(mar = c(0.1, 0.1, 0.1, 0.1)) # Departments of France. area.plot(elec88$area) # Dissimilarity matrix. d0 <- dist(elec88$xy) # Frequency distribution maximizing spatial diversity in France # according to Rao's quadratic entropy. France.m <- DivCoefMax(d0) w0 <- France.m$vectors$num v0 <- France.m$value (1:94) [w0 > 0] # Smallest circle including all the 94 departments. # The squared radius of that circle is the maximal value of the # spatial diversity. w1 = elec88$xy[c(6, 28, 66), ] w.c = apply(w1 * w0[c(6, 28, 66)], 2, sum) symbols(w.c[1], w.c[2], circles = sqrt(v0), inc = FALSE, add = TRUE) s.value(elec88$xy, w0, add.plot = TRUE) par(mar = par.safe) # Maximisation of Rao's diversity coefficient # with ultrametric dissimilarities. data(microsatt) mic.genet <- count2genet(microsatt$tab) mic.dist <- dist.genet(mic.genet, 1) mic.phylog <- hclust2phylog(hclust(mic.dist)) plot.phylog(mic.phylog) mic.maxpond <- DivCoefMax(mic.phylog$Wdist)$vectors$num dotchart.phylog(mic.phylog, mic.maxpond) ## End(Not run)
## Not run: par.safe <- par()$mar data(elec88) par(mar = c(0.1, 0.1, 0.1, 0.1)) # Departments of France. area.plot(elec88$area) # Dissimilarity matrix. d0 <- dist(elec88$xy) # Frequency distribution maximizing spatial diversity in France # according to Rao's quadratic entropy. France.m <- DivCoefMax(d0) w0 <- France.m$vectors$num v0 <- France.m$value (1:94) [w0 > 0] # Smallest circle including all the 94 departments. # The squared radius of that circle is the maximal value of the # spatial diversity. w1 = elec88$xy[c(6, 28, 66), ] w.c = apply(w1 * w0[c(6, 28, 66)], 2, sum) symbols(w.c[1], w.c[2], circles = sqrt(v0), inc = FALSE, add = TRUE) s.value(elec88$xy, w0, add.plot = TRUE) par(mar = par.safe) # Maximisation of Rao's diversity coefficient # with ultrametric dissimilarities. data(microsatt) mic.genet <- count2genet(microsatt$tab) mic.dist <- dist.genet(mic.genet, 1) mic.phylog <- hclust2phylog(hclust(mic.dist)) plot.phylog(mic.phylog) mic.maxpond <- DivCoefMax(mic.phylog$Wdist)$vectors$num dotchart.phylog(mic.phylog, mic.maxpond) ## End(Not run)
Calculate divisors of positive natural numbers.
Divisors(x)
Divisors(x)
x |
integer number for which the divisors are to be returned |
Divisibility is a mathematical relationship between two integers. An integer is divisible by another integer if there is no remainder in the division. The number 11 has only two divisors: 1 and the number 11 itself, whereas the number 12 has many divisors: 1, 2, 3, 4, 6 and 12.
In elementary number theory, the concept of divisibility is limited to natural numbers.
The number of its divisors can be determined with the function length()
.
an integer vector containg the divisors
Andri Signorell <[email protected]>
Divisors(c(145, 786))
Divisors(c(145, 786))
Split the vector x into partitions and apply the function to each partition separately. Computation restarts for each partition.
The logic is the same as the OLAP functions in SQL, e.g. SUM(x) OVER (PARTITION BY group)
.
DoBy(x, ...) ## S3 method for class 'formula' DoBy(formula, data = parent.frame(), subset, na.action, vnames = NULL, ...) ## Default S3 method: DoBy(x, by, FUN, vnames = NULL, collapse = FALSE, ...)
DoBy(x, ...) ## S3 method for class 'formula' DoBy(formula, data = parent.frame(), subset, na.action, vnames = NULL, ...) ## Default S3 method: DoBy(x, by, FUN, vnames = NULL, collapse = FALSE, ...)
x |
a vector that should be operated. |
by |
list of one or more factors, each of same length as |
FUN |
Function to apply for each factor level combination. |
formula |
a formula of the form |
data |
an optional matrix or data frame (or similar: see |
subset |
an optional vector specifying a subset of observations to be used. |
na.action |
a function which indicates what should happen when the data contain |
vnames |
name for the new variables. |
collapse |
logical, determining if the results should be collapsed to groups. Default is |
... |
optional arguments to |
This is more or less the same as the function ave
, with the arguments organized a bit different and offering more flexibility.
a data.frame with the same number of rows as length as x
containing the groupwise results of FUN
and the used group factors.
The attribute response
denotes the name of the response variable in case the formula interface was used.
Optional arguments to FUN
supplied by the ... argument are not divided into cells. It is therefore inappropriate for FUN
to expect additional arguments with the same length as x
.
Andri Signorell <[email protected]>
d.frm <- data.frame(x=rep(1:4,3), v=sample(x=1:3, size=12, replace=TRUE), g=gl(4,3,labels=letters[1:4]), m=gl(3,4,labels=LETTERS[1:3])) # SQL-OLAP: sum() over (partition by g) DoBy(d.frm$x, d.frm$g, FUN=sum) # DoBy(d.frm$x, FUN=sum) # more than 1 grouping variables are organized as list as in tapply: DoBy(d.frm$x, list(d.frm$g, d.frm$m), mean) # count d.frm$count <- DoBy(d.frm$x, d.frm$g, length) # rank d.frm$rank <- DoBy(d.frm$v, d.frm$g, rank) d.frm$dense_rank <- DoBy(d.frm$v, d.frm$g, Rank, ties.method="dense") d.frm$rank_desc <- DoBy(d.frm$x, d.frm$g, function(x) rank(-x)) # row_number d.frm$row_number <- DoBy(d.frm$v, d.frm$g, function(x) order(x)) d.frm
d.frm <- data.frame(x=rep(1:4,3), v=sample(x=1:3, size=12, replace=TRUE), g=gl(4,3,labels=letters[1:4]), m=gl(3,4,labels=LETTERS[1:3])) # SQL-OLAP: sum() over (partition by g) DoBy(d.frm$x, d.frm$g, FUN=sum) # DoBy(d.frm$x, FUN=sum) # more than 1 grouping variables are organized as list as in tapply: DoBy(d.frm$x, list(d.frm$g, d.frm$m), mean) # count d.frm$count <- DoBy(d.frm$x, d.frm$g, length) # rank d.frm$rank <- DoBy(d.frm$v, d.frm$g, rank) d.frm$dense_rank <- DoBy(d.frm$v, d.frm$g, Rank, ties.method="dense") d.frm$rank_desc <- DoBy(d.frm$x, d.frm$g, function(x) rank(-x)) # row_number d.frm$row_number <- DoBy(d.frm$v, d.frm$g, function(x) order(x)) d.frm
do.call
The do.call
can be somewhat slow,
especially when working with large objects. This function
is based upon the suggestions from Hadley Wickham on the R
mailing list (reference not available anymore).
Also thanks to Tommy at StackOverflow for
suggesting
how to handle double and triple colon operators, ::
,
further enhancing the function.
DoCall(what, args, quote = FALSE, envir = parent.frame())
DoCall(what, args, quote = FALSE, envir = parent.frame())
what |
either a function or a non-empty character string naming the function to be called. |
args |
a list of arguments to the function
call. The |
quote |
a logical value indicating whether to quote the arguments. |
envir |
an environment within which to evaluate the
call. This will be most useful if |
While the function attempts to do most of what
do.call
can it has limitations. It
can currently not parse the example code from the
original function: do.call(paste,
list(as.name("A"), as.name("B")), quote = TRUE)
and the
funcitonality of quote
has not been thoroughly
tested.
This is a verbatim copy from Gmisc::fastDoCall.
Max Gordon <[email protected]>
DoCall("complex", list(imaginary = 1:3)) ## if we already have a list (e.g. a data frame) ## we need c() to add further arguments tmp <- expand.grid(letters[1:2], 1:3, c("+", "-")) DoCall("paste", c(tmp, sep = "")) ## examples of where objects will be found. A <- 2 f <- function(x) print(x^2) env <- new.env() assign("A", 10, envir = env) assign("f", f, envir = env) f <- function(x) print(x) f(A) # 2 DoCall("f", list(A)) # 2 DoCall("f", list(A), envir = env) # 4 DoCall(f, list(A), envir = env) # 2 DoCall("f", list(quote(A)), envir = env) # 100 DoCall(f, list(quote(A)), envir = env) # 10 DoCall("f", list(as.name("A")), envir = env) # 100 eval(call("f", A)) # 2 eval(call("f", quote(A))) # 2 eval(call("f", A), envir = env) # 4 eval(call("f", quote(A)), envir = env) # 100
DoCall("complex", list(imaginary = 1:3)) ## if we already have a list (e.g. a data frame) ## we need c() to add further arguments tmp <- expand.grid(letters[1:2], 1:3, c("+", "-")) DoCall("paste", c(tmp, sep = "")) ## examples of where objects will be found. A <- 2 f <- function(x) print(x^2) env <- new.env() assign("A", 10, envir = env) assign("f", f, envir = env) f <- function(x) print(x) f(A) # 2 DoCall("f", list(A)) # 2 DoCall("f", list(A), envir = env) # 4 DoCall(f, list(A), envir = env) # 2 DoCall("f", list(quote(A)), envir = env) # 100 DoCall(f, list(quote(A)), envir = env) # 10 DoCall("f", list(as.name("A")), envir = env) # 100 eval(call("f", A)) # 2 eval(call("f", quote(A))) # 2 eval(call("f", A), envir = env) # 4 eval(call("f", quote(A)), envir = env) # 100
'dot' or 'scalar' product of vectors or pairwise columns of matrices.
Dot(x, y)
Dot(x, y)
x |
numeric vector or matrix |
y |
numeric vector or matrix |
Returns the 'dot' or 'scalar' product of vectors or columns of matrices.
Two vectors must be of same length, two matrices must be of
the same size.
If x
and y
are column or row vectors, their dot product
will be computed as if they were simple vectors.
A scalar or vector of length the number of columns of x
and
y
.
Hans W. Borchers <[email protected]>
Dot(1:5, 1:5) #=> 55 # Length of space diagonal in 3-dim- cube: sqrt(Dot(c(1,1,1), c(1,1,1))) #=> 1.732051
Dot(1:5, 1:5) #=> 55 # Length of space diagonal in 3-dim- cube: sqrt(Dot(c(1,1,1), c(1,1,1))) #=> 1.732051
Draw one or more elliptic (or circular) arcs from theta.1
to theta.2
on an existing plot using classic graphics.
DrawArc(x = 0, y = x, rx = 1, ry = rx, theta.1 = 0, theta.2 = 2*pi, nv = 100, col = par("col"), lty = par("lty"), lwd = par("lwd"), plot = TRUE)
DrawArc(x = 0, y = x, rx = 1, ry = rx, theta.1 = 0, theta.2 = 2*pi, nv = 100, col = par("col"), lty = par("lty"), lwd = par("lwd"), plot = TRUE)
x , y
|
a vector (or scalar) of xy-coordinates of the center(s) of the arc(s). |
rx |
a scalar or a vector giving the semi-major axis of the ellipse for the arc(s) |
ry |
a scalar or a vector giving the semi-minor axis of the ellipse for the arc(s). Default is radius.x which will result in a circle arc with radius.x. |
theta.1 |
a scalar or a vector of starting angles in radians. |
theta.2 |
a scalar or a vector of ending angles in radians. |
nv |
number of vertices used to plot the arc. Scalar or vector. |
col |
color for the arc(s). Scalar or vector. |
lty |
line type used for drawing. |
lwd |
line width used for drawing. |
plot |
logical. If |
All parameters are recycled if necessary.
Be sure to use an aspect ratio of 1 as shown in the example to avoid distortion.
DrawArc
invisibly returns a list of the calculated coordinates for all shapes.
Andri Signorell <[email protected]>
curve(sin(x), 0, pi, col="blue", asp=1) DrawArc(x = pi/2, y = 0, rx = 1, theta.1 = pi/4, theta.2 = 3*pi/4, col="red")
curve(sin(x), 0, pi, col="blue", asp=1) DrawArc(x = pi/2, y = 0, rx = 1, theta.1 = pi/4, theta.2 = 3*pi/4, col="red")
Draw a band using a simple syntax. Just a wrapper for the function polygon()
typically used to draw confidence bands.
DrawBand(x, y, col = SetAlpha("grey", 0.5), border = NA)
DrawBand(x, y, col = SetAlpha("grey", 0.5), border = NA)
x |
a vector or a matrix with x coordinates for the band. If x is given as matrix it must be a |
y |
a vector or a matrix with y coordinates for the band. If y is given as matrix it must be a |
col |
the color of the band. |
border |
the border color of the band. |
Andri Signorell <[email protected]>
set.seed(18) x <- rnorm(15) y <- x + rnorm(15) new <- seq(-3, 3, 0.5) pred.w.plim <- predict(lm(y ~ x), newdata=data.frame(x=new), interval="prediction") pred.w.clim <- predict(lm(y ~ x), newdata=data.frame(x=new), interval="confidence") plot(y ~ x) DrawBand(y = c(pred.w.plim[,2], rev(pred.w.plim[,3])), x=c(new, rev(new)), col= SetAlpha("grey90", 0.5)) # passing y as matrix interface allows more intuitive arguments DrawBand(y = pred.w.clim[, 2:3], x = new, col= SetAlpha("grey80", 0.5)) abline(lm(y~x), col="brown")
set.seed(18) x <- rnorm(15) y <- x + rnorm(15) new <- seq(-3, 3, 0.5) pred.w.plim <- predict(lm(y ~ x), newdata=data.frame(x=new), interval="prediction") pred.w.clim <- predict(lm(y ~ x), newdata=data.frame(x=new), interval="confidence") plot(y ~ x) DrawBand(y = c(pred.w.plim[,2], rev(pred.w.plim[,3])), x=c(new, rev(new)), col= SetAlpha("grey90", 0.5)) # passing y as matrix interface allows more intuitive arguments DrawBand(y = pred.w.clim[, 2:3], x = new, col= SetAlpha("grey80", 0.5)) abline(lm(y~x), col="brown")
Draw a Bezier curve.
DrawBezier(x = 0, y = x, nv = 100, col = par("col"), lty = par("lty"), lwd = par("lwd"), plot = TRUE)
DrawBezier(x = 0, y = x, nv = 100, col = par("col"), lty = par("lty"), lwd = par("lwd"), plot = TRUE)
x , y
|
a vector of xy-coordinates to define the Bezier curve. Should at least contain 3 points. |
nv |
number of vertices to draw the curve. |
col |
color(s) for the curve. Default is |
lty |
line type for borders and shading; defaults to |
lwd |
line width for borders and shading. |
plot |
logical. If |
Bezier curves appear in such areas as mechanical computer aided design (CAD). They are named after P. Bezier, who used a closely related representation in Renault's UNISURF CAD system in the early 1960s (similar, unpublished, work was done by P. de Casteljau at Citroen in the late 1950s and early 1960s). The 1970s and 1980s saw a flowering of interest in Bezier curves, with many CAD systems using them, and many important developments in their theory. The usefulness of Bezier curves resides in their many geometric and analytical properties. There are elegant and efficient algorithms for evaluation, differentiation, subdivision of the curves, and conversion to other useful representations. (See: Farin, 1993)
DrawBezier
invisibly returns a list of the calculated coordinates for all shapes.
Frank E Harrell Jr <[email protected]>
G. Farin (1993) Curves and surfaces for computer aided geometric design. A practical guide, Acad. Press
polygon
, DrawRegPolygon
, DrawCircle
, DrawArc
Canvas(xlim=c(0,1)) grid() DrawBezier( x=c(0,0.5,1), y=c(0,0.5,0), col="blue", lwd=2) DrawBezier( x=c(0,0.5,1), y=c(0,1,0), col="red", lwd=2) DrawBezier( x=c(0,0.25,0.5,0.75,1), y=c(0,1,1,1,0), col="darkgreen", lwd=2)
Canvas(xlim=c(0,1)) grid() DrawBezier( x=c(0,0.5,1), y=c(0,0.5,0), col="blue", lwd=2) DrawBezier( x=c(0,0.5,1), y=c(0,1,0), col="red", lwd=2) DrawBezier( x=c(0,0.25,0.5,0.75,1), y=c(0,1,1,1,0), col="darkgreen", lwd=2)
Draw one or several circle on an existing plot.
DrawCircle(x = 0, y = x, r.out = 1, r.in = 0, theta.1 = 0, theta.2 = 2*pi, border = par("fg"), col = NA, lty = par("lty"), lwd = par("lwd"), nv = 100, plot = TRUE)
DrawCircle(x = 0, y = x, r.out = 1, r.in = 0, theta.1 = 0, theta.2 = 2*pi, border = par("fg"), col = NA, lty = par("lty"), lwd = par("lwd"), nv = 100, plot = TRUE)
x , y
|
a vector (or scalar) of xy-coordinates for the center(s) of the circle(s). |
r.out |
a vector (or scalar) of the outer radius of the circle. |
r.in |
a vector (or scalar) of a potential inner radius of an annulus. |
theta.1 |
a vector (or scalar) of the starting angle(s). The sectors are built counterclockwise. |
theta.2 |
a vector (or scalar) of the ending angle(s). |
nv |
number of vertices to draw the circle. |
border |
color for circle borders. The default is par("fg"). Use border = |
col |
color(s) to fill or shade the circle(s) with. The default |
lty |
line type for borders and shading; defaults to |
lwd |
line width for borders and shading. |
plot |
logical. If |
All geometric arguments will be recycled.
The function invisibly returns a list of the calculated coordinates for all shapes.
Andri Signorell <[email protected]>
polygon
, DrawRegPolygon
, DrawEllipse
, DrawArc
Canvas(xlim = c(-5,5), xpd=TRUE) cols <- Pal("Helsana")[1:4] # Draw ring DrawCircle (r.in = 1, r.out = 5, border="darkgrey", col=SetAlpha(DescTools::hyellow, 0.2), lwd=2) # Draw circle DrawCircle (r.in = 6, border=DescTools::hgreen, lwd=3) # Draw sectors geom <- rbind(c(-pi, 0, .25, .5), c(0, pi, 1, 2), c(-pi/2, pi/2, 2, 2.5), c(pi/2, 3 * pi/2, 3, 4), c(pi - pi/8, pi + pi/8, 1.5, 2.5)) DrawCircle (r.in = geom[,3], r.out = geom[,4], theta.1 = geom[,1], theta.2 = geom[,2], col = SetAlpha(cols, 0.6), border = cols, lwd=1) # clipping Canvas(bg="lightgrey", main="Yin ~ Yang") DrawCircle (r.out = 1, col="white") clip(0, 2, 2, -2) DrawCircle(col="black") clip(-2, 2, 2, -2) DrawCircle (y = c(-0.5,0.5), r.out = 0.5, col=c("black", "white"), border=NA) DrawCircle (y = c(-0.5,0.5), r.out = 0.1, col=c("white", "black"), border=NA) DrawCircle () # overplotting circles Canvas(xlim=c(-5,5)) DrawCircle (r.out=4:1, col=c("white", "steelblue2", "white", "red"), lwd=3, nv=300) # rotation x <- seq(-3, 3, length.out=10) y <- rep(0, length.out=length(x)) Canvas(xlim=c(-5,5), bg="black") sapply( (0:11) * pi/6, function(theta) { xy <- Rotate(x, y=y, theta=theta) DrawCircle (x=xy$x, y=xy$y, r.in=2.4, border=SetAlpha("white", 0.2)) } )
Canvas(xlim = c(-5,5), xpd=TRUE) cols <- Pal("Helsana")[1:4] # Draw ring DrawCircle (r.in = 1, r.out = 5, border="darkgrey", col=SetAlpha(DescTools::hyellow, 0.2), lwd=2) # Draw circle DrawCircle (r.in = 6, border=DescTools::hgreen, lwd=3) # Draw sectors geom <- rbind(c(-pi, 0, .25, .5), c(0, pi, 1, 2), c(-pi/2, pi/2, 2, 2.5), c(pi/2, 3 * pi/2, 3, 4), c(pi - pi/8, pi + pi/8, 1.5, 2.5)) DrawCircle (r.in = geom[,3], r.out = geom[,4], theta.1 = geom[,1], theta.2 = geom[,2], col = SetAlpha(cols, 0.6), border = cols, lwd=1) # clipping Canvas(bg="lightgrey", main="Yin ~ Yang") DrawCircle (r.out = 1, col="white") clip(0, 2, 2, -2) DrawCircle(col="black") clip(-2, 2, 2, -2) DrawCircle (y = c(-0.5,0.5), r.out = 0.5, col=c("black", "white"), border=NA) DrawCircle (y = c(-0.5,0.5), r.out = 0.1, col=c("white", "black"), border=NA) DrawCircle () # overplotting circles Canvas(xlim=c(-5,5)) DrawCircle (r.out=4:1, col=c("white", "steelblue2", "white", "red"), lwd=3, nv=300) # rotation x <- seq(-3, 3, length.out=10) y <- rep(0, length.out=length(x)) Canvas(xlim=c(-5,5), bg="black") sapply( (0:11) * pi/6, function(theta) { xy <- Rotate(x, y=y, theta=theta) DrawCircle (x=xy$x, y=xy$y, r.in=2.4, border=SetAlpha("white", 0.2)) } )
Draw one or several ellipses on an existing plot.
DrawEllipse(x = 0, y = x, radius.x = 1, radius.y = 0.5, rot = 0, nv = 100, border = par("fg"), col = par("bg"), lty = par("lty"), lwd = par("lwd"), plot = TRUE)
DrawEllipse(x = 0, y = x, radius.x = 1, radius.y = 0.5, rot = 0, nv = 100, border = par("fg"), col = par("bg"), lty = par("lty"), lwd = par("lwd"), plot = TRUE)
x , y
|
the x and y co-ordinates for the centre(s) of the ellipse(s). |
radius.x |
a scalar or a vector giving the semi-major axis of the ellipse. |
radius.y |
a scalar or a vector giving the semi-minor axis of the ellipse. |
rot |
angle of rotation in radians. |
nv |
number of vertices to draw the ellipses. |
border |
color for borders. The default is |
col |
color(s) to fill or shade the annulus sector with. The default |
lty |
line type for borders and shading; defaults to |
lwd |
line width for borders and shading. |
plot |
logical. If |
Use DegToRad
if you want to define rotation angle in degrees.
The function invisibly returns a list of the calculated coordinates for all shapes.
Andri Signorell <[email protected]>
polygon
, DrawRegPolygon
, DrawCircle
, DrawArc
par(mfrow=c(1,2)) Canvas() DrawEllipse(rot = c(1:3) * pi/3, col=SetAlpha(c("blue","red","green"), 0.5) ) plot(cars) m <- var(cars) eig <- eigen(m) eig.val <- sqrt(eig$values) eig.vec <- eig$vectors DrawEllipse(x=mean(cars$speed), y=mean(cars$dist), radius.x=eig.val[1] , radius.y=eig.val[2] , rot=acos(eig.vec[1,1]), border="blue", lwd=3)
par(mfrow=c(1,2)) Canvas() DrawEllipse(rot = c(1:3) * pi/3, col=SetAlpha(c("blue","red","green"), 0.5) ) plot(cars) m <- var(cars) eig <- eigen(m) eig.val <- sqrt(eig$values) eig.vec <- eig$vectors DrawEllipse(x=mean(cars$speed), y=mean(cars$dist), radius.x=eig.val[1] , radius.y=eig.val[2] , rot=acos(eig.vec[1,1]), border="blue", lwd=3)
Draw a regular polygon with n corners. This is the workhorse function for drawing regular polygons. Drawing a circle can be done by setting the vertices to a value of say 100.
DrawRegPolygon(x = 0, y = x, radius.x = 1, radius.y = radius.x, rot = 0, nv = 3, border = par("fg"), col = par("bg"), lty = par("lty"), lwd = par("lwd"), plot = TRUE)
DrawRegPolygon(x = 0, y = x, radius.x = 1, radius.y = radius.x, rot = 0, nv = 3, border = par("fg"), col = par("bg"), lty = par("lty"), lwd = par("lwd"), plot = TRUE)
x , y
|
a vector (or scalar) of xy-coordinates of the center(s) of the regular polygon(s). |
radius.x |
a scalar or a vector giving the semi-major axis of the ellipse for the polygon(s). |
radius.y |
a scalar or a vector giving the semi-minor axis of the ellipse for the polygon(s). Default is radius.x which will result in a polygon with radius.x. |
rot |
angle of rotation in radians. |
nv |
number of vertices to draw the polygon(s). |
border |
color for borders. The default is |
col |
color(s) to fill or shade the shape with. The default |
lty |
line type for borders and shading; defaults to |
lwd |
line width for borders and shading. |
plot |
logical. If |
All geometric arguments will be recycled.
The function invisibly returns a list of the calculated coordinates for all shapes.
Andri Signorell <[email protected]>
# Draw 4 triangles (nv = 3) with different rotation angles plot(c(0,1),c(0,1), asp=1, type="n", xaxt="n", yaxt="n", xlab="", ylab="") DrawRegPolygon(x = 0.5, y = 0.5, rot = (1:4)*pi/6, radius.x = 0.5, nv = 3, col = SetAlpha("yellow",0.5)) # Draw several polygons plot(c(0,1),c(0,1), asp=1, type="n", xaxt="n", yaxt="n", xlab="", ylab="") DrawRegPolygon(x = 0.5, y = 0.5, radius.x=seq(50, 5, -10) * 1 /100, rot=0, nv = c(50, 10, 7, 4, 3), col=SetAlpha("blue",seq(0.2,0.7,0.1))) # Combine several polygons by sorting the coordinates # Calculate the xy-points for two concentric pentagons d.pts <- do.call("rbind", lapply(DrawRegPolygon(radius.x=c(1,0.38), nv=5, rot=c(pi/2, pi/2+pi/5), plot=FALSE ), data.frame)) # prepare plot plot(c(-1,1),c(-1,1), asp=1, type="n", xaxt="n", yaxt="n", xlab="", ylab="") # .. and draw the polygon with reordered points polygon( d.pts[order(rep(1:6, times=2), rep(1:2, each=6)), c("x","y")], col="yellow") # Move the center plot(c(0,1),c(0,1), asp=1, type="n", xaxt="n", yaxt="n", xlab="", ylab="") theta <- seq(0, pi/6, length.out=5) xy <- PolToCart( exp(theta) /2, theta) DrawRegPolygon(x=xy$x, y=xy$y + 0.5, radius.x=seq(0.5, 0.1, -0.1), nv=4, rot=seq(0, pi/2, length.out=5), col=rainbow(5) ) # Plot a polygon with a "hole" plot(c(-1,1),c(-1,1), asp=1, type="n", xaxt="n", yaxt="n", xlab="", ylab="") DrawRegPolygon(nv = 4, rot=pi/4, col="red" ) text(x=0,y=0, "Polygon", cex=6, srt=45) # Calculate circle and hexagon, but do not plot pts <- DrawRegPolygon(radius.x=c(0.7, 0.5), nv = c(100, 6), plot=FALSE ) # combine the 2 shapes and plot the new structure polygon(x = unlist(lapply(pts, "[", "x")), y=unlist(lapply(pts, "[", "y")), col="green", border=FALSE)
# Draw 4 triangles (nv = 3) with different rotation angles plot(c(0,1),c(0,1), asp=1, type="n", xaxt="n", yaxt="n", xlab="", ylab="") DrawRegPolygon(x = 0.5, y = 0.5, rot = (1:4)*pi/6, radius.x = 0.5, nv = 3, col = SetAlpha("yellow",0.5)) # Draw several polygons plot(c(0,1),c(0,1), asp=1, type="n", xaxt="n", yaxt="n", xlab="", ylab="") DrawRegPolygon(x = 0.5, y = 0.5, radius.x=seq(50, 5, -10) * 1 /100, rot=0, nv = c(50, 10, 7, 4, 3), col=SetAlpha("blue",seq(0.2,0.7,0.1))) # Combine several polygons by sorting the coordinates # Calculate the xy-points for two concentric pentagons d.pts <- do.call("rbind", lapply(DrawRegPolygon(radius.x=c(1,0.38), nv=5, rot=c(pi/2, pi/2+pi/5), plot=FALSE ), data.frame)) # prepare plot plot(c(-1,1),c(-1,1), asp=1, type="n", xaxt="n", yaxt="n", xlab="", ylab="") # .. and draw the polygon with reordered points polygon( d.pts[order(rep(1:6, times=2), rep(1:2, each=6)), c("x","y")], col="yellow") # Move the center plot(c(0,1),c(0,1), asp=1, type="n", xaxt="n", yaxt="n", xlab="", ylab="") theta <- seq(0, pi/6, length.out=5) xy <- PolToCart( exp(theta) /2, theta) DrawRegPolygon(x=xy$x, y=xy$y + 0.5, radius.x=seq(0.5, 0.1, -0.1), nv=4, rot=seq(0, pi/2, length.out=5), col=rainbow(5) ) # Plot a polygon with a "hole" plot(c(-1,1),c(-1,1), asp=1, type="n", xaxt="n", yaxt="n", xlab="", ylab="") DrawRegPolygon(nv = 4, rot=pi/4, col="red" ) text(x=0,y=0, "Polygon", cex=6, srt=45) # Calculate circle and hexagon, but do not plot pts <- DrawRegPolygon(radius.x=c(0.7, 0.5), nv = c(100, 6), plot=FALSE ) # combine the 2 shapes and plot the new structure polygon(x = unlist(lapply(pts, "[", "x")), y=unlist(lapply(pts, "[", "y")), col="green", border=FALSE)
Generate a matrix of dummy codes (class indicators) for a given factor.
Dummy(x, method = c("treatment", "sum", "helmert", "poly", "full"), base = 1, levels = NULL)
Dummy(x, method = c("treatment", "sum", "helmert", "poly", "full"), base = 1, levels = NULL)
x |
factor or vector of classes for cases. |
method |
defines the method of the contrasts being formed. Can be one
out of |
base |
an integer specifying which group is considered the baseline group. |
levels |
an optional vector of the values (as character strings) that |
For reverting dummy codes see the approach in the examples below.
a matrix with the dummy codes.
The number of rows correspond to the number of elements in x
and the number of columns to the number of its levels - 1, respectively to the number of levels given as argument -1.
When method = "full"
is chosen the number of columns will correspond to the number of levels.
Andri Signorell <[email protected]>
Venables, W N and Ripley, B D (2002): Modern Applied Statistics with S. Fourth edition. Springer.
model.frame
, contrasts
, class.ind
in the package nnet
x <- c("red","blue","green","blue","green","red","red","blue") Dummy(x) Dummy(x, base=2) Dummy(x, method="sum") y <- c("Max","Max","Max","Max","Max","Bill","Bill","Bill") Dummy(y) Dummy(y, base="Max") Dummy(y, base="Max", method="full") # "Undummy" (revert the dummy coding) m <- Dummy(y, method="full") m z <- apply(m, 1, function(x) colnames(m)[x==1]) z identical(y, as.vector(z)) m <- Dummy(y) m z <- apply(m, 1, function(x) ifelse(sum(x)==0, attr(m,"base"), colnames(m)[x==1])) z
x <- c("red","blue","green","blue","green","red","red","blue") Dummy(x) Dummy(x, base=2) Dummy(x, method="sum") y <- c("Max","Max","Max","Max","Max","Bill","Bill","Bill") Dummy(y) Dummy(y, base="Max") Dummy(y, base="Max", method="full") # "Undummy" (revert the dummy coding) m <- Dummy(y, method="full") m z <- apply(m, 1, function(x) colnames(m)[x==1]) z identical(y, as.vector(z)) m <- Dummy(y) m z <- apply(m, 1, function(x) ifelse(sum(x)==0, attr(m,"base"), colnames(m)[x==1])) z
Performs Dunnett's test for comparing several treatments with a control.
DunnettTest(x, ...) ## Default S3 method: DunnettTest(x, g, control = NULL, conf.level = 0.95, ...) ## S3 method for class 'formula' DunnettTest(formula, data, subset, na.action, ...)
DunnettTest(x, ...) ## Default S3 method: DunnettTest(x, g, control = NULL, conf.level = 0.95, ...) ## S3 method for class 'formula' DunnettTest(formula, data, subset, na.action, ...)
x |
a numeric vector of data values, or a list of numeric data vectors. |
g |
a vector or factor object giving the group for the
corresponding elements of |
control |
the level of the control group against which the others should be tested. If there are multiple levels the calculation will be performed for every one. |
conf.level |
confidence level of the interval. |
formula |
a formula of the form |
data |
an optional matrix or data frame (or similar: see
|
subset |
an optional vector specifying a subset of observations to be used. |
na.action |
a function which indicates what should happen when
the data contain |
... |
further arguments to be passed to or from methods. |
DunnettTest
does the post hoc pairwise multiple comparisons procedure.
If x
is a list, its elements are taken as the samples to be
compared, and hence have to be numeric data vectors. In this case,
g
is ignored, and one can simply use DunnettTest(x)
to perform the test. If the samples are not yet contained in a
list, use DunnettTest(list(x, ...))
.
Otherwise, x
must be a numeric data vector, and g
must
be a vector or factor object of the same length as x
giving
the group for the corresponding elements of x
.
A list of class c("PostHocTest")
, containing one matrix named after the control with columns diff
giving the difference in the observed means, lwr.ci
giving the lower end point of the interval, upr.ci
giving the upper end point and pval
giving the p-value after adjustment for the multiple comparisons.
There are print and plot methods for class "PostHocTest"
. The plot method does not accept xlab
, ylab
or main
arguments and creates its own values for each plot.
Andri Signorell <[email protected]>, the interface is based on R-Core code
Dunnett C. W. (1955) A multiple comparison procedure for comparing several treatments with a control, Journal of the American Statistical Association, 50:1096-1121.
## Hollander & Wolfe (1973), 116. ## Mucociliary efficiency from the rate of removal of dust in normal ## subjects, subjects with obstructive airway disease, and subjects ## with asbestosis. x <- c(2.9, 3.0, 2.5, 2.6, 3.2) # normal subjects y <- c(3.8, 2.7, 4.0, 2.4) # with obstructive airway disease z <- c(2.8, 3.4, 3.7, 2.2, 2.0) # with asbestosis DunnettTest(list(x, y, z)) ## Equivalently, x <- c(x, y, z) g <- factor(rep(1:3, c(5, 4, 5)), labels = c("Normal subjects", "Subjects with obstructive airway disease", "Subjects with asbestosis")) DunnettTest(x, g) ## Formula interface boxplot(Ozone ~ Month, data = airquality) DunnettTest(Ozone ~ Month, data = airquality) DunnettTest(Ozone ~ Month, data = airquality, control="8", conf.level=0.9)
## Hollander & Wolfe (1973), 116. ## Mucociliary efficiency from the rate of removal of dust in normal ## subjects, subjects with obstructive airway disease, and subjects ## with asbestosis. x <- c(2.9, 3.0, 2.5, 2.6, 3.2) # normal subjects y <- c(3.8, 2.7, 4.0, 2.4) # with obstructive airway disease z <- c(2.8, 3.4, 3.7, 2.2, 2.0) # with asbestosis DunnettTest(list(x, y, z)) ## Equivalently, x <- c(x, y, z) g <- factor(rep(1:3, c(5, 4, 5)), labels = c("Normal subjects", "Subjects with obstructive airway disease", "Subjects with asbestosis")) DunnettTest(x, g) ## Formula interface boxplot(Ozone ~ Month, data = airquality) DunnettTest(Ozone ~ Month, data = airquality) DunnettTest(Ozone ~ Month, data = airquality, control="8", conf.level=0.9)
Performs Dunn's test of multiple comparisons using rank sums.
DunnTest(x, ...) ## Default S3 method: DunnTest(x, g, method = c("holm", "hochberg", "hommel", "bonferroni", "BH", "BY", "fdr", "none"), alternative = c("two.sided", "less", "greater"), out.list = TRUE, ...) ## S3 method for class 'formula' DunnTest(formula, data, subset, na.action, ...) ## S3 method for class 'DunnTest' print(x, digits = getOption("digits", 3), ...)
DunnTest(x, ...) ## Default S3 method: DunnTest(x, g, method = c("holm", "hochberg", "hommel", "bonferroni", "BH", "BY", "fdr", "none"), alternative = c("two.sided", "less", "greater"), out.list = TRUE, ...) ## S3 method for class 'formula' DunnTest(formula, data, subset, na.action, ...) ## S3 method for class 'DunnTest' print(x, digits = getOption("digits", 3), ...)
x |
a numeric vector of data values, or a list of numeric data vectors. |
g |
a vector or factor object giving the group for the
corresponding elements of |
method |
the method for adjusting p-values for multiple comparisons. The function is calling |
alternative |
a character string specifying the alternative hypothesis, must be one of |
out.list |
logical, indicating if the results should be printed in list mode or as a square matrix. Default is list (TRUE). |
formula |
a formula of the form |
data |
an optional matrix or data frame (or similar: see
|
subset |
an optional vector specifying a subset of observations to be used. |
na.action |
a function which indicates what should happen when
the data contain |
digits |
controls the number of fixed digits to print. |
... |
further arguments to be passed to or from methods. |
DunnTest
performs the post hoc pairwise multiple comparisons procedure appropriate to follow the rejection of a Kruskal-Wallis test. The Kruskal-Wallis test, being a non-parametric analog of the one-way ANOVA, is an omnibus test of the null hypothesis that none of k groups stochastically dominate one another.
Dunn's test is constructed in part by summing jointly ranked data. The rank sum test, itself a non-parametric analog of the unpaired t-test, is possibly intuitive, but inappropriate as a post hoc pairwise test, because (1) it fails to retain the dependent ranking that produced the Kruskal-Wallis test statistic, and (2) it does not incorporate the pooled variance estimate implied by the null hypothesis of the Kruskal-Wallis test.
If x
is a list, its elements are taken as the samples to be
compared, and hence have to be numeric data vectors. In this case,
g
is ignored, and one can simply use DunnTest(x)
to perform the test. If the samples are not yet contained in a
list, use DunnTest(list(x, ...))
.
Otherwise, x
must be a numeric data vector, and g
must
be a vector or factor object of the same length as x
giving
the group for the corresponding elements of x
.
A list with class "DunnTest"
containing the following components:
res |
an array containing the mean rank differencens and the according p-values |
Andri Signorell <[email protected]>, the interface is based on R-Core code
Dunn, O. J. (1961) Multiple comparisons among means Journal of the American Statistical Association, 56(293):52-64.
Dunn, O. J. (1964) Multiple comparisons using rank sums Technometrics, 6(3):241-252.
kruskal.test
, wilcox.test
, p.adjust
## Hollander & Wolfe (1973), 116. ## Mucociliary efficiency from the rate of removal of dust in normal ## subjects, subjects with obstructive airway disease, and subjects ## with asbestosis. x <- c(2.9, 3.0, 2.5, 2.6, 3.2) # normal subjects y <- c(3.8, 2.7, 4.0, 2.4) # with obstructive airway disease z <- c(2.8, 3.4, 3.7, 2.2, 2.0) # with asbestosis DunnTest(list(x, y, z)) ## Equivalently, x <- c(x, y, z) g <- factor(rep(1:3, c(5, 4, 5)), labels = c("Normal subjects", "Subjects with obstructive airway disease", "Subjects with asbestosis")) # do the kruskal.test first kruskal.test(x, g) # ...and the pairwise test afterwards DunnTest(x, g) ## Formula interface. boxplot(Ozone ~ Month, data = airquality) DunnTest(Ozone ~ Month, data = airquality)
## Hollander & Wolfe (1973), 116. ## Mucociliary efficiency from the rate of removal of dust in normal ## subjects, subjects with obstructive airway disease, and subjects ## with asbestosis. x <- c(2.9, 3.0, 2.5, 2.6, 3.2) # normal subjects y <- c(3.8, 2.7, 4.0, 2.4) # with obstructive airway disease z <- c(2.8, 3.4, 3.7, 2.2, 2.0) # with asbestosis DunnTest(list(x, y, z)) ## Equivalently, x <- c(x, y, z) g <- factor(rep(1:3, c(5, 4, 5)), labels = c("Normal subjects", "Subjects with obstructive airway disease", "Subjects with asbestosis")) # do the kruskal.test first kruskal.test(x, g) # ...and the pairwise test afterwards DunnTest(x, g) ## Formula interface. boxplot(Ozone ~ Month, data = airquality) DunnTest(Ozone ~ Month, data = airquality)
Performs the Durbin-Watson test for autocorrelation of disturbances.
DurbinWatsonTest(formula, order.by = NULL, alternative = c("greater", "two.sided", "less"), iterations = 15, exact = NULL, tol = 1e-10, data = list())
DurbinWatsonTest(formula, order.by = NULL, alternative = c("greater", "two.sided", "less"), iterations = 15, exact = NULL, tol = 1e-10, data = list())
formula |
a symbolic description for the model to be tested
(or a fitted |
order.by |
Either a vector |
alternative |
a character string specifying the alternative hypothesis. |
iterations |
an integer specifying the number of iterations when calculating the p-value with the "pan" algorithm. |
exact |
logical. If set to |
tol |
tolerance. Eigenvalues computed have to be greater than
|
data |
an optional data frame containing the variables in the model.
By default the variables are taken from the environment which |
The Durbin-Watson test has the null hypothesis that the autocorrelation
of the disturbances is 0. It is possible to test against the alternative that it is
greater than, not equal to, or less than 0, respectively. This can be specified
by the alternative
argument.
Under the assumption of normally distributed disturbances, the null distribution of the Durbin-Watson statistic is the distribution of a linear combination of chi-squared variables. The p-value is computed using the Fortran version of Applied Statistics Algorithm AS 153 by Farebrother (1980, 1984). This algorithm is called "pan" or "gradsol". For large sample sizes the algorithm might fail to compute the p value; in that case a warning is printed and an approximate p value will be given; this p value is computed using a normal approximation with mean and variance of the Durbin-Watson test statistic.
Examples can not only be found on this page, but also on the help pages of the
data sets bondyield
, currencysubstitution
,
growthofmoney
, moneydemand
,
unemployment
, wages
.
For an overview on R and econometrics see Racine & Hyndman (2002).
An object of class "htest"
containing:
statistic |
the test statistic. |
p.value |
the corresponding p-value. |
method |
a character string with the method used. |
data.name |
a character string with the data name. |
This function was previously published as dwtest
in the lmtest package and has been integrated here without logical changes.
Torsten Hothorn, Achim Zeileis, Richard W. Farebrother (pan.f), Clint Cummins (pan.f), Giovanni Millo, David Mitchell
J. Durbin & G.S. Watson (1950), Testing for Serial Correlation in Least Squares Regression I. Biometrika 37, 409–428.
J. Durbin & G.S. Watson (1951), Testing for Serial Correlation in Least Squares Regression II. Biometrika 38, 159–178.
J. Durbin & G.S. Watson (1971), Testing for Serial Correlation in Least Squares Regression III. Biometrika 58, 1–19.
R.W. Farebrother (1980), Pan's Procedure for the Tail Probabilities of the Durbin-Watson Statistic (Corr: 81V30 p189; AS R52: 84V33 p363- 366; AS R53: 84V33 p366- 369). Applied Statistics 29, 224–227.
R. W. Farebrother (1984),
[AS R53] A Remark on Algorithms AS 106 (77V26 p92-98), AS 153 (80V29 p224-227)
and AS 155: The Distribution of a Linear Combination of Random
Variables (80V29 p323-333)
Applied Statistics 33, 366–369.
W. Krämer & H. Sonnberger (1986), The Linear Regression Model under Test. Heidelberg: Physica.
J. Racine & R. Hyndman (2002), Using R To Teach Econometrics. Journal of Applied Econometrics 17, 175–189.
## generate two AR(1) error terms with parameter ## rho = 0 (white noise) and rho = 0.9 respectively err1 <- rnorm(100) ## generate regressor and dependent variable x <- rep(c(-1,1), 50) y1 <- 1 + x + err1 ## perform Durbin-Watson test DurbinWatsonTest(y1 ~ x) err2 <- stats::filter(err1, 0.9, method="recursive") y2 <- 1 + x + err2 DurbinWatsonTest(y2 ~ x) ## for a simple vector use: e_t <- c(-32.33, -26.603, 2.215, -16.967, -1.148, -2.512, -1.967, 11.669, -0.513, 27.032, -4.422, 40.032, 23.577, 33.94, -2.787, -8.606, 0.575, 6.848, -18.971, -29.063) DurbinWatsonTest(e_t ~ 1)
## generate two AR(1) error terms with parameter ## rho = 0 (white noise) and rho = 0.9 respectively err1 <- rnorm(100) ## generate regressor and dependent variable x <- rep(c(-1,1), 50) y1 <- 1 + x + err1 ## perform Durbin-Watson test DurbinWatsonTest(y1 ~ x) err2 <- stats::filter(err1, 0.9, method="recursive") y2 <- 1 + x + err2 DurbinWatsonTest(y2 ~ x) ## for a simple vector use: e_t <- c(-32.33, -26.603, 2.215, -16.967, -1.148, -2.512, -1.967, 11.669, -0.513, 27.032, -4.422, 40.032, 23.577, 33.94, -2.787, -8.606, 0.575, 6.848, -18.971, -29.063) DurbinWatsonTest(e_t ~ 1)
Computes Shannon entropy and the mutual information of two variables. The entropy quantifies the expected value of the information contained in a vector. The mutual information is a quantity that measures the mutual dependence of the two random variables.
Entropy(x, y = NULL, base = 2, ...) MutInf(x, y, base = 2, ...)
Entropy(x, y = NULL, base = 2, ...) MutInf(x, y, base = 2, ...)
x |
a vector or a matrix of numerical or categorical type. If only x is supplied it will be interpreted as contingency table. |
y |
a vector with the same type and dimension as x. If y is not |
base |
base of the logarithm to be used, defaults to 2. |
... |
further arguments are passed to the function |
The Shannon entropy equation provides a way to estimate the average minimum number of bits needed to encode a string of symbols, based on the frequency of the symbols.
It is given by the formula where
is the
probability of character number i showing up in a stream of characters of the given "script".
The entropy is ranging from 0 to Inf.
a numeric value.
Andri Signorell <[email protected]>
Shannon, Claude E. (July/October 1948). A Mathematical Theory of Communication, Bell System Technical Journal 27 (3): 379-423.
Ihara, Shunsuke (1993) Information theory for continuous systems, World Scientific. p. 2. ISBN 978-981-02-0985-8.
package entropy which implements various estimators of entropy
Entropy(as.matrix(rep(1/8, 8))) # http://r.789695.n4.nabble.com/entropy-package-how-to-compute-mutual-information-td4385339.html x <- as.factor(c("a","b","a","c","b","c")) y <- as.factor(c("b","a","a","c","c","b")) Entropy(table(x), base=exp(1)) Entropy(table(y), base=exp(1)) Entropy(x, y, base=exp(1)) # Mutual information is Entropy(table(x), base=exp(1)) + Entropy(table(y), base=exp(1)) - Entropy(x, y, base=exp(1)) MutInf(x, y, base=exp(1)) Entropy(table(x)) + Entropy(table(y)) - Entropy(x, y) MutInf(x, y, base=2) # http://en.wikipedia.org/wiki/Cluster_labeling tab <- matrix(c(60,10000,200,500000), nrow=2, byrow=TRUE) MutInf(tab, base=2) d.frm <- Untable(as.table(tab)) str(d.frm) MutInf(d.frm[,1], d.frm[,2]) table(d.frm[,1], d.frm[,2]) MutInf(table(d.frm[,1], d.frm[,2])) # Ranking mutual information can help to describe clusters # # r.mi <- MutInf(x, grp) # attributes(r.mi)$dimnames <- attributes(tab)$dimnames # # # calculating ranks of mutual information # r.mi_r <- apply( -r.mi, 2, rank, na.last=TRUE ) # # show only first 6 ranks # r.mi_r6 <- ifelse( r.mi_r < 7, r.mi_r, NA) # attributes(r.mi_r6)$dimnames <- attributes(tab)$dimnames # r.mi_r6
Entropy(as.matrix(rep(1/8, 8))) # http://r.789695.n4.nabble.com/entropy-package-how-to-compute-mutual-information-td4385339.html x <- as.factor(c("a","b","a","c","b","c")) y <- as.factor(c("b","a","a","c","c","b")) Entropy(table(x), base=exp(1)) Entropy(table(y), base=exp(1)) Entropy(x, y, base=exp(1)) # Mutual information is Entropy(table(x), base=exp(1)) + Entropy(table(y), base=exp(1)) - Entropy(x, y, base=exp(1)) MutInf(x, y, base=exp(1)) Entropy(table(x)) + Entropy(table(y)) - Entropy(x, y) MutInf(x, y, base=2) # http://en.wikipedia.org/wiki/Cluster_labeling tab <- matrix(c(60,10000,200,500000), nrow=2, byrow=TRUE) MutInf(tab, base=2) d.frm <- Untable(as.table(tab)) str(d.frm) MutInf(d.frm[,1], d.frm[,2]) table(d.frm[,1], d.frm[,2]) MutInf(table(d.frm[,1], d.frm[,2])) # Ranking mutual information can help to describe clusters # # r.mi <- MutInf(x, grp) # attributes(r.mi)$dimnames <- attributes(tab)$dimnames # # # calculating ranks of mutual information # r.mi_r <- apply( -r.mi, 2, rank, na.last=TRUE ) # # show only first 6 ranks # r.mi_r6 <- ifelse( r.mi_r < 7, r.mi_r, NA) # attributes(r.mi_r6)$dimnames <- attributes(tab)$dimnames # r.mi_r6
Calculate Greenhouse-Geisser and Huynh-Feldt epsilons.
Eps(S, p, g, n)
Eps(S, p, g, n)
S |
pxp covariance matrix |
p |
dimension of observation vectors |
g |
number of groups |
n |
number of subjects |
a numeric value
Hans Rudolf Roth <[email protected]>
Vonesh, E.F., Chinchilli, V.M. (1997) Linear and Nonlinear Models for the Analysis of Repeated Measurements Marcel Dekker, New York, p.84-86
Crowder, M.J., Hand, D.J. (1990) Analysis of Repeated Measures. Chapman & Hall, London, p.54-55
## find!
## find!
Add error bars to an existing plot.
ErrBars(from, to = NULL, pos = NULL, mid = NULL, horiz = FALSE, col = par("fg"), lty = par("lty"), lwd = par("lwd"), code = 3, length = 0.05, pch = NA, cex.pch = par("cex"), col.pch = par("fg"), bg.pch = par("bg"), ...)
ErrBars(from, to = NULL, pos = NULL, mid = NULL, horiz = FALSE, col = par("fg"), lty = par("lty"), lwd = par("lwd"), code = 3, length = 0.05, pch = NA, cex.pch = par("cex"), col.pch = par("fg"), bg.pch = par("bg"), ...)
from |
coordinates of points from which to draw (the lower end of the error bars). If |
to |
coordinates of points to which to draw (the upper end of the error bars). |
pos |
numeric, position of the error bars. This will either be the x-coordinate in case of vertical error bars and the y-coordinate in case of horizontal error bars. |
mid |
numeric, position of midpoints. Defaults to the mean of |
horiz |
logical, determining whether horizontal error bars are needed (default is FALSE). |
col |
the line color. |
lty |
the line type. |
lwd |
line width. |
code |
integer code, determining where end lines are to be drawn. |
length |
the length of the end lines. |
pch |
plotting character for the midpoints. The position of the points is given by |
cex.pch |
the character extension for the plotting characters. Default is |
col.pch |
the color of the plotting characters. Default is |
bg.pch |
the background color of the plotting characters (if pch is set to 21:25). Default is |
... |
the dots are passed to the |
A short wrapper for plotting error bars by means of arrows
.
Andri Signorell <[email protected]>
par(mfrow=c(2,2)) b <- barplot(1:5, ylim=c(0,6)) ErrBars(from=1:5-rep(0.5,5), to=1:5+rep(0.5,5), pos=b, length=0.2) # just on one side b <- barplot(1:5, ylim=c(0,6)) ErrBars(from=1:5, to=1:5+rep(0.5,5), pos=b, length=0.2, col="red", code=2, lwd=2) b <- barplot(1:5, xlim=c(0,6), horiz=TRUE) ErrBars(from=1:5, to=1:5+rep(0.2,5), pos=b, horiz=TRUE, length=0.2, col="red", code=2, lwd=2) par(xpd=FALSE) dotchart(1:5, xlim=c(0,6)) ErrBars(from=1:5-rep(0.2,5), to=1:5+rep(0.2,5), horiz=TRUE, length=0.1)
par(mfrow=c(2,2)) b <- barplot(1:5, ylim=c(0,6)) ErrBars(from=1:5-rep(0.5,5), to=1:5+rep(0.5,5), pos=b, length=0.2) # just on one side b <- barplot(1:5, ylim=c(0,6)) ErrBars(from=1:5, to=1:5+rep(0.5,5), pos=b, length=0.2, col="red", code=2, lwd=2) b <- barplot(1:5, xlim=c(0,6), horiz=TRUE) ErrBars(from=1:5, to=1:5+rep(0.2,5), pos=b, horiz=TRUE, length=0.2, col="red", code=2, lwd=2) par(xpd=FALSE) dotchart(1:5, xlim=c(0,6)) ErrBars(from=1:5-rep(0.2,5), to=1:5+rep(0.2,5), horiz=TRUE, length=0.1)
Calculates eta-squared, partial eta-squared and generalized eta-squared
EtaSq(x, type = 2, anova = FALSE) ## S3 method for class 'lm' EtaSq(x, type = 2, anova = FALSE) ## S3 method for class 'aovlist' EtaSq(x, type = 2, anova = FALSE)
EtaSq(x, type = 2, anova = FALSE) ## S3 method for class 'lm' EtaSq(x, type = 2, anova = FALSE) ## S3 method for class 'aovlist' EtaSq(x, type = 2, anova = FALSE)
x |
An analysis of variance ( |
type |
What type of sum of squares to calculate? |
anova |
Should the full ANOVA table be printed out in addition to the effect sizes? |
Calculates the eta-squared, partial eta-squared, and generalized eta-squared measures of effect size that are commonly used in analysis of variance. The input x
should be the analysis of variance object itself. For between-subjects designs, generalized eta-squared equals partial eta-squared. The reported generalized eta-squared for repeated-measures designs assumes that all factors are manipulated, i.e., that there are no measured factors like gender (see references).
For unbalanced designs, the default in EtaSq
is to compute Type II sums of squares (type=2
), in keeping with the Anova
function in the car
package. It is possible to revert to the Type I SS values (type=1
) to be consistent with anova
, but this rarely tests hypotheses of interest. Type III SS values (type=3
) can also be computed. EtaSq.aovlist
requires type=1
.
If anova=FALSE
, the output for EtaSq.lm
is an M x 2 matrix, for EtaSq.aovlist
it is an M x 3 matrix. Each of the M rows corresponds to one of the terms in the ANOVA (e.g., main effect 1, main effect 2, interaction, etc), and each of the columns corresponds to a different measure of effect size. Column 1 contains the eta-squared values, and column 2 contains partial eta-squared values. Column 3 contains the generalized eta-squared values. If anova=TRUE
, the output contains additional columns containing the sums of squares, mean squares, degrees of freedom, F-statistics and p-values. For EtaSq.aovlist
, additional columns contain the error sum of squares and error degrees of freedom corresponding to an effect term.
Danielle Navarro <[email protected]>, Daniel Wollschlaeger <[email protected]>
Bakeman, R. (2005). Recommended effect size statistics for repeated measures designs. Behavior Research Methods 37(3), 379-384.
Olejnik, S. and Algina, J. (2003). Generalized Eta and Omega Squared Statistics: Measures of Effect Size for Some Common Research Designs. Psychological Methods 8(4), 434-447.
#### Example 1: one-way ANOVA #### outcome <- c(1.4,2.1,3.0,2.1,3.2,4.7,3.5,4.5,5.4) # data treatment1 <- factor(c(1,1,1,2,2,2,3,3,3)) # grouping variable anova1 <- aov(outcome ~ treatment1) # run the ANOVA summary(anova1) # print the ANOVA table EtaSq(anova1) # effect size #### Example 2: two-way ANOVA #### treatment2 <- factor(c(1,2,3,1,2,3,1,2,3)) # second grouping variable anova2 <- aov(outcome ~ treatment1 + treatment2) # run the ANOVA summary(anova2) # print the ANOVA table EtaSq(anova2) # effect size #### Example 3: two-way ANOVA unbalanced cell sizes #### #### data from Maxwell & Delaney, 2004 #### #### Designing experiments and analyzing data #### dfMD <- data.frame(IV1=factor(rep(1:3, c(3+5+7, 5+6+4, 5+4+6))), IV2=factor(rep(rep(1:3, 3), c(3,5,7, 5,6,4, 5,4,6))), DV=c(c(41, 43, 50), c(51, 43, 53, 54, 46), c(45, 55, 56, 60, 58, 62, 62), c(56, 47, 45, 46, 49), c(58, 54, 49, 61, 52, 62), c(59, 55, 68, 63), c(43, 56, 48, 46, 47), c(59, 46, 58, 54), c(55, 69, 63, 56, 62, 67))) # use contr.sum for correct sum of squares type 3 dfMD$IV1s <- C(dfMD$IV1, "contr.sum") dfMD$IV2s <- C(dfMD$IV2, "contr.sum") dfMD$IV1t <- C(dfMD$IV1, "contr.treatment") dfMD$IV2t <- C(dfMD$IV2, "contr.treatment") EtaSq(aov(DV ~ IV1s*IV2s, data=dfMD), type=3) EtaSq(aov(DV ~ IV1t*IV2t, data=dfMD), type=1) #### Example 4: two-way split-plot ANOVA -> EtaSq.aovlist #### DV_t1 <- round(rnorm(3*10, -0.5, 1), 2) DV_t2 <- round(rnorm(3*10, 0, 1), 2) DV_t3 <- round(rnorm(3*10, 0.5, 1), 2) dfSPF <- data.frame(id=factor(rep(1:(3*10), times=3)), IVbtw=factor(rep(LETTERS[1:3], times=3*10)), IVwth=factor(rep(1:3, each=3*10)), DV=c(DV_t1, DV_t2, DV_t3)) spf <- aov(DV ~ IVbtw*IVwth + Error(id/IVwth), data=dfSPF) EtaSq(spf, type=1, anova=TRUE)
#### Example 1: one-way ANOVA #### outcome <- c(1.4,2.1,3.0,2.1,3.2,4.7,3.5,4.5,5.4) # data treatment1 <- factor(c(1,1,1,2,2,2,3,3,3)) # grouping variable anova1 <- aov(outcome ~ treatment1) # run the ANOVA summary(anova1) # print the ANOVA table EtaSq(anova1) # effect size #### Example 2: two-way ANOVA #### treatment2 <- factor(c(1,2,3,1,2,3,1,2,3)) # second grouping variable anova2 <- aov(outcome ~ treatment1 + treatment2) # run the ANOVA summary(anova2) # print the ANOVA table EtaSq(anova2) # effect size #### Example 3: two-way ANOVA unbalanced cell sizes #### #### data from Maxwell & Delaney, 2004 #### #### Designing experiments and analyzing data #### dfMD <- data.frame(IV1=factor(rep(1:3, c(3+5+7, 5+6+4, 5+4+6))), IV2=factor(rep(rep(1:3, 3), c(3,5,7, 5,6,4, 5,4,6))), DV=c(c(41, 43, 50), c(51, 43, 53, 54, 46), c(45, 55, 56, 60, 58, 62, 62), c(56, 47, 45, 46, 49), c(58, 54, 49, 61, 52, 62), c(59, 55, 68, 63), c(43, 56, 48, 46, 47), c(59, 46, 58, 54), c(55, 69, 63, 56, 62, 67))) # use contr.sum for correct sum of squares type 3 dfMD$IV1s <- C(dfMD$IV1, "contr.sum") dfMD$IV2s <- C(dfMD$IV2, "contr.sum") dfMD$IV1t <- C(dfMD$IV1, "contr.treatment") dfMD$IV2t <- C(dfMD$IV2, "contr.treatment") EtaSq(aov(DV ~ IV1s*IV2s, data=dfMD), type=3) EtaSq(aov(DV ~ IV1t*IV2t, data=dfMD), type=1) #### Example 4: two-way split-plot ANOVA -> EtaSq.aovlist #### DV_t1 <- round(rnorm(3*10, -0.5, 1), 2) DV_t2 <- round(rnorm(3*10, 0, 1), 2) DV_t3 <- round(rnorm(3*10, 0.5, 1), 2) dfSPF <- data.frame(id=factor(rep(1:(3*10), times=3)), IVbtw=factor(rep(LETTERS[1:3], times=3*10)), IVwth=factor(rep(1:3, each=3*10)), DV=c(DV_t1, DV_t2, DV_t3)) spf <- aov(DV ~ IVbtw*IVwth + Error(id/IVwth), data=dfSPF) EtaSq(spf, type=1, anova=TRUE)
Expected Value and Variance for the distribution of a discrete random variable. (For didactical purposes..)
EX(x, p) VarX(x, p)
EX(x, p) VarX(x, p)
x |
the values of the random variable |
p |
the probabilities of the values |
numeric value
Andri Signorell <[email protected]>
EX(x=c(1:3), p=c(0.2, 0.5, 0.3)) VarX(x=c(1:3), p=c(0.2, 0.5, 0.3))
EX(x=c(1:3), p=c(0.2, 0.5, 0.3)) VarX(x=c(1:3), p=c(0.2, 0.5, 0.3))
Calculate the expected frequencies of an n-way table assuming independence.
ExpFreq(x, freq = c("abs", "rel"))
ExpFreq(x, freq = c("abs", "rel"))
x |
a table. |
freq |
indicates, whether absolute or relative frenquencies should be computed. Can either
be |
A table with either the absolute or the relative expected frequencies.
This is a copy of the function independence_table
in vcd.
David Meyer <[email protected]>
ExpFreq(Titanic) ExpFreq(UCBAdmissions, freq="r")
ExpFreq(Titanic) ExpFreq(UCBAdmissions, freq="r")
Find the kth smallest, resp. largest values from a vector x
and return the values and their frequencies.
Small(x, k = 5, unique = FALSE, na.last = NA) Large(x, k = 5, unique = FALSE, na.last = NA) HighLow(x, nlow = 5, nhigh = nlow, na.last = NA)
Small(x, k = 5, unique = FALSE, na.last = NA) Large(x, k = 5, unique = FALSE, na.last = NA) HighLow(x, nlow = 5, nhigh = nlow, na.last = NA)
x |
a |
k |
an integer >0 defining how many extreme values should be returned. Default is |
unique |
logical, defining if unique values should be considered or not. If this is set to |
na.last |
for controlling the treatment of |
nlow |
a single integer. The number of the smallest elements of a vector to be printed. Defaults to 5. |
nhigh |
a single integer. The number of the greatest elements of a vector to be printed. Defaults to the number of |
This does not seem to be a difficult problem at first sight. We could simply tabulate and sort the vector and finally take the first or last k values. However sorting and tabulating the whole vector when we're just interested in the few smallest values is a considerable waste of resources. This approach becomes already impracticable for medium vector lengths (~105). There are several points and solutions of this problem discussed out there. The present implementation is based on highly efficient C++ code and proved to be very fast.
HighLow combines the two upper functions and reports the k extreme values on both sides together with their frequencies in parentheses. It is used for describing univariate variables and is interesting for checking the ends of the vector, where in real data often wrong values accumulate. This is in essence a printing routine for the highest and the lowest values of x.
if unique
is set to FALSE
: a vector with the k most extreme values,
else: a list, containing the k most extreme values and their frequencies.
Andri Signorell <[email protected]>
C++ parts by Nathan Russell and Romain Francois
https://gallery.rcpp.org/articles/top-elements-from-vectors-using-priority-queue/
x <- sample(1:10, 1000, rep=TRUE) Large(x, 3) Large(x, k=3, unique=TRUE) # works fine up to x ~ 1e6 x <- runif(1000000) Small(x, 3, unique=TRUE) Small(x, 3, unique=FALSE) # Both ends cat(HighLow(d.pizza$temperature, na.last=NA))
x <- sample(1:10, 1000, rep=TRUE) Large(x, 3) Large(x, k=3, unique=TRUE) # works fine up to x ~ 1e6 x <- runif(1000000) Small(x, 3, unique=TRUE) Small(x, 3, unique=FALSE) # Both ends cat(HighLow(d.pizza$temperature, na.last=NA))
Density function, distribution function, quantile function and random generation for the maximum/minimum of a given number of independent variables from a specified distribution.
dExtrVal(x, densfun, distnfun, ..., distn, mlen = 1, largest = TRUE, log = FALSE) pExtrVal(q, distnfun, ..., distn, mlen = 1, largest = TRUE, lower.tail = TRUE) qExtrVal(p, quantfun, ..., distn, mlen = 1, largest = TRUE, lower.tail = TRUE) rExtrVal(n, quantfun, ..., distn, mlen = 1, largest = TRUE)
dExtrVal(x, densfun, distnfun, ..., distn, mlen = 1, largest = TRUE, log = FALSE) pExtrVal(q, distnfun, ..., distn, mlen = 1, largest = TRUE, lower.tail = TRUE) qExtrVal(p, quantfun, ..., distn, mlen = 1, largest = TRUE, lower.tail = TRUE) rExtrVal(n, quantfun, ..., distn, mlen = 1, largest = TRUE)
x , q
|
Vector of quantiles. |
p |
Vector of probabilities. |
n |
Number of observations. |
densfun , distnfun , quantfun
|
Density, distribution and
quantile function of the specified distribution. The density
function must have a |
... |
Parameters of the specified distribution. |
distn |
A character string, optionally given as an
alternative to |
mlen |
The number of independent variables. |
largest |
Logical; if |
log |
Logical; if |
lower.tail |
Logical; if |
dExtrVal
gives the density function, pExtrVal
gives the
distribution function and qExtrVal
gives the quantile function
of the maximum/minimum of mlen
independent variables from
a specified distibution. rExtrVal
generates random deviates.
Alec Stephenson <[email protected]>
dExtrVal(2:4, dnorm, pnorm, mean = 0.5, sd = 1.2, mlen = 5) dExtrVal(2:4, distn = "norm", mean = 0.5, sd = 1.2, mlen = 5) dExtrVal(2:4, distn = "exp", mlen = 2, largest = FALSE) pExtrVal(2:4, distn = "exp", rate = 1.2, mlen = 2) qExtrVal(seq(0.9, 0.6, -0.1), distn = "exp", rate = 1.2, mlen = 2) rExtrVal(5, qgamma, shape = 1, mlen = 10) p <- (1:9)/10 pexp(qExtrVal(p, distn = "exp", rate = 1.2, mlen = 1), rate = 1.2) ## [1] 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
dExtrVal(2:4, dnorm, pnorm, mean = 0.5, sd = 1.2, mlen = 5) dExtrVal(2:4, distn = "norm", mean = 0.5, sd = 1.2, mlen = 5) dExtrVal(2:4, distn = "exp", mlen = 2, largest = FALSE) pExtrVal(2:4, distn = "exp", rate = 1.2, mlen = 2) qExtrVal(seq(0.9, 0.6, -0.1), distn = "exp", rate = 1.2, mlen = 2) rExtrVal(5, qgamma, shape = 1, mlen = 10) p <- (1:9)/10 pexp(qExtrVal(p, distn = "exp", rate = 1.2, mlen = 1), rate = 1.2) ## [1] 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Compute the prime factorization(s) of integer(s) n
.
Factorize(n)
Factorize(n)
n |
vector of integers to factorize. |
works via Primes
, currently in a cheap way, sub-optimal
for large composite .
A named list
of the same length as n
,
each element a 2-column matrix with column "p"
the prime
factors and column~"m"
their respective exponents (or
multiplities), i.e., for a prime number n
, the resulting matrix
is cbind(p = n, m = 1)
.
Martin Maechler, Jan. 1996.
GCD
, LCM
, Primes
, IsPrime
, Divisors
For factorization of moderately or really large numbers, see the gmp
package, and its factorize()
(which is ~20x faster!).
Factorize(47) Factorize(seq(101, 120, by=2))
Factorize(47) Factorize(seq(101, 120, by=2))
Retrieve a function's arguments and default values in a list.
FctArgs(name, sort = FALSE)
FctArgs(name, sort = FALSE)
name |
name of the function. |
sort |
logical. Should the function arguments be sorted? Defaults to |
a data.frame with the default in the first columns and with row.names as argument names.
Andri Signorell <[email protected]>
formalArgs
, formals
, args
, alist
, body
formalArgs(PlotFdist) formals(PlotFdist) # compare: FctArgs(PlotFdist) # alternative also spotting unexported functions GetArgs <- function(FUN) { a <- formals(getAnywhere(FUN)$objs[[1]]) arg.labels <- names(a) arg.values <- as.character(a) char <- sapply(a, is.character) arg.values[char] <- paste("\"", arg.values[char], "\"", sep="") c(fname=FUN, args=paste(StrTrim(gsub("= $", "", paste(arg.labels, arg.values, sep=" = "))), collapse=", ")) } fcts <- grep("plot.Desc", unclass(lsf.str(envir = asNamespace("DescTools"), all.names = TRUE)), value=TRUE) fargs <- t(unname(sapply(fcts, GetArgs))) fargs
formalArgs(PlotFdist) formals(PlotFdist) # compare: FctArgs(PlotFdist) # alternative also spotting unexported functions GetArgs <- function(FUN) { a <- formals(getAnywhere(FUN)$objs[[1]]) arg.labels <- names(a) arg.values <- as.character(a) char <- sapply(a, is.character) arg.values[char] <- paste("\"", arg.values[char], "\"", sep="") c(fname=FUN, args=paste(StrTrim(gsub("= $", "", paste(arg.labels, arg.values, sep=" = "))), collapse=", ")) } fcts <- grep("plot.Desc", unclass(lsf.str(envir = asNamespace("DescTools"), all.names = TRUE)), value=TRUE) fargs <- t(unname(sapply(fcts, GetArgs))) fargs
Generates Fibonacci numbers.
Fibonacci(n)
Fibonacci(n)
n |
nonnegative integer or vector of nonnegative integers. |
Generates the n
-th Fibonacci number, whereas Fibonacci(0) = 0.
A single integer, or a vector of integers.
Andri Signorell <[email protected]>
https://en.wikipedia.org/wiki/Fibonacci_number
Fibonacci(0) # 1 Fibonacci(2) # 2 Fibonacci(0:3) # 0 1 1 2 # Golden ratio F <- Fibonacci(1:25) # ... 75025 121393 f25 <- F[25]/F[24] # 1.618033989 phi <- (sqrt(5) + 1)/2 abs(f25 - phi) # 7.945178e-11 # Fibonacci numbers without iteration fibo <- function(n) { phi <- (sqrt(5) + 1)/2 fib <- (phi^(n+1) - (1-phi)^(n+1)) / (2*phi - 1) round(fib) } fibo(30:33) # 1346269 2178309 3524578 5702887
Fibonacci(0) # 1 Fibonacci(2) # 2 Fibonacci(0:3) # 0 1 1 2 # Golden ratio F <- Fibonacci(1:25) # ... 75025 121393 f25 <- F[25]/F[24] # 1.618033989 phi <- (sqrt(5) + 1)/2 abs(f25 - phi) # 7.945178e-11 # Fibonacci numbers without iteration fibo <- function(n) { phi <- (sqrt(5) + 1)/2 fib <- (phi^(n+1) - (1-phi)^(n+1)) / (2*phi - 1) round(fib) } fibo(30:33) # 1346269 2178309 3524578 5702887
Find a color on a defined color range depending on the value of x. This is helpful for colorcoding numeric values.
FindColor(x, cols = rev(heat.colors(100)), min.x = NULL, max.x = NULL, all.inside = FALSE)
FindColor(x, cols = rev(heat.colors(100)), min.x = NULL, max.x = NULL, all.inside = FALSE)
x |
numeric. |
cols |
a vector of colors. |
min.x |
the x-value to be used for the left edge of the first color. If left to the default |
max.x |
the x-value to be used for the right edge of the last color. If left to the default |
all.inside |
logical; if true, the returned indices are coerced into |
For the selection of colors the option rightmost.closed
in the used function
findInterval
is set to TRUE.
This will ensure that all values on the right edge of the range are assigned a color.
How values outside the boundaries of min.x and max.x should be handled can be controlled by all.inside
. Set this value to TRUE, if those values
should get the colors at the edges or set it to FALSE, if they should remain white (which is the default).
Note that findInterval
closes the intervals on the left side, e.g. [0, 1). This option can't be changed. Consequently will x-values lying on the edge of two colors get the color of the bigger one.
Andri Signorell <[email protected]>
Canvas(7, main="Use of function FindColor()") # get some data x <- c(23, 56, 96) # get a color range from blue via white to red cols <- colorRampPalette(c("blue","white","red"))(100) ColorLegend(x="bottomleft", cols=cols, labels=seq(0, 100, 10), cex=0.8) # and now the color coding of x: (xcols <- FindColor(x, cols, min.x=0, max.x=100)) # this should be the same as cols[x+1] # how does it look like? y0 <- c(-5, -2, 1) text(x=1, y=max(y0)+2, labels="Color coding of x:") text(x=1.5, y=y0, labels=x) DrawRegPolygon(x=3, y=y0, nv=4, rot=pi/4, col=xcols) text(x=6, y=y0, labels=xcols) # how does the function select colors? Canvas(xlim = c(0,1), ylim = c(0,1)) cols <- c(red="red", yellow="yellow", green="green", blue="blue") ColorLegend(x=0, y=1, width=1, cols=rev(cols), horiz = TRUE, labels=Format(seq(0, 1, .25), digits=2), frame="grey", cex=0.8 ) x <- c(-0.2, 0, 0.15, 0.55, .75, 1, 1.3) arrows(x0 = x, y0 = 0.6, y1 = 0.8, angle = 15, length = .2) text(x=x, y = 0.5, labels = x, adj = c(0.5,0.5)) text(x=x, y = 0.4, labels = names(FindColor(x, cols=cols, min.x = 0, max.x = 1, all.inside = TRUE)), adj = c(0.5,0.5)) text(x=x, y = 0.3, labels = names(FindColor(x, cols=cols, min.x = 0, max.x = 1, all.inside = FALSE)), adj = c(0.5,0.5))
Canvas(7, main="Use of function FindColor()") # get some data x <- c(23, 56, 96) # get a color range from blue via white to red cols <- colorRampPalette(c("blue","white","red"))(100) ColorLegend(x="bottomleft", cols=cols, labels=seq(0, 100, 10), cex=0.8) # and now the color coding of x: (xcols <- FindColor(x, cols, min.x=0, max.x=100)) # this should be the same as cols[x+1] # how does it look like? y0 <- c(-5, -2, 1) text(x=1, y=max(y0)+2, labels="Color coding of x:") text(x=1.5, y=y0, labels=x) DrawRegPolygon(x=3, y=y0, nv=4, rot=pi/4, col=xcols) text(x=6, y=y0, labels=xcols) # how does the function select colors? Canvas(xlim = c(0,1), ylim = c(0,1)) cols <- c(red="red", yellow="yellow", green="green", blue="blue") ColorLegend(x=0, y=1, width=1, cols=rev(cols), horiz = TRUE, labels=Format(seq(0, 1, .25), digits=2), frame="grey", cex=0.8 ) x <- c(-0.2, 0, 0.15, 0.55, .75, 1, 1.3) arrows(x0 = x, y0 = 0.6, y1 = 0.8, angle = 15, length = .2) text(x=x, y = 0.5, labels = x, adj = c(0.5,0.5)) text(x=x, y = 0.4, labels = names(FindColor(x, cols=cols, min.x = 0, max.x = 1, all.inside = TRUE)), adj = c(0.5,0.5)) text(x=x, y = 0.3, labels = names(FindColor(x, cols=cols, min.x = 0, max.x = 1, all.inside = FALSE)), adj = c(0.5,0.5))
This function searches through a correlation matrix and returns a vector of integers corresponding to columns to remove to reduce pair-wise correlations.
FindCorr(x, cutoff = .90, verbose = FALSE)
FindCorr(x, cutoff = .90, verbose = FALSE)
x |
A correlation matrix |
cutoff |
A numeric value for the pair-wise absolute correlation cutoff |
verbose |
A boolean for printing the details |
The absolute values of pair-wise correlations are considered. If two variables have a high correlation, the function looks at the mean absolute correlation of each variable and removes the variable with the largest mean absolute correlation.
There are several function in the subselect package that can also be used to accomplish the same goal. However the package was removed from CRAN and available in the archives.
A vector of indices denoting the columns to remove. If no correlations meet the criteria, numeric(0)
is returned.
Original R code by Dong Li, modified by Max Kuhn
Max Kuhn. Contributions from Jed Wing, Steve Weston, Andre Williams, Chris Keefer, Allan Engelhardt, Tony Cooper, Zachary Mayer and the R Core Team (2014). caret: Classification and Regression Training. R package version 6.0-35. https://cran.r-project.org/package=caret
corrMatrix <- diag(rep(1, 5)) corrMatrix[2, 3] <- corrMatrix[3, 2] <- .7 corrMatrix[5, 3] <- corrMatrix[3, 5] <- -.7 corrMatrix[4, 1] <- corrMatrix[1, 4] <- -.67 corrDF <- expand.grid(row = 1:5, col = 1:5) corrDF$correlation <- as.vector(corrMatrix) PlotCorr(xtabs(correlation ~ ., corrDF), las=1, border="grey") FindCorr(corrMatrix, cutoff = .65, verbose = TRUE) FindCorr(corrMatrix, cutoff = .99, verbose = TRUE) # d.pizza example m <- cor(data.frame(lapply(d.pizza, as.numeric)), use="pairwise.complete.obs") FindCorr(m, verbose = TRUE) m[, FindCorr(m)]
corrMatrix <- diag(rep(1, 5)) corrMatrix[2, 3] <- corrMatrix[3, 2] <- .7 corrMatrix[5, 3] <- corrMatrix[3, 5] <- -.7 corrMatrix[4, 1] <- corrMatrix[1, 4] <- -.67 corrDF <- expand.grid(row = 1:5, col = 1:5) corrDF$correlation <- as.vector(corrMatrix) PlotCorr(xtabs(correlation ~ ., corrDF), las=1, border="grey") FindCorr(corrMatrix, cutoff = .65, verbose = TRUE) FindCorr(corrMatrix, cutoff = .99, verbose = TRUE) # d.pizza example m <- cor(data.frame(lapply(d.pizza, as.numeric)), use="pairwise.complete.obs") FindCorr(m, verbose = TRUE) m[, FindCorr(m)]
Convert a correlation to a z score or z to r using the Fisher transformation or find the confidence intervals for a specified correlation.
FisherZ(rho) FisherZInv(z) CorCI(rho, n, conf.level = 0.95, alternative = c("two.sided", "less", "greater"))
FisherZ(rho) FisherZInv(z) CorCI(rho, n, conf.level = 0.95, alternative = c("two.sided", "less", "greater"))
rho |
the Pearson's correlation coefficient |
z |
a Fisher z transformed value |
n |
sample size used for calculating the confidence intervals |
alternative |
is a character string, one of |
conf.level |
confidence level for the returned confidence interval, restricted to lie between zero and one. |
The sampling distribution of Pearson's r is not normally distributed. Fisher developed a transformation now called "Fisher's z-transformation" that converts Pearson's r to the normally distributed variable z. The formula for the transformation is:
z value corresponding to r (in FisherZ)
r corresponding to z (in FisherZInv)
rho, lower and upper confidence intervals (CorCI)
William Revelle <[email protected]>,
slight modifications Andri Signorell <[email protected]> based on R-Core code
cors <- seq(-.9, .9, .1) zs <- FisherZ(cors) rs <- FisherZInv(zs) round(zs, 2) n <- 30 r <- seq(0, .9, .1) rc <- t(sapply(r, CorCI, n=n)) t <- r * sqrt(n-2) / sqrt(1-r^2) p <- (1 - pt(t, n-2)) / 2 r.rc <- data.frame(r=r, z=FisherZ(r), lower=rc[,2], upper=rc[,3], t=t, p=p) round(r.rc,2)
cors <- seq(-.9, .9, .1) zs <- FisherZ(cors) rs <- FisherZInv(zs) round(zs, 2) n <- 30 r <- seq(0, .9, .1) rc <- t(sapply(r, CorCI, n=n)) t <- r * sqrt(n-2) / sqrt(1-r^2) p <- (1 - pt(t, n-2)) / 2 r.rc <- data.frame(r=r, z=FisherZ(r), lower=rc[,2], upper=rc[,3], t=t, p=p) round(r.rc,2)
Convert a text to a table by using complete columns of spaces (or any other separator) as delimiting point.
FixToTable(txt, sep = " ", delim = "\t", trim = TRUE, header = TRUE)
FixToTable(txt, sep = " ", delim = "\t", trim = TRUE, header = TRUE)
txt |
the text to be partitioned. Works best, if txt is a matrix. |
sep |
the separator to use. Will frequently be " ". |
delim |
the new delimiter to insert. (default tab) |
trim |
logical. Should the separated text be trimmed from whitespace? Defaults to TRUE. |
header |
logical. Should the first line be interpreted as header? |
Only a complete appearance of the separator character in the same position over all rows will be accepted as column delimiter.
a matrix of the separated text.
Andri Signorell <[email protected]>
# let's get some tabbed text txt <- matrix(capture.output(Titanic[,,2,1])) FixToTable(txt[-1,])
# let's get some tabbed text txt <- matrix(capture.output(Titanic[,,2,1])) FixToTable(txt[-1,])
Formatting numbers with base R tools often degenerates into a major intellectual challenge for us little minds down here in the valley of tears. There are a number of options available and quite often it's hard to work out which one to use, when a more uncommon setting is needed.
The Format()
function wraps all these functions and tries to offer a simpler, less technical, but still flexible interface.
There's also an easygoing interface for format templates, defined as a list consisting of any accepted format features. This enables to define templates globally and easily change or modify them later.
Format(x, digits = NULL, sci = NULL, big.mark = NULL, ldigits = NULL, zero.form = NULL, na.form = NULL, fmt = NULL, align = NULL, width = NULL, lang = NULL, eps = NULL, ...) ## S3 method for class 'table' Format(x, digits = NULL, sci = NULL, big.mark = NULL, ldigits = NULL, zero.form = NULL, na.form = NULL, fmt = NULL, align = NULL, width = NULL, lang = NULL, eps = NULL, ...) ## S3 method for class 'matrix' Format(x, digits = NULL, sci = NULL, big.mark = NULL, ldigits = NULL, zero.form = NULL, na.form = NULL, fmt = NULL, align = NULL, width = NULL, lang = NULL, eps = NULL, ...) ## Default S3 method: Format(x, digits = NULL, sci = NULL, big.mark = NULL, ldigits = NULL, zero.form = NULL, na.form = NULL, fmt = NULL, align = NULL, width = NULL, lang = NULL, eps = NULL, ...) Fmt(...) as.fmt(...) as.CDateFmt(fmt)
Format(x, digits = NULL, sci = NULL, big.mark = NULL, ldigits = NULL, zero.form = NULL, na.form = NULL, fmt = NULL, align = NULL, width = NULL, lang = NULL, eps = NULL, ...) ## S3 method for class 'table' Format(x, digits = NULL, sci = NULL, big.mark = NULL, ldigits = NULL, zero.form = NULL, na.form = NULL, fmt = NULL, align = NULL, width = NULL, lang = NULL, eps = NULL, ...) ## S3 method for class 'matrix' Format(x, digits = NULL, sci = NULL, big.mark = NULL, ldigits = NULL, zero.form = NULL, na.form = NULL, fmt = NULL, align = NULL, width = NULL, lang = NULL, eps = NULL, ...) ## Default S3 method: Format(x, digits = NULL, sci = NULL, big.mark = NULL, ldigits = NULL, zero.form = NULL, na.form = NULL, fmt = NULL, align = NULL, width = NULL, lang = NULL, eps = NULL, ...) Fmt(...) as.fmt(...) as.CDateFmt(fmt)
x |
an atomic numerical, typically a vector of real numbers or a matrix of numerical values. Factors will be converted to strings. |
digits |
integer, the desired (fixed) number of digits after the decimal point. Unlike |
sci |
integer. The power of 10 to be set when deciding to print numeric values in exponential notation. Fixed notation will be preferred unless the number is larger than 10^scipen. If just one value is set it will be used for the left border 10^(-scipen) as well as for the right one (10^scipen). A negative and a positive value can also be set independently. Default is
|
big.mark |
character; if not empty used as mark between every 3 decimals before the decimal point. Default is "" (none). |
ldigits |
number of leading zeros. |
zero.form |
character, string specifying how zeros should be specially formatted. Useful for pretty printing 'sparse' objects.
If set to |
na.form |
character, string specifying how |
fmt |
either a format string, allowing to flexibly define special formats or an object of class |
align |
the character on whose position the strings will be aligned. Left alignment can be requested by setting |
width |
integer, the defined fixed width of the strings. |
lang |
optional value setting the language for the months and daynames. Can be either |
eps |
a numerical tolerance used mainly for formatting p values, those less than eps are formatted as " |
... |
further arguments to be passed to or from methods. |
Format()
is the workhorse here and formats numbers and dates.
The argument fmt
is very flexible and is used to generate a variety of different formats. When x
is a date, it can take ISO-8601-date-and-time-format codes consisting of (d
, m
and y
for day, month or year) and defining the combination of day month and year representation. Repeating the specific code defines the degree of abbreviation. The format 'yyyy-mm-dd'
would yield a date as 2020-10-12
.
Date Codes | |
d |
day of the month without leading zero (1 - 31) |
dd |
day of the month with leading zero (01 - 31) |
ddd |
abbreviated name for the day of the week (e.g. Mon) in the current user's language |
dddd |
full name for the day of the week (e.g. Monday) in the current user's language |
m |
month without leading zero (1 - 12) |
mm |
month with leading zero (01 - 12) |
mmm |
abbreviated month name (e.g. Jan) in the current user's language |
mmmm |
full month name (e.g. January) in the current user's language |
y |
year without century, without leading zero (0 - 99) |
yy |
year without century, with leading zero (00 - 99) |
yyyy |
year with century. For example: 2005 |
The function as.CDateFmt()
converts ISO-8601 codes into the C-format codes used in base R.
So
as.CDateFmt("yyyy mm dd")
yields "%Y %m %d"
.
Even more variability is needed to display numeric values. For the most frequently used formats there are the following special codes available:
Code | ||
e |
scientific | forces scientific representation of x, e.g. 3.141e-05. The number of digits, |
alignment and zero values are further respected. | ||
eng |
engineering | forces scientific representation of x , but only with powers that are a multiple of 3. |
engabb |
engineering abbr. | same as eng , but replaces the exponential representation by codes, |
e.g. M for mega (1e6). See d.prefix . |
||
% |
percent | will divide the given number by 100 and append the %-sign (without a separator). |
p |
p-value | will wrap the function format.pval and return a p-value format. |
Use eps to define the threshold to switch to a < 000 representation. |
||
frac |
fractions | will (try to) convert numbers to fractions. So 0.1 will be displayed as 1/10. |
See fractions() . |
||
* |
significance | will produce a significance representation of a p-value consisting of * and ., |
while the breaks are set according to the used defaults e.g. in lm as |
||
[0, 0.001] = *** |
||
(0.001, 0.01] = ** |
||
(0.01, 0.05] = * |
||
(0.05, 0.1] = . |
||
(0.1,1] = | ||
p*
|
p-value stars | will produce p-value and significance stars |
fmt
can as well be an object of class fmt
consisting of a list out of the arguments above.
This allows to store and manage the full format in variables or as options (in DescToolsOptions()
) and use it as format template subsequently.
Finally fmt
can also be a function in x, which makes formatting very flexible.
New formats can be created by means of as.fmt()
. This works quite straight on. We can use any of the arguments from Format()
and combine them to a list.
The following code will define a new format template named "myNumFmt
" of the class "fmt"
. Provided to Format()
this will result in a number displayed with 2 fixed digits and a comma as big mark:
myNumFmt <- as.fmt(digits=2, big.mark=",") Format(12222.89345, fmt=myNumFmt) = 12,222.89
The latter returns the same result as if the arguments would have been supplied directly: Format(12222.89345, digits=2, big.mark=",")
.
Many report functions (e.g. TOne()
) in DescTools use three default formats for counts (named "abs"
), numeric values ("num"
) and percentages ("per"
). These formats can be set by the user as options (see DescToolsOptions()
. For other purposes any number of any named formats can be defined.
Fmt()
is used to access and edit already defined Formats. It can directly adapt defined properties and returns the format template. Fmt("num", digits=1, sci=10)
will use the current version of the numeric format and change the digits to 1 and the threshold to switch to scientifc presentation to numbers >1e10 and <1e-10.
Format templates can be altered using their names. With Fmt(abs=Fmt("abs", big.mark=" "))
the format template for count values "abs"
will be overwritten with the new values and stored as option for the current session.
The formats can as well be organized as options. DescToolsOptions("fmt")
would display the currently defined formats. This mechanic works analogously to the options()
procedure of base R. So to store the current settings we can use
opt <- DescToolsOptions("fmt") ... do some stuff like redefining the global formats ... DescToolOptions(opt)
The last command resets the options and so we have again the initial definitions for the format templates.
the formatted values as characters.
If x
was a matrix
, then a the result will also be a matrix
. (Hope this will not surprise you...)
Andri Signorell <[email protected]>
format
, formatC
, prettyNum
, sprintf
, symnum
,StrAlign
, StrPad
, Sys.setlocale
,Weekday
, Month
,
DescToolsOptions
Format(as.Date(c("2014-11-28", "2014-1-2")), fmt="ddd, d mmmm yyyy") Format(as.Date(c("2014-11-28", "2014-1-2")), fmt="ddd, d mmmm yyyy", lang="engl") x <- pi * 10^(-10:10) Format(x, digits=3, fmt="%", sci=NA) Format(x, digits=4, sci=c(4, 6), ldigits=0, width=9, align=".") # format a matrix m <- matrix(runif(100), nrow=10, dimnames=list(LETTERS[1:10], LETTERS[1:10])) Format(m, digits=1) # engineering format Format(x, fmt="eng", digits=2) Format(x, fmt="engabb", ldigits=2, digits=2) # combine with grams [g] paste(Format(x, fmt="engabb", ldigits=2, digits=2), "g", sep="") # example form symnum pval <- rev(sort(c(outer(1:6, 10^-(1:3))))) noquote(cbind(Format(pval, fmt="p"), Format(pval, fmt="*"))) # use Fmt() to get and define new formats stored as option Fmt() # all defined formats Fmt("abs") # only format named "abs" Fmt("nexist") # only format named "nexist" (nonexisting) Fmt("abs", "per", "nexist") Fmt("abs", digits=3) # get Fmt("abs") and overwrite digits Fmt("abs", na.form="-") # get Fmt("abs") and add user defined na.form # define totally new format and store as option Fmt(nob=as.fmt(digits=10, na.form="nodat")) # overwrite an existing format Fmt(nob=Fmt("nob", digits=5)) Fmt("nob") # change the character to be used as the decimal point opt <- options(OutDec=",") Format(1200, digits=2, big.mark = ".") options(opt)
Format(as.Date(c("2014-11-28", "2014-1-2")), fmt="ddd, d mmmm yyyy") Format(as.Date(c("2014-11-28", "2014-1-2")), fmt="ddd, d mmmm yyyy", lang="engl") x <- pi * 10^(-10:10) Format(x, digits=3, fmt="%", sci=NA) Format(x, digits=4, sci=c(4, 6), ldigits=0, width=9, align=".") # format a matrix m <- matrix(runif(100), nrow=10, dimnames=list(LETTERS[1:10], LETTERS[1:10])) Format(m, digits=1) # engineering format Format(x, fmt="eng", digits=2) Format(x, fmt="engabb", ldigits=2, digits=2) # combine with grams [g] paste(Format(x, fmt="engabb", ldigits=2, digits=2), "g", sep="") # example form symnum pval <- rev(sort(c(outer(1:6, 10^-(1:3))))) noquote(cbind(Format(pval, fmt="p"), Format(pval, fmt="*"))) # use Fmt() to get and define new formats stored as option Fmt() # all defined formats Fmt("abs") # only format named "abs" Fmt("nexist") # only format named "nexist" (nonexisting) Fmt("abs", "per", "nexist") Fmt("abs", digits=3) # get Fmt("abs") and overwrite digits Fmt("abs", na.form="-") # get Fmt("abs") and add user defined na.form # define totally new format and store as option Fmt(nob=as.fmt(digits=10, na.form="nodat")) # overwrite an existing format Fmt(nob=Fmt("nob", digits=5)) Fmt("nob") # change the character to be used as the decimal point opt <- options(OutDec=",") Format(1200, digits=2, big.mark = ".") options(opt)
Frac()
returns the fractional part of a numeric value.
MaxDigits()
return the number of digits in x
. Ndec()
returns the number of decimals.Prec()
returns the precision of a number x
.
Frac(x, dpwr = NA) MaxDigits(x) Ndec(x) Prec(x)
Frac(x, dpwr = NA) MaxDigits(x) Ndec(x) Prec(x)
x |
the numeric value (or a vector of numerics), whose fractional part is to be calculated. |
dpwr |
power of 10 for a factor z, the fractional part will be multiplied with. The result will be returned rounded to integer. Defaults to |
Andri Signorell <[email protected]>
format.info
, as.integer
, trunc
x <- rnorm(5)*100 x Frac(x) # multiply by 10^4 Frac(x, dpwr=4) MaxDigits(c(1.25, 1.8, 12.0, 1.00000)) x <- c("0.0000", "0", "159.283", "1.45e+10", "1.4599E+10" ) Ndec(x) Prec(as.numeric(x))
x <- rnorm(5)*100 x Frac(x) # multiply by 10^4 Frac(x, dpwr=4) MaxDigits(c(1.25, 1.8, 12.0, 1.00000)) x <- c("0.0000", "0", "159.283", "1.45e+10", "1.4599E+10" ) Ndec(x) Prec(as.numeric(x))
Density function, distribution function, quantile function and random generation for the Frechet distribution with location, scale and shape parameters.
dFrechet(x, loc=0, scale=1, shape=1, log = FALSE) pFrechet(q, loc=0, scale=1, shape=1, lower.tail = TRUE) qFrechet(p, loc=0, scale=1, shape=1, lower.tail = TRUE) rFrechet(n, loc=0, scale=1, shape=1)
dFrechet(x, loc=0, scale=1, shape=1, log = FALSE) pFrechet(q, loc=0, scale=1, shape=1, lower.tail = TRUE) qFrechet(p, loc=0, scale=1, shape=1, lower.tail = TRUE) rFrechet(n, loc=0, scale=1, shape=1)
x , q
|
Vector of quantiles. |
p |
Vector of probabilities. |
n |
Number of observations. |
loc , scale , shape
|
Location, scale and shape parameters (can be given as vectors). |
log |
Logical; if |
lower.tail |
Logical; if |
The Frechet distribution function with parameters
,
and
is
for and zero otherwise, where
and
.
dFrechet
gives the density function, pFrechet
gives
the distribution function, qFrechet
gives the quantile
function, and rFrechet
generates random deviates.
Alec Stephenson <[email protected]>
rGenExtrVal
, rGumbel
, rRevWeibull
dFrechet(2:4, 1, 0.5, 0.8) pFrechet(2:4, 1, 0.5, 0.8) qFrechet(seq(0.9, 0.6, -0.1), 2, 0.5, 0.8) rFrechet(6, 1, 0.5, 0.8) p <- (1:9)/10 pFrechet(qFrechet(p, 1, 2, 0.8), 1, 2, 0.8) ## [1] 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
dFrechet(2:4, 1, 0.5, 0.8) pFrechet(2:4, 1, 0.5, 0.8) qFrechet(seq(0.9, 0.6, -0.1), 2, 0.5, 0.8) rFrechet(6, 1, 0.5, 0.8) p <- (1:9)/10 pFrechet(qFrechet(p, 1, 2, 0.8), 1, 2, 0.8) ## [1] 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Calculates absolute and relative frequencies of a vector x
. Continuous (numeric) variables
will be cut using the same logic as used by the function hist
.
Categorical variables will be aggregated by table
. The result will contain single and cumulative frequencies for both, absolute values and percentages.
Freq(x, breaks = hist(x, plot = FALSE)$breaks, include.lowest = TRUE, ord = c("level", "desc", "asc", "name"), useNA = c("no", "ifany", "always"), ...) ## S3 method for class 'Freq' print(x, digits = NULL, ...)
Freq(x, breaks = hist(x, plot = FALSE)$breaks, include.lowest = TRUE, ord = c("level", "desc", "asc", "name"), useNA = c("no", "ifany", "always"), ...) ## S3 method for class 'Freq' print(x, digits = NULL, ...)
x |
the variable to be described, can be any atomic type. |
breaks |
either a numeric vector of two or more cut points or a single number (greater than or equal to 2)
giving the number of intervals into which x is to be cut. Default taken from the function |
include.lowest |
logical, indicating if an x[i] equal to the lowest (or highest, for |
ord |
how should the result be ordered? Default is |
useNA |
one out of |
digits |
integer, determining the number of digits used to format the relative frequencies. |
... |
further arguments are passed to the function |
By default only the valid cases are considered for the frequencies, say NA
values are excluded. (This is in accordance with the default behavior of the R function table
, which seemed a reasonable reference.) If the NA
s should be included you can set the useNA
argument to either "ifany"
or "always"
.
For numeric variables, if breaks
is specified as a single number, the range of the data is divided into breaks pieces of equal length,
and then the outer limits are moved away by 0.1% of the range to ensure that the extreme values both fall
within the break intervals.
(If x
is a constant vector, equal-length intervals are created that cover the single value.) See cut
.
an object of type "Freq"
, which is basically a data.frame with 5 columns (earning a specific print routine), containing the following components:
level |
factor. The levels of the grouping variable. |
freq |
integer. The absolute frequencies. |
perc |
numeric. The relative frequencies (percent). |
cumfreq |
integer. The cumulative sum of the absolute frequencies. |
cumperc |
numeric. The cumulative sum of the relative frequencies. |
Andri Signorell <[email protected]>
cut
, hist
, cumsum
, table
, prop.table
,
PercTable
, Freq2D
data(d.pizza) # result is a data.frame d.freq <- Freq(d.pizza$price) d.freq # it is printed by default with 3 digits for the percent values, # but the number of digits can be defined in the print function print(d.freq, digits=5) # sorted by frequency Freq(d.pizza$driver, ord="desc") # sorted by name using all the observations, say including NAs Freq(d.pizza$driver, ord="name", useNA="ifany") # percentages and cumulative frequencies for a vector of count data Freq(as.table(c(2,4,12,8)))
data(d.pizza) # result is a data.frame d.freq <- Freq(d.pizza$price) d.freq # it is printed by default with 3 digits for the percent values, # but the number of digits can be defined in the print function print(d.freq, digits=5) # sorted by frequency Freq(d.pizza$driver, ord="desc") # sorted by name using all the observations, say including NAs Freq(d.pizza$driver, ord="name", useNA="ifany") # percentages and cumulative frequencies for a vector of count data Freq(as.table(c(2,4,12,8)))
Calculate a frequency distribution for two continuous variables.
Freq2D(x, ...) ## S3 method for class 'formula' Freq2D(formula, data, subset, ...) ## Default S3 method: Freq2D(x, y, n=20, pad=0, dnn=NULL, ...)
Freq2D(x, ...) ## S3 method for class 'formula' Freq2D(formula, data, subset, ...) ## Default S3 method: Freq2D(x, y, n=20, pad=0, dnn=NULL, ...)
x |
a vector of x values, or a data frame whose first two columns contain the x and y values. |
y |
a vector of y values. |
formula |
a |
data |
a |
subset |
an optional vector specifying a subset of observations to be used. |
n |
the desired number of bins for the output, a scalar or a vector of length 2. |
pad |
number of rows and columns to add to each margin, containing only zeros. |
dnn |
the names to be given to the dimensions in the result. |
... |
named arguments to be passed to the default method. |
The exact number of bins is determined by the
pretty
function, based on the value of n
.
Padding the margins with zeros can be helpful for subsequent analysis, such as smoothing.
The print
logical flag only has an effect when layout=1
.
The layout
argument specifies one of the following formats for
the binned frequency output:
matrix
that is easy to read, aligned like a
scatterplot.
list
with three elements (x, y, matrix) that can be
passed to various plotting functions.
data.frame
with three columns (x, y, frequency) that
can be analyzed further.
Arni Magnusson <[email protected]>>
cut
, table
, and print.table
are the basic underlying functions.Freq
, PercTable
Freq2D(quakes$long, quakes$lat, dnn="") Freq2D(lat ~ long, quakes, n=c(10, 20), pad=1) # range(Freq2D(saithe, print=FALSE)) # Layout, plot # Freq2D(saithe, layout=2) # Freq2D(saithe, layout=3) # contour(Freq2D(saithe, layout=2)) # lattice::contourplot(Freq ~ Bio + HR, Freq2D(saithe,layout=3))
Freq2D(quakes$long, quakes$lat, dnn="") Freq2D(lat ~ long, quakes, n=c(10, 20), pad=1) # range(Freq2D(saithe, print=FALSE)) # Layout, plot # Freq2D(saithe, layout=2) # Freq2D(saithe, layout=3) # contour(Freq2D(saithe, layout=2)) # lattice::contourplot(Freq ~ Bio + HR, Freq2D(saithe,layout=3))
Calculates the greatest common divisor (GCD) and least common multiple (LCM) of all the values present in its arguments.
GCD(..., na.rm = FALSE) LCM(..., na.rm = FALSE)
GCD(..., na.rm = FALSE) LCM(..., na.rm = FALSE)
... |
integer or logical vectors. |
na.rm |
logical. Should missing values (including NaN) be removed? |
The computation is based on the Euclidean algorithm without using the extended
version.The greatest common divisor for
all numbers in the integer vector x
will be computed (the multiple GCD).
A numeric (integer) value.
The following relation is always true:
n * m = GCD(n, m) * LCM(n, m)
Dirk Eddelbuettel <[email protected]> (RCPP part), Andri Signorell <[email protected]>, originally based on code in package numbers by Hans W Borchers <[email protected]>
Eddelbuettel, D. (2013). Seamless R and C++ Integration with Rcpp. New York, NY: Springer.
GCD(12, 10) GCD(144, 233) # Fibonacci numbers are relatively prime to each other LCM(12, 10) LCM(144, 233) # = 144 * 233 # all elements will be flattened by unlist GCD(2, 3, c(5, 7) * 11) GCD(c(2*3, 3*5, 5*7)) LCM(c(2, 3, 5, 7) * 11) LCM(2*3, 3*5, 5*7)
GCD(12, 10) GCD(144, 233) # Fibonacci numbers are relatively prime to each other LCM(12, 10) LCM(144, 233) # = 144 * 233 # all elements will be flattened by unlist GCD(2, 3, c(5, 7) * 11) GCD(c(2*3, 3*5, 5*7)) LCM(c(2, 3, 5, 7) * 11) LCM(2*3, 3*5, 5*7)
Density function, distribution function, quantile function and random generation for the generalized Extreme value (GenExtrVal) distribution with location, scale and shape parameters.
dGenExtrVal(x, loc=0, scale=1, shape=0, log = FALSE) pGenExtrVal(q, loc=0, scale=1, shape=0, lower.tail = TRUE) qGenExtrVal(p, loc=0, scale=1, shape=0, lower.tail = TRUE) rGenExtrVal(n, loc=0, scale=1, shape=0)
dGenExtrVal(x, loc=0, scale=1, shape=0, log = FALSE) pGenExtrVal(q, loc=0, scale=1, shape=0, lower.tail = TRUE) qGenExtrVal(p, loc=0, scale=1, shape=0, lower.tail = TRUE) rGenExtrVal(n, loc=0, scale=1, shape=0)
x , q
|
Vector of quantiles. |
p |
Vector of probabilities. |
n |
Number of observations. |
loc , scale , shape
|
Location, scale and shape parameters; the
|
log |
Logical; if |
lower.tail |
Logical; if |
The GenExtrVal distribution function with parameters
,
and
is
for , where
.
If
the distribution is defined by continuity.
If
, the value
is
either greater than the upper end point (if
), or less
than the lower end point (if
).
The parametric form of the GenExtrVal encompasses that of the Gumbel,
Frechet and reverse Weibull distributions, which are obtained
for ,
and
respectively.
It was first introduced by Jenkinson (1955).
dGenExtrVal
gives the density function, pGenExtrVal
gives the
distribution function, qGenExtrVal
gives the quantile function,
and rGenExtrVal
generates random deviates.
Alec Stephenson <[email protected]>
Jenkinson, A. F. (1955) The frequency distribution of the annual maximum (or minimum) of meteorological elements. Quart. J. R. Met. Soc., 81, 158–171.
rFrechet
,
rGumbel
, rRevWeibull
dGenExtrVal(2:4, 1, 0.5, 0.8) pGenExtrVal(2:4, 1, 0.5, 0.8) qGenExtrVal(seq(0.9, 0.6, -0.1), 2, 0.5, 0.8) rGenExtrVal(6, 1, 0.5, 0.8) p <- (1:9)/10 pGenExtrVal(qGenExtrVal(p, 1, 2, 0.8), 1, 2, 0.8) ## [1] 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
dGenExtrVal(2:4, 1, 0.5, 0.8) pGenExtrVal(2:4, 1, 0.5, 0.8) qGenExtrVal(seq(0.9, 0.6, -0.1), 2, 0.5, 0.8) rGenExtrVal(6, 1, 0.5, 0.8) p <- (1:9)/10 pGenExtrVal(qGenExtrVal(p, 1, 2, 0.8), 1, 2, 0.8) ## [1] 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Density function, distribution function, quantile function and random generation for the generalized Pareto distribution (GenPareto) with location, scale and shape parameters.
dGenPareto(x, loc=0, scale=1, shape=0, log = FALSE) pGenPareto(q, loc=0, scale=1, shape=0, lower.tail = TRUE) qGenPareto(p, loc=0, scale=1, shape=0, lower.tail = TRUE) rGenPareto(n, loc=0, scale=1, shape=0)
dGenPareto(x, loc=0, scale=1, shape=0, log = FALSE) pGenPareto(q, loc=0, scale=1, shape=0, lower.tail = TRUE) qGenPareto(p, loc=0, scale=1, shape=0, lower.tail = TRUE) rGenPareto(n, loc=0, scale=1, shape=0)
x , q
|
Vector of quantiles. |
p |
Vector of probabilities. |
n |
Number of observations. |
loc , scale , shape
|
Location, scale and shape parameters; the
|
log |
Logical; if |
lower.tail |
Logical; if |
The generalized Pareto distribution function (Pickands, 1975) with
parameters ,
and
is
for and
, where
.
If
the distribution is defined by continuity.
dGenPareto
gives the density function, pGenPareto
gives the
distribution function, qGenPareto
gives the quantile function,
and rGenPareto
generates random deviates.
Alec Stephenson <[email protected]>
Pickands, J. (1975) Statistical inference using Extreme Order statistics. Annals of Statistics, 3, 119–131.
dGenPareto(2:4, 1, 0.5, 0.8) pGenPareto(2:4, 1, 0.5, 0.8) qGenPareto(seq(0.9, 0.6, -0.1), 2, 0.5, 0.8) rGenPareto(6, 1, 0.5, 0.8) p <- (1:9)/10 pGenPareto(qGenPareto(p, 1, 2, 0.8), 1, 2, 0.8) ## [1] 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
dGenPareto(2:4, 1, 0.5, 0.8) pGenPareto(2:4, 1, 0.5, 0.8) qGenPareto(seq(0.9, 0.6, -0.1), 2, 0.5, 0.8) rGenPareto(6, 1, 0.5, 0.8) p <- (1:9)/10 pGenPareto(qGenPareto(p, 1, 2, 0.8), 1, 2, 0.8) ## [1] 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Generates a random grouping from a given data vector, where the group sizes correspond to the numeric vector grp_n.
GenRandGroups(x, grp_n)
GenRandGroups(x, grp_n)
x |
a vector containing the objects which should be grouped |
grp_n |
an integer vector with the required group sizes |
For group divisions in class, it is often useful to have a function available that randomizes these divisions.
a list sized length of grp_n
with the x elements assigned to their group.
Andri Signorell <[email protected]>
# say we have 12 students and want 3 groups with sizes 4,3, and 5 GenRandGroups(x=LETTERS[1:12], grp_n=c(4,3,5))
# say we have 12 students and want 3 groups with sizes 4,3, and 5 GenRandGroups(x=LETTERS[1:12], grp_n=c(4,3,5))
A geometric sequence is a sequence, such that each term is given by a multiple of q of the previous one. A geometric series consists out of the sum of all former values of a geometric sequence..
GeomSn(a1, q, n)
GeomSn(a1, q, n)
a1 |
the first element of the sequence |
q |
the factor of the sequence |
n |
number of elements to include in the sum |
the sum as numeric value
Andri Signorell <[email protected]>
GeomSn(a1=3, q=2, n=5) # calculates the sum of the first 5 elements of the sequence (gseq <- 3 * (2^(0:5))) sum(gseq) GeomSn(a1=3, q=2, n=0:5)
GeomSn(a1=3, q=2, n=5) # calculates the sum of the first 5 elements of the sequence (gseq <- 3 * (2^(0:5))) sum(gseq) GeomSn(a1=3, q=2, n=0:5)
This function transforms geometric structures by translating, scaling and/or rotating them.
GeomTrans(x, y = NULL, trans = 0, scale = 1, theta = 0)
GeomTrans(x, y = NULL, trans = 0, scale = 1, theta = 0)
x , y
|
vectors containing the coordinates of the vertices of the polygon
, which has to be transformed. The coordinates can be passed in a plotting structure (a list with x and y components), a two-column matrix, .... See |
trans |
a vector of two values for the translation in x-, resp. y-direction. If only one value is supplied it will be recycled. |
scale |
a vector of two values for the scaling factor in x-, resp. y-direction. If only one value is supplied it will be recycled. |
theta |
angle of the rotation in radians starting from 3 o'clock counterclockwise. |
The function invisibly returns a list of the coordinates for the transformed shape(s).
Andri Signorell <[email protected]>
polygon
, DrawRegPolygon
, DrawEllipse
, DrawArc
# let's have a triangle Canvas(main="Rotation") x <- DrawRegPolygon(nv=3)[[1]] xt <- GeomTrans(x, trans=c(1, 3), scale=c(2, 2), theta=pi/4) polygon(xt)
# let's have a triangle Canvas(main="Rotation") x <- DrawRegPolygon(nv=3)[[1]] xt <- GeomTrans(x, trans=c(1, 3), scale=c(2, 2), theta=pi/4) polygon(xt)
For screening purposes it can be useful to get a list of all function calls our function may depend on. GetCall()
parses the function source and return all found function calls grouped by their package.
GetCalls(fun, alphabetic = TRUE, package = NULL)
GetCalls(fun, alphabetic = TRUE, package = NULL)
fun |
the name of the function to be parsed |
alphabetic |
logic, determining the order of the result |
package |
name of the package, if only functions of this specific package should be returned. |
a list of vectors structered by package
Nicholas Cooper <njcooper at gmx.co.uk> (in package NCmisc) with some tweaking by Andri Signorell <[email protected]>
LsFct()
GetCalls("t.test.default") sapply(c("Closest", "Format"), function(x) paste(unname(unlist(GetCalls(x))), collapse=", "))
GetCalls("t.test.default") sapply(c("Closest", "Format"), function(x) paste(unname(unlist(GetCalls(x))), collapse=", "))
Look for a running Word, resp. Excel instance and return its handle. If no running instance is found a new instance will be created (which will be communicated with a warning).
GetCurrWrd() GetCurrXL()
GetCurrWrd() GetCurrXL()
a handle (pointer) to the running Word, resp. Excel instance.
When closing an application instance, the value of the pointer in R is not somehow automatically invalidated. In such cases the corresponding variable contains an invalid address.
Whether the pointer still refers to a valid running application instance can be checked by IsValidHwnd
.
Andri Signorell <[email protected]>
## Not run: # Windows-specific example # Start a new instance GetNewWrd() # grab the handle to this instance wrd <- GetCurrWrd() # this should be valid IsValidHwnd(wrd) # close the instance wrd$quit() # now it should be gone and the pointer invalid if(IsValidHwnd(wrd)){ print("Ouups! Still there?") } else { print("GetCurrWrd: no running word instance found...") } ## End(Not run)
## Not run: # Windows-specific example # Start a new instance GetNewWrd() # grab the handle to this instance wrd <- GetCurrWrd() # this should be valid IsValidHwnd(wrd) # close the instance wrd$quit() # now it should be gone and the pointer invalid if(IsValidHwnd(wrd)){ print("Ouups! Still there?") } else { print("GetCurrWrd: no running word instance found...") } ## End(Not run)
Start a new instance of Word and return its handle.
By means of this handle we can then control the word application. WrdKill
ends a running MS-Word task.
GetNewWrd(visible = TRUE, template = "Normal", header = FALSE, main = "Descriptive report") WrdKill()
GetNewWrd(visible = TRUE, template = "Normal", header = FALSE, main = "Descriptive report") WrdKill()
visible |
logical, should Word made visible? Defaults to |
template |
the name of the template to be used for creating a new document. |
header |
logical, should a caption and a list of contents be inserted? Default is |
main |
the main title of the report |
The package RDCOMClient reveals the whole VBA-world of MS-Word. So generally speaking any VBA code can be run fully controlled by R. In practise, it might be a good idea to record a macro and rewrite the VB-code in R.
Here's a list of some frequently used commands. Let's assume we have a handle to the application and a handle to the current selection defined as:
wrd <- GetNewWrd() sel <- wrd$Selection()
Then we can access the most common properties as follows:
new document | wrd[["Documents"]]$Add(template, FALSE, 0) , template is the templatename. |
open document | wrd[["Documents"]]$Open(Filename="C:/MyPath/MyDocument.docx") . |
save document | wrd$ActiveDocument()$SaveAs2(FileName="P:/MyFile.docx") |
quit word | wrd$quit() |
kill word task | WrdKill kills a running word task (which might not be ended with quit.) |
normal text | Use ToWrd which offers many arguments as fontname, size, color, alignment etc. |
ToWrd("Lorem ipsum dolor sit amet, consetetur",
|
|
font=list(name="Arial", size=10, col=wdConst$wdColorRed) |
|
simple text | sel$TypeText("sed diam nonumy eirmod tempor invidunt ut labore") |
heading | WrdCaption("My Word-Story", index=1) |
insert R output | ToWrd(capture.output(str(d.diamonds))) |
pagebreak | sel$InsertBreak(wdConst$wdPageBreak) |
sectionbreak | sel$InsertBreak(wdConst$wdSectionBreakContinuous) |
(wdSectionBreakNextPage ) |
|
move cursor right | sel$MoveRight(Unit=wdConst$wdCharacter, Count=2, Extend=wdConst$wdExtend) |
goto end | sel$EndKey(Unit=wdConst$wdStory) |
pagesetup | sel[["PageSetup"]][["Bottommargin"]] <- 4 * 72 |
orientation | sel[["PageSetup"]][["Orientation"]] <- wdConst$wdOrientLandscape |
add bookmark | wrd[["ActiveDocument"]][["Bookmarks"]]$Add("myBookmark") |
goto bookmark | sel$GoTo(wdConst$wdGoToBookmark, 0, 0, "myBookmark") |
update bookmark | WrdUpdateBookmark("myBookmark", "New text for my bookmark") |
show document map | wrd[["ActiveWindow"]][["DocumentMap"]] <- TRUE |
create table | WrdTable () which allows to define the table's geometry |
insert caption | sel$InsertCaption(Label="Abbildung", TitleAutoText="InsertCaption",
|
Title="My Title") |
|
tables of figures | wrd$ActiveDocument()$TablesOfFigures()$Add(Range=sel$range(),
|
Caption="Figure") |
|
insert header | wview <- wrd[["ActiveWindow"]][["ActivePane"]][["View"]][["SeekView"]]
|
wview <- ifelse(header, wdConst$wdSeekCurrentPageHeader, wdConst$wdSeekCurrentPageFooter)
|
|
ToWrd(x, ..., wrd=wrd)
|
|
a handle (pointer) to the created Word instance.
Note that the list of contents has to be refreshed by hand after inserting text (if inserted by header = TRUE
).
Andri Signorell <[email protected]>
## Not run: # Windows-specific example wrd <- GetNewWrd() Desc(d.pizza[,1:4], wrd=wrd) wrd <- GetNewWrd(header=TRUE) Desc(d.pizza[,1:4], wrd=wrd) # enumerate all bookmarks in active document for(i in 1:wrd[["ActiveDocument"]][["Bookmarks"]]$count()){ print(wrd[["ActiveDocument"]][["Bookmarks"]]$Item(i)$Name()) } ## End(Not run)
## Not run: # Windows-specific example wrd <- GetNewWrd() Desc(d.pizza[,1:4], wrd=wrd) wrd <- GetNewWrd(header=TRUE) Desc(d.pizza[,1:4], wrd=wrd) # enumerate all bookmarks in active document for(i in 1:wrd[["ActiveDocument"]][["Bookmarks"]]$count()){ print(wrd[["ActiveDocument"]][["Bookmarks"]]$Item(i)$Name()) } ## End(Not run)
Start a new instance of Excel and return its handle. This is needed to address the Excel application and objects afterwards.
GetNewXL(visible = TRUE, newdoc = TRUE)
GetNewXL(visible = TRUE, newdoc = TRUE)
visible |
logical, should Excel made visible? Defaults to |
newdoc |
logical, determining if a new workbook should be created. Defaults to |
Here's a list of some frequently used commands.
Let's assume:
xl <- GetNewXL()
workbooks | xl$workbooks()$count() |
quit excel | xl$quit() |
Andri Signorell <[email protected]>
XLView
, XLGetRange
, XLGetWorkbook
## Not run: # Windows-specific example # get a handle to a new excel instance xl <- GetNewXL() ## End(Not run)
## Not run: # Windows-specific example # get a handle to a new excel instance xl <- GetNewXL() ## End(Not run)
Compute the Gini coefficient, the most commonly used measure of inequality.
Gini(x, weights = NULL, unbiased = TRUE, conf.level = NA, R = 10000, type = "bca", na.rm = FALSE)
Gini(x, weights = NULL, unbiased = TRUE, conf.level = NA, R = 10000, type = "bca", na.rm = FALSE)
x |
a vector containing at least non-negative elements. The result will be |
weights |
a numerical vector of weights the same length as |
unbiased |
logical. In order for G to be an unbiased estimate of the true population value,
calculated gini is multiplied by |
conf.level |
confidence level for the confidence interval, restricted to lie between 0 and 1.
If set to |
R |
number of bootstrap replicates. Usually this will be a single positive integer.
For importance resampling, some resamples may use one set of weights and others use a different set of weights. In this case R would be a vector of
integers where each component gives the number of resamples from each of the rows of weights. |
type |
character string representing the type of interval required.
The value should be one out of the c( |
na.rm |
logical. Should missing values be removed? Defaults to FALSE. |
The range of the Gini coefficient goes from 0 (no concentration) to (maximal concentration). The bias corrected Gini coefficient goes from 0 to 1.
The small sample variance properties of the Gini coefficient are not known, and large sample approximations to the variance of the coefficient are poor (Mills and Zandvakili, 1997; Glasser, 1962; Dixon et al., 1987),
therefore confidence intervals are calculated via bootstrap re-sampling methods (Efron and Tibshirani, 1997).
Two types of bootstrap confidence intervals are commonly used, these are
percentile and bias-corrected (Mills and Zandvakili, 1997; Dixon et al., 1987; Efron and Tibshirani, 1997).
The bias-corrected intervals are most appropriate for most applications. This is set as default for the type
argument ("bca"
).
Dixon (1987) describes a refinement of the bias-corrected method known as 'accelerated' -
this produces values very closed to conventional bias corrected intervals.
(Iain Buchan (2002) Calculating the Gini coefficient of inequality, see: https://www.statsdirect.com/help/default.htm#nonparametric_methods/gini.htm)
If conf.level
is set to NA
then the result will be
a |
single numeric value |
and
if a conf.level
is provided, a named numeric vector with 3 elements:
gini |
Gini coefficient |
lwr.ci |
lower bound of the confidence interval |
upr.ci |
upper bound of the confidence interval |
Andri Signorell <[email protected]>
Cowell, F. A. (2000) Measurement of Inequality in Atkinson, A. B. / Bourguignon, F. (Eds): Handbook of Income Distribution. Amsterdam.
Cowell, F. A. (1995) Measuring Inequality Harvester Wheatshef: Prentice Hall.
Marshall, Olkin (1979) Inequalities: Theory of Majorization and Its Applications. New York: Academic Press.
Glasser C. (1962) Variance formulas for the mean difference and coefficient of concentration. Journal of the American Statistical Association 57:648-654.
Mills JA, Zandvakili A. (1997). Statistical inference via bootstrapping for measures of inequality. Journal of Applied Econometrics 12:133-150.
Dixon, PM, Weiner J., Mitchell-Olds T, Woodley R. (1987) Boot-strapping the Gini coefficient of inequality. Ecology 68:1548-1551.
Efron B, Tibshirani R. (1997) Improvements on cross-validation: The bootstrap method. Journal of the American Statistical Association 92:548-560.
See Herfindahl
, Rosenbluth
for concentration measures,
Lc
for the Lorenz curveineq()
in the package ineq contains additional inequality measures
# generate vector (of incomes) x <- c(541, 1463, 2445, 3438, 4437, 5401, 6392, 8304, 11904, 22261) # compute Gini coefficient Gini(x) # working with weights fl <- c(2.5, 7.5, 15, 35, 75, 150) # midpoints of classes n <- c(25, 13, 10, 5, 5, 2) # frequencies # with confidence intervals Gini(x=fl, weights=n, conf.level=0.95, unbiased=FALSE) # some special cases x <- c(10, 10, 0, 0, 0) plot(Lc(x)) Gini(x, unbiased=FALSE) # the same with weights Gini(x=c(10, 0), weights=c(2,3), unbiased=FALSE) # perfect balance Gini(c(10, 10, 10))
# generate vector (of incomes) x <- c(541, 1463, 2445, 3438, 4437, 5401, 6392, 8304, 11904, 22261) # compute Gini coefficient Gini(x) # working with weights fl <- c(2.5, 7.5, 15, 35, 75, 150) # midpoints of classes n <- c(25, 13, 10, 5, 5, 2) # frequencies # with confidence intervals Gini(x=fl, weights=n, conf.level=0.95, unbiased=FALSE) # some special cases x <- c(10, 10, 0, 0, 0) plot(Lc(x)) Gini(x, unbiased=FALSE) # the same with weights Gini(x=c(10, 0), weights=c(2,3), unbiased=FALSE) # perfect balance Gini(c(10, 10, 10))
Calculate the Gini-Simpson coefficient, the Gini variant proposed by Deltas and the Hunter-Gaston Index.
GiniSimpson(x, na.rm = FALSE) GiniDeltas(x, na.rm = FALSE) HunterGaston(x, na.rm = FALSE)
GiniSimpson(x, na.rm = FALSE) GiniDeltas(x, na.rm = FALSE) HunterGaston(x, na.rm = FALSE)
x |
a factor containing at least non-negative elements. |
na.rm |
logical. Should missing values be removed? Defaults to |
The original Simpson index equals the probability that two entities taken at random from the dataset of interest (with replacement) represent the same type.
The Simpson index was introduced in 1949 by Edward H. Simpson to measure the degree of concentration when individuals are classified into types. The same index was rediscovered by Orris C. Herfindahl in 1950. The square root of the index had already been introduced in 1945 by the economist Albert O. Hirschman. As a result, the same measure is usually known as the Simpson index in ecology, and as the Herfindahl index or the Herfindahl-Hirschman index (HHI) in economics.
Its transformation 1 - therefore equals the probability that the two entities represent different types.
This measure is also known in ecology as the probability of interspecific encounter (PIE) and the Gini-Simpson index.
a numeric value.
Andri Signorell <[email protected]>
Cover Thomas M. and Thomas Joy A. (1991) Elements of Information Theory. Wiley.
Hunter, P., Gaston, A. G. (1988) Numerical Index of the Discriminatory Ability of Typing Systems: an Application of Simpson's Index of Diversity, JOURNAL OF CLINICAL MICROBIOLOGY, Nov. 1988, p. 2465-2466, 0095-1137/88/112465-02$02.00/0
Deltas (2003) DOI:10.1162/rest.2003.85.1.226.
DivCoef
, Entropy
, Gini
, Herfindahl
x <- c(261,29,33,15,39,28,95,5,6,28,69,8,105,38,15) GiniSimpson(x) # is the same as 1 - Herfindahl(x) GiniSimpson(c(783,121,112,70,201,153,425,19,37,126,325,51,442,193,41))
x <- c(261,29,33,15,39,28,95,5,6,28,69,8,105,38,15) GiniSimpson(x) # is the same as 1 - Herfindahl(x) GiniSimpson(c(783,121,112,70,201,153,425,19,37,126,325,51,442,193,41))
Calculates the geometric mean, its confidence interval and the geometric standard deviation of a vector x.
Gmean(x, method = c("classic", "boot"), conf.level = NA, sides = c("two.sided","left","right"), na.rm = FALSE, ...) Gsd(x, na.rm = FALSE)
Gmean(x, method = c("classic", "boot"), conf.level = NA, sides = c("two.sided","left","right"), na.rm = FALSE, ...) Gsd(x, na.rm = FALSE)
x |
a positive numeric vector. An object which is not a vector is coerced (if possible) by as.vector. |
method |
a vector of character strings representing the type of intervals required. The value should be any subset of the values |
conf.level |
confidence level of the interval. Default is |
sides |
a character string specifying the side of the confidence interval, must be one of |
na.rm |
logical, indicating whether |
... |
further arguments are passed to the |
The geometric mean is defined as:
The geometric mean and geometric standard deviation are restricted to positive inputs (because otherwise the answer can have an imaginary component). Hence if any argument is negative, the result will be NA
. If any argument is zero, then the geometric mean is zero.
For strict positive values the geometric mean is computed as exp(MeanCI(log(x)))
.
Considerations (Roenfeldt 2018)
"The calculation of the geometric mean requires that all values are non-zero and positive. So what should you do if you have data that do not meet this requirement? If you have values that equal zero, you have a few options:
Adjust your scale so that you add 1 to every number in the data set, and then subtract 1 from the resulting geometric mean.
Ignore zeros or missing data in your calculations.
Convert zeros to a very small number (often called "below the detection limit") that is less than the next smallest number in the data set.
If you have negative numbers, you will need to convert those numbers to a positive value before calculating the geometric mean. You can then assign the resulting geometric mean a negative value. If your data set contains both positive and negative values, you will have to separate them and find the geometric means for each group, and you can then find the weighted average of their individual geometric means to find the total geometric mean for the full data set. If none of these options appeals to you, you are not alone! There is controversy among statisticians about what is the best method for dealing with these values. You may want to calculate several types of averages and decide what makes the most sense for you and the results you are trying to report."
a numeric value.
Andri Signorell <[email protected]>
Snedecor, G. W., Cochran, W. G. Cochran (1989) Statistical Methods, 8th ed. Ames, IA: Iowa State University Press
Roenfeldt K. (2018) Better than Average: Calculating Geometric Means Using SAS, Henry M. Jackson Foundation for the Advancement of Military Medicine, https://www.lexjansen.com/wuss/2018/56_Final_Paper_PDF.pdf
x <- runif(5) Gmean(x) m <- matrix(runif(50), nrow = 10) apply(m, 2, Gmean) sapply(as.data.frame(m), Gmean) # ...................................................... # example in https://www.stata.com/manuals13/rameans.pdf x <- c(5,4,-4,-5,0,0,NA,7) # positives only Gmean(x[x>0], na.rm=TRUE, conf.level=0.95) # add 5 to original values and remove zeros Gmean(NAIfZero(x+5), na.rm=TRUE, conf.level = 0.95)
x <- runif(5) Gmean(x) m <- matrix(runif(50), nrow = 10) apply(m, 2, Gmean) sapply(as.data.frame(m), Gmean) # ...................................................... # example in https://www.stata.com/manuals13/rameans.pdf x <- c(5,4,-4,-5,0,0,NA,7) # positives only Gmean(x[x>0], na.rm=TRUE, conf.level=0.95) # add 5 to original values and remove zeros Gmean(NAIfZero(x+5), na.rm=TRUE, conf.level = 0.95)
Density, distribution function, quantile function and random generation for the Gompertz distribution with unrestricted shape.
dGompertz(x, shape, rate = 1, log = FALSE) pGompertz(q, shape, rate = 1, lower.tail = TRUE, log.p = FALSE) qGompertz(p, shape, rate = 1, lower.tail = TRUE, log.p = FALSE) rGompertz(n, shape = 1, rate = 1)
dGompertz(x, shape, rate = 1, log = FALSE) pGompertz(q, shape, rate = 1, lower.tail = TRUE, log.p = FALSE) qGompertz(p, shape, rate = 1, lower.tail = TRUE, log.p = FALSE) rGompertz(n, shape = 1, rate = 1)
x , q
|
vector of quantiles. |
shape , rate
|
vector of shape and rate parameters. |
log , log.p
|
logical; if TRUE, probabilities p are given as log(p). |
lower.tail |
logical; if TRUE (default), probabilities are |
p |
vector of probabilities. |
n |
number of observations. If |
The Gompertz distribution with shape
parameter and
rate
parameter has probability density function
For the Gompertz is equivalent to the exponential distribution
with constant hazard and rate
.
The probability distribution function is
Thus if is negative, letting
tend to infinity shows that
there is a non-zero probability
of living
forever. On these occasions
qGompertz
and rGompertz
will
return Inf
.
dGompertz
gives the density, pGompertz
gives the
distribution function, qGompertz
gives the quantile function,
and rGompertz
generates random deviates.
Some implementations of the Gompertz restrict to be strictly
positive, which ensures that the probability of survival decreases to zero
as
increases to infinity. The more flexible implementation given
here is consistent with
streg
in Stata.
The functions dGompertz
and similar available in the
package eha label the parameters the other way round, so that what is
called the shape
there is called the rate
here, and what is
called 1 / scale
there is called the shape
here. The
terminology here is consistent with the exponential dexp
and
Weibull dweibull
distributions in R.
Christopher Jackson <[email protected]>
Stata Press (2007) Stata release 10 manual: Survival analysis and epidemiological tables.
Calculate Goodman Kruskal's Gamma statistic, a measure of
association for ordinal factors in a two-way table.
The function has interfaces for a contingency table (matrix) and for single vectors (which will then be tabulated).
GoodmanKruskalGamma(x, y = NULL, conf.level = NA, ...)
GoodmanKruskalGamma(x, y = NULL, conf.level = NA, ...)
x |
a numeric vector or a contingency table. A matrix will be treated as a table. |
y |
NULL (default) or a vector with compatible dimensions to |
conf.level |
confidence level of the interval. If set to |
... |
further arguments are passed to the function |
The estimator of is based only on the number of concordant and discordant pairs of observations. It ignores tied pairs (that is, pairs of observations that have equal values of X or equal values of Y). Gamma is appropriate only when both variables lie on an ordinal scale.
It has the range [-1, 1]. If the two variables are independent, then the estimator of gamma tends to be close to zero.
For tables, gamma is equivalent to Yule's Q (
YuleQ
).
Gamma is estimated by
where P equals twice the number of concordances and Q twice the number of discordances.
a single numeric value if no confidence intervals are requested,
and otherwise a numeric vector with 3 elements for the estimate, the lower and the upper confidence interval
Andri Signorell <[email protected]>
Agresti, A. (2002) Categorical Data Analysis. John Wiley & Sons, pp. 57-59.
Brown, M.B., Benedetti, J.K.(1977) Sampling Behavior of Tests for Correlation in Two-Way Contingency Tables, Journal of the American Statistical Association, 72, 309-315.
Goodman, L. A., & Kruskal, W. H. (1954) Measures of association for cross classifications. Journal of the American Statistical Association, 49, 732-764.
Goodman, L. A., & Kruskal, W. H. (1963) Measures of association for cross classifications III: Approximate sampling theory. Journal of the American Statistical Association, 58, 310-364.
There's another implementation of gamma in vcdExtra GKgamma
ConDisPairs
yields concordant and discordant pairs
Other association measures: KendallTauA
(tau-a), KendallTauB
(tau-b), cor
(method="kendall") for tau-b, StuartTauC
(tau-c), SomersDelta
Lambda
, GoodmanKruskalTau
(tau), UncertCoef
, MutInf
# example in: # http://support.sas.com/documentation/cdl/en/statugfreq/63124/PDF/default/statugfreq.pdf # pp. S. 1821 (149) tab <- as.table(rbind( c(26,26,23,18, 9), c( 6, 7, 9,14,23)) ) GoodmanKruskalGamma(tab, conf.level=0.95)
# example in: # http://support.sas.com/documentation/cdl/en/statugfreq/63124/PDF/default/statugfreq.pdf # pp. S. 1821 (149) tab <- as.table(rbind( c(26,26,23,18, 9), c( 6, 7, 9,14,23)) ) GoodmanKruskalGamma(tab, conf.level=0.95)
Calculate Goodman Kruskal's tau statistic, a measure of
association for ordinal factors in a two-way table.
The function has interfaces for a table (matrix) and for single vectors.
GoodmanKruskalTau(x, y = NULL, direction = c("row", "column"), conf.level = NA, ...)
GoodmanKruskalTau(x, y = NULL, direction = c("row", "column"), conf.level = NA, ...)
x |
a numeric vector or a table. A matrix will be treated as table. |
y |
NULL (default) or a vector with compatible dimensions to |
direction |
direction of the calculation. Can be |
conf.level |
confidence level of the interval. If set to |
... |
further arguments are passed to the function |
Goodman-Kruskal tau measures association for cross tabulations of nominal level variables. Goodman-Kruskal tau is based on random category assignment. It measures the percentage improvement in predictability of the dependent variable (column or row variable) given the value of other variables (row or column variables). Goodman-Kruskal tau is the same as Goodman-Kruskal lambda except the calculations of the tau statistic are based on assignment probabilities specified by marginal or conditional proportions. Misclassification probabilities are based on random category assignment with probabilities specified by marginal or conditional proportion.
Goodman Kruskal tau reduces to (see:
Phi
) in the 2x2-table case.
a single numeric value if no confidence intervals are requested,
and otherwise a numeric vector with 3 elements for the estimate, the lower and the upper confidence interval
Andri Signorell <[email protected]>, based on code from Antti Arppe <[email protected]>
Agresti, A. (2002) Categorical Data Analysis. John Wiley & Sons, pp. 57-59.
Goodman, L. A., & Kruskal, W. H. (1954) Measures of association for cross classifications. Journal of the American Statistical Association, 49, 732-764.
Somers, R. H. (1962) A New Asymmetric Measure of Association for Ordinal Variables, American Sociological Review, 27, 799-811.
Goodman, L. A., & Kruskal, W. H. (1963) Measures of association for cross classifications III: Approximate sampling theory. Journal of the American Statistical Association, 58, 310-364.
Liebetrau, A. M. (1983) Measures of Association, Sage University Papers Series on Quantitative Applications in the Social Sciences, 07-004. Newbury Park, CA: Sage, pp. 24–30
ConDisPairs
yields concordant and discordant pairs
Other association measures: KendallTauA
(Tau a), cor
(method="kendall") for Tau b, StuartTauC
, GoodmanKruskalGamma
Lambda
, UncertCoef
, MutInf
# example in: # http://support.sas.com/documentation/cdl/en/statugfreq/63124/PDF/default/statugfreq.pdf # pp. S. 1821 tab <- as.table(rbind(c(26,26,23,18,9),c(6,7,9,14,23))) # Goodman Kruskal's tau C|R GoodmanKruskalTau(tab, direction="column", conf.level=0.95) # Goodman Kruskal's tau R|C GoodmanKruskalTau(tab, direction="row", conf.level=0.95) # http://support.sas.com/documentation/cdl/en/statugfreq/63124/PDF/default/statugfreq.pdf # pp. 1814 (143) tab <- as.table(cbind(c(11,2),c(4,6))) GoodmanKruskalTau(tab, direction="row", conf.level=0.95) GoodmanKruskalTau(tab, direction="column", conf.level=0.95) # reduce both to: Phi(tab)^2 # example 1 in Liebetrau (1983) tt <- matrix(c(549,93,233,119,225,455,402, 212,124,78,42,41,12,132, 54,54,33,13,46,7,153), ncol=3, dimnames=list(rownames=c("Gov", "Mil", "Edu", "Eco", "Intel", "Rel", "For"), colnames=c("One", "Two", "Multi"))) GoodmanKruskalTau(tt, direction = "row", conf.level = 0.95) GoodmanKruskalTau(tt, direction = "column", conf.level = 0.95) # SPSS ttt <- matrix(c(225,53,206,3,1,12), nrow=3, dimnames=list(rownames=c("right","center", "left"), colnames=c("us","ussr"))) round(GoodmanKruskalTau(ttt, direction = "r", con=0.95), d=3) round(GoodmanKruskalTau(ttt, direction = "c"), d=3)
# example in: # http://support.sas.com/documentation/cdl/en/statugfreq/63124/PDF/default/statugfreq.pdf # pp. S. 1821 tab <- as.table(rbind(c(26,26,23,18,9),c(6,7,9,14,23))) # Goodman Kruskal's tau C|R GoodmanKruskalTau(tab, direction="column", conf.level=0.95) # Goodman Kruskal's tau R|C GoodmanKruskalTau(tab, direction="row", conf.level=0.95) # http://support.sas.com/documentation/cdl/en/statugfreq/63124/PDF/default/statugfreq.pdf # pp. 1814 (143) tab <- as.table(cbind(c(11,2),c(4,6))) GoodmanKruskalTau(tab, direction="row", conf.level=0.95) GoodmanKruskalTau(tab, direction="column", conf.level=0.95) # reduce both to: Phi(tab)^2 # example 1 in Liebetrau (1983) tt <- matrix(c(549,93,233,119,225,455,402, 212,124,78,42,41,12,132, 54,54,33,13,46,7,153), ncol=3, dimnames=list(rownames=c("Gov", "Mil", "Edu", "Eco", "Intel", "Rel", "For"), colnames=c("One", "Two", "Multi"))) GoodmanKruskalTau(tt, direction = "row", conf.level = 0.95) GoodmanKruskalTau(tt, direction = "column", conf.level = 0.95) # SPSS ttt <- matrix(c(225,53,206,3,1,12), nrow=3, dimnames=list(rownames=c("right","center", "left"), colnames=c("us","ussr"))) round(GoodmanKruskalTau(ttt, direction = "r", con=0.95), d=3) round(GoodmanKruskalTau(ttt, direction = "c"), d=3)
GTest
performs chi-squared contingency table tests
and goodness-of-fit tests.
GTest(x, y = NULL, correct = c("none", "williams", "yates"), p = rep(1/length(x), length(x)), rescale.p = FALSE)
GTest(x, y = NULL, correct = c("none", "williams", "yates"), p = rep(1/length(x), length(x)), rescale.p = FALSE)
x |
a numeric vector or matrix. |
y |
a numeric vector; ignored if |
correct |
one out of |
p |
a vector of probabilities of the same length of |
rescale.p |
a logical scalar; if |
The G-test is also called "Likelihood Ratio Test" and is asymptotically equivalent to the Pearson ChiSquare-test but not usually used when analyzing 2x2 tables. It is used in logistic regression and loglinear modeling which involves contingency tables. The G-test is also reported in the standard summary of Desc
for tables.
If x
is a matrix with one row or column, or if x
is a
vector and y
is not given, then a goodness-of-fit test
is performed (x
is treated as a one-dimensional
contingency table). The entries of x
must be non-negative
integers. In this case, the hypothesis tested is whether the
population probabilities equal those in p
, or are all equal if
p
is not given.
If x
is a matrix with at least two rows and columns, it is
taken as a two-dimensional contingency table: the entries of x
must be non-negative integers. Otherwise, x
and y
must
be vectors or factors of the same length; cases with missing values
are removed, the objects are coerced to factors, and the contingency
table is computed from these. Then G-test is
performed on the null hypothesis that the joint distribution of the
cell counts in a 2-dimensional contingency table is the product of the
row and column marginals.
Test of independence Yates' correction taken from Mike Camann's 2x2 G-test function. Goodness of Fit Yates' correction as described in Zar (2000).
A list with class "htest"
containing the following
components:
statistic |
the value the chi-squared test statistic. |
parameter |
the degrees of freedom of the approximate
chi-squared distribution of the test statistic, |
p.value |
the p-value for the test. |
method |
a character string indicating the type of test performed, and whether Monte Carlo simulation or continuity correction was used. |
data.name |
a character string giving the name(s) of the data. |
observed |
the observed counts. |
expected |
the expected counts under the null hypothesis. |
Pete Hurd <[email protected]>, Andri Signorell <[email protected]> (tiny tweaks)
Hope, A. C. A. (1968) A simplified Monte Carlo significance test procedure. J. Roy, Statist. Soc. B 30, 582–598.
Patefield, W. M. (1981) Algorithm AS159. An efficient method of generating r x c tables with given row and column totals. Applied Statistics 30, 91–97.
Agresti, A. (2007) An Introduction to Categorical Data Analysis, 2nd ed., New York: John Wiley & Sons. Page 38.
Sokal, R. R., F. J. Rohlf (2012) Biometry: the principles and practice of statistics in biological research. 4th edition. W. H. Freeman and Co.: New York. 937 pp.
## From Agresti(2007) p.39 M <- as.table(rbind(c(762, 327, 468), c(484,239,477))) dimnames(M) <- list(gender=c("M","F"), party=c("Democrat","Independent", "Republican")) (Xsq <- GTest(M)) # Prints test summary Xsq$observed # observed counts (same as M) Xsq$expected # expected counts under the null ## Testing for population probabilities ## Case A. Tabulated data x <- c(A = 20, B = 15, C = 25) GTest(x) GTest(as.table(x)) # the same x <- c(89,37,30,28,2) p <- c(40,20,20,15,5) try( GTest(x, p = p) # gives an error ) # works p <- c(0.40,0.20,0.20,0.19,0.01) # Expected count in category 5 # is 1.86 < 5 ==> chi square approx. GTest(x, p = p) # maybe doubtful, but is ok! ## Case B. Raw data x <- trunc(5 * runif(100)) GTest(table(x)) # NOT 'GTest(x)'!
## From Agresti(2007) p.39 M <- as.table(rbind(c(762, 327, 468), c(484,239,477))) dimnames(M) <- list(gender=c("M","F"), party=c("Democrat","Independent", "Republican")) (Xsq <- GTest(M)) # Prints test summary Xsq$observed # observed counts (same as M) Xsq$expected # expected counts under the null ## Testing for population probabilities ## Case A. Tabulated data x <- c(A = 20, B = 15, C = 25) GTest(x) GTest(as.table(x)) # the same x <- c(89,37,30,28,2) p <- c(40,20,20,15,5) try( GTest(x, p = p) # gives an error ) # works p <- c(0.40,0.20,0.20,0.19,0.01) # Expected count in category 5 # is 1.86 < 5 ==> chi square approx. GTest(x, p = p) # maybe doubtful, but is ok! ## Case B. Raw data x <- trunc(5 * runif(100)) GTest(table(x)) # NOT 'GTest(x)'!
Density function, distribution function, quantile function and random generation for the Gumbel distribution with location and scale parameters.
dGumbel(x, loc=0, scale=1, log = FALSE) pGumbel(q, loc=0, scale=1, lower.tail = TRUE) qGumbel(p, loc=0, scale=1, lower.tail = TRUE) rGumbel(n, loc=0, scale=1)
dGumbel(x, loc=0, scale=1, log = FALSE) pGumbel(q, loc=0, scale=1, lower.tail = TRUE) qGumbel(p, loc=0, scale=1, lower.tail = TRUE) rGumbel(n, loc=0, scale=1)
x , q
|
Vector of quantiles. |
p |
Vector of probabilities. |
n |
Number of observations. |
loc , scale
|
Location and scale parameters (can be given as vectors). |
log |
Logical; if |
lower.tail |
Logical; if |
The Gumbel distribution function with parameters
and
is
for all real , where
.
dGumbel
gives the density function, pGumbel
gives
the distribution function, qGumbel
gives the quantile
function, and rGumbel
generates random deviates.
Alec Stephenson <[email protected]>
rFrechet
, rGenExtrVal
, rRevWeibull
dGumbel(-1:2, -1, 0.5) pGumbel(-1:2, -1, 0.5) qGumbel(seq(0.9, 0.6, -0.1), 2, 0.5) rGumbel(6, -1, 0.5) p <- (1:9)/10 pGumbel(qGumbel(p, -1, 2), -1, 2) ## [1] 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
dGumbel(-1:2, -1, 0.5) pGumbel(-1:2, -1, 0.5) qGumbel(seq(0.9, 0.6, -0.1), 2, 0.5) rGumbel(6, -1, 0.5) p <- (1:9)/10 pGumbel(qGumbel(p, -1, 2), -1, 2) ## [1] 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Computes the concentration within a vector according to the specified concentration measure.
Herfindahl(x, n = rep(1, length(x)), parameter = 1, na.rm = FALSE) Rosenbluth(x, n = rep(1, length(x)), na.rm = FALSE)
Herfindahl(x, n = rep(1, length(x)), parameter = 1, na.rm = FALSE) Rosenbluth(x, n = rep(1, length(x)), na.rm = FALSE)
x |
a vector containing non-negative elements |
n |
a vector of frequencies (weights), must be same length as x. |
parameter |
parameter of the concentration measure (if set to |
na.rm |
logical. Should missing values be removed? Defaults to FALSE. |
the value of the concentration measure
The same measure is usually known as the Simpson index in ecology, and as the Herfindahl index or the Herfindahl-Hirschman index (HHI) in economics.
These functions were previously published as conc()
in the ineq package and have been
integrated here without logical changes. NA
and weights support were added.
Achim Zeileis <[email protected]>
Cowell, F. A. (2000) Measurement of Inequality, in Atkinson, A. B., Bourguignon, F. Handbook of Income Distribution. (Eds) Amsterdam
Cowell, F. A. (1995) Measuring Inequality. Prentice Hall/Harvester Wheatshef
Hall, M., Tidemann, N. (1967) Measures of Concentration, JASA 62, 162-168.
See Gini
, Atkinson
and ineq()
for additional inequality measures
# generate vector (of sales) x <- c(541, 1463, 2445, 3438, 4437, 5401, 6392, 8304, 11904, 22261) # compute Herfindahl coefficient with parameter 1 Herfindahl(x) # compute coefficient of Hall/Tiedemann/Rosenbluth Rosenbluth(x) # Some more examples Herfindahl(c(261,29,33,15,39,28,95,5,6,28,69,8,105,38,15)) Herfindahl(c(783,121,112,70,201,153,425,19,37,126,325,51,442,193,41))
# generate vector (of sales) x <- c(541, 1463, 2445, 3438, 4437, 5401, 6392, 8304, 11904, 22261) # compute Herfindahl coefficient with parameter 1 Herfindahl(x) # compute coefficient of Hall/Tiedemann/Rosenbluth Rosenbluth(x) # Some more examples Herfindahl(c(261,29,33,15,39,28,95,5,6,28,69,8,105,38,15)) Herfindahl(c(783,121,112,70,201,153,425,19,37,126,325,51,442,193,41))
Given a color as a hex string #rrggbb, find the closest match in the table of known (named) colors.
HexToCol(hexstr, method = "rgb", metric = "euclidean")
HexToCol(hexstr, method = "rgb", metric = "euclidean")
hexstr |
a color or a vector of colors specified as hexadecimal string of the form "#RRGGBB" or "#RRGGBBAA" |
method |
character string specifying the color space to be used. Can be "rgb" (default) or "hsv". |
metric |
character string specifying the metric to be used for calculating distances between the colors. Available options are "euclidean" (default) and "manhattan". Euclidean distances are root sum-of-squares of differences, and manhattan distances are the sum of absolute differences. |
Finds the color with the minimum squared distance in RGB space.
The colorname(s) of the closest match(es) (if more than one).
Ben Bolker, vector support Andri Signorell <[email protected]>
ColToHex(c("lightblue", "salmon")) HexToCol(c("#ADD8E6", "#FA1572")) HexToCol(Pal("Helsana")) x <- ColToRgb("darkmagenta") x[2,] <- x[2,] + 155 RgbToCol(x)
ColToHex(c("lightblue", "salmon")) HexToCol(c("#ADD8E6", "#FA1572")) HexToCol(Pal("Helsana")) x <- ColToRgb("darkmagenta") x[2,] <- x[2,] + 155 RgbToCol(x)
HexToRgb()
converts a hexstring color the its red/green/blue representation.
HexToRgb(hex)
HexToRgb(hex)
hex |
a color or a vector of colors specified as hexadecimal string of the form "#RRGGBB" or "#RRGGBBAA" |
A hex color is written as a hash character, "#", followed by 3 or 4 hexadecimal numbers, say 6, resp. 8, digits (0-9A-F). The first 3 pairs of digits specify the red, green and blue components. When there are 8 digits, then the last pair is interpreted as alpha channel defining transparency, where 00
represents a fully transparent color and FF
represent a fully opaque color.
The result will be returned as a matrix having 3 or 4 rows, depending on if the input contained a RRGGBBAA
definition or not. No distinction is made between upper and lower case. A missing leading # is tolerated.
a matrix with 3 or 4 rows.
Andri Signorell <[email protected]>
HexToRgb(c("#ADD8E6", "#FA1572")) # 4-digit representation returns a 4 row matrix HexToRgb(hex=c("#A52A2ABB","#A52A3B","C52A3B"))
HexToRgb(c("#ADD8E6", "#FA1572")) # 4-digit representation returns a 4 row matrix HexToRgb(hex=c("#A52A2ABB","#A52A3B","C52A3B"))
Calculates the harmonic mean and its confidence interval of a vector x.
Hmean(x, method = c("classic", "boot"), conf.level = NA, sides = c("two.sided","left","right"), na.rm = FALSE, ...)
Hmean(x, method = c("classic", "boot"), conf.level = NA, sides = c("two.sided","left","right"), na.rm = FALSE, ...)
x |
a positive numeric vector. An object which is not a vector is coerced (if possible) by as.vector. |
method |
a vector of character strings representing the type of intervals required. The value should be any subset of the values |
conf.level |
confidence level of the interval. Default is |
sides |
a character string specifying the side of the confidence interval, must be one of |
na.rm |
logical, indicating whether |
... |
further arguments are passed to the |
To compute the harmonic mean, 1/x
is first calculated, before the arithmetic mean and its confidence interval are computed by MeanCI
. The harmonic mean is then the reciprocal of the arithmetic mean of the reciprocals of the values. The same applies to the confidence interval.
The harmonic mean is restricted to strictly positive inputs, if any argument is negative, then the result will be NA
.
If the lower bound of the confidence interval is not greater than zero, then the confidence interval is not defined, and thus NA
will be reported.
Use sapply
to calculate the measures from data frame, resp. from a matrix.
a numeric value.
Andri Signorell <[email protected]>
Snedecor, G. W., Cochran, W. G. (1989) Statistical Methods, 8th ed. Ames, IA: Iowa State University Press
x <- runif(5) Hmean(x) m <- matrix(runif(50), nrow = 10) apply(m, 2, Hmean) sapply(as.data.frame(m), Hmean)
x <- runif(5) Hmean(x) m <- matrix(runif(50), nrow = 10) apply(m, 2, Hmean) sapply(as.data.frame(m), Hmean)
HmsToSec
- Converts a vector of h:m:s to seconds.
SecToHms
- Converts a vector of seconds to h:m:s.
HmsToSec(x) SecToHms(x, digits = NULL)
HmsToSec(x) SecToHms(x, digits = NULL)
x |
A vector of times in h:m:s (for |
digits |
the number of digits to use for potential fractions of seconds. |
HmsToSec
- Returns a vector of times in seconds.
SecToHms
- Returns a vector of times in h:m:s format.
Tyler Rinker <[email protected]>
HmsToSec(c("02:00:03", "04:03:01")) HmsToSec(SecToHms(c(222, 1234, 55))) SecToHms(c(256, 3456, 56565))
HmsToSec(c("02:00:03", "04:03:01")) HmsToSec(SecToHms(c(222, 1234, 55))) SecToHms(c(256, 3456, 56565))
Function to compute the Hodges-Lehmann estimator of location in the one and two sample case following a clever fast algorithm by John Monahan (1984).
HodgesLehmann(x, y = NULL, conf.level = NA, na.rm = FALSE)
HodgesLehmann(x, y = NULL, conf.level = NA, na.rm = FALSE)
x |
a numeric vector. |
y |
an optional numeric vector of data values: as with x non-finite values will be omitted. |
conf.level |
confidence level of the interval. |
na.rm |
logical. Should missing values be removed? Defaults to |
The Hodges-Lehmann estimator is the median of the combined data points and Walsh averages.
It is the same as the Pseudo Median returned as a by-product of the function wilcox.test
(which however does not calculate correctly as soon as ties are present).
Note that in the two-sample case the estimator for the difference in location parameters does not estimate the difference in medians (a common misconception) but rather the median of the difference between a sample from x and a sample from y.
(The calculation of the confidence intervals is not yet implemented.)
the Hodges-Lehmann estimator of location as a single numeric value if no confidence intervals are requested,
and otherwise a numeric vector with 3 elements for the estimate, the lower and the upper confidence interval
Cyril Flurin Moser (Cyril did the lion's share and coded Monahan's algorithm in C++), Andri Signorell <[email protected]>
Hodges, J.L., and Lehmann, E.L. (1963), Estimates of location based on rank tests. The Annals of Mathematical Statistics, 34, 598–611.
Monahan, J. (1984), Algorithm 616: Fast Computation of the Hodges-Lehmann Location Estimator, ACM Transactions on Mathematical Software, Vol. 10, No. 3, pp. 265-270
set.seed(1) x <- rt(100, df = 3) y <- rt(100, df = 5) HodgesLehmann(x) HodgesLehmann(x, y) # same as wilcox.test(x, conf.int = TRUE)$estimate
set.seed(1) x <- rt(100, df = 3) y <- rt(100, df = 5) HodgesLehmann(x) HodgesLehmann(x, y) # same as wilcox.test(x, conf.int = TRUE)$estimate
Computes a matrix of Hoeffding's (1948) D
statistics for all possible
pairs of columns of a matrix. D
is a measure of the distance
between F(x,y)
and G(x)H(y)
, where F(x,y)
is the joint CDF of X
and Y
,
and G
and H
are marginal CDFs. Missing values are deleted in pairs rather than deleting all rows
of x
having any missing variables.
The D
statistic is robust against a wide
variety of alternatives to independence, such as non-monotonic relationships.
The larger the value of D
, the more dependent are X
and Y
(for many types
of dependencies). D
used here is 30 times Hoeffding's original D
, and
ranges from -0.5 to 1.0 if there are no ties in the data.
print.HoeffD
prints the information derived by HoeffD
. The higher
the value of D
, the more dependent are x
and y
.
HoeffD(x, y) ## S3 method for class 'HoeffD' print(x, ...)
HoeffD(x, y) ## S3 method for class 'HoeffD' print(x, ...)
x |
a numeric matrix with at least 5 rows and at least 2 columns (if
|
y |
a numeric vector or matrix which will be concatenated to |
... |
ignored |
Uses midranks in case of ties, as described by Hollander and Wolfe.
P-values are approximated by linear interpolation on the table
in Hollander and Wolfe, which uses the asymptotically equivalent
Blum-Kiefer-Rosenblatt statistic. For P<.0001
or >0.5
, P
values are
computed using a well-fitting linear regression function in log P
vs.
the test statistic.
Ranks (but not bivariate ranks) are computed using efficient
algorithms (see reference 3).
a list with elements D
, the
matrix of D statistics, n
the
matrix of number of observations used in analyzing each pair of variables,
and P
, the asymptotic P-values.
Pairs with fewer than 5 non-missing values have the D statistic set to NA.
The diagonals of n
are the number of non-NAs for the single variable
corresponding to that row and column.
Frank Harrell <[email protected]>
Department of Biostatistics
Vanderbilt University
Hoeffding W. (1948) A non-parametric test of independence. Ann Math Stat 19:546–57.
Hollander M., Wolfe D.A. (1973) Nonparametric Statistical Methods, pp. 228–235, 423. New York: Wiley.
Press W.H., Flannery B.P., Teukolsky S.A., Vetterling, W.T. (1988) Numerical Recipes in C Cambridge: Cambridge University Press.
x <- c(-2, -1, 0, 1, 2) y <- c(4, 1, 0, 1, 4) z <- c(1, 2, 3, 4, NA) q <- c(1, 2, 3, 4, 5) HoeffD(cbind(x, y, z, q)) # Hoeffding's test can detect even one-to-many dependency set.seed(1) x <- seq(-10, 10, length=200) y <- x * sign(runif(200, -1, 1)) plot(x, y) HoeffD(x, y)
x <- c(-2, -1, 0, 1, 2) y <- c(4, 1, 0, 1, 4) z <- c(1, 2, 3, 4, NA) q <- c(1, 2, 3, 4, 5) HoeffD(cbind(x, y, z, q)) # Hoeffding's test can detect even one-to-many dependency set.seed(1) x <- seq(-10, 10, length=200) y <- x * sign(runif(200, -1, 1)) plot(x, y) HoeffD(x, y)
The function computes Hosmer-Lemeshow goodness of fit tests for C and H statistic as well as the le Cessie-van Houwelingen-Copas-Hosmer unweighted sum of squares test for global goodness of fit.
HosmerLemeshowTest(fit, obs, ngr = 10, X, verbose = FALSE)
HosmerLemeshowTest(fit, obs, ngr = 10, X, verbose = FALSE)
fit |
numeric vector with fitted probabilities. |
obs |
numeric vector with observed values. |
ngr |
number of groups for C and H statistic. |
X |
covariate(s) for le Cessie-van Houwelingen-Copas-Hosmer global goodness of fit test. |
verbose |
logical, print intermediate results. |
Hosmer-Lemeshow goodness of fit tests are computed; see Lemeshow and Hosmer (1982).
If X
is specified, the le Cessie-van Houwelingen-Copas-Hosmer
unweighted sum of squares test for global goodness of fit is additionally
determined; see Hosmer et al. (1997).
A list of tests.
Matthias Kohl <[email protected]>
Lemeshow, S. Hosmer, D.W., (1982): A review of goodness of fit statistics for use in the development of logistic regression models. American Journal of Epidemiology, 115(1), 92-106.
Hosmer, D.W., Hosmer, T., le Cessie, S., Lemeshow, S. (1997). A comparison of goodness-of-fit tests for the logistic regression model. Statistics in Medicine, 16, 965-980.
set.seed(111) x1 <- factor(sample(1:3, 50, replace = TRUE)) x2 <- rnorm(50) obs <- sample(c(0,1), 50, replace = TRUE) fit <- glm(obs ~ x1+x2, family = binomial) HosmerLemeshowTest(fit = fitted(fit), obs = obs, X = cbind(x1, x2))
set.seed(111) x1 <- factor(sample(1:3, 50, replace = TRUE)) x2 <- rnorm(50) obs <- sample(c(0,1), 50, replace = TRUE) fit <- glm(obs ~ x1+x2, family = binomial) HosmerLemeshowTest(fit = fitted(fit), obs = obs, X = cbind(x1, x2))
Hotelling's T2 test is the multivariate generlisation of the Student's t test. A one-sample Hotelling's T2 test can be used to test if a set of vectors of data (which should be a sample of a single statistical population) has a mean equal to a hypothetical mean. A two-sample Hotelling's T2 test may be used to test for significant differences between the mean vectors (multivariate means) of two multivariate data sets are different.
HotellingsT2Test(x, ...) ## Default S3 method: HotellingsT2Test(x, y = NULL, mu = NULL, test = "f", ...) ## S3 method for class 'formula' HotellingsT2Test(formula, data, subset, na.action, ...)
HotellingsT2Test(x, ...) ## Default S3 method: HotellingsT2Test(x, y = NULL, mu = NULL, test = "f", ...) ## S3 method for class 'formula' HotellingsT2Test(formula, data, subset, na.action, ...)
x |
a numeric data frame or matrix. |
y |
an optional numeric data frame or matrix for the two sample test. If |
mu |
a vector indicating the hypothesized value of the mean (or difference
in means if a two sample test is performed). |
test |
if |
formula |
a formula of the form |
data |
an optional matrix or data frame (or similar: see |
subset |
an optional vector specifying a subset of observations to be used. |
na.action |
a function which indicates what should happen when the data contain NAs. Defaults to |
... |
further arguments to be passed to or from methods. |
The classical test for testing the location of a multivariate population or for testing the mean
difference for two multivariate populations. When test = "f"
the F-distribution is used for
the test statistic and it is assumed that the data are normally distributed. If the chisquare
approximation is used, the normal assumption can be relaxed to existence of second moments.
In the two sample case both populations are assumed to have the same covariance matrix.
The formula interface is only applicable for the 2-sample tests.
A list with class 'htest' containing the following components:
statistic |
the value of the T2-statistic. (That is the scaled value of the statistic that has an
F distribution or a chisquare distribution depending on the value of |
parameter |
the degrees of freedom for the T2-statistic. |
p.value |
the p-value for the test. |
null.value |
the specified hypothesized value of the mean or mean difference depending on whether it was a one-sample test or a two-sample test. |
alternative |
a character string with the value 'two.sided'. |
method |
a character string indicating what type of test was performed. |
data.name |
a character string giving the name of the data (and grouping vector). |
Klaus Nordhausen, <[email protected]>
Nordhausen K., Sirkia S., Oja H. and Tyler D. E. (2012) ICSNP: Tools for
Multivariate Nonparametrics. R package version 1.0-9.
https://cran.r-project.org/package=ICSNP
Anderson, T.W. (2003), An introduction to multivariate analysis, New Jersey: Wiley.
math.teach <- data.frame( teacher = factor(rep(1:2, c(3, 6))), satis = c(1, 3, 2, 4, 6, 6, 5, 5, 4), know = c(3, 7, 2, 6, 8, 8, 10, 10, 6)) with(math.teach, HotellingsT2Test(cbind(satis, know) ~ teacher))
math.teach <- data.frame( teacher = factor(rep(1:2, c(3, 6))), satis = c(1, 3, 2, 4, 6, 6, 5, 5, 4), know = c(3, 7, 2, 6, 8, 8, 10, 10, 6)) with(math.teach, HotellingsT2Test(cbind(satis, know) ~ teacher))
(Generalized) Huber M-estimator of location with MAD scale, being
sensible also when the scale is zero where huber()
returns an error.
HuberM(x, k = 1.345, mu = median(x), s = mad(x, center = mu), na.rm = FALSE, conf.level = NA, ci.type = c("wald", "boot"), ...)
HuberM(x, k = 1.345, mu = median(x), s = mad(x, center = mu), na.rm = FALSE, conf.level = NA, ci.type = c("wald", "boot"), ...)
x |
numeric vector. |
k |
positive factor; the algorithm winsorizes at |
mu |
initial location estimator. |
s |
scale estimator held constant through the iterations. |
na.rm |
logical, indicating whether |
conf.level |
confidence level of the interval. If set to |
ci.type |
The type of confidence interval required. The value should be any subset
of the values |
... |
the dots are passed to the function |
The standard error is computed using the correction factor but no finite sample correction.
The original function is not exported, but can be accessed as DescTools::.huberM
.
If conf.level
is set to NA
then the result will be
a |
single numeric value |
and
if a conf.level
is provided, a named numeric vector with 3 elements:
huberm |
the estimate for location |
lwr.ci |
lower bound of the confidence interval |
upr.ci |
upper bound of the confidence interval |
Martin Maechler, building on the MASS code mentioned.
Andri Signorell <[email protected]> (confidence intervals and interface)
Huber, P. J. (1981) Robust Statistics. Wiley.
hubers
(and huber
) in package MASS;
mad
.
HuberM(c(1:9, 1000)) mad (c(1:9, 1000)) set.seed(7) x <- c(round(rnorm(1000), 1), round(rnorm(50, m=10, sd = 10))) HuberM(x, conf.level=0.95) ## Not run: # scale zero HuberM(rep(9, 100)) mad (rep(9, 100)) # bootstrap confidence intervals HuberM(x, conf.level=0.95, ci.type="boot") ## End(Not run)
HuberM(c(1:9, 1000)) mad (c(1:9, 1000)) set.seed(7) x <- c(round(rnorm(1000), 1), round(rnorm(50, m=10, sd = 10))) HuberM(x, conf.level=0.95) ## Not run: # scale zero HuberM(rep(9, 100)) mad (rep(9, 100)) # bootstrap confidence intervals HuberM(x, conf.level=0.95, ci.type="boot") ## End(Not run)
The Intraclass correlation is used as a measure of association when studying the reliability of raters. Shrout and Fleiss (1979) outline 6 different estimates, that depend upon the particular experimental design. All are implemented and given confidence limits.
ICC(x, type = c("all", "ICC1", "ICC2", "ICC3", "ICC1k", "ICC2k", "ICC3k"), conf.level = NA, na.rm = FALSE) ## S3 method for class 'ICC' print(x, digits = 3, ...)
ICC(x, type = c("all", "ICC1", "ICC2", "ICC3", "ICC1k", "ICC2k", "ICC3k"), conf.level = NA, na.rm = FALSE) ## S3 method for class 'ICC' print(x, digits = 3, ...)
x |
|
type |
one out of "all", "ICC1", "ICC2", "ICC3", "ICC1k", "ICC2k", "ICC3k". See details. |
conf.level |
confidence level of the interval. If set to |
na.rm |
logical, indicating whether |
digits |
number of digits to use in printing |
... |
further arguments to be passed to or from methods. |
Shrout and Fleiss (1979) consider six cases of reliability of ratings done by k raters on n targets.
ICC1 | Each target is rated by a different judge and the judges are selected at random. |
(This is a one-way ANOVA fixed effects model and is found by (MSB- MSW)/(MSB+ (nr-1)*MSW)) | |
ICC2 | A random sample of k judges rate each target. The measure is one of absolute agreement |
in the ratings. Found as (MSB- MSE)/(MSB + (nr-1)*MSE + nr*(MSJ-MSE)/nc) | |
ICC3 | A fixed set of k judges rate each target. There is no generalization to a larger population |
of judges. (MSB - MSE)/(MSB+ (nr-1)*MSE) | |
Then, for each of these cases, is reliability to be estimated for a single rating or for the average of k ratings? (The 1 rating case is equivalent to the average intercorrelation, the k rating case to the Spearman Brown adjusted reliability.)
ICC1 is sensitive to differences in means between raters and is a measure of absolute agreement.
ICC2 and ICC3 remove mean differences between judges, but are sensitive to interactions of raters by judges.
The difference between ICC2 and ICC3 is whether raters are seen as fixed or random effects.
ICC1k, ICC2k, ICC3K reflect the means of k raters.
The intraclass correlation is used if raters are all of the same “class". That is, there is no logical way of distinguishing them. Examples include correlations between pairs of twins, correlations between raters. If the variables are logically distinguishable (e.g., different items on a test), then the more typical coefficient is based upon the inter-class correlation (e.g., a Pearson r) and a statistic such as alpha or omega might be used.
if method is set to "all", then the result will be
results |
A matrix of 6 rows and 8 columns, including the ICCs, F test, p values, and confidence limits |
summary |
The anova summary table |
stats |
The anova statistics |
MSW |
Mean Square Within based upon the anova |
if a specific type has been defined, the function will first check, whether no confidence intervals are requested:
if so, the result will be the estimate as numeric value
else a named numeric vector with 3 elements
ICCx |
estimate (name is the selected type of coefficient) |
lwr.ci |
lower confidence interval |
upr.ci |
upper confidence interval |
The results for the lower and upper Bounds for ICC(2,k) do not match those of SPSS 9 or 10, but do match the definitions of Shrout and Fleiss. SPSS seems to have been using the formula in McGraw and Wong, but not the errata on p 390. They seem to have fixed it in more recent releases (15).
William Revelle <[email protected]>, some editorial amendments Andri Signorell <[email protected]>
Shrout, P. E., Fleiss, J. L. (1979) Intraclass correlations: uses in assessing rater reliability. Psychological Bulletin, 86, 420-3428.
McGraw, K. O., Wong, S. P. (1996) Forming inferences about some intraclass correlation coefficients. Psychological Methods, 1, 30-46. + errata on page 390.
Revelle, W. (in prep) An introduction to psychometric theory with applications in R Springer. (working draft available at http://personality-project.org/r/book/
sf <- matrix(c( 9, 2, 5, 8, 6, 1, 3, 2, 8, 4, 6, 8, 7, 1, 2, 6, 10,5, 6, 9, 6, 2, 4, 7), ncol=4, byrow=TRUE, dimnames=list(paste("S", 1:6, sep=""), paste("J", 1:4, sep="")) ) sf #example from Shrout and Fleiss (1979) ICC(sf)
sf <- matrix(c( 9, 2, 5, 8, 6, 1, 3, 2, 8, 4, 6, 8, 7, 1, 2, 6, 10,5, 6, 9, 6, 2, 4, 7), ncol=4, byrow=TRUE, dimnames=list(paste("S", 1:6, sep=""), paste("J", 1:4, sep="")) ) sf #example from Shrout and Fleiss (1979) ICC(sf)
The function identify
reads the position of the graphics pointer when
the (first) mouse button is pressed. It then searches the
coordinates given in x and y for the point closest to the
pointer. If this point is close enough to the pointer,
its index will be returned as part of the value of the
call.
## S3 method for class 'formula' identify(formula, data, subset, na.action, ...)
## S3 method for class 'formula' identify(formula, data, subset, na.action, ...)
formula |
a formula of the form |
data |
The data frame from which the formula should be evaluated. |
subset |
an optional vector specifying a subset of observations to be used. |
na.action |
a function which indicates what should happen when the data contain NAs. Defaults to
|
... |
Other arguments to be passed to
|
This function is meant to make it easier to call
identify
after plot
has been called
using a formula and the data
argument.
A two dimensional plot must be active and the vectors in
x
and data frame in data
must correspond to
the x- and y-axes and the data of the plot.
If pos
is FALSE
, an integer vector containing the
indices of the identified points, in the order they were identified.
If pos
is TRUE
, a list containing a component
ind
, indicating which points were identified and a component
pos
, indicating where the labels were placed relative to the
identified points (1=below, 2=left, 3=above, 4=right and 0=no offset,
used if atpen = TRUE
).
Derek Ogle <[email protected]>
identify
, locator
, text
https://www.rforge.net/NCStats/files/
## Not run: ## Copy and try in an interactive R session plot(dist ~ speed, data = cars, subset = speed < 17) identify(dist ~ speed, data = cars, subset = speed < 17) ## End(Not run)
## Not run: ## Copy and try in an interactive R session plot(dist ~ speed, data = cars, subset = speed < 17) identify(dist ~ speed, data = cars, subset = speed < 17) ## End(Not run)
Find all the points lying either in a rectangle area spanned by an upper left and a bottom-right point or by a polygon area consisting of any number of points defined by point and click.
IdentifyA(x, ...) ## S3 method for class 'formula' IdentifyA(formula, data, subset, poly = FALSE, ...) ## Default S3 method: IdentifyA(x, y = NULL, poly = FALSE, ...)
IdentifyA(x, ...) ## S3 method for class 'formula' IdentifyA(formula, data, subset, poly = FALSE, ...) ## Default S3 method: IdentifyA(x, y = NULL, poly = FALSE, ...)
x , y
|
x and y values of the points used to create the plot. |
formula |
a |
data |
a data frame (or list) from which the variables in
|
subset |
an optional vector specifying a subset of observations to be used. |
poly |
logical. Defines if a polygon or a rectangle should be used to select the points. Default is rectangle.
If a polygon should be used, set this argument to |
... |
Other arguments to be passed to |
Index vector with the points lying within the selected area.
The coordinates are returned as text in the attribute "cond"
.
Andri Signorell <[email protected]>
## Not run: # run the example via copy and paste plot(temperature ~ delivery_min, data=d.pizza) idx <- IdentifyA(temperature ~ delivery_min, data=d.pizza) # you selected the following points d.pizza[idx,] points(temperature ~ delivery_min, data = d.pizza[idx,], col="green") # use the attr("cond") for subsets in code attr(idx, "cond") # create a group variable for the found points d.pizza$grp <- seq(nrow(d.pizza)) %in% idx # try the polygon option idx <- IdentifyA(temperature ~ delivery_min, data=d.pizza, poly=TRUE) points(temperature ~ delivery_min, data = d.pizza[idx,], col="red") ## End(Not run)
## Not run: # run the example via copy and paste plot(temperature ~ delivery_min, data=d.pizza) idx <- IdentifyA(temperature ~ delivery_min, data=d.pizza) # you selected the following points d.pizza[idx,] points(temperature ~ delivery_min, data = d.pizza[idx,], col="green") # use the attr("cond") for subsets in code attr(idx, "cond") # create a group variable for the found points d.pizza$grp <- seq(nrow(d.pizza)) %in% idx # try the polygon option idx <- IdentifyA(temperature ~ delivery_min, data=d.pizza, poly=TRUE) points(temperature ~ delivery_min, data = d.pizza[idx,], col="red") ## End(Not run)
Function that fills in all NA values using the k Nearest Neighbours of each case with NA values. By default it uses the values of the neighbours and obtains an weighted (by the distance to the case) average of their values to fill in the unknows. If meth='median' it uses the median/most frequent value, instead.
ImputeKnn(data, k = 10, scale = TRUE, meth = "weighAvg", distData = NULL)
ImputeKnn(data, k = 10, scale = TRUE, meth = "weighAvg", distData = NULL)
data |
A data frame with the data set |
k |
The number of nearest neighbours to use (defaults to 10) |
scale |
Boolean setting if the data should be scale before finding the nearest neighbours (defaults to TRUE) |
meth |
String indicating the method used to calculate the value to fill in each NA. Available values are 'median' or 'weighAvg' (the default). |
distData |
Optionally you may sepecify here a data frame containing the data set
that should be used to find the neighbours. This is usefull when
filling in NA values on a test set, where you should use only
information from the training set. This defaults to NULL, which means
that the neighbours will be searched in |
This function uses the k-nearest neighbours to fill in the unknown (NA) values in a data set. For each case with any NA value it will search for its k most similar cases and use the values of these cases to fill in the unknowns.
If meth='median'
the function will use either the median (in
case of numeric variables) or the most frequent value (in case of
factors), of the neighbours to fill in the NAs. If
meth='weighAvg'
the function will use a weighted average of the
values of the neighbours. The weights are given by exp(-dist(k,x)
where dist(k,x)
is the euclidean distance between the case with
NAs (x) and the neighbour k.
A data frame without NA values
Luis Torgo [email protected]
Torgo, L. (2010) Data Mining using R: learning with case studies, CRC Press (ISBN: 9781439810187).
cleanPizza <- ImputeKnn(d.pizza[, -2]) # no dates allowed summary(cleanPizza)
cleanPizza <- ImputeKnn(d.pizza[, -2]) # no dates allowed summary(cleanPizza)
Returns the value of a specific named argument if it was comprised in the dots or a default value, if it wasn't.
InDots(..., arg, default)
InDots(..., arg, default)
... |
the dots arguments to be checked. |
arg |
the name of argument to test for. |
default |
the default value to return, if the argument |
the value of the argument, if it exists else the specified default value.
Andri Signorell <[email protected]>
# Function returns the argument A, if supplied or 999 foobar <- function(...){ DescTools::InDots(..., arg="A", default=99) } foobar(A=5) foobar(B=5, C=8)
# Function returns the argument A, if supplied or 999 foobar <- function(...){ DescTools::InDots(..., arg="A", default=99) } foobar(A=5) foobar(B=5, C=8)
computes interquartile range of the x values. Weights are supported.
IQRw(x, weights = NULL, na.rm = FALSE, type = 7)
IQRw(x, weights = NULL, na.rm = FALSE, type = 7)
x |
a numeric vector. |
weights |
an optional numeric vector giving the sample weights. |
na.rm |
logical. Should missing values be removed? |
type |
an integer selecting one of the many quantile algorithms, see |
This implementation is based on Quantile()
function, which allows to define weights.
Andri Signorell <[email protected]>
Median()
, Quantile()
, IQR()
, quantile()
x <- c(3.7,3.3,3.5,2.8) w <- c(5, 5, 4, 1)/15 IQRw(x=x, weights=w)
x <- c(3.7,3.3,3.5,2.8) w <- c(5, 5, 4, 1)/15 IQRw(x=x, weights=w)
Check if the given x is of any known Date type.
IsDate(x, what = c("either", "both", "timeVaries"))
IsDate(x, what = c("either", "both", "timeVaries"))
x |
a vector or values to be checked. |
what |
can be any value out of " |
This checks for many known Date and Time classes: "POSIXt", "POSIXct", "dates", "times", "chron", "Date".
logical vector of the same dimension as x.
Frank E Harrell
IsDate(as.Date("2013-04-10")) IsDate(31002)
IsDate(as.Date("2013-04-10")) IsDate(31002)
Test if a variable contains only two values. The variable does not need to be a numerical value, factors and logicals are supported as well. NA
s can be skipped by setting na.rm
to TRUE
.
IsDichotomous(x, strict = FALSE, na.rm = FALSE) Flags(x, na.rm = FALSE)
IsDichotomous(x, strict = FALSE, na.rm = FALSE) Flags(x, na.rm = FALSE)
x |
a numeric or integer vector, a logical vector or a factor (ordered and unordered) |
strict |
logical. If set to |
na.rm |
logical, indicating whether |
IsDichotomous tests a single variable. Flags returns the names of all the dichotomous variables in a list or data.frame.
TRUE
if x
contains only two unique values, FALSE
else
Andri Signorell <[email protected]>
IsDichotomous(sample(10, 5, replace=TRUE))
IsDichotomous(sample(10, 5, replace=TRUE))
Confirmation of the Euclidean nature of a distance matrix by the Gower's theorem.IsEuclid
is used in summary.dist
.
IsEuclid(distmat, plot = FALSE, print = FALSE, tol = 1e-07)
IsEuclid(distmat, plot = FALSE, print = FALSE, tol = 1e-07)
distmat |
an object of class 'dist' |
plot |
a logical value indicating whether the eigenvalues bar plot of the matrix of the term |
print |
a logical value indicating whether the eigenvalues of the matrix of the term |
tol |
a tolerance threshold : an eigenvalue is considered positive if it is larger than |
returns a logical value indicating if all the eigenvalues are positive or equal to zero
Daniel Chessel
Stephane Dray [email protected]
Gower, J.C. and Legendre, P. (1986) Metric and Euclidean properties of dissimilarity coefficients. Journal of Classification, 3, 5–48.
w <- matrix(runif(10000), 100, 100) w <- dist(w) summary(w) IsEuclid (w) # TRUE
w <- matrix(runif(10000), 100, 100) w <- dist(w) summary(w) IsEuclid (w) # TRUE
Checks if the elements of an integer vector x
are even or odd.
IsOdd(x)
IsOdd(x)
x |
vector of integers |
a logic vector
Andri Signorell <[email protected]>
IsOdd(1:10)
IsOdd(1:10)
Returns for a vector or matrix of positive integers a logical object of the same dimension(s) containing TRUE
for the elements that are prime and FALSE
otherwise.
IsPrime(x)
IsPrime(x)
x |
vector or matrix of nonnegative integers |
Given a vector or a matrix of positive integers returns a vector of the same size
of FALSE
and TRUE
. Use which(IsPrime(1:21))
to get the positions.
logical vector
Hans W. Borchers <[email protected]>
x <- matrix(1:10, nrow=10, ncol=10, byrow=TRUE) x * IsPrime(x) # Find first prime number octett: octett <- c(0, 2, 6, 8, 30, 32, 36, 38) - 19 while (TRUE) { octett <- octett + 210 if (all(IsPrime(octett))) { cat(octett, "\n", sep=" ") break } }
x <- matrix(1:10, nrow=10, ncol=10, byrow=TRUE) x * IsPrime(x) # Find first prime number octett: octett <- c(0, 2, 6, 8, 30, 32, 36, 38) - 19 while (TRUE) { octett <- octett + 210 if (all(IsPrime(octett))) { cat(octett, "\n", sep=" ") break } }
Check if a pointer points to a valid and running MS-Office instance. The function does this by first checking for NULL
and nil
pointer and then
trying to get the current selection of the application.
IsValidHwnd(hwnd)
IsValidHwnd(hwnd)
hwnd |
the pointer to a word instance as created by |
logical value, TRUE
if hwnd is a valid pointer to a running application
Andri Signorell <[email protected]>
GetCurrWrd
(), GetCurrXL
(), GetCurrPP
()
This function performs the Jarque-Bera tests of normality either the robust or the classical way.
JarqueBeraTest(x, robust = TRUE, method = c("chisq", "mc"), N = 0, na.rm = FALSE)
JarqueBeraTest(x, robust = TRUE, method = c("chisq", "mc"), N = 0, na.rm = FALSE)
x |
a numeric vector of data values. |
robust |
defines, whether the robust version should be used.
Default is |
method |
a character string out of |
N |
number of Monte Carlo simulations for the empirical critical values |
na.rm |
defines if |
The test is based on a joint statistic using skewness and kurtosis
coefficients. The robust Jarque-Bera (RJB) version of utilizes
the robust standard deviation (namely the mean absolute deviation
from the median, as provided e. g. by MeanAD(x, FUN=median)
) to estimate sample kurtosis and skewness. For more details see Gel and Gastwirth (2006).
Setting robust
to FALSE
will perform the original Jarque-Bera test (see
Jarque, C. and Bera, A (1980)).
A list with class htest
containing the following components:
statistic |
the value of the test statistic. |
parameter |
the degrees of freedom. |
p.value |
the p-value of the test. |
method |
type of test was performed. |
data.name |
a character string giving the name of the data. |
This function is melted from the jarque.bera.test
(in tseries
package) and the rjb.test
from the package lawstat
.
W. Wallace Hui, Yulia R. Gel, Joseph L. Gastwirth, Weiwen Miao
Gastwirth, J. L.(1982) Statistical Properties of A Measure
of Tax Assessment Uniformity, Journal of Statistical Planning
and Inference 6, 1-12.
Gel, Y. R. and Gastwirth, J. L. (2008) A robust modification of
the Jarque-Bera test of normality, Economics Letters 99, 30-32.
Jarque, C. and Bera, A. (1980) Efficient tests for normality, homoscedasticity and serial independence of regression residuals, Economics Letters 6, 255-259.
Alternative tests for normality as
shapiro.test
,
AndersonDarlingTest
, CramerVonMisesTest
, LillieTest
, PearsonTest
, ShapiroFranciaTest
qqnorm
, qqline
for producing a normal quantile-quantile plot
x <- rnorm(100) # null hypothesis JarqueBeraTest(x) x <- runif(100) # alternative hypothesis JarqueBeraTest(x, robust=TRUE)
x <- rnorm(100) # null hypothesis JarqueBeraTest(x) x <- runif(100) # alternative hypothesis JarqueBeraTest(x, robust=TRUE)
Jonckheere-Terpstra test to test for ordered differences among classes.
JonckheereTerpstraTest(x, ...) ## Default S3 method: JonckheereTerpstraTest(x, g, alternative = c("two.sided", "increasing", "decreasing"), nperm = NULL, exact = NULL, ...) ## S3 method for class 'formula' JonckheereTerpstraTest(formula, data, subset, na.action, ...)
JonckheereTerpstraTest(x, ...) ## Default S3 method: JonckheereTerpstraTest(x, g, alternative = c("two.sided", "increasing", "decreasing"), nperm = NULL, exact = NULL, ...) ## S3 method for class 'formula' JonckheereTerpstraTest(formula, data, subset, na.action, ...)
x |
a numeric vector of data values, or a list of numeric data vectors. |
g |
a vector or factor object giving the group for the corresponding elements of x. Ignored if x is a list. |
alternative |
means are monotonic ( |
nperm |
number of permutations for the reference distribution.
The default is |
formula |
a formula of the form |
data |
an optional matrix or data frame (or similar: see |
subset |
an optional vector specifying a subset of observations to be used. |
na.action |
a function which indicates what should happen when the data contain NAs. Defaults to |
exact |
logical, defining if the exact test should be calculated. If left to |
... |
further argument to be passed to methods. |
JonckheereTerpstraTest is the exact (permutation) version of the Jonckheere-Terpstra test. It uses the statistic
where are observations in groups
and
respectively. The asymptotic version is equivalent to
cor.test(x, g, method="k")
. The exact calculation requires that there
be no ties and that the sample size is less than 100. When data are
tied and sample size is at most 100 permutation p-value is returned.
If x is a list, its elements are taken as the samples to be compared, and hence have to be numeric data vectors.
In this case, g is ignored, and one can simply use JonckheereTerpstraTest(x) to perform the test.
If the samples are not yet contained in a list, use JonckheereTerpstraTest(list(x, ...)).
Otherwise, x
must be a numeric data vector, and g
must be a vector or factor object of the
same length as x
giving the group for the corresponding elements of x
.
The function was previously published as jonckheere.test()
in the clinfun package and has been
integrated here without logical changes. Some argument checks and a formula interface were added.
Venkatraman E. Seshan <[email protected]>, minor adaptations Andri Signorell
Jonckheere, A. R. (1954). A distribution-free k-sample test again ordered alternatives. Biometrika 41:133-145.
Terpstra, T. J. (1952). The asymptotic normality and consistency of Kendall's test against trend, when ties are present in one ranking. Indagationes Mathematicae 14:327-333.
set.seed(1234) g <- ordered(rep(1:5, rep(10,5))) x <- rnorm(50) + 0.3 * as.numeric(g) JonckheereTerpstraTest(x, g) x[1:2] <- mean(x[1:2]) # tied data JonckheereTerpstraTest(x, g) JonckheereTerpstraTest(x, g, nperm=5000) # Duller, S. 222 coffee <- list( c_4=c(447,396,383,410), c_2=c(438,521,468,391,504,472), c_0=c(513,543,506,489,407)) # the list interface: JonckheereTerpstraTest(coffee) # the formula interface breaking <- data.frame( speed=c(20,25,25,25,25,30,30,30,35,35), distance=c(48,33,59,48,56,60,101,67,85,107)) JonckheereTerpstraTest(distance ~ speed, breaking, alternative="increasing")
set.seed(1234) g <- ordered(rep(1:5, rep(10,5))) x <- rnorm(50) + 0.3 * as.numeric(g) JonckheereTerpstraTest(x, g) x[1:2] <- mean(x[1:2]) # tied data JonckheereTerpstraTest(x, g) JonckheereTerpstraTest(x, g, nperm=5000) # Duller, S. 222 coffee <- list( c_4=c(447,396,383,410), c_2=c(438,521,468,391,504,472), c_0=c(513,543,506,489,407)) # the list interface: JonckheereTerpstraTest(coffee) # the formula interface breaking <- data.frame( speed=c(20,25,25,25,25,30,30,30,35,35), distance=c(48,33,59,48,56,60,101,67,85,107)) JonckheereTerpstraTest(distance ~ speed, breaking, alternative="increasing")
Computes kappa as an index of interrater agreement between m raters on categorical data.
KappaM(x, method = c("Fleiss", "Conger", "Light"), conf.level = NA)
KappaM(x, method = c("Fleiss", "Conger", "Light"), conf.level = NA)
x |
|
method |
a logical indicating whether the exact Kappa (Conger, 1980), the Kappa described by Fleiss (1971) or Light's Kappa (1971) should be computed. |
conf.level |
confidence level of the interval. If set to |
Missing data are omitted in a listwise way.
The coefficient described by Fleiss (1971) does not reduce to Cohen's Kappa (unweighted) for m=2 raters. Therefore, the exact Kappa coefficient, which is slightly higher in most cases, was proposed by Conger (1980).
Light's Kappa equals the average of all possible combinations of bivariate Kappas between raters.
The confidence levels can only be reported using Fleiss' formulation of Kappa.
a single numeric value if no confidence intervals are requested,
and otherwise a numeric vector with 3 elements for the estimate, the lower and the upper confidence interval
This function was previously published as kappam.fleiss()
in the irr package and has been integrated here with some changes in the interface.
Matthias Gamer, with some modifications by Andri Signorell <[email protected]>
Conger, A.J. (1980): Integration and generalisation of Kappas for multiple raters. Psychological Bulletin, 88, 322-328
Fleiss, J.L. (1971): Measuring nominal scale agreement among many raters Psychological Bulletin, 76, 378-382
Fleiss, J.L., Levin, B., & Paik, M.C. (2003): Statistical Methods for Rates and Proportions, 3rd Edition. New York: John Wiley & Sons
Light, R.J. (1971): Measures of response agreement for qualitative data: Some generalizations and alternatives. Psychological Bulletin, 76, 365-377.
statement <- data.frame( A=c(2,3,1,3,1,2,1,2,3,3,3,3,3,2,1,3,3,2,2,1, 2,1,3,3,2,2,1,2,1,1,2,3,3,3,3,3,1,2,1,1), B=c(2,2,2,1,1,2,1,2,3,3,2,3,1,3,1,1,3,2,1,2, 2,1,3,2,2,2,3,2,1,1,2,2,3,3,3,3,2,2,2,3), C=c(2,2,2,1,1,2,1,2,3,3,2,3,3,3,3,2,2,2,2,3, 2,2,3,3,2,2,3,2,2,2,2,3,3,3,3,3,3,2,2,2), D=c(2,2,2,1,1,2,1,2,3,3,2,3,3,3,3,3,2,2,2,2, 3,1,3,2,2,2,1,2,2,1,2,3,3,3,3,3,3,2,2,1), E=c(2,2,2,3,3,2,3,1,3,3,2,3,3,3,3,3,2,2,2,3, 2,3,3,2,2,2,3,2,1,3,2,3,3,1,3,3,3,2,2,1) ) KappaM(statement) KappaM(statement, method="Conger") # Exact Kappa KappaM(statement, conf.level=0.95) # Fleiss' Kappa and confidence intervals KappaM(statement, method="Light") # Exact Kappa
statement <- data.frame( A=c(2,3,1,3,1,2,1,2,3,3,3,3,3,2,1,3,3,2,2,1, 2,1,3,3,2,2,1,2,1,1,2,3,3,3,3,3,1,2,1,1), B=c(2,2,2,1,1,2,1,2,3,3,2,3,1,3,1,1,3,2,1,2, 2,1,3,2,2,2,3,2,1,1,2,2,3,3,3,3,2,2,2,3), C=c(2,2,2,1,1,2,1,2,3,3,2,3,3,3,3,2,2,2,2,3, 2,2,3,3,2,2,3,2,2,2,2,3,3,3,3,3,3,2,2,2), D=c(2,2,2,1,1,2,1,2,3,3,2,3,3,3,3,3,2,2,2,2, 3,1,3,2,2,2,1,2,2,1,2,3,3,3,3,3,3,2,2,1), E=c(2,2,2,3,3,2,3,1,3,3,2,3,3,3,3,3,2,2,2,3, 2,3,3,2,2,2,3,2,1,3,2,3,3,1,3,3,3,2,2,1) ) KappaM(statement) KappaM(statement, method="Conger") # Exact Kappa KappaM(statement, conf.level=0.95) # Fleiss' Kappa and confidence intervals KappaM(statement, method="Light") # Exact Kappa
Calculate Kendall's tau-a statistic, a measure of
association for ordinal factors in a two-way table.
The function has interfaces for a table (matrix) and for single vectors.
KendallTauA(x, y = NULL, direction = c("row", "column"), conf.level = NA, ...)
KendallTauA(x, y = NULL, direction = c("row", "column"), conf.level = NA, ...)
x |
a numeric vector or a table. A matrix will be treated as table. |
y |
NULL (default) or a vector with compatible dimensions to |
direction |
direction of the calculation. Can be |
conf.level |
confidence level of the interval. If set to |
... |
further arguments are passed to the function |
Kendall's tau coefficient (sometimes called "Kendall rank correlation coefficient"), is a statistic used to measure the association between two measured quantities. It is a measure of rank correlation: the similarity of the orderings of the data when ranked by each of the quantities.
Kendall's tau-a is computed as
where P equals twice the number of concordances and Q twice the number of discordances. Its range is [-1, 1].
(Note that Kendall tau-a does not take into consideration any ties, which makes it unpractical. Consider using KendallTauB
(Tau-b) when ties are present.)
a single numeric value if no confidence intervals are requested,
and otherwise a numeric vector with 3 elements for the estimate, the lower and the upper confidence interval
Andri Signorell <[email protected]>
Agresti, A. (2002) Categorical Data Analysis. John Wiley & Sons, pp. 57-59.
Hollander, M, Wolfe, D. A., Chicken, E. (2014) Nonparametric Statistical Methods, Third edition, Wiley,
Liebetrau, A. M. (1983) Measures of Association, Sage University Papers Series on Quantitative Applications in the Social Sciences, 07-004. Newbury Park, CA: Sage, pp. 49-56
ConDisPairs
yields concordant and discordant pairs
Other association measures: cor (method="kendall")
for Tau b, StuartTauC
, GoodmanKruskalGamma
Lambda
, UncertCoef
, MutInf
# example in: # http://support.sas.com/documentation/cdl/en/statugfreq/63124/PDF/default/statugfreq.pdf # pp. S. 1821 tab <- as.table(rbind(c(26,26,23,18,9),c(6,7,9,14,23))) # Kendall's tau-a C|R KendallTauA(tab, direction="column", conf.level=0.95) # Kendall's tau-a R|C KendallTauA(tab, direction="row", conf.level=0.95) # http://support.sas.com/documentation/cdl/en/statugfreq/63124/PDF/default/statugfreq.pdf # pp. 1814 (143) tab <- as.table(cbind(c(11,2),c(4,6))) KendallTauA(tab, direction="row", conf.level=0.95) KendallTauA(tab, direction="column", conf.level=0.95) # Liebetrau, pp. 52 x <- c(1,2,2,3,3,3,4,5) y <- c(1,3,2,1,5,3,4,5) ConDisPairs(table(x, y)) KendallTauA(x, y, conf.level=0.95)
# example in: # http://support.sas.com/documentation/cdl/en/statugfreq/63124/PDF/default/statugfreq.pdf # pp. S. 1821 tab <- as.table(rbind(c(26,26,23,18,9),c(6,7,9,14,23))) # Kendall's tau-a C|R KendallTauA(tab, direction="column", conf.level=0.95) # Kendall's tau-a R|C KendallTauA(tab, direction="row", conf.level=0.95) # http://support.sas.com/documentation/cdl/en/statugfreq/63124/PDF/default/statugfreq.pdf # pp. 1814 (143) tab <- as.table(cbind(c(11,2),c(4,6))) KendallTauA(tab, direction="row", conf.level=0.95) KendallTauA(tab, direction="column", conf.level=0.95) # Liebetrau, pp. 52 x <- c(1,2,2,3,3,3,4,5) y <- c(1,3,2,1,5,3,4,5) ConDisPairs(table(x, y)) KendallTauA(x, y, conf.level=0.95)
Calculate Kendall's tau-b. The estimator could also be calculated with cor(..., method="kendall")
.
The calculation of confidence intervals however would not be found there.
KendallTauB(x, y = NULL, conf.level = NA, ...)
KendallTauB(x, y = NULL, conf.level = NA, ...)
x |
a numeric vector, matrix or data.frame. |
y |
NULL (default) or a vector with compatible dimensions to |
conf.level |
confidence level of the interval. If set to |
... |
further arguments are passed to the function |
a single numeric value if no confidence intervals are requested,
and otherwise a numeric vector with 3 elements for the estimate, the lower and the upper confidence interval
Andri Signorell <[email protected]>
Agresti, A. (2002) Categorical Data Analysis. John Wiley & Sons, pp. 57-59.
Kendall, M. (1955) Rank Correlation Methods, Second Edition, London: Charles Griffin and Co.
Brown, M.B.andBenedetti, J.K.(1977) Sampling Behavior of Tests for Correlation in Two-Way Contingency Tables, Journal of the American Statistical Association, 72, 309-315.
ConDisPairs
yields concordant and discordant pairs
Other association measures: GoodmanKruskalGamma
, KendallTauA
(tau-a), cor
(method="kendall") for tau-b, StuartTauC
(tau-c), SomersDelta
Lambda
, GoodmanKruskalTau
, UncertCoef
, MutInf
# example in: # http://support.sas.com/documentation/cdl/en/statugfreq/63124/PDF/default/statugfreq.pdf # pp. S. 1821 tab <- as.table(rbind(c(26,26,23,18,9),c(6,7,9,14,23))) KendallTauB(tab, conf.level=0.95)
# example in: # http://support.sas.com/documentation/cdl/en/statugfreq/63124/PDF/default/statugfreq.pdf # pp. S. 1821 tab <- as.table(rbind(c(26,26,23,18,9),c(6,7,9,14,23))) KendallTauB(tab, conf.level=0.95)
Computes Kendall's coefficient of concordance, a popular measure of association. It is an index of interrater reliability of ordinal data. The coefficient could be corrected for ties within raters.
KendallW(x, correct = FALSE, test = FALSE, na.rm = FALSE)
KendallW(x, correct = FALSE, test = FALSE, na.rm = FALSE)
x |
|
correct |
a logical indicating whether the coefficient should be corrected for ties within raters. |
test |
a logical indicating whether the test statistic and p-value should be reported. |
na.rm |
logical, indicating whether |
The test for Kendall's W is completely equivalent to friedman.test
. The only advantage of this test over Friedman's is that Kendall's W has an interpretation as the coefficient of concordance. The test itself is only valid for large samples.
Kendall's W should be corrected for ties, if raters did not use a true ranking order for the subjects.
Either a single value if test is set to FALSE
or else
a list with class “htest” containing the following components:
statistic |
the value of the chi-square statistic. |
p.value |
the p-value for the test. |
method |
the character string “Kendall's coefficient of concordance W”. |
data.name |
a character string giving the name(s) of the data. |
estimate |
the coefficient of concordance. |
parameter |
the degrees of freedom df, the number of subjects examined and the number of raters. |
This function was previously published as kendall()
in the irr package and has been
integrated here without logical changes, but with some adaptations in the result structure.
Matthias Gamer <[email protected]>
Kendall, M.G. (1948) Rank correlation methods. London: Griffin.
cor
, KappaM
, CronbachAlpha
, ICC
, friedman.test
anxiety <- data.frame(rater1=c(3,3,3,4,5,5,2,3,5,2,2,6,1,5,2,2,1,2,4,3), rater2=c(3,6,4,6,2,4,2,4,3,3,2,3,3,3,2,2,1,3,3,4), rater3=c(2,1,4,4,3,2,1,6,1,1,1,2,3,3,1,1,3,3,2,2)) KendallW(anxiety, TRUE) # with test results KendallW(anxiety, TRUE, test=TRUE) # example from Siegel and Castellan (1988) d.att <- data.frame( id = c(4,21,11), airfare = c(5,1,4), climate = c(6,7,5), season = c(7,6,1), people = c(1,2,3), program = c(2,3,2), publicity = c(4,5,7), present = c(3,4,6), interest = c(8,8,8) ) KendallW(t(d.att[, -1]), test = TRUE) # which is perfectly the same as friedman.test(y=as.matrix(d.att[,-1]), groups = d.att$id)
anxiety <- data.frame(rater1=c(3,3,3,4,5,5,2,3,5,2,2,6,1,5,2,2,1,2,4,3), rater2=c(3,6,4,6,2,4,2,4,3,3,2,3,3,3,2,2,1,3,3,4), rater3=c(2,1,4,4,3,2,1,6,1,1,1,2,3,3,1,1,3,3,2,2)) KendallW(anxiety, TRUE) # with test results KendallW(anxiety, TRUE, test=TRUE) # example from Siegel and Castellan (1988) d.att <- data.frame( id = c(4,21,11), airfare = c(5,1,4), climate = c(6,7,5), season = c(7,6,1), people = c(1,2,3), program = c(2,3,2), publicity = c(4,5,7), present = c(3,4,6), interest = c(8,8,8) ) KendallW(t(d.att[, -1]), test = TRUE) # which is perfectly the same as friedman.test(y=as.matrix(d.att[,-1]), groups = d.att$id)
List the keywords for specific R man pages or return a list of valid R keywords.
Keywords(topic)
Keywords(topic)
topic |
optional, object or man page topic |
If topic
is provided, return a list of the Keywords associated
with topic
. Otherwise, display the list of valid R Keywords
from the R doc/Keywords file.
Gregory R. Warnes [email protected]
## Show all valid R Keywords Keywords() ## Show Keywords associated with the 'merge' function Keywords(merge) Keywords("merge")
## Show all valid R Keywords Keywords() ## Show Keywords associated with the 'merge' function Keywords(merge) Keywords("merge")
Calculate the alpha coefficient of reliability proposed by Krippendorff.
KrippAlpha(x, method=c("nominal", "ordinal", "interval", "ratio"))
KrippAlpha(x, method=c("nominal", "ordinal", "interval", "ratio"))
x |
classifier x object matrix of classifications or scores |
method |
data level of x |
A list with class '"irrlist"' containing the following components:
method |
a character string describing the method. |
subjects |
the number of data objects. |
raters |
the number of raters. |
irr.name |
a character string specifying the name of the coefficient. |
value |
value of alpha. |
stat.name |
here "nil" as there is no test statistic. |
statistic |
the value of the test statistic (NULL). |
p.value |
the probability of the test statistic (NULL). |
cm |
the concordance/discordance matrix used in the calculation of alpha |
data.values |
a character vector of the unique data values |
levx |
the unique values of the ratings |
nmatchval |
the count of matches, used in calculation |
data.level |
the data level of the ratings ("nominal","ordinal", "interval","ratio") |
Krippendorff's alpha coefficient is particularly useful where the level of measurement of classification data is higher than nominal or ordinal. https://bmcmedresmethodol.biomedcentral.com/articles/10.1186/s12874-016-0200-9
This function was previously published as kripp.alpha()
in the irr package and has been
integrated here without logical changes, but with some adaptations in the result structure.
Jim Lemon <[email protected]>
Krippendorff, K. (1980) Content analysis: An introduction to its methodology. Beverly Hills, CA: Sage.
CronbachAlpha
, KappaM
, CohenKappa
# the "C" data from Krippendorff nmm <- matrix(c(1,1,NA,1,2,2,3,2,3,3,3,3,3,3,3,3,2,2,2,2,1,2,3,4,4,4,4,4, 1,1,2,1,2,2,2,2,NA,5,5,5,NA,NA,1,1,NA,NA,3,NA), nrow=4) # first assume the default nominal classification KrippAlpha(nmm) # now use the same data with the other three methods KrippAlpha(nmm, "ordinal") KrippAlpha(nmm, "interval") KrippAlpha(nmm, "ratio")
# the "C" data from Krippendorff nmm <- matrix(c(1,1,NA,1,2,2,3,2,3,3,3,3,3,3,3,3,2,2,2,2,1,2,3,4,4,4,4,4, 1,1,2,1,2,2,2,2,NA,5,5,5,NA,NA,1,1,NA,NA,3,NA), nrow=4) # first assume the default nominal classification KrippAlpha(nmm) # now use the same data with the other three methods KrippAlpha(nmm, "ordinal") KrippAlpha(nmm, "interval") KrippAlpha(nmm, "ratio")
Set and retrieve the label
, resp. unit
attribute of x
. This can be helpful for documenting the specific meaning of a variable, of an entire data.frame or any other object. For single vectors it can be useful to store the unit.
Label(x) Label(x) <- value Labels(x) Labels(x) <- value Unit(x) Unit(x) <- value
Label(x) Label(x) <- value Labels(x) Labels(x) <- value Unit(x) Unit(x) <- value
x |
any object |
value |
a single string describing the object |
The label should consist of a single text (length of 1). The text may contain line feeds.
It can be deleted by setting the label to NULL
.
Labels()
can be used to retrieve and assign vectorized labels to data.frames or lists.
Label
and Unit
return the label attribute of x, if any; otherwise, NULL.
Andri Signorell <[email protected]>
A more elaborated label version can be found in package Hmisc label()
.
# add a descriptive label to a variable Label(d.diamonds$colour) <- "The rating scale applied to diamonds ranges from colorless to yellow, as any other color is extremely rare." # technically just appending the text as attribute to the variable attributes(d.diamonds$colour) # label is supported while describing data Desc(d.diamonds$colour) # The label can be deleted by setting it to NULL Label(d.diamonds$colour) <- NULL # Labelling the columns of a data.frame is best done with a loop # (all so far seen *apply aproaches lead to more complicated code...) lbl <- RndWord(16, 7) for(i in seq_along(lbl)) Label(d.pizza[, i]) <- lbl[i] Str(d.pizza)
# add a descriptive label to a variable Label(d.diamonds$colour) <- "The rating scale applied to diamonds ranges from colorless to yellow, as any other color is extremely rare." # technically just appending the text as attribute to the variable attributes(d.diamonds$colour) # label is supported while describing data Desc(d.diamonds$colour) # The label can be deleted by setting it to NULL Label(d.diamonds$colour) <- NULL # Labelling the columns of a data.frame is best done with a loop # (all so far seen *apply aproaches lead to more complicated code...) lbl <- RndWord(16, 7) for(i in seq_along(lbl)) Label(d.pizza[, i]) <- lbl[i] Str(d.pizza)
Calculate symmetric and asymmetric Goodman Kruskal lambda and their confidence intervals. Lamdba is a measure of proportional reduction in error in cross tabulation analysis. For any sample with a nominal independent variable and dependent variable (or ones that can be treated nominally), it indicates the extent to which the modal categories and frequencies for each value of the independent variable differ from the overall modal category and frequency, i.e. for all values of the independent variable together
Lambda(x, y = NULL, direction = c("symmetric", "row", "column"), conf.level = NA, ...)
Lambda(x, y = NULL, direction = c("symmetric", "row", "column"), conf.level = NA, ...)
x |
a numeric vector, a matrix or a table. |
y |
|
direction |
type of lambda. Can be one out of |
conf.level |
confidence level for the returned confidence interval, restricted to lie between 0 and 1. |
... |
further arguments are passed to the function |
Asymmetric lambda is interpreted as the probable improvement in predicting the column variable Y given knowledge of the row variable X.
The nondirectional lambda is the average of the two asymmetric lambdas, Lambda(C|R) and Lambda(R|C).
Lambda (asymmetric and symmetric) has a scale ranging from 0 to 1.
Data can be passed to the function either as matrix or data.frame in x
, or as two numeric vectors x
and y
. In the latter case table(x, y, ...)
is calculated. Thus NA
s are handled the same way as table
does. Note that tables are by default calculated without NAs (which breaks the package's law to in general not omit NAs silently). The specific argument useNA
can be passed via the ... argument.PairApply
can be used to calculate pairwise lambdas.
if no confidence intervals are requested:
the estimate as numeric value
else a named numeric vector with 3 elements
lambda |
estimate |
lwr.ci |
lower confidence interval |
upr.ci |
upper confidence interval |
Andri Signorell <[email protected]> based on code from Antti Arppe <[email protected]>,
Nanina Anderegg (confidence interval symmetric lambda)
Agresti, A. (2002) Categorical Data Analysis. John Wiley & Sons
Goodman, L. A., Kruskal W. H. (1979) Measures of Association for Cross Classifications. New
York: Springer-Verlag (contains articles appearing in J. Amer. Statist. Assoc. in 1954,
1959, 1963, 1972).
http://www.nssl.noaa.gov/users/brooks/public_html/feda/papers/goodmankruskal1.pdf (might be outdated)
Liebetrau, A. M. (1983) Measures of Association, Sage University Papers Series on Quantitative Applications in the Social Sciences, 07-004. Newbury Park, CA: Sage, pp. 17–24
GoodmanKruskalTau
, GoodmanKruskalGamma
, KendallTauA
, KendallTauB
, StuartTauC
, SomersDelta
, cor
# example from Goodman Kruskal (1954) m <- as.table(cbind(c(1768,946,115), c(807,1387,438), c(189,746,288), c(47,53,16))) dimnames(m) <- list(paste("A", 1:3), paste("B", 1:4)) m # direction default is "symmetric" Lambda(m) Lambda(m, conf.level=0.95) Lambda(m, direction="row") Lambda(m, direction="column")
# example from Goodman Kruskal (1954) m <- as.table(cbind(c(1768,946,115), c(807,1387,438), c(189,746,288), c(47,53,16))) dimnames(m) <- list(paste("A", 1:3), paste("B", 1:4)) m # direction default is "symmetric" Lambda(m) Lambda(m, conf.level=0.95) Lambda(m, direction="row") Lambda(m, direction="column")
Lc computes the (empirical) ordinary and generalized Lorenz curve of a vector x. Desc calculates some key figures for a Lorenz curve and produces a quick description.
Lc(x, ...) ## Default S3 method: Lc(x, n = rep(1, length(x)), na.rm = FALSE, ...) ## S3 method for class 'formula' Lc(formula, data, subset, na.action, ...) ## S3 method for class 'Lc' plot(x, general = FALSE, lwd = 2, type = "l", xlab = "p", ylab = "L(p)", main = "Lorenz curve", las = 1, pch = NA, ...) ## S3 method for class 'Lclist' plot(x, col = 1, lwd = 2, lty = 1, main = "Lorenz curve", xlab = "p", ylab = "L(p)", ...) ## S3 method for class 'Lc' lines(x, general = FALSE, lwd = 2, conf.level = NA, args.cband = NULL, ...) ## S3 method for class 'Lc' predict(object, newdata, conf.level=NA, general=FALSE, n=1000, ...)
Lc(x, ...) ## Default S3 method: Lc(x, n = rep(1, length(x)), na.rm = FALSE, ...) ## S3 method for class 'formula' Lc(formula, data, subset, na.action, ...) ## S3 method for class 'Lc' plot(x, general = FALSE, lwd = 2, type = "l", xlab = "p", ylab = "L(p)", main = "Lorenz curve", las = 1, pch = NA, ...) ## S3 method for class 'Lclist' plot(x, col = 1, lwd = 2, lty = 1, main = "Lorenz curve", xlab = "p", ylab = "L(p)", ...) ## S3 method for class 'Lc' lines(x, general = FALSE, lwd = 2, conf.level = NA, args.cband = NULL, ...) ## S3 method for class 'Lc' predict(object, newdata, conf.level=NA, general=FALSE, n=1000, ...)
x |
a vector containing non-negative elements, or a Lc-object for plot and lines. |
n |
a vector of frequencies, must be same length as |
na.rm |
logical. Should missing values be removed? Defaults to FALSE. |
general |
logical. If |
col |
color of the curve |
lwd |
the linewidth of the curve |
lty |
the linetype of the curve |
type |
type of the plot, default is line ( |
xlab , ylab
|
label of the x-, resp. y-axis. |
pch |
the point character (default is |
main |
main title of the plot. |
las |
las of the axis. |
formula |
a formula of the form |
data |
an optional matrix or data frame (or similar: see |
subset |
an optional vector specifying a subset of observations to be used. |
na.action |
a function which indicates what should happen when the data contain NAs. Defaults to |
conf.level |
confidence level for the bootstrap confidence interval. Set this to |
args.cband |
list of arguments for the confidence band, such as color or border (see |
object |
object of class inheriting from "Lc" |
newdata |
an optional vector of percentages p for which to predict. If omitted, the original values of the object are used. |
... |
further argument to be passed to methods. |
Lc(x)
computes the empirical ordinary Lorenz curve of x
as well as the generalized Lorenz curve (= ordinary Lorenz curve *
mean(x)). The result can be interpreted like this: p
*100 percent
have L(p)
*100 percent of x
.
If n
is changed to anything but the default x
is
interpreted as a vector of class means and n
as a vector of
class frequencies: in this case Lc
will compute the minimal
Lorenz curve (= no inequality within each group).
A list of class "Lc"
with the following components:
p |
vector of percentages |
L |
vector with values of the ordinary Lorenz curve |
L.general |
vector with values of the generalized Lorenz curve |
x |
the original x values (needed for computing confidence intervals) |
n |
the original n values |
These functions were previously published as Lc()
in the ineq package and have been
integrated here without logical changes.
Achim Zeileis <[email protected]>, extensions Andri Signorell <[email protected]>
Arnold, B. C. (1987) Majorization and the Lorenz Order: A Brief Introduction, Springer
Cowell, F. A. (2000) Measurement of Inequality in Atkinson, A. B. / Bourguignon, F. (Eds): Handbook of Income Distribution. Amsterdam.
Cowell, F. A. (1995) Measuring Inequality Harvester Wheatshef: Prentice Hall.
The original location Lc()
,
inequality measures Gini()
, Atkinson()
priceCarpenter <- d.pizza$price[d.pizza$driver=="Carpenter"] priceMiller <- d.pizza$price[d.pizza$driver=="Miller"] # compute the Lorenz curves Lc.p <- Lc(priceCarpenter, na.rm=TRUE) Lc.u <- Lc(priceMiller, na.rm=TRUE) plot(Lc.p) lines(Lc.u, col=2) # the picture becomes even clearer with generalized Lorenz curves plot(Lc.p, general=TRUE) lines(Lc.u, general=TRUE, col=2) # inequality measures emphasize these results, e.g. Atkinson's measure Atkinson(priceCarpenter, na.rm=TRUE) Atkinson(priceMiller, na.rm=TRUE) # income distribution of the USA in 1968 (in 10 classes) # x vector of class means, n vector of class frequencies x <- c(541, 1463, 2445, 3438, 4437, 5401, 6392, 8304, 11904, 22261) n <- c(482, 825, 722, 690, 661, 760, 745, 2140, 1911, 1024) # compute minimal Lorenz curve (= no inequality in each group) Lc.min <- Lc(x, n=n) plot(Lc.min) # input of frequency tables with midpoints of classes fl <- c(2.5,7.5,15,35,75,150) # midpoints n <- c(25,13,10,5,5,2) # frequencies plot(Lc(fl, n), # Lorenz-Curve panel.first=grid(10, 10), main="Lorenzcurve Farmers", xlab="Percent farmers (cumulative)", ylab="Percent of area (%)" ) # add confidence band lines(Lc(fl, n), conf.level=0.95, args.cband=list(col=SetAlpha(DescToolsOptions("col")[2], 0.3))) Gini(fl, n) # find specific function values using predict x <- c(1,1,4) lx <- Lc(x) plot(lx) # get interpolated function value at p=0.55 y0 <- predict(lx, newdata=0.55) abline(v=0.55, h=y0$L, lty="dotted") # and for the inverse question use approx y0 <- approx(x=lx$L, y=lx$p, xout=0.6) abline(h=0.6, v=y0$y, col="red") text(x=0.1, y=0.65, label=expression(L^{-1}*(0.6) == 0.8), col="red") text(x=0.65, y=0.2, label=expression(L(0.55) == 0.275)) # input of frequency tables with midpoints of classes fl <- c(2.5,7.5,15,35,75,150) # midpoints n <- c(25,13,10,5,5,2) # frequencies # the formula interface for Lc lst <- Lc(count ~ cut(price, breaks=5), data=d.pizza) plot(lst, col=1:length(lst), panel.first=grid(), lwd=2) legend(x="topleft", legend=names(lst), fill=1:length(lst)) # Describe with Desc-function lx <- Lc(fl, n) Desc(lx)
priceCarpenter <- d.pizza$price[d.pizza$driver=="Carpenter"] priceMiller <- d.pizza$price[d.pizza$driver=="Miller"] # compute the Lorenz curves Lc.p <- Lc(priceCarpenter, na.rm=TRUE) Lc.u <- Lc(priceMiller, na.rm=TRUE) plot(Lc.p) lines(Lc.u, col=2) # the picture becomes even clearer with generalized Lorenz curves plot(Lc.p, general=TRUE) lines(Lc.u, general=TRUE, col=2) # inequality measures emphasize these results, e.g. Atkinson's measure Atkinson(priceCarpenter, na.rm=TRUE) Atkinson(priceMiller, na.rm=TRUE) # income distribution of the USA in 1968 (in 10 classes) # x vector of class means, n vector of class frequencies x <- c(541, 1463, 2445, 3438, 4437, 5401, 6392, 8304, 11904, 22261) n <- c(482, 825, 722, 690, 661, 760, 745, 2140, 1911, 1024) # compute minimal Lorenz curve (= no inequality in each group) Lc.min <- Lc(x, n=n) plot(Lc.min) # input of frequency tables with midpoints of classes fl <- c(2.5,7.5,15,35,75,150) # midpoints n <- c(25,13,10,5,5,2) # frequencies plot(Lc(fl, n), # Lorenz-Curve panel.first=grid(10, 10), main="Lorenzcurve Farmers", xlab="Percent farmers (cumulative)", ylab="Percent of area (%)" ) # add confidence band lines(Lc(fl, n), conf.level=0.95, args.cband=list(col=SetAlpha(DescToolsOptions("col")[2], 0.3))) Gini(fl, n) # find specific function values using predict x <- c(1,1,4) lx <- Lc(x) plot(lx) # get interpolated function value at p=0.55 y0 <- predict(lx, newdata=0.55) abline(v=0.55, h=y0$L, lty="dotted") # and for the inverse question use approx y0 <- approx(x=lx$L, y=lx$p, xout=0.6) abline(h=0.6, v=y0$y, col="red") text(x=0.1, y=0.65, label=expression(L^{-1}*(0.6) == 0.8), col="red") text(x=0.65, y=0.2, label=expression(L(0.55) == 0.275)) # input of frequency tables with midpoints of classes fl <- c(2.5,7.5,15,35,75,150) # midpoints n <- c(25,13,10,5,5,2) # frequencies # the formula interface for Lc lst <- Lc(count ~ cut(price, breaks=5), data=d.pizza) plot(lst, col=1:length(lst), panel.first=grid(), lwd=2) legend(x="topleft", legend=names(lst), fill=1:length(lst)) # Describe with Desc-function lx <- Lc(fl, n) Desc(lx)
Performs Lehmacher's chi-squared test for marginal homogenity in a symmetric two-dimensional contingency table.
LehmacherTest(x, y = NULL) ## S3 method for class 'mtest' print(x, digits = 4L, ...)
LehmacherTest(x, y = NULL) ## S3 method for class 'mtest' print(x, digits = 4L, ...)
x |
either a two-dimensional contingency table in matrix form, or a factor object. |
y |
a factor object; ignored if x is a matrix. |
digits |
a non-null value for digits specifies the minimum number of significant digits to be printed in values. See details in |
... |
further arguments to be passed to or from other methods. They are ignored in this function. |
The null is that the probabilities of being classified into cells [i,j] and [j,i] are the same.
If x is a matrix, it is taken as a two-dimensional contingency table, and hence its entries should be nonnegative integers. Otherwise, both x and y must be vectors or factors of the same length. Incomplete cases are removed, vectors are coerced into factors, and the contingency table is computed from these.
A list with class "mtest"
containing the following components:
statistic |
a vector with the value of the test statistics. |
parameter |
the degrees of freedom, which is always 1 in LehmacherTest. |
p.value |
a vector with the p-values of the single tests. |
p.value.corr |
a vector with the "hochberg" adjusted p-values of the single tests. (See |
method |
a character string indicating what type of test was performed. |
data.name |
a character string giving the name of the data. |
Andri Signorell <[email protected]>
Lehmacher, W. (1980) Simultaneous sign tests for marginal homogeneity of square contingency tables Biometrical Journal, Volume 22, Issue 8, pages 795-798
mcnemar.test
(resp. BowkerTest for a CxC-matrix), StuartMaxwellTest
, WoolfTest
x <- matrix(c(400,40,20,10, 50,300,60,20, 10,40,120,5, 5,90,50,80), nrow=4, byrow=TRUE) LehmacherTest(x)
x <- matrix(c(400,40,20,10, 50,300,60,20, 10,40,120,5, 5,90,50,80), nrow=4, byrow=TRUE) LehmacherTest(x)
Computes Levene's test for homogeneity of variance across groups.
LeveneTest(y, ...) ## S3 method for class 'formula' LeveneTest(formula, data, ...) ## S3 method for class 'lm' LeveneTest(y, ...) ## Default S3 method: LeveneTest(y, group, center=median, ...)
LeveneTest(y, ...) ## S3 method for class 'formula' LeveneTest(formula, data, ...) ## S3 method for class 'lm' LeveneTest(y, ...) ## Default S3 method: LeveneTest(y, group, center=median, ...)
y |
response variable for the default method, or a |
group |
factor defining groups. |
center |
The name of a function to compute the center of each group;
|
formula |
a formula of the form |
data |
an optional matrix or data frame (or similar: see
|
... |
arguments to be passed down, e.g., |
returns an object meant to be printed showing the results of the test.
This function was previously published as leveneTest() in the library(car) and has been integrated here without logical changes.
John Fox [email protected]; original generic version
contributed by Derek Ogle
adapted from a response posted by Brian Ripley to the r-help email list.
Fox, J. (2008) Applied Regression Analysis and Generalized Linear Models, Second Edition. Sage.
Fox, J. and Weisberg, S. (2011) An R Companion to Applied Regression, Second Edition, Sage.
fligner.test
for a rank-based (nonparametric)
-sample test for homogeneity of variances;
mood.test
for another rank-based two-sample test for a
difference in scale parameters;
var.test
and bartlett.test
for parametric
tests for the homogeneity in variance.
ansari_test
in package coin
for exact and approximate conditional p-values for the
Ansari-Bradley test, as well as different methods for handling ties.
## example from ansari.test: ## Hollander & Wolfe (1973, p. 86f): ## Serum iron determination using Hyland control sera ramsay <- c(111, 107, 100, 99, 102, 106, 109, 108, 104, 99, 101, 96, 97, 102, 107, 113, 116, 113, 110, 98) jung.parekh <- c(107, 108, 106, 98, 105, 103, 110, 105, 104, 100, 96, 108, 103, 104, 114, 114, 113, 108, 106, 99) LeveneTest( c(ramsay, jung.parekh), factor(c(rep("ramsay",length(ramsay)), rep("jung.parekh",length(jung.parekh))))) LeveneTest( c(rnorm(10), rnorm(10, 0, 2)), factor(rep(c("A","B"),each=10)) ) ## Not run: # original example from package car with(Moore, LeveneTest(conformity, fcategory)) with(Moore, LeveneTest(conformity, interaction(fcategory, partner.status))) LeveneTest(conformity ~ fcategory * partner.status, data = Moore) LeveneTest(conformity ~ fcategory * partner.status, data = Moore, center = mean) LeveneTest(conformity ~ fcategory * partner.status, data = Moore, center = mean, trim = 0.1) LeveneTest(lm(conformity ~ fcategory*partner.status, data = Moore)) ## End(Not run)
## example from ansari.test: ## Hollander & Wolfe (1973, p. 86f): ## Serum iron determination using Hyland control sera ramsay <- c(111, 107, 100, 99, 102, 106, 109, 108, 104, 99, 101, 96, 97, 102, 107, 113, 116, 113, 110, 98) jung.parekh <- c(107, 108, 106, 98, 105, 103, 110, 105, 104, 100, 96, 108, 103, 104, 114, 114, 113, 108, 106, 99) LeveneTest( c(ramsay, jung.parekh), factor(c(rep("ramsay",length(ramsay)), rep("jung.parekh",length(jung.parekh))))) LeveneTest( c(rnorm(10), rnorm(10, 0, 2)), factor(rep(c("A","B"),each=10)) ) ## Not run: # original example from package car with(Moore, LeveneTest(conformity, fcategory)) with(Moore, LeveneTest(conformity, interaction(fcategory, partner.status))) LeveneTest(conformity ~ fcategory * partner.status, data = Moore) LeveneTest(conformity ~ fcategory * partner.status, data = Moore, center = mean) LeveneTest(conformity ~ fcategory * partner.status, data = Moore, center = mean, trim = 0.1) LeveneTest(lm(conformity ~ fcategory*partner.status, data = Moore)) ## End(Not run)
Performs the Lilliefors (Kolmogorov-Smirnov) test for the composite hypothesis of normality, see e.g. Thode (2002, Sec. 5.1.1).
LillieTest(x)
LillieTest(x)
x |
a numeric vector of data values, the number of which must be greater than 4. Missing values are allowed. |
The Lilliefors (Kolmogorov-Smirnov) test is an EDF omnibus test for the composite
hypothesis of normality. The test statistic is the maximal absolute difference
between empirical and
hypothetical cumulative distribution function. It may be computed as
with
where . Here,
is the cumulative distribution function
of the standard normal distribution, and
and
are mean and standard deviation of the data values.
The p-value is computed from the Dallal-Wilkinson (1986) formula, which is claimed to
be only reliable when the p-value is smaller than 0.1. If the Dallal-Wilkinson
p-value turns out to be greater than 0.1, then the p-value is computed from the distribution of
the modified statistic
, see Stephens (1974),
the actual p-value formula being obtained by a simulation and approximation process.
A list of class htest
, containing the following components:
statistic |
the value of the Lilliefors (Kolomogorv-Smirnov) statistic. |
p.value |
the p-value for the test. |
method |
the character string “Lilliefors (Kolmogorov-Smirnov) normality test”. |
data.name |
a character string giving the name(s) of the data. |
The Lilliefors (Kolomorov-Smirnov) test is the most famous EDF omnibus test for normality.
Compared to the Anderson-Darling test and the Cramer-von Mises test it is known to perform worse.
Although the test statistic obtained from LillieTest(x)
is the same as that obtained from
ks.test(x, "pnorm", mean(x), sd(x))
, it is not correct to use the p-value from the latter
for the composite hypothesis of normality (mean and variance unknown),
since the distribution of the test statistic is different when the parameters are estimated.
The function call LillieTest(x)
essentially produces
the same result as the S-PLUS function call ks.gof(x)
with the distinction that the p-value is not set to 0.5 when
the Dallal-Wilkinson approximation yields a p-value greater than 0.1. (Actually,
the alternative p-value approximation is provided for the complete range of test statistic values,
but is only used when the Dallal-Wilkinson approximation fails.)
Juergen Gross <[email protected]>
Dallal, G.E. and Wilkinson, L. (1986) An analytic approximation to the distribution of Lilliefors' test for normality. The American Statistician, 40, 294–296.
Stephens, M.A. (1974) EDF statistics for goodness of fit and some comparisons. Journal of the American Statistical Association, 69, 730–737.
Thode Jr., H.C. (2002) Testing for Normality Marcel Dekker, New York.
shapiro.test
for performing the Shapiro-Wilk test for normality.
AndersonDarlingTest
, CramerVonMisesTest
,
PearsonTest
, ShapiroFranciaTest
for performing further tests for normality.
qqnorm
for producing a normal quantile-quantile plot.
LillieTest(rnorm(100, mean = 5, sd = 3)) LillieTest(runif(100, min = 2, max = 4))
LillieTest(rnorm(100, mean = 5, sd = 3)) LillieTest(runif(100, min = 2, max = 4))
Add a linear regression line to an existing plot. The function first calculates the prediction of a lm
object for a reasonable amount of points, then adds the line to the plot and inserts a polygon with the confidence and, if required, the prediction intervals.
In addition to abline
the function will also display polynomial models.
## S3 method for class 'lm' lines(x, col = Pal()[1], lwd = 2, lty = "solid", type = "l", n = 100, conf.level = 0.95, args.cband = NULL, pred.level = NA, args.pband = NULL, xpred = NULL, ...)
## S3 method for class 'lm' lines(x, col = Pal()[1], lwd = 2, lty = "solid", type = "l", n = 100, conf.level = 0.95, args.cband = NULL, pred.level = NA, args.pband = NULL, xpred = NULL, ...)
x |
linear model object as result from lm(y~x). |
col |
linecolor of the line. Default is the color returned by |
lwd |
line width of the line. |
lty |
line type of the line. |
type |
character indicating the type of plotting; actually any of the |
n |
number of points used for plotting the fit. |
conf.level |
confidence level for the confidence interval. Set this to |
args.cband |
list of arguments for the confidence band, such as color or border (see |
pred.level |
confidence level for the prediction interval. Set this to NA, if no prediction band should be plotted.
Default is |
args.pband |
list of arguments for the prediction band, such as color or border (see |
xpred |
a numeric vector |
... |
further arguments are not used specifically. |
It's sometimes illuminating to plot a regression line with its prediction, resp. confidence intervals over an existing scatterplot. This only makes sense, if just a simple linear model explaining a target variable by (a function of) one single predictor is to be visualized.
nothing
Andri Signorell <[email protected]>
opar <- par(mfrow=c(1,2)) plot(hp ~ wt, mtcars) lines(lm(hp ~ wt, mtcars), col="steelblue") # add the prediction intervals in different color plot(hp ~ wt, mtcars) r.lm <- lm(hp ~ wt, mtcars) lines(r.lm, col="red", pred.level=0.95, args.pband=list(col=SetAlpha("grey",0.3)) ) # works with transformations too plot(dist ~ sqrt(speed), cars) lines(lm(dist ~ sqrt(speed), cars), col=DescTools::hred) plot(dist ~ log(speed), cars) lines(lm(dist ~ log(speed), cars), col=DescTools::hred) # and with more specific variables based on only one predictor plot(dist ~ speed, cars) lines(lm(dist ~ poly(speed, degree=2), cars), col=DescTools::hred) par(opar)
opar <- par(mfrow=c(1,2)) plot(hp ~ wt, mtcars) lines(lm(hp ~ wt, mtcars), col="steelblue") # add the prediction intervals in different color plot(hp ~ wt, mtcars) r.lm <- lm(hp ~ wt, mtcars) lines(r.lm, col="red", pred.level=0.95, args.pband=list(col=SetAlpha("grey",0.3)) ) # works with transformations too plot(dist ~ sqrt(speed), cars) lines(lm(dist ~ sqrt(speed), cars), col=DescTools::hred) plot(dist ~ log(speed), cars) lines(lm(dist ~ log(speed), cars), col=DescTools::hred) # and with more specific variables based on only one predictor plot(dist ~ speed, cars) lines(lm(dist ~ poly(speed, degree=2), cars), col=DescTools::hred) par(opar)
Add a loess smoother to an existing plot. The function first calculates the prediction of a loess object for a reasonable amount of points, then adds the line to the plot and inserts a polygon with the confidence intervals.
## S3 method for class 'loess' lines(x, col = Pal()[1], lwd = 2, lty = "solid", type = "l", n = 100, conf.level = 0.95, args.band = NULL, ...) ## S3 method for class 'smooth.spline' lines(x, col = Pal()[1], lwd = 2, lty = "solid", type = "l", conf.level = 0.95, args.band = NULL, ...) ## S3 method for class 'SmoothSpline' lines(x, col = Pal()[1], lwd = 2, lty = "solid", type = "l", conf.level = 0.95, args.band = NULL, ...)
## S3 method for class 'loess' lines(x, col = Pal()[1], lwd = 2, lty = "solid", type = "l", n = 100, conf.level = 0.95, args.band = NULL, ...) ## S3 method for class 'smooth.spline' lines(x, col = Pal()[1], lwd = 2, lty = "solid", type = "l", conf.level = 0.95, args.band = NULL, ...) ## S3 method for class 'SmoothSpline' lines(x, col = Pal()[1], lwd = 2, lty = "solid", type = "l", conf.level = 0.95, args.band = NULL, ...)
x |
the loess or smooth.spline object to be plotted. |
col |
linecolor of the smoother. Default is DescTools's |
lwd |
line width of the smoother. |
lty |
line type of the smoother. |
type |
type of plot, defaults to |
n |
number of points used for plotting the fit. |
conf.level |
confidence level for the confidence interval. Set this to NA, if no confidence band should be plotted. Default is 0.95. |
args.band |
list of arguments for the confidence band, such as color or border (see |
... |
further arguments are passed to the smoother ( |
Loess can result in heavy computational load if there are many points!
Andri Signorell <[email protected]>
loess
, scatter.smooth
, smooth.spline
, SmoothSpline
par(mfrow=c(1,2)) x <- runif(100) y <- rnorm(100) plot(x, y) lines(loess(y~x)) plot(temperature ~ delivery_min, data=d.pizza) lines(loess(temperature ~ delivery_min, data=d.pizza)) plot(temperature ~ delivery_min, data=d.pizza) lines(loess(temperature ~ delivery_min, data=d.pizza), conf.level = 0.99, args.band = list(col=SetAlpha("red", 0.4), border="black") ) # the default values from scatter.smooth lines(loess(temperature ~ delivery_min, data=d.pizza, span=2/3, degree=1, family="symmetric"), col="red")
par(mfrow=c(1,2)) x <- runif(100) y <- rnorm(100) plot(x, y) lines(loess(y~x)) plot(temperature ~ delivery_min, data=d.pizza) lines(loess(temperature ~ delivery_min, data=d.pizza)) plot(temperature ~ delivery_min, data=d.pizza) lines(loess(temperature ~ delivery_min, data=d.pizza), conf.level = 0.99, args.band = list(col=SetAlpha("red", 0.4), border="black") ) # the default values from scatter.smooth lines(loess(temperature ~ delivery_min, data=d.pizza, span=2/3, degree=1, family="symmetric"), col="red")
Functions like mtext
or axis
use the line
argument to set the distance from plot. Sometimes it's useful to have the distance in user coordinates. LineToUser()
does this nontrivial conversion.
LineToUser(line, side)
LineToUser(line, side)
line |
the number of lines |
side |
the side of the plot |
For the LineToUser
function to work, there must be an open plot.
the user coordinates for the given lines
Andri Signorell <[email protected]>
plot(1:10) LineToUser(line=2, side=4)
plot(1:10) LineToUser(line=2, side=4)
This will scale the numeric vector x
linearly from an old scale between low
and high
to a new one between newlow
and newhigh
.
LinScale(x, low = NULL, high = NULL, newlow = 0, newhigh = 1)
LinScale(x, low = NULL, high = NULL, newlow = 0, newhigh = 1)
x |
a numeric matrix(like object). |
low |
numeric. The minimum value of the scale, defaults to min(x).
This is calculated columnwise by default; defined |
high |
numeric. The maximum value of the scale, defaults to max(x). This is calculated columnwise by default; when a maxval is entered, it will be recycled. |
newlow |
numeric. The minimum value of the new scale, defaults to 0, resulting in a 0-1 scale for x. |
newhigh |
numeric. The maximum value of the scale, defaults to 1.
|
Hmm, hardly worth coding...
The centered and scaled matrix. The numeric centering and scalings used (if any) are returned as attributes "scaled:center
" and "scaled:scale
"
Andri Signorell <[email protected]>
# transform the temperature from Celsius to Fahrenheit LinScale(d.pizza[1:20, "temperature"], 0, 100, -17.8, 37.8 ) # and the price from Dollar to Euro LinScale(d.pizza[1:20, "price"], 0, 1, 0, 0.76) # together LinScale(d.pizza[1:20, c("temperature", "price")], 0, c(100, 1), c(-17.8, 0), c(37.8, 0.76) ) ## Not run: par(mfrow=c(3,1), mar=c(0,5,0,3), oma=c(5,0,5,0)) plot(LinScale(d.frm[,1]), ylim=c(-2,2), xaxt="n", ylab="LinScale") plot(RobScale(d.frm[,1]), ylim=c(-2,2), xaxt="n", ylab="RobScale") plot(scale(d.frm[,1]), ylim=c(-2,2), ylab="scale") title("Compare scales", outer = TRUE) ## End(Not run)
# transform the temperature from Celsius to Fahrenheit LinScale(d.pizza[1:20, "temperature"], 0, 100, -17.8, 37.8 ) # and the price from Dollar to Euro LinScale(d.pizza[1:20, "price"], 0, 1, 0, 0.76) # together LinScale(d.pizza[1:20, c("temperature", "price")], 0, c(100, 1), c(-17.8, 0), c(37.8, 0.76) ) ## Not run: par(mfrow=c(3,1), mar=c(0,5,0,3), oma=c(5,0,5,0)) plot(LinScale(d.frm[,1]), ylim=c(-2,2), xaxt="n", ylab="LinScale") plot(RobScale(d.frm[,1]), ylim=c(-2,2), xaxt="n", ylab="RobScale") plot(scale(d.frm[,1]), ylim=c(-2,2), ylab="scale") title("Compare scales", outer = TRUE) ## End(Not run)
List all the objects, functions or data in a package.
LsObj(package) LsFct(package)
LsObj(package) LsFct(package)
package |
the name of the package |
This is just a wrapper for ls
, ls.str
and lsf.str
with the appropriate arguments (as I always forgot how to do the trick).
LsObj()
lists all objects, LsFct()
just the functions in a package.
Andri Signorell <[email protected]>
Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) The New S Language. Wadsworth & Brooks/Cole.
LsFct("DescTools")
LsFct("DescTools")
In longitudinal studies it's common that individuals drop out before all responses can be obtained. Measurements obtained before the individual dropped out can be used to impute the unknown measurement(s). The last observation carried forward method is one way to impute values for the missing observations. For the last observation carried forward (LOCF) approach the missing values are replaced by the last observed value of that variable for each individual regardless of when it occurred.
LOCF()
replaces NA
s with the most recent non-NA prior to it.
LOCF(x) ## Default S3 method: LOCF(x) ## S3 method for class 'data.frame' LOCF(x) ## S3 method for class 'matrix' LOCF(x)
LOCF(x) ## Default S3 method: LOCF(x) ## S3 method for class 'data.frame' LOCF(x) ## S3 method for class 'matrix' LOCF(x)
x |
a vector, a data.frame or a matrix containing NAs. |
The function will replace all NAs found in a vector with the last earlier value not being NA. In data.frames each column will be treated as described.
It should be noted, that the last observation carried forward approach may result in biased estimates and may underestimate the variability.
a vector with the same dimension as x.
Daniel Wollschlaeger <[email protected]>
See also the package Hmisc for less coarse imputation functions.
d.frm <- data.frame( tag=rep(c("mo", "di", "mi", "do", "fr", "sa", "so"), 4) , val=rep(c(runif(5), rep(NA,2)), 4) ) d.frm$locf <- LOCF( d.frm$val ) d.frm
d.frm <- data.frame( tag=rep(c("mo", "di", "mi", "do", "fr", "sa", "so"), 4) , val=rep(c(runif(5), rep(NA,2)), 4) ) d.frm$locf <- LOCF( d.frm$val ) d.frm
A function that finds the local outlier factor (Breunig et al.,2000) of the matrix "data" using k neighbours. The local outlier factor (LOF) is a measure of outlyingness that is calculated for each observation. The user decides whether or not an observation will be considered an outlier based on this measure. The LOF takes into consideration the density of the neighborhood around the observation to determine its outlyingness.
LOF(data, k)
LOF(data, k)
data |
The data set to be explored |
k |
The kth-distance to be used to calculate the LOF's. |
The LOFs are calculated over a range of values, and the max local outlier factor is determined over this range.
lof |
A vector with the local outlier factor of each observation |
This function was originally published in the library dprep.
Caroline Rodriguez
Breuning, M., Kriegel, H., Ng, R.T, and Sander. J. (2000). LOF: Identifying density-based local outliers. In Proceedings of the ACM SIGMOD International Conference on Management of Data
# Detecting the top 10 outliers using the LOF algorithm (iris.lof <- LOF(iris[,-5], 10))
# Detecting the top 10 outliers using the LOF algorithm (iris.lof <- LOF(iris[,-5], 10))
Compute generalized logit and generalized inverse logit functions.
Logit(x, min = 0, max = 1) LogitInv(x, min = 0, max = 1)
Logit(x, min = 0, max = 1) LogitInv(x, min = 0, max = 1)
x |
value(s) to be transformed |
min |
lower end of logit interval |
max |
upper end of logit interval |
The generalized logit function takes values on [min, max] and
transforms them to span .
It is defined as:
The generalized inverse logit function provides the inverse transformation:
Transformed value(s).
Gregory R. Warnes [email protected]
x <- seq(0,10, by=0.25) xt <- Logit(x, min=0, max=10) cbind(x,xt) y <- LogitInv(xt, min=0, max=10) cbind(x, xt, y)
x <- seq(0,10, by=0.25) xt <- Logit(x, min=0, max=10) cbind(x,xt) y <- LogitInv(xt, min=0, max=10) cbind(x, xt, y)
Transforms the data by a log transformation, modifying small and zero observations such that the transformation is linear for and logarithmic for x > threshold. So the transformation yields finite values and is continuously differentiable.
LogSt(x, base = 10, calib = x, threshold = NULL, mult = 1) LogStInv(x, base = NULL, threshold = NULL)
LogSt(x, base = 10, calib = x, threshold = NULL, mult = 1) LogStInv(x, base = NULL, threshold = NULL)
x |
a vector or matrix of data, which is to be transformed |
base |
a positive or complex number: the base with respect to which logarithms are computed. Defaults to 10. Use=exp(1) for natural log. |
calib |
a vector or matrix of data used to calibrate the transformation(s), i.e., to determine
the constant |
threshold |
constant |
mult |
a tuning constant affecting the transformation of small values, see |
In order to avoid for
in log-transformations there's often a constant added to the variable before taking the
. This is not always a pleasable strategy.
The function
LogSt
handles this problem based on the following ideas:
The modification should only affect the values for "small" arguments.
What "small" is should be determined in connection with the non-zero values of the original variable, since it should behave well (be equivariant) with respect to a change in the "unit of measurement".
The function must remain monotone, and it should remain (weakly) convex.
These criteria are implemented here as follows: The shape is determined by a
threshold at which - coming from above - the log function switches to a linear function with the same slope at this point.
This is obtained by
Small values are determined by the threshold . If not given by the argument
threshold
, it is determined by the quartiles and
of the non-zero data as those smaller than
where
can be set by the argument
mult
.
The rationale is, that, for lognormal data, this constant identifies 2 percent of the data as small.
Beyond this limit, the transformation continues linear with the derivative of the log curve at this point.
Another idea for choosing the threshold was: median(x) / (median(x)/quantile(x, 0.25))^2.9)
The function chooses rather than natural logs by default because they can be backtransformed relatively easily in mind.
A generalized log (see: Rocke 2003) can be calculated in order to stabilize the variance as:
function (x, a) { return(log((x + sqrt(x^2 + a^2)) / 2)) }
the transformed data. The value used for the transformation and needed for inverse transformation is returned as
attr(.,"threshold")
and the used base as attr(.,"base")
.
Werner A. Stahel, ETH Zurich
slight modifications Andri Signorell <[email protected]>
Rocke, D M, Durbin B (2003): Approximate variance-stabilizing transformations for gene-expression microarray data, Bioinformatics. 22;19(8):966-72.
dd <- c(seq(0,1,0.1), 5 * 10^rnorm(100, 0, 0.2)) dd <- sort(dd) r.dl <- LogSt(dd) plot(dd, r.dl, type="l") abline(v=attr(r.dl, "threshold"), lty=2) x <- rchisq(df=3, n=100) # should give 0 (or at least something small): LogStInv(LogSt(x)) - x
dd <- c(seq(0,1,0.1), 5 * 10^rnorm(100, 0, 0.2)) dd <- sort(dd) r.dl <- LogSt(dd) plot(dd, r.dl, type="l") abline(v=attr(r.dl, "threshold"), lty=2) x <- rchisq(df=3, n=100) # should give 0 (or at least something small): LogStInv(LogSt(x)) - x
Compute the median absolute deviation, i.e., the (lo-/hi-) median of
the absolute deviations from the median, and (by default) adjust by a
factor for asymptotically normal consistency. This function wraps the specific base R function mad
and extends it for the use of weights.
MAD(x, weights = NULL, center = Median, constant = 1.4826, na.rm = FALSE, low = FALSE, high = FALSE)
MAD(x, weights = NULL, center = Median, constant = 1.4826, na.rm = FALSE, low = FALSE, high = FALSE)
x |
a numeric vector. |
weights |
a numerical vector of weights the same length as |
center |
the centre given either as numeric value or as a function to be applied to |
constant |
scale factor (default is |
na.rm |
if |
low |
if |
high |
if |
The actual value calculated is constant * cMedian(abs(x - center))
with the default value of center
being median(x)
, and
cMedian
being the usual, the ‘low’ or ‘high’ median, see
the arguments description for low
and high
above.
The default constant = 1.4826
(approximately
=
1/qnorm(3/4)
)
ensures consistency, i.e.,
for distributed as
and large
.
If na.rm
is TRUE
then NA
values are stripped from x
before computation takes place.
If this is not done then an NA
value in
x
will cause MAD
to return NA
.
IQR
which is simpler but less robust, IQRw
for weights,
mad
, median
, var
, MADCI
(confidence intervals).
MAD(c(1:9)) print(MAD(c(1:9), constant = 1)) == MAD(c(1:8, 100), constant = 1) # = 2 ; TRUE x <- c(1,2,3,5,7,8) sort(abs(x - median(x))) c(MAD(x, constant = 1), MAD(x, constant = 1, low = TRUE), MAD(x, constant = 1, high = TRUE)) # use weights x <- sample(20, 30, replace = TRUE) z <- as.numeric(names(w <- table(x))) (m1 <- MAD(z, weights=w)) (m2 <- MAD(x)) stopifnot(identical(m1, m2))
MAD(c(1:9)) print(MAD(c(1:9), constant = 1)) == MAD(c(1:8, 100), constant = 1) # = 2 ; TRUE x <- c(1,2,3,5,7,8) sort(abs(x - median(x))) c(MAD(x, constant = 1), MAD(x, constant = 1, low = TRUE), MAD(x, constant = 1, high = TRUE)) # use weights x <- sample(20, 30, replace = TRUE) z <- as.numeric(names(w <- table(x))) (m1 <- MAD(z, weights=w)) (m2 <- MAD(x)) stopifnot(identical(m1, m2))
A function for the median absolute deviation is included in base R, mad
, but there's no function for calculating confidence intervals. Arachchige/Prendergast introduce interval estimators of the MAD to make reliable inferences for dispersion for a single population and ratios and differences of MADs for comparing two populations.
MADCI(x, y = NULL, two.samp.diff = TRUE, gld.est = "TM", conf.level = 0.95, sides = c("two.sided","left","right"), na.rm = FALSE, ...)
MADCI(x, y = NULL, two.samp.diff = TRUE, gld.est = "TM", conf.level = 0.95, sides = c("two.sided","left","right"), na.rm = FALSE, ...)
x |
a (non-empty) numeric vector of data values. |
y |
a second (non-empty) numeric vector of data values. |
two.samp.diff |
logical, defining if the confidence intervals for the difference (mad(x)-mad(y)) (default) or for the squared ratio ((mad(x)/mad(y))^2) should be calculated. Ignored if y is not given. |
gld.est |
A character string, to select the estimation method for the generalized lambda distribution. One of: |
conf.level |
confidence level of the interval. |
sides |
a character string specifying the side of the confidence interval, must be one of |
na.rm |
logical. Should missing values be removed? Defaults to |
... |
further arguments, not used here |
a numeric vector with 3 elements:
mad |
median absolute deviation |
lwr.ci |
lower bound of the confidence interval |
upr.ci |
upper bound of the confidence interval |
Arachchige Chandima N. P. G., Prendergast Luke A., Andri Signorell <[email protected]> (only interface)
Arachchige Chandima N. P. G., Prendergast Luke A. (2019) Confidence intervals for median absolute deviations, arXiv:1910.00229 [math.ST]
x <- rlnorm(100) y <- rlnorm(200, meanlog=1.2) MADCI(x) # single sample MADCI(x, y) # two sample difference MADCI(x, y, two.samp.diff = FALSE) # two sample squared ratio
x <- rlnorm(100) y <- rlnorm(200, meanlog=1.2) MADCI(x) # single sample MADCI(x, y) # two sample difference MADCI(x, y, two.samp.diff = FALSE) # two sample squared ratio
Plot margins are normally set by par("mar")
. However one is forced to always define all margins, even if just one should be altered. The convenience function Mar()
allows to set one single margin (or several) while leaving the others unchanged.
Mgp()
does the same for the distances of axis title, labels and line.
Mar(bottom = NULL, left = NULL, top = NULL, right = NULL, outer = FALSE, reset = FALSE) Mgp(title = NULL, labels = NULL, line = NULL, reset = FALSE)
Mar(bottom = NULL, left = NULL, top = NULL, right = NULL, outer = FALSE, reset = FALSE) Mgp(title = NULL, labels = NULL, line = NULL, reset = FALSE)
bottom |
the bottom margin, if set to |
left |
the left margin, if set to |
top |
the top margin, if set to |
right |
the right margin, if set to |
outer |
logical, defining if inner margins ( |
reset |
if set to |
title |
margin line for the axis title (default 3) |
labels |
margin line for the axis labels (default 1) |
line |
margin line for the axis line (default 0) |
Running Mar()
without any arguments will return the current settings, either par("mar")
, when outer is set to FALSE
or par("oma")
for outer = TRUE
.
Andri Signorell <[email protected]>
# largen the left margin only Mar(left=10.1) # or as alternative: Mar(, 10.1) Mgp(title=6) # ylab must be placed a little further to the left barplot(1:7, names=levels(d.pizza$driver), horiz=TRUE, las=1, ylab="driver", col=Pal("Helsana"))
# largen the left margin only Mar(left=10.1) # or as alternative: Mar(, 10.1) Mgp(title=6) # ylab must be placed a little further to the left barplot(1:7, names=levels(d.pizza$driver), horiz=TRUE, las=1, ylab="driver", col=Pal("Helsana"))
Compute the -th power of a matrix. Whereas
x^k
computes
element wise powers, x %^% k
corresponds to matrix multiplications,
x %*% x %*% ... %*% x
.
x %^% k
x %^% k
x |
a square |
k |
an integer, |
Argument is coerced to integer using
as.integer
.
The algorithm uses matrix
multiplications.
A matrix of the same dimension as x
.
If you think you need x^k
for , then consider
instead
solve(x %^% (-k))
.
Based on an R-help posting of Vicente Canto Casasola, and Vincent Goulet's C implementation in actuar.
%*%
for matrix multiplication.
A <- cbind(1, 2 * diag(3)[,-1]) A A %^% 2 stopifnot(identical(A, A %^% 1), A %^% 2 == A %*% A)
A <- cbind(1, 2 * diag(3)[,-1]) A A %^% 2 stopifnot(identical(A, A %^% 1), A %^% 2 == A %*% A)
Generic function for the (trimmed) arithmetic mean, possibly with given weights.
Mean(x, ...) ## S3 method for class 'Freq' Mean(x, breaks, ...) ## Default S3 method: Mean(x, weights = NULL, trim = 0, na.rm = FALSE, ...)
Mean(x, ...) ## S3 method for class 'Freq' Mean(x, breaks, ...) ## Default S3 method: Mean(x, weights = NULL, trim = 0, na.rm = FALSE, ...)
x |
An object. Currently there are methods for numeric/logical vectors
and date, date-time and time interval objects.
Complex vectors are allowed for |
... |
further arguments passed to or from other methods. |
breaks |
breaks for calculating the mean for classified data as
composed by |
weights |
a numerical vector of weights the same length as |
trim |
the fraction (0 to 0.5) of observations to be trimmed from each
end of |
na.rm |
a logical value indicating whether |
If trim
is zero (the default), the arithmetic mean of the
values in x
is computed, as a numeric or complex vector of length
one. If x
is not logical (coerced to numeric), numeric (including
integer) or complex, NA_real_
is returned, with a warning.
If trim
is non-zero, a symmetrically trimmed mean is computed with a
fraction of trim
observations deleted from each end before the mean
is computed.
trim
and weights
can't be used together at the same time.
Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) The New S Language. Wadsworth & Brooks/Cole.
weighted.mean
, mean.POSIXct
,
colMeans
for row and column means.
x <- c(0:10, 50) xm <- Mean(x) c(xm, Mean(x, trim = 0.10))
x <- c(0:10, 50) xm <- Mean(x) c(xm, Mean(x, trim = 0.10))
Calculates the mean absolute deviation from a center point, typically the sample mean or the median. %% ~~ A concise (1-5 lines) description of what the function does. ~~
MeanAD(x, weights = NULL, center = Mean, na.rm = FALSE)
MeanAD(x, weights = NULL, center = Mean, na.rm = FALSE)
x |
a vector containing the observations. %% ~~Describe |
weights |
a numerical vector of weights the same length as |
center |
a single numerical value or the name of a function applied to |
na.rm |
a logical value indicating whether or not missing values should
be removed. Defaults to |
The MeanAD
function calculates the mean absolute deviation from the mean
value (or from another supplied center point) of x, after having removed
NA
values (if requested):
The function
supports the use of weights. The default function for the center value
Mean()
has a weights arguments, too. If a user defined
function is used it must be assured that it has a weights argument.
Numeric value.
Andri Signorell [email protected] following an idea of Danielle
Navarro (aad
in the lsr package)
x <- runif(100) MeanAD(x) speed <- c(58, 88, 40, 60, 72, 66, 80, 48, NA) MeanAD(speed) MeanAD(speed, na.rm=TRUE) # using the median as centerpoint x <- c(2,3,5,3,1,15,23) MeanAD(x, center=mean) MeanAD(x, center=median) # define a fixed center MeanAD(x, center=4) # use of weights MeanAD(x=0:6, weights=c(21,46,54,40,24,10,5))
x <- runif(100) MeanAD(x) speed <- c(58, 88, 40, 60, 72, 66, 80, 48, NA) MeanAD(speed) MeanAD(speed, na.rm=TRUE) # using the median as centerpoint x <- c(2,3,5,3,1,15,23) MeanAD(x, center=mean) MeanAD(x, center=median) # define a fixed center MeanAD(x, center=4) # use of weights MeanAD(x=0:6, weights=c(21,46,54,40,24,10,5))
Collection of several approaches to determine confidence intervals for the mean. Both, the classical way and bootstrap intervals are implemented for both, normal and trimmed means.
MeanCI( x, sd = NULL, trim = 0, conf.level = 0.95, sides = c("two.sided", "left", "right"), method = c("classic", "boot"), na.rm = FALSE, ... )
MeanCI( x, sd = NULL, trim = 0, conf.level = 0.95, sides = c("two.sided", "left", "right"), method = c("classic", "boot"), na.rm = FALSE, ... )
x |
a (non-empty) numeric vector of data values. |
sd |
the standard deviation of x. If provided it's interpreted as sd of
the population and the normal quantiles will be used for constructing the
confidence intervals. If left to |
trim |
the fraction (0 to 0.5) of observations to be trimmed from each
end of |
conf.level |
confidence level of the interval. |
sides |
a character string specifying the side of the confidence
interval, must be one of |
method |
A vector of character strings representing the type of
intervals required. The value should be any subset of the values
|
na.rm |
a logical value indicating whether |
... |
further arguments are passed to the |
The confidence intervals for the trimmed means use winsorized variances as described in the references.
a numeric vector with 3 elements:
mean |
mean |
lwr.ci |
lower bound of the confidence interval |
upr.ci |
upper bound of the confidence interval |
Andri Signorell [email protected]
Wilcox, R. R., Keselman H. J. (2003) Modern robust data analysis methods: measures of central tendency Psychol Methods, 8(3):254-74
Wilcox, R. R. (2005) Introduction to robust estimation and hypothesis testing Elsevier Academic Press
Mean
, t.test
, MeanDiffCI
,
MedianCI
, VarCI
, MeanCIn
x <- d.pizza$price[1:20] MeanCI(x, na.rm=TRUE) MeanCI(x, conf.level=0.99, na.rm=TRUE) MeanCI(x, sides="left") # same as: t.test(x, alternative="greater") MeanCI(x, sd=25, na.rm=TRUE) # the different types of bootstrap confints MeanCI(x, method="boot", type="norm", na.rm=TRUE) MeanCI(x, trim=0.1, method="boot", type="norm", na.rm=TRUE) MeanCI(x, trim=0.1, method="boot", type="basic", na.rm=TRUE) MeanCI(x, trim=0.1, method="boot", type="stud", na.rm=TRUE) MeanCI(x, trim=0.1, method="boot", type="perc", na.rm=TRUE) MeanCI(x, trim=0.1, method="boot", type="bca", na.rm=TRUE) MeanCI(x, trim=0.1, method="boot", type="bca", R=1999, na.rm=TRUE) # Getting the MeanCI for more than 1 column round(t(sapply(d.pizza[, 1:4], MeanCI, na.rm=TRUE)), 3)
x <- d.pizza$price[1:20] MeanCI(x, na.rm=TRUE) MeanCI(x, conf.level=0.99, na.rm=TRUE) MeanCI(x, sides="left") # same as: t.test(x, alternative="greater") MeanCI(x, sd=25, na.rm=TRUE) # the different types of bootstrap confints MeanCI(x, method="boot", type="norm", na.rm=TRUE) MeanCI(x, trim=0.1, method="boot", type="norm", na.rm=TRUE) MeanCI(x, trim=0.1, method="boot", type="basic", na.rm=TRUE) MeanCI(x, trim=0.1, method="boot", type="stud", na.rm=TRUE) MeanCI(x, trim=0.1, method="boot", type="perc", na.rm=TRUE) MeanCI(x, trim=0.1, method="boot", type="bca", na.rm=TRUE) MeanCI(x, trim=0.1, method="boot", type="bca", R=1999, na.rm=TRUE) # Getting the MeanCI for more than 1 column round(t(sapply(d.pizza[, 1:4], MeanCI, na.rm=TRUE)), 3)
Returns the required sample size to obtain a given width of a confidence interval for the sample mean. The function uses uniroot()
to find a numeric solution. The t distribution is used.
MeanCIn(ci, sd, interval = c(2, 100000), conf.level = 0.95, norm = FALSE, tol = .Machine$double.eps^0.5)
MeanCIn(ci, sd, interval = c(2, 100000), conf.level = 0.95, norm = FALSE, tol = .Machine$double.eps^0.5)
ci |
the left and right bound of the interval, which is presumed to be symmetric. |
sd |
the standard deviation of the sample. |
interval |
the interval for the sample size to be searched into, (default is c(2, 100000)). |
conf.level |
confidence level, defaults to |
norm |
logical, determining if the t- or normaldistribution should be used. |
tol |
the desired accuracy (convergence tolerance). |
The required sample sizes for a specific width of confidence interval for the mean depends recursively on the sample size, as the samplesize defines the degrees of freedom in the t-distribution. Although in most practical cases it will be sufficient to use the normal distribution, we might be interested in exact results.
a numeric value
Andri Signorell <[email protected]>
BinomCIn()
MeanCIn(ci=c(25, 27), sd=5)
MeanCIn(ci=c(25, 27), sd=5)
Calculates the confidence interval for the difference of two means either the classical way or with the bootstrap approach.
MeanDiffCI(x, ...) ## Default S3 method: MeanDiffCI(x, y, method = c("classic", "norm", "basic", "stud", "perc", "bca"), conf.level = 0.95, sides = c("two.sided", "left", "right"), paired = FALSE, na.rm = FALSE, R = 999, ...) ## S3 method for class 'formula' MeanDiffCI(formula, data, subset, na.action, ...)
MeanDiffCI(x, ...) ## Default S3 method: MeanDiffCI(x, y, method = c("classic", "norm", "basic", "stud", "perc", "bca"), conf.level = 0.95, sides = c("two.sided", "left", "right"), paired = FALSE, na.rm = FALSE, R = 999, ...) ## S3 method for class 'formula' MeanDiffCI(formula, data, subset, na.action, ...)
x |
a (non-empty) numeric vector of data values. |
y |
a (non-empty) numeric vector of data values. |
method |
a vector of character strings representing the type of intervals required. The value should be any subset of the values
|
conf.level |
confidence level of the interval. |
sides |
a character string specifying the side of the confidence interval, must be one of |
paired |
a logical indicating whether you want confidence intervals for a paired design. Defaults to |
na.rm |
logical. Should missing values be removed? Defaults to |
R |
the number of bootstrap replicates. Usually this will be a single positive integer. For importance resampling, some resamples may use one set of weights and others use a different set of weights. In this case R would be a vector of integers where each component gives the number of resamples from each of the rows of weights.
See |
formula |
a formula of the form |
data |
an optional matrix or data frame (or similar: see |
subset |
an optional vector specifying a subset of observations to be used. |
na.action |
a function which indicates what should happen when the data contain |
... |
further argument to be passed to or from methods. |
This function collects code from two sources. The classical confidence interval is calculated by means of t.test
.
The bootstrap intervals are strongly based on the example in boot
.
a numeric vector with 3 elements:
meandiff |
the difference: mean(x) - mean(y) |
lwr.ci |
lower bound of the confidence interval |
upr.ci |
upper bound of the confidence interval |
Andri Signorell <[email protected]>
MeanCI
, VarCI
, MedianCI
,
boot.ci
x <- d.pizza$price[d.pizza$driver=="Carter"] y <- d.pizza$price[d.pizza$driver=="Miller"] MeanDiffCI(x, y, na.rm=TRUE) MeanDiffCI(x, y, conf.level=0.99, na.rm=TRUE) # the different types of bootstrap confints MeanDiffCI(x, y, method="norm", na.rm=TRUE) MeanDiffCI(x, y, method="basic", na.rm=TRUE) # MeanDiffCI(x, y, method="stud", na.rm=TRUE) MeanDiffCI(x, y, method="perc", na.rm=TRUE) MeanDiffCI(x, y, method="bca", na.rm=TRUE) # the formula interface MeanDiffCI(price ~ driver, data=d.pizza, subset=driver %in% c("Carter","Miller"))
x <- d.pizza$price[d.pizza$driver=="Carter"] y <- d.pizza$price[d.pizza$driver=="Miller"] MeanDiffCI(x, y, na.rm=TRUE) MeanDiffCI(x, y, conf.level=0.99, na.rm=TRUE) # the different types of bootstrap confints MeanDiffCI(x, y, method="norm", na.rm=TRUE) MeanDiffCI(x, y, method="basic", na.rm=TRUE) # MeanDiffCI(x, y, method="stud", na.rm=TRUE) MeanDiffCI(x, y, method="perc", na.rm=TRUE) MeanDiffCI(x, y, method="bca", na.rm=TRUE) # the formula interface MeanDiffCI(price ~ driver, data=d.pizza, subset=driver %in% c("Carter","Miller"))
Calculates the standard error of mean.
MeanSE(x, sd = NULL, na.rm = FALSE)
MeanSE(x, sd = NULL, na.rm = FALSE)
x |
a (non-empty) numeric vector of data values. |
sd |
the standard deviation of |
na.rm |
logical. Should missing values be removed? Defaults to |
MeanSE calculates the standard error of the mean defined as:
being standard deviation of
x
and n the length of x
.
the standard error as numeric value.
Andri Signorell <[email protected]>
data(d.pizza) MeanSE(d.pizza$price, na.rm=TRUE) # evaluate data.frame sapply(d.pizza[,1:4], MeanSE, na.rm=TRUE)
data(d.pizza) MeanSE(d.pizza$price, na.rm=TRUE) # evaluate data.frame sapply(d.pizza[,1:4], MeanSE, na.rm=TRUE)
Some measures of model accuracy like mean absolute error (MAE), mean absolute percentage error (MAPE), symmetric mean absolute percentage error (SMAPE), mean squared error (MSE) and root mean squared error (RMSE).
MAE(x, ...) ## Default S3 method: MAE(x, ref, na.rm = FALSE, ...) ## S3 method for class 'lm' MAE(x, ...) MAPE(x, ...) ## Default S3 method: MAPE(x, ref, na.rm = FALSE, ...) ## S3 method for class 'lm' MAPE(x, ...) SMAPE(x, ...) ## Default S3 method: SMAPE(x, ref, na.rm = FALSE, ...) ## S3 method for class 'lm' SMAPE(x, ...) MSE(x, ...) ## Default S3 method: MSE(x, ref, na.rm = FALSE, ...) ## S3 method for class 'lm' MSE(x, ...) RMSE(x, ...) ## Default S3 method: RMSE(x, ref, na.rm = FALSE, ...) ## S3 method for class 'lm' RMSE(x, ...) NMAE(x, ref, train.y) NMSE(x, ref, train.y)
MAE(x, ...) ## Default S3 method: MAE(x, ref, na.rm = FALSE, ...) ## S3 method for class 'lm' MAE(x, ...) MAPE(x, ...) ## Default S3 method: MAPE(x, ref, na.rm = FALSE, ...) ## S3 method for class 'lm' MAPE(x, ...) SMAPE(x, ...) ## Default S3 method: SMAPE(x, ref, na.rm = FALSE, ...) ## S3 method for class 'lm' SMAPE(x, ...) MSE(x, ...) ## Default S3 method: MSE(x, ref, na.rm = FALSE, ...) ## S3 method for class 'lm' MSE(x, ...) RMSE(x, ...) ## Default S3 method: RMSE(x, ref, na.rm = FALSE, ...) ## S3 method for class 'lm' RMSE(x, ...) NMAE(x, ref, train.y) NMSE(x, ref, train.y)
x |
the predicted values of a model or a model-object itself. |
ref |
the observed true values. |
train.y |
the observed true values in a train dataset. |
na.rm |
a logical value indicating whether or not missing values should be removed. Defaults to FALSE. |
... |
further arguments |
The function will remove NA
values first (if requested).
MAE calculates the mean absolute error:
MAPE calculates the mean absolute percentage error:
SMAPE calculates the symmetric mean absolute percentage error:
MSE calculates mean squared error:
RMSE calculates the root mean squared error:
the specific numeric value
Andri Signorell <[email protected]>
Armstrong, J. S. (1985) Long-range Forecasting: From Crystal Ball to Computer, 2nd. ed. Wiley. ISBN 978-0-471-82260-8
https://en.wikipedia.org/wiki/Symmetric_mean_absolute_percentage_error
Torgo, L. (2010) Data Mining with R: Learning with Case Studies, Chapman and Hall/CRC Press
r.lm <- lm(Fertility ~ ., data=swiss) MAE(r.lm) # the same as: MAE(predict(r.lm), swiss$Fertility) MAPE(r.lm) MSE(r.lm) RMSE(r.lm)
r.lm <- lm(Fertility ~ ., data=swiss) MAE(r.lm) # the same as: MAE(predict(r.lm), swiss$Fertility) MAPE(r.lm) MSE(r.lm) RMSE(r.lm)
Skew
computes the skewness, Kurt
the excess kurtosis of the values in x.
Skew(x, weights = NULL, na.rm = FALSE, method = 3, conf.level = NA, ci.type = "bca", R = 1000, ...) Kurt(x, weights = NULL, na.rm = FALSE, method = 3, conf.level = NA, ci.type = "bca", R = 1000, ...)
Skew(x, weights = NULL, na.rm = FALSE, method = 3, conf.level = NA, ci.type = "bca", R = 1000, ...) Kurt(x, weights = NULL, na.rm = FALSE, method = 3, conf.level = NA, ci.type = "bca", R = 1000, ...)
x |
a numeric vector. An object which is not a vector is coerced (if possible) by |
weights |
a numerical vector of weights the same length as |
na.rm |
logical, indicating whether |
method |
integer out of 1, 2 or 3 (default). See Details. |
conf.level |
confidence level of the interval. If set to |
ci.type |
The type of confidence interval required. The value should be any subset
of the values |
R |
The number of bootstrap replicates. Usually this will be a single positive integer. For importance resampling,
some resamples may use one set of weights and others use a different set of weights. In this case |
... |
the dots are passed to the function |
Kurt()
returns the excess kurtosis, therefore the kurtosis calculates as Kurt(x) + 3
if required.
If na.rm
is TRUE
then missing values are removed before computation proceeds.
The methods for calculating the skewness can either be:method = 1: g_1 = m_3 / m_2^(3/2)
method = 2: G_1 = g_1 * sqrt(n(n-1)) / (n-2)
method = 3: b_1 = m_3 / s^3 = g_1 ((n-1)/n)^(3/2)
and the ones for the kurtosis:method = 1: g_2 = m_4 / m_2^2 - 3
method = 2: G_2 = ((n+1) g_2 + 6) * (n-1) / ((n-2)(n-3))
method = 3: b_2 = m_4 / s^4 - 3 = (g_2 + 3) (1 - 1/n)^2 - 3
method = 1 is the typical definition used in Stata and in many older textbooks.
method = 2 is used in SAS and SPSS.
method = 3 is used in MINITAB and BMDP.
Cramer et al. (1997) mention the asymptotic standard error of the skewness, resp. kurtosis:
ASE.skew = sqrt( 6n(n-1)/((n-2)(n+1)(n+3)) ) ASE.kurt = sqrt( (n^2 - 1)/((n-3)(n+5)) )
to be used for calculating the confidence intervals. This is implemented here with ci.type="classic"
. However, Joanes and Gill (1998) advise against this approach, pointing out that the normal assumptions would virtually always be violated.
They suggest using the bootstrap method. That's why the default method for the confidence interval type is set to "bca"
.
This implementation of the two functions is comparably fast, as the expensive sums are coded in C.
If conf.level
is set to NA
then the result will be
a |
single numeric value |
and
if a conf.level
is provided, a named numeric vector with 3 elements:
skew , kurt
|
the specific estimate, either skewness or kurtosis |
lwr.ci |
lower bound of the confidence interval |
upr.ci |
upper bound of the confidence interval |
Andri Signorell <[email protected]>, David Meyer <[email protected]> (method = 3)
Cramer, D. (1997): Basic Statistics for Social Research Routledge.
Joanes, D. N., Gill, C. A. (1998): Comparing measures of sample skewness and kurtosis. The Statistician, 47, 183-189.
mean
, sd
, similar code in library(e1071)
Skew(d.pizza$price, na.rm=TRUE) Kurt(d.pizza$price, na.rm=TRUE) # use sapply to calculate skewness for a data.frame sapply(d.pizza[,c("temperature","price","delivery_min")], Skew, na.rm=TRUE) # or apply to do that columnwise with a matrix apply(as.matrix(d.pizza[,c("temperature","price","delivery_min")]), 2, Skew, na.rm=TRUE)
Skew(d.pizza$price, na.rm=TRUE) Kurt(d.pizza$price, na.rm=TRUE) # use sapply to calculate skewness for a data.frame sapply(d.pizza[,c("temperature","price","delivery_min")], Skew, na.rm=TRUE) # or apply to do that columnwise with a matrix apply(as.matrix(d.pizza[,c("temperature","price","delivery_min")]), 2, Skew, na.rm=TRUE)
Compute the sample median. The function basically wraps the function Quantile()
, which offers the option to define weights.
For grouped data the median can be estimated by linear interpolation within the class containing the median, which is implemented in the interface for Freq
-objects.
Median(x, ...) ## S3 method for class 'factor' Median(x, na.rm = FALSE, ...) ## S3 method for class 'Freq' Median(x, breaks, ...) ## Default S3 method: Median(x, weights = NULL, na.rm = FALSE, ...)
Median(x, ...) ## S3 method for class 'factor' Median(x, na.rm = FALSE, ...) ## S3 method for class 'Freq' Median(x, breaks, ...) ## Default S3 method: Median(x, weights = NULL, na.rm = FALSE, ...)
x |
an object for which a method has been defined, or a numeric vector containing the values whose median is to be computed. |
weights |
a numerical vector of weights the same length as |
breaks |
breaks for calculating the mean for classified data as composed by |
na.rm |
a logical value indicating whether |
... |
further arguments passed to or from other methods. |
This is a generic function for which methods can be written. However,
the default method makes use of is.na
, sort
and
mean
from package base all of which are generic, and so
the default method will work for most classes
(e.g., "Date"
) for which a median is a reasonable
concept.
Calculating the median for ordered factors is not implemented in standard R, as it's not well defined (it is not clear what to do if the median sits between two levels in factors of even length). This function returns the high median and prints a warning if the low median would be different (which is supposed to be a rare event). There's a vivid discussion between experts going on whether this should be defined or not. We'll wait for definitive results and enjoy the function's comfort so far...
Note that there are alternative approaches for calculating weighted median (e.g. matrixstats::weightedMedian
).
The default method returns a length-one object of the same type as
x
, except when x
is integer of even length, when the
result will be double.
If there are no values or if na.rm = FALSE
and there are NA
values the result is NA
of the same type as x
(or more
generally the result of x[FALSE][NA]
).
Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) The New S Language. Wadsworth & Brooks/Cole.
quantile
for general quantiles.
https://stat.ethz.ch/pipermail/r-help/2003-November/042684.html
https://stackoverflow.com/questions/7925102/idiomatic-method-of-finding-the-median-of-an-ordinal
Median(1:4) # = 2.5 [even number] Median(c(1:3, 100, 1000)) # = 3 [odd, robust] # Approximation for classified data breaks <- seq(10,70, 10) Median( Freq(cut(d.pizza$temperature, breaks=breaks)), breaks=breaks) # compared to Median(d.pizza$temperature) # starting from a classified table # from to income # 0 4000 20 # 4000 6000 42 # 6000 8000 31 # 8000 10000 12 # Freq(as.table(c(20,42,31,12))) # level freq perc cumfreq cumperc # 1 A 20 19.0% 20 19.0% # 2 B 42 40.0% 62 59.0% # 3 C 31 29.5% 93 88.6% # 4 D 12 11.4% 105 100.0% Median(Freq(as.table(c(20,42,31,12))), breaks=c(0,4000,6000,8000,10000)) # use weights x <- sample(20, 30, replace = TRUE) z <- as.numeric(names(w <- table(x))) (m1 <- Median(z, weights=w)) (m2 <- Median(x)) stopifnot(identical(m1, m2))
Median(1:4) # = 2.5 [even number] Median(c(1:3, 100, 1000)) # = 3 [odd, robust] # Approximation for classified data breaks <- seq(10,70, 10) Median( Freq(cut(d.pizza$temperature, breaks=breaks)), breaks=breaks) # compared to Median(d.pizza$temperature) # starting from a classified table # from to income # 0 4000 20 # 4000 6000 42 # 6000 8000 31 # 8000 10000 12 # Freq(as.table(c(20,42,31,12))) # level freq perc cumfreq cumperc # 1 A 20 19.0% 20 19.0% # 2 B 42 40.0% 62 59.0% # 3 C 31 29.5% 93 88.6% # 4 D 12 11.4% 105 100.0% Median(Freq(as.table(c(20,42,31,12))), breaks=c(0,4000,6000,8000,10000)) # use weights x <- sample(20, 30, replace = TRUE) z <- as.numeric(names(w <- table(x))) (m1 <- Median(z, weights=w)) (m2 <- Median(x)) stopifnot(identical(m1, m2))
Calculate the confidence interval for the median.
MedianCI( x, conf.level = 0.95, sides = c("two.sided", "left", "right"), method = c("exact", "boot"), na.rm = FALSE, ... )
MedianCI( x, conf.level = 0.95, sides = c("two.sided", "left", "right"), method = c("exact", "boot"), na.rm = FALSE, ... )
x |
a (non-empty) numeric vector of data values. |
conf.level |
confidence level of the interval |
sides |
a character string specifying the side of the confidence
interval, must be one of |
method |
defining the type of interval that should be calculated (one
out of |
na.rm |
logical. Should missing values be removed? Defaults to
|
... |
the dots are passed on to |
The "exact"
method is the way SAS is said to calculate the confidence
interval. This is also implemented in SignTest
. The boot
confidence interval type is calculated by means of boot.ci
with default type "perc"
.
Use sapply
,
resp.apply
, to get the confidence intervals from a data.frame
or from a matrix.
a numeric vector with 3 elements:
median |
median |
lwr.ci |
lower bound of the confidence interval |
upr.ci |
upper bound of the confidence interval |
Andri Signorell [email protected]
wilcox.test
, MeanCI
,
median
, HodgesLehmann
MedianCI(d.pizza$price, na.rm=TRUE) MedianCI(d.pizza$price, conf.level=0.99, na.rm=TRUE) t(round(sapply(d.pizza[,c("delivery_min","temperature","price")], MedianCI, na.rm=TRUE), 3)) MedianCI(d.pizza$price, na.rm=TRUE, method="exact") MedianCI(d.pizza$price, na.rm=TRUE, method="boot") x <- runif(100) set.seed(448) MedianCI(x, method="boot") # ... the same as set.seed(448) MedianCI(x, method="boot", type="bca") MedianCI(x, method="boot", type="basic") MedianCI(x, method="boot", type="perc") MedianCI(x, method="boot", type="norm", R=499) # not supported: MedianCI(x, method="boot", type="stud") MedianCI(x, method="boot", sides="right")
MedianCI(d.pizza$price, na.rm=TRUE) MedianCI(d.pizza$price, conf.level=0.99, na.rm=TRUE) t(round(sapply(d.pizza[,c("delivery_min","temperature","price")], MedianCI, na.rm=TRUE), 3)) MedianCI(d.pizza$price, na.rm=TRUE, method="exact") MedianCI(d.pizza$price, na.rm=TRUE, method="boot") x <- runif(100) set.seed(448) MedianCI(x, method="boot") # ... the same as set.seed(448) MedianCI(x, method="boot", type="bca") MedianCI(x, method="boot", type="basic") MedianCI(x, method="boot", type="perc") MedianCI(x, method="boot", type="norm", R=499) # not supported: MedianCI(x, method="boot", type="stud") MedianCI(x, method="boot", sides="right")
Performs multiple substitions in (a) string(s).
Mgsub(pattern, replacement, x, ...)
Mgsub(pattern, replacement, x, ...)
pattern |
character string containing a regular expression (or character string for fixed = TRUE) to be matched in the given character vector. Coerced by as.character to a character string if possible. |
replacement |
a replacement for matched pattern as in |
x |
a character vector where matches are sought, or an object which can be coerced by as.character to a character vector. Long vectors are supported. |
... |
all dots are passed on to gsub. |
a character vector of the same length and with the same attributes as x (after possible coercion to character).
Andri Signorell <[email protected]>
x <- c("ABC", "BCD", "CDE") Mgsub(pattern=c("B", "C"), replacement=c("X","Y"), x)
x <- c("ABC", "BCD", "CDE") Mgsub(pattern=c("B", "C"), replacement=c("X","Y"), x)
The Mantel-Haenszel chi-square statistic tests the alternative hypothesis that there is a linear association between the row variable and the column variable. Both variables must lie on an ordinal scale.
MHChisqTest(x, srow = 1:nrow(x), scol = 1:ncol(x))
MHChisqTest(x, srow = 1:nrow(x), scol = 1:ncol(x))
x |
a frequency table or a matrix. |
srow |
scores for the row variable, defaults to 1:nrow. |
scol |
scores for the colummn variable, defaults to 1:ncol. |
The statistic is computed as , where
is the Pearson correlation between the row variable and the column variable. The Mantel-Haenszel chi-square statistic use the scores specified by srow and scol.
Under the null hypothesis of no association,
has an asymptotic chi-square distribution with one degree of freedom.
A list with class "htest"
containing the following
components:
statistic |
the value the Mantel-Haenszel chi-squared test statistic. |
parameter |
the degrees of freedom of the approximate chi-squared distribution of the test statistic. |
p.value |
the p-value for the test. |
method |
a character string indicating the type of test performed. |
data.name |
a character string giving the name(s) of the data. |
Andri Signorell <[email protected]>
Agresti, A. (2002) Categorical Data Analysis. John Wiley & Sons, pp 86 ff.
chisq.test
,
for calculating correlation of a table: corr
## A r x c table Agresti (2002, p. 57) Job Satisfaction Job <- matrix(c(1,2,1,0, 3,3,6,1, 10,10,14,9, 6,7,12,11), 4, 4, dimnames = list(income = c("< 15k", "15-25k", "25-40k", "> 40k"), satisfaction = c("VeryD", "LittleD", "ModerateS", "VeryS")) ) MHChisqTest(Job, srow=c(7.5,20,32.5,60))
## A r x c table Agresti (2002, p. 57) Job Satisfaction Job <- matrix(c(1,2,1,0, 3,3,6,1, 10,10,14,9, 6,7,12,11), 4, 4, dimnames = list(income = c("< 15k", "15-25k", "25-40k", "> 40k"), satisfaction = c("VeryD", "LittleD", "ModerateS", "VeryS")) ) MHChisqTest(Job, srow=c(7.5,20,32.5,60))
Calculate the midpoints of a sequence of numbers. This is e.g. useful for labelling stacked barplots.
Midx(x, incl.zero = FALSE, cumulate = FALSE)
Midx(x, incl.zero = FALSE, cumulate = FALSE)
x |
the numeric vector |
incl.zero |
should zero be appended to x before proceeding? If |
cumulate |
should the result be calculated as cumulative sum? Default is FALSE. |
numeric vector with the calculated midpoins
Andri Signorell <[email protected]>
x <- c(1, 3, 6, 7) Midx(x) Midx(x, incl.zero = TRUE) Midx(x, incl.zero = TRUE, cumulate = TRUE) # an alternative to head(MoveAvg(c(0, x), order = 2, align = "l"), n = -1) tab <- matrix(c(401,216,221,254,259,169), nrow=2, byrow=TRUE) b <- barplot(tab, beside = FALSE, horiz=TRUE) x <- t(apply(tab, 2, Midx, incl.zero=TRUE, cumulate=TRUE)) text(tab, x=x, y=b, col="red")
x <- c(1, 3, 6, 7) Midx(x) Midx(x, incl.zero = TRUE) Midx(x, incl.zero = TRUE, cumulate = TRUE) # an alternative to head(MoveAvg(c(0, x), order = 2, align = "l"), n = -1) tab <- matrix(c(401,216,221,254,259,169), nrow=2, byrow=TRUE) b <- barplot(tab, beside = FALSE, horiz=TRUE) x <- t(apply(tab, 2, Midx, incl.zero=TRUE, cumulate=TRUE)) text(tab, x=x, y=b, col="red")
This function can be used to compute the result of color mixing (it assumes additive mixing).
MixColor(col1, col2, amount1 = 0.5)
MixColor(col1, col2, amount1 = 0.5)
col1 |
the first color. |
col2 |
the second color. |
amount1 |
the amount of color1. The amount of color2 results in (1-amount1). |
The mixed color as hexstring
Andri Signorell <[email protected]>
# a mix between red and yellow with rates 3:7 MixColor("red", "yellow", 0.3)
# a mix between red and yellow with rates 3:7 MixColor("red", "yellow", 0.3)
Calculate the mode, the most frequent value, of a numeric or character vector x.
Mode(x, na.rm = FALSE)
Mode(x, na.rm = FALSE)
x |
a (non-empty) numeric vector of data values. |
na.rm |
logical. Should missing values be removed? Defaults to |
The mode is usually useful for qualitative data, sometimes still for an integer vector. For numerical vectors, it is not so much the central tendency property of the mode that is interesting as the information about conspicuous accumulation points, which sometimes can indicate data errors. In Desc()
it is integrated in the numeric description to draw the analyst's attention to strikingly high frequencies of a single value as soon as they exceed a certain treshold. (In a numeric vector we would in general rather expect low numbers of tied values, or we should be aware of the process properties that generates them.)
The handling of NA
values follows the standards of the package. As soon as a single NA
value occurs, NA
is returned as result. This approach can sometimes be conservative when calculating the mode. The mode could be determined unambiguously in cases where the number of missing values is small enough that - regardless of what value they have - they cannot alter the sample mode. The modal frequency could then be determined within a lower and upper range. In the example of x=c(1,1,1,1,2,2,NA)
we know that the mode of x is 1 regardless of what the true value is for the one missing value; and we know that the modal frequency must be between 4 and 5. However this is not implemented in the function and further considerations in this direction are left to the user here.
The mode is elsewhere often calculated in a crude and wasteful way by tabulating the frequency for all elements of the vector and returning the most frequent one. This function uses a sophisticated data structure in C++ and is limited to determining the most frequent element only. Therefore it is orders of magnitude faster than other implementations, especially for large numeric vectors with large numbers of distinct values.
You might furthermore consider using density(x)$x[which.max(density(x)$y)]
for quantitative data or alternatively use hist()
.
Another interesting idea for a more robust estimation of the mode:
peak <- optimize(function(x, model) predict(model, data.frame(x = x)), c(min(x), max(x)), maximum = TRUE, model = y.loess) points(peak$maximum, peak$objective, pch=FILLED.CIRCLE <- 19)
The most frequent value as number or character, depending of class(x)
. If there is more than one, all are returned in a vector.
The modal frequency is attached as attribute named "freq"
.
Andri Signorell <[email protected]>, great Rcpp part by Joseph Wood and Ralf Stubner
https://stackoverflow.com/questions/55212746/rcpp-fast-statistical-mode-function-with-vector-input-of-any-type/ https://stackoverflow.com/a/55213471/8416610
# normal mode Mode(c(0:5, 5)) Mode(5) Mode(NA) Mode(c(NA, NA)) Mode(c(NA, 0:5)) Mode(c(NA, 0:5), na.rm=TRUE) Mode(c(NA, 0:5, 5), na.rm=TRUE) # returns all encountered modes, if several exist Mode(c(0:5, 4, 5, 6)) Mode(d.pizza$driver) Mode(d.pizza$driver, na.rm=TRUE) Mode(as.character(d.pizza$driver), na.rm=TRUE) # use sapply for evaluating data.frames (resp. apply for matrices) sapply(d.pizza[,c("driver", "temperature", "date")], Mode, na.rm=TRUE)
# normal mode Mode(c(0:5, 5)) Mode(5) Mode(NA) Mode(c(NA, NA)) Mode(c(NA, 0:5)) Mode(c(NA, 0:5), na.rm=TRUE) Mode(c(NA, 0:5, 5), na.rm=TRUE) # returns all encountered modes, if several exist Mode(c(0:5, 4, 5, 6)) Mode(d.pizza$driver) Mode(d.pizza$driver, na.rm=TRUE) Mode(as.character(d.pizza$driver), na.rm=TRUE) # use sapply for evaluating data.frames (resp. apply for matrices) sapply(d.pizza[,c("driver", "temperature", "date")], Mode, na.rm=TRUE)
Perform Moses test of extreme reactions, which is a distribution-free non-parametric test for the difference between two independent groups in the extremity of scores (in both directions) that the groups contain. Scores from both groups are pooled and converted to ranks, and the test statistic is the span of scores (the range plus 1) in one of the groups chosen arbitrarily. An exact probability is computed for the span and then recomputed after dropping a specified number of extreme scores from each end of its range. The exact one-tailed probability is calculated.
MosesTest(x, ...) ## Default S3 method: MosesTest(x, y, extreme = NULL, ...) ## S3 method for class 'formula' MosesTest(formula, data, subset, na.action, ...)
MosesTest(x, ...) ## Default S3 method: MosesTest(x, y, extreme = NULL, ...) ## S3 method for class 'formula' MosesTest(formula, data, subset, na.action, ...)
x |
numeric vector of data values. |
y |
numeric vector of data values. |
formula |
a formula of the form |
data |
an optional matrix or data frame (or similar: see |
subset |
an optional vector specifying a subset of observations to be used. |
na.action |
a function which indicates what should happen when the data contain |
extreme |
integer, defines the number of extreme values to be dropped from the control group before calculating the
span. Default ( |
... |
further arguments to be passed to or from methods. |
For two independent samples from a continuous field, this tests whether extreme values are equally likely in both populations or if they are more likely to occur in the population from which the sample with the larger range was drawn.
Note that the ranks are calculated in decreasing mode.
A list with class “htest” containing the following components:
statistic |
the value of the Moses Test statistic. |
p.value |
the p-value for the test. |
method |
the character string “Moses Test of Extreme Reactions”. |
data.name |
a character string giving the name(s) of the data. |
Andri Signorell <[email protected]>
Moses, L.E. (1952) A Two-Sample Test, Psychometrika, 17, 239-247.
x <- c(0.80, 0.83, 1.89, 1.04, 1.45, 1.38, 1.91, 1.64, 0.73, 1.46) y <- c(1.15, 0.88, 0.90, 0.74, 1.21) MosesTest(x, y) set.seed(1479) x <- sample(1:20, 10, replace=TRUE) y <- sample(5:25, 6, replace=TRUE) MosesTest(x, y)
x <- c(0.80, 0.83, 1.89, 1.04, 1.45, 1.38, 1.91, 1.64, 0.73, 1.46) y <- c(1.15, 0.88, 0.90, 0.74, 1.21) MosesTest(x, y) set.seed(1479) x <- sample(1:20, 10, replace=TRUE) y <- sample(5:25, 6, replace=TRUE) MosesTest(x, y)
Compute a simple moving average (running mean).
MoveAvg(x, order, align = c("center", "left", "right"), endrule = c("NA", "keep", "constant"))
MoveAvg(x, order, align = c("center", "left", "right"), endrule = c("NA", "keep", "constant"))
x |
univariate time series. |
order |
order of moving average. |
align |
specifies whether result should be centered (default), left-aligned or right-aligned. |
endrule |
character string indicating how the values at the beginning and the end (of the data) should be treated.
|
The implementation is using the function filter
to calculate the moving average.
Returns a vector of the same size and same class as x.
Andri Signorell <[email protected]>
There's a faster implementation of running mean in the package caTools runmean()
and a slower one in forecast ma()
. There's similar code in Midx()
.
MoveAvg(AirPassengers, order=5)
MoveAvg(AirPassengers, order=5)
Confidence intervals for multinomial proportions are often approximated by single binomial confidence intervals, which might in practice often yield satisfying results, but is properly speaking not correct. This function calculates simultaneous confidence intervals for multinomial proportions either according to the methods of Sison and Glaz, Goodman, Wald, Wald with continuity correction or Wilson.
MultinomCI(x, conf.level = 0.95, sides = c("two.sided", "left", "right"), method = c("sisonglaz", "cplus1", "goodman", "wald", "waldcc", "wilson", "qh", "fs"))
MultinomCI(x, conf.level = 0.95, sides = c("two.sided", "left", "right"), method = c("sisonglaz", "cplus1", "goodman", "wald", "waldcc", "wilson", "qh", "fs"))
x |
A vector of positive integers representing the number of occurrences of each class. The total number of samples equals the sum of such elements. |
conf.level |
confidence level, defaults to 0.95. |
sides |
a character string specifying the side of the confidence interval, must be one of |
method |
character string specifing which method to use; can be one out of
|
Given a vector of observations with the number of samples falling in each class of a multinomial distribution,
builds the simultaneous confidence intervals for the multinomial probabilities according to the method proposed by the mentioned authors.
The R code for Sison and Glaz (1995) has been translated from thes SAS code written by May and Johnson (2000). See the references for the other methods (qh = Quesensberry-Hurst, fs = Fitzpatrick-Scott).
Some approaches for the confidence intervals can potentially yield negative results or values beyond 1. These would be reset such as not to exceed the range of [0, 1].
A matrix with 3 columns:
est |
estimate |
lwr.ci |
lower bound of the confidence interval |
upr.ci |
upper bound of the confidence interval |
The number of rows correspond to the dimension of x.
Pablo J. Villacorta Iglesias <[email protected]>
Department of Computer Science and Artificial Intelligence, University of Granada (Spain)
(Sison-Glaz)
Andri Signorell <[email protected]> (Goodman, Wald, Wilson, Fitzpatrick-Scott, Quesensberry-Hurst)
Fitzpatrick, S. and Scott, A. (1987). Quick simultaneous confidence interval for multinomial proportions. Journal of American Statistical Association 82(399): 875-878.
Glaz, J., Sison, C.P. (1999) Simultaneous confidence intervals for multinomial proportions. Journal of Statistical Planning and Inference 82:251-262.
Goodman, L. A. (1965) On Simultaneous Confidence Intervals for Multinomial Proportions Technometrics, 7, 247-254.
May, W.L., Johnson, W.D.(2000) Constructing two-sided simultaneous confidence intervals for multinomial proportions for small counts in a large number of cells. Journal of Statistical Software 5(6) . Paper and code available at https://www.jstatsoft.org/v05/i06.
Quesensberry, C.P. and Hurst, D.C. (1964). Large Sample Simultaneous Confidence Intervals for Multinational Proportions. Technometrics, 6: 191-195.
Sangeetha, U., Subbiah, M., Srinivasan, M. R. (2013) Mathematical Analysis of propensity of aberration on the methods for interval estimation of the multinomial proportions. IOSR Journal of Mathematics, e-ISSN: 2278-5728,p-ISSN: 2319-765X, Volume 7, Issue 4 (Jul. - Aug. 2013), PP 23-28
Sison, C.P and Glaz, J. (1995) Simultaneous confidence intervals and sample size determination for multinomial proportions. Journal of the American Statistical Association, 90:366-369.
Wald, A. Tests of statistical hypotheses concerning several parameters when the number of observations is large, Trans. Am. Math. Soc. 54 (1943) 426-482.
Wilson, E. B. Probable inference, the law of succession and statistical inference, J.Am. Stat. Assoc. 22 (1927) 209-212.
# Multinomial distribution with 3 classes, from which a sample of 79 elements # were drawn: 23 of them belong to the first class, 12 to the # second class and 44 to the third class. Punctual estimations # of the probabilities from this sample would be 23/79, 12/79 # and 44/79 but we want to build 95% simultaneous confidence intervals # for the true probabilities MultinomCI(c(23, 12, 44), conf.level=0.95) # single sided MultinomCI(c(23, 12, 44), conf.level=0.95, sides="left") MultinomCI(c(23, 12, 44), conf.level=0.95, sides="right") x <- c(35, 74, 22, 69) MultinomCI(x, method="goodman") MultinomCI(x, method="sisonglaz") MultinomCI(x, method="cplus1") MultinomCI(x, method="wald") MultinomCI(x, method="waldcc") MultinomCI(x, method="wilson") # compare to BinomCI(x, n=sum(x)) # example in Goodman (1965) MultinomCI(x = c(91,49,37,43),conf.level = 0.95,method="goodman") # example from Sison, Glaz (1999) in Sangeetha (2013) - Table 2 x <- c(56, 72, 73, 59, 62, 87, 58) do.call(cbind, lapply(c("wald", "waldcc", "wilson", "qh", "goodman", "fs", "sisonglaz"), function(m) round(MultinomCI(x, method=m)[,-1], 3)))
# Multinomial distribution with 3 classes, from which a sample of 79 elements # were drawn: 23 of them belong to the first class, 12 to the # second class and 44 to the third class. Punctual estimations # of the probabilities from this sample would be 23/79, 12/79 # and 44/79 but we want to build 95% simultaneous confidence intervals # for the true probabilities MultinomCI(c(23, 12, 44), conf.level=0.95) # single sided MultinomCI(c(23, 12, 44), conf.level=0.95, sides="left") MultinomCI(c(23, 12, 44), conf.level=0.95, sides="right") x <- c(35, 74, 22, 69) MultinomCI(x, method="goodman") MultinomCI(x, method="sisonglaz") MultinomCI(x, method="cplus1") MultinomCI(x, method="wald") MultinomCI(x, method="waldcc") MultinomCI(x, method="wilson") # compare to BinomCI(x, n=sum(x)) # example in Goodman (1965) MultinomCI(x = c(91,49,37,43),conf.level = 0.95,method="goodman") # example from Sison, Glaz (1999) in Sangeetha (2013) - Table 2 x <- c(56, 72, 73, 59, 62, 87, 58) do.call(cbind, lapply(c("wald", "waldcc", "wilson", "qh", "goodman", "fs", "sisonglaz"), function(m) round(MultinomCI(x, method=m)[,-1], 3)))
Merge multiple data frames by row names, or do other versions of database join operations.
MultMerge(..., all.x = TRUE, all.y = TRUE, by = NULL)
MultMerge(..., all.x = TRUE, all.y = TRUE, by = NULL)
... |
data frames to be coerced to one. |
all.x |
logical; if |
all.y |
logical; analogous to |
by |
column used for merging, if this is not defined rownames will be used by default. The column must be included in all the provided data frames. |
A data frame. The rows are sorted according to the appearance of previously unobserved rownames. So the rownames appearing in the first data frame are first, then the rownames in the second data frame, which have no corespondence in the first data frame and so on. The columns are the remaining columns in x1 and then those in x2 and then those in x3. The result has the row names resulting from the merge.
Andri Signorell <[email protected]>
x1 <- SetNames(data.frame(v=letters[1:6], w=1:6), rownames=c("A", "B", "C", "D", "E", "F")) x2 <- SetNames(data.frame(v=letters[1:3], ww=11:13), rownames=c("B", "C", "D")) x3 <- SetNames(data.frame(v=letters[12:16], wwww=22:26), rownames=c("A", "C", "E", "G", "J")) # default is "merge by rownames" MultMerge(x1, x2, x3) # ... which does not really make sense here # merge by column v MultMerge(x1, x2, x3, by="v")
x1 <- SetNames(data.frame(v=letters[1:6], w=1:6), rownames=c("A", "B", "C", "D", "E", "F")) x2 <- SetNames(data.frame(v=letters[1:3], ww=11:13), rownames=c("B", "C", "D")) x3 <- SetNames(data.frame(v=letters[12:16], wwww=22:26), rownames=c("A", "C", "E", "G", "J")) # default is "merge by rownames" MultMerge(x1, x2, x3) # ... which does not really make sense here # merge by column v MultMerge(x1, x2, x3, by="v")
In order to replace the NAs in a factor an additional level has to be defined first. This function does this and replaces the NAs by the given level.
NALevel(x, level)
NALevel(x, level)
x |
a vector which will be turned into a factor. |
level |
the name for the new level |
the vector x with the NAs replaced by level
Andri Signorell <[email protected]>
x <- c(LETTERS[1:5], NA) table(NALevel(x, "something else"))
x <- c(LETTERS[1:5], NA) table(NALevel(x, "something else"))
Performs Nemenyi's test of multiple comparisons.
NemenyiTest(x, ...) ## Default S3 method: NemenyiTest(x, g, dist = c("tukey", "chisq"), out.list = TRUE, ...) ## S3 method for class 'formula' NemenyiTest(formula, data, subset, na.action, ...)
NemenyiTest(x, ...) ## Default S3 method: NemenyiTest(x, g, dist = c("tukey", "chisq"), out.list = TRUE, ...) ## S3 method for class 'formula' NemenyiTest(formula, data, subset, na.action, ...)
x |
a numeric vector of data values, or a list of numeric data vectors. |
g |
a vector or factor object giving the group for the
corresponding elements of |
dist |
the distribution used for the test. Can be |
out.list |
logical, defining if the output should be organized in listform. |
formula |
a formula of the form |
data |
an optional matrix or data frame (or similar: see
|
subset |
an optional vector specifying a subset of observations to be used. |
na.action |
a function which indicates what should happen when
the data contain |
... |
further arguments to be passed to or from methods. |
Nemenyi proposed a test based on rank sums and the application of the family-wise error method to control Type I error inflation, if multiple comparisons are done. The Tukey and Kramer approach uses mean rank sums and can be employed for equally as well as unequally sized samples without ties.
A list of class htest
, containing the following components:
statistic |
Nemenyi test |
p.value |
the p-value for the test |
null.value |
is the value of the median specified by the null hypothesis. This
equals the input argument |
alternative |
a character string describing the alternative hypothesis. |
method |
the type of test applied |
data.name |
a character string giving the names of the data. |
Andri Signorell <[email protected]>
Nemenyi, P. B. (1963) Distribution-Free Multiple Comparisons New York, State University of New York, Downstate Medical Center
Hollander, M., Wolfe, D.A. (1999) Nonparametric Statistical Methods New York, Wiley, pp. 787
Friedman, M. (1937) The use of ranks to avoid the assumption of normality implicit in the analysis of variance Journal of the American Statistical Association, 32:675-701
Friedman, M. (1940) A comparison of alternative tests of significance for the problem of m rankings Annals of Mathematical Statistics, 11:86-92
## Hollander & Wolfe (1973), 116. ## Mucociliary efficiency from the rate of removal of dust in normal ## subjects, subjects with obstructive airway disease, and subjects ## with asbestosis. x <- c(2.9, 3.0, 2.5, 2.6, 3.2) # normal subjects y <- c(3.8, 2.7, 4.0, 2.4) # with obstructive airway disease z <- c(2.8, 3.4, 3.7, 2.2, 2.0) # with asbestosis NemenyiTest(list(x, y, z)) ## Equivalently, x <- c(x, y, z) g <- factor(rep(1:3, c(5, 4, 5)), labels = c("Normal subjects", "Subjects with obstructive airway disease", "Subjects with asbestosis")) NemenyiTest(x, g) ## Formula interface. boxplot(Ozone ~ Month, data = airquality) NemenyiTest(Ozone ~ Month, data = airquality) # Hedderich & Sachs, 2012, p. 555 d.frm <- data.frame(x=c(28,30,33,35,38,41, 36,39,40,43,45,50, 44,45,47,49,53,54), g=c(rep(LETTERS[1:3], each=6)), stringsAsFactors=TRUE) NemenyiTest(x~g, d.frm)
## Hollander & Wolfe (1973), 116. ## Mucociliary efficiency from the rate of removal of dust in normal ## subjects, subjects with obstructive airway disease, and subjects ## with asbestosis. x <- c(2.9, 3.0, 2.5, 2.6, 3.2) # normal subjects y <- c(3.8, 2.7, 4.0, 2.4) # with obstructive airway disease z <- c(2.8, 3.4, 3.7, 2.2, 2.0) # with asbestosis NemenyiTest(list(x, y, z)) ## Equivalently, x <- c(x, y, z) g <- factor(rep(1:3, c(5, 4, 5)), labels = c("Normal subjects", "Subjects with obstructive airway disease", "Subjects with asbestosis")) NemenyiTest(x, g) ## Formula interface. boxplot(Ozone ~ Month, data = airquality) NemenyiTest(Ozone ~ Month, data = airquality) # Hedderich & Sachs, 2012, p. 555 d.frm <- data.frame(x=c(28,30,33,35,38,41, 36,39,40,43,45,50, 44,45,47,49,53,54), g=c(rep(LETTERS[1:3], each=6)), stringsAsFactors=TRUE) NemenyiTest(x~g, d.frm)
Encode a vector x to a factor and then to a numeric value. It's a simple shortcut for as.numeric(factor(x, ...))
Nf(x, ...)
Nf(x, ...)
x |
a vector of data, usually taking a small number of distinct values. |
... |
the dots are passed on to |
numeric vector
Andri Signorell <[email protected]>
x <- LETTERS[10:15] Nf(x) # same as .. as.numeric(factor(x))
x <- LETTERS[10:15] Nf(x) # same as .. as.numeric(factor(x))
Calculate the one period returns, the net present value (NPV()
), the internal rate of return (IRR()
) of a sequence of payments. NPVFixBond()
returns the netpresent value for a fixed-rate bond, YTM()
the yield to maturity for a bond.
OPR(K, D = NULL, log = FALSE) NPV(i, cf, t = seq(along = cf) - 1) IRR(cf, t = seq(along = cf) - 1, interval = c(-1.5, 1.5), ...) NPVFixBond(i, Co, RV, n) YTM(Co, PP, RV, n)
OPR(K, D = NULL, log = FALSE) NPV(i, cf, t = seq(along = cf) - 1) IRR(cf, t = seq(along = cf) - 1, interval = c(-1.5, 1.5), ...) NPVFixBond(i, Co, RV, n) YTM(Co, PP, RV, n)
i |
the interest rate |
cf |
numeric vector with the payments |
t |
periods |
K |
the capital at time t |
D |
dividend at time t |
log |
logical, determining if the simple returns (default) or log returns are to be calculated. |
interval |
a vector containing the end-points of the interval to
be searched for the root in the function |
Co |
coupon payments of a fixed-rate bond |
PP |
purchase price for a fixed-rate bond |
RV |
redemption value |
n |
the term of the bond, total number of periods |
... |
the dots are passed to the |
The one period returns are calculated as
a numeric value
Andri Signorell <[email protected]>
# one root IRR(cf <- c(-900, -250+450-90, 460-100, 500-120, 550-140)) # several IRR solutions IRR(cf = c(-100, 500, -600)) # no solution IRR(cf = c(-100, 400, -600)) # negative and huge solution IRR(cf = c(-100, 1000, -600), interval = c(-1.5, 1000))
# one root IRR(cf <- c(-900, -250+450-90, 460-100, 500-120, 550-140)) # several IRR solutions IRR(cf = c(-100, 500, -600)) # no solution IRR(cf = c(-100, 400, -600)) # negative and huge solution IRR(cf = c(-100, 1000, -600), interval = c(-1.5, 1000))
Return a vector with all zero elements removed.
NZ(x)
NZ(x)
x |
numeric vector |
numerich vector
Andri Signorell <[email protected]>
x <- c(1,2,0,3) NZ(x)
x <- c(1,2,0,3) NZ(x)
Calculates odds ratio by unconditional maximum likelihood estimation (wald
),
conditional maximum likelihood estimation (mle
) or median-unbiased estimation (midp
).
Confidence intervals are calculated using normal approximation (wald
) and exact methods
(midp
, mle
).
OddsRatio(x, conf.level = NULL, ...) ## S3 method for class 'glm' OddsRatio(x, conf.level = NULL, digits = 3, use.profile = FALSE, ...) ## S3 method for class 'multinom' OddsRatio(x, conf.level = NULL, digits = 3, ...) ## S3 method for class 'zeroinfl' OddsRatio(x, conf.level = NULL, digits = 3, ...) ## Default S3 method: OddsRatio(x, conf.level = NULL, y = NULL, method = c("wald", "mle", "midp"), interval = c(0, 1000), ...)
OddsRatio(x, conf.level = NULL, ...) ## S3 method for class 'glm' OddsRatio(x, conf.level = NULL, digits = 3, use.profile = FALSE, ...) ## S3 method for class 'multinom' OddsRatio(x, conf.level = NULL, digits = 3, ...) ## S3 method for class 'zeroinfl' OddsRatio(x, conf.level = NULL, digits = 3, ...) ## Default S3 method: OddsRatio(x, conf.level = NULL, y = NULL, method = c("wald", "mle", "midp"), interval = c(0, 1000), ...)
x |
a vector or a |
y |
NULL (default) or a vector with compatible dimensions to |
digits |
the number of fixed digits to be used for printing the odds ratios. |
method |
method for calculating odds ratio and confidence intervals. Can be one out of
" |
conf.level |
confidence level. Default is |
interval |
interval for the function |
use.profile |
logical. Defines if profile approach should be used, which normally is a good choice. Calculating profile can however take ages for large datasets and not be necessary there. So we can fallback to normal confidence intervals. |
... |
further arguments are passed to the function |
If a table is provided the following table structure is preferred:
disease=1 disease=0 exposed=1 n11 n10 exposed=0 n01 n00
however, for odds ratios the following table is equivalent:
disease=0 disease=1 exposed=0 (ref) n00 n01 exposed=1 n10 n11
If the table to be provided to this function is not in the
preferred form, the function Rev()
can be used to "reverse" the table rows, resp.
-columns. Reversing columns or rows (but not both) will lead to the inverse of the odds ratio.
In case of zero entries, 0.5 will be added to the table.
a single numeric value if conf.level is set to NA
a numeric vector with 3 elements for estimate, lower and upper confidence interval if conf.level is provided
Andri Signorell <[email protected]>, strongly based on code from Tomas Aragon, <[email protected]>
Kenneth J. Rothman and Sander Greenland (1998): Modern Epidemiology, Lippincott-Raven Publishers
Kenneth J. Rothman (2002): Epidemiology: An Introduction, Oxford University Press
Nicolas P. Jewell (2004): Statistics for Epidemiology, 1st Edition, 2004, Chapman & Hall, pp. 73-81
Agresti, Alan (2013) Categorical Data Analysis. NY: John Wiley and Sons, Chapt. 3.1.1
# Case-control study assessing whether exposure to tap water # is associated with cryptosporidiosis among AIDS patients tab <- matrix(c(2, 29, 35, 64, 12, 6), 3, 2, byrow=TRUE) dimnames(tab) <- list("Tap water exposure" = c("Lowest", "Intermediate", "Highest"), "Outcome" = c("Case", "Control")) tab <- Rev(tab, margin=2) OddsRatio(tab[1:2,]) OddsRatio(tab[c(1,3),]) OddsRatio(tab[1:2,], method="mle") OddsRatio(tab[1:2,], method="midp") OddsRatio(tab[1:2,], method="wald", conf.level=0.95) # in case of zeros consider using glm for calculating OR dp <- data.frame (a=c(20, 7, 0, 0), b=c(0, 0, 0, 12), t=c(1, 0, 1, 0)) fit <- glm(cbind(a, b) ~ t, data=dp, family=binomial) exp(coef(fit)) # calculation of log oddsratios in a 2x2xk table migraine <- xtabs(freq ~ ., cbind(expand.grid(treatment=c("active","placebo"), response=c("better","same"), gender=c("female","male")), freq=c(16,5,11,20,12,7,16,19)) ) log(apply(migraine, 3, OddsRatio)) # OddsRatio table for logistic regression models r.glm <- glm(type ~ ., data=MASS::Pima.tr2, family=binomial) OddsRatio(r.glm) plot(OddsRatio(r.glm), xlim=c(0.5, 2), main="OddsRatio - glm", pch=NA, lblcolor=DescTools::hred, args.errbars=list(col=DescTools::horange, pch=21, col.pch=DescTools::hblue, bg.pch=DescTools::hyellow, cex.pch=1.5))
# Case-control study assessing whether exposure to tap water # is associated with cryptosporidiosis among AIDS patients tab <- matrix(c(2, 29, 35, 64, 12, 6), 3, 2, byrow=TRUE) dimnames(tab) <- list("Tap water exposure" = c("Lowest", "Intermediate", "Highest"), "Outcome" = c("Case", "Control")) tab <- Rev(tab, margin=2) OddsRatio(tab[1:2,]) OddsRatio(tab[c(1,3),]) OddsRatio(tab[1:2,], method="mle") OddsRatio(tab[1:2,], method="midp") OddsRatio(tab[1:2,], method="wald", conf.level=0.95) # in case of zeros consider using glm for calculating OR dp <- data.frame (a=c(20, 7, 0, 0), b=c(0, 0, 0, 12), t=c(1, 0, 1, 0)) fit <- glm(cbind(a, b) ~ t, data=dp, family=binomial) exp(coef(fit)) # calculation of log oddsratios in a 2x2xk table migraine <- xtabs(freq ~ ., cbind(expand.grid(treatment=c("active","placebo"), response=c("better","same"), gender=c("female","male")), freq=c(16,5,11,20,12,7,16,19)) ) log(apply(migraine, 3, OddsRatio)) # OddsRatio table for logistic regression models r.glm <- glm(type ~ ., data=MASS::Pima.tr2, family=binomial) OddsRatio(r.glm) plot(OddsRatio(r.glm), xlim=c(0.5, 2), main="OddsRatio - glm", pch=NA, lblcolor=DescTools::hred, args.errbars=list(col=DescTools::horange, pch=21, col.pch=DescTools::hblue, bg.pch=DescTools::hyellow, cex.pch=1.5))
Density function, distribution function and random generation for a selected Order statistic of a given number of independent variables from a specified distribution.
dOrder(x, densfun, distnfun, ..., distn, mlen = 1, j = 1, largest = TRUE, log = FALSE) pOrder(q, distnfun, ..., distn, mlen = 1, j = 1, largest = TRUE, lower.tail = TRUE) rOrder(n, quantfun, ..., distn, mlen = 1, j = 1, largest = TRUE)
dOrder(x, densfun, distnfun, ..., distn, mlen = 1, j = 1, largest = TRUE, log = FALSE) pOrder(q, distnfun, ..., distn, mlen = 1, j = 1, largest = TRUE, lower.tail = TRUE) rOrder(n, quantfun, ..., distn, mlen = 1, j = 1, largest = TRUE)
x , q
|
Vector of quantiles. |
n |
Number of observations. |
densfun , distnfun , quantfun
|
Density, distribution and
quantile function of the specified distribution. The density
function must have a |
... |
Parameters of the specified distribution. |
distn |
A character string, optionally specified as an
alternative to |
mlen |
The number of independent variables. |
j |
The Order statistic, taken as the |
largest |
Logical; if |
log |
Logical; if |
lower.tail |
Logical; if |
dOrder
gives the density function, pOrder
gives the
distribution function and qOrder
gives the quantile function
of a selected Order statistic from a sample of size mlen
,
from a specified distibution. rOrder
generates random deviates.
Alec Stephenson <[email protected]>
dOrder(2:4, dnorm, pnorm, mean = 0.5, sd = 1.2, mlen = 5, j = 2) dOrder(2:4, distn = "norm", mean = 0.5, sd = 1.2, mlen = 5, j = 2) dOrder(2:4, distn = "exp", mlen = 2, j = 2) pOrder(2:4, distn = "exp", rate = 1.2, mlen = 2, j = 2) rOrder(5, qgamma, shape = 1, mlen = 10, j = 2)
dOrder(2:4, dnorm, pnorm, mean = 0.5, sd = 1.2, mlen = 5, j = 2) dOrder(2:4, distn = "norm", mean = 0.5, sd = 1.2, mlen = 5, j = 2) dOrder(2:4, distn = "exp", mlen = 2, j = 2) pOrder(2:4, distn = "exp", rate = 1.2, mlen = 2, j = 2) rOrder(5, qgamma, shape = 1, mlen = 10, j = 2)
The odds ratio is a common measure when comparing two groups in terms of an outcome that is either present or absent. As the odds ratio is in general poorly understood, odds ratios are often discussed in terms of risks, relying on the approximation, that odds ratio and relative risk are about the same when the outcome is rare. However the relative risk also depends on the risk of the baseline group and if the outcome is not rare there can be large differences between both measures and the odds ratio may substantially overestimate the relative risk. In fact, the same odds ratio could imply a very different relative risk for subgroups of the population with different baseline risks.
The present function transforms a given odds-ratio (OR) to the respective relative risk (RR) either for simple odds ratios but also for odds ratios resulting from a logistic model.
ORToRelRisk(...) ## S3 method for class 'OddsRatio' ORToRelRisk(x, ... ) ## Default S3 method: ORToRelRisk(or, p0, ...)
ORToRelRisk(...) ## S3 method for class 'OddsRatio' ORToRelRisk(x, ... ) ## Default S3 method: ORToRelRisk(or, p0, ...)
x |
the odds ratios of a logistic model as returned by |
or |
numeric vector, containing odds-ratios. |
p0 |
numeric vector, incidence of the outcome of interest in the nonexposed group ("baseline risk"). |
... |
further arguments, are not used here. |
The function transforms a given odds-ratio (or) to the respective relative risk (rr). It can also be used to transform the limits of confidence intervals.
The formula for converting an odds ratio to a relative risk is
where is the baseline risk.
For transformation of odds ratios resulting from a logit model, we use the formula of Zhang and Yu (1998).
relative risk.
Matthias Kohl <[email protected]>
Zhang, J. and Yu, K. F. (1998). What's the relative risk? A method of correcting the odds ratio in cohort studies of common outcomes. JAMA, 280(19):1690-1691.
Grant, R. L. (2014) Converting an odds ratio to a range of plausible relative risks for better communication of research findings. BMJ 2014;348:f7450 doi: 10.1136/bmj.f7450
(heart <- as.table(matrix(c(11, 2, 4, 6), nrow=2, dimnames = list(Exposure = c("High", "Low"), Response = c("Yes", "No"))))) RelRisk(heart) # calculated as (11/15)/(2/8) OddsRatio(heart) # calculated as (11/4)/(2/6) ORToRelRisk(OddsRatio(heart), p0 = 2/8) # Relative risk = odds ratio / (1 - p0 + (p0 * odds ratio)) # where p0 is the baseline risk ## single OR to RR ORToRelRisk(14.1, 0.05) ## OR and 95% confidence interval ORToRelRisk(c(14.1, 7.8, 27.5), 0.05) ## Logistic OR and 95% confidence interval logisticOR <- rbind(c(14.1, 7.8, 27.5), c(8.7, 5.5, 14.3), c(27.4, 17.2, 45.8), c(4.5, 2.7, 7.8), c(0.25, 0.17, 0.37), c(0.09, 0.05, 0.14)) colnames(logisticOR) <- c("OR", "2.5%", "97.5%") rownames(logisticOR) <- c("7.4", "4.2", "3.0", "2.0", "0.37", "0.14") logisticOR ## p0 p0 <- c(0.05, 0.12, 0.32, 0.27, 0.40, 0.40) ## Compute corrected RR ## helper function ORToRelRisk.mat <- function(or, p0){ res <- matrix(NA, nrow = nrow(or), ncol = ncol(or)) for(i in seq_len(nrow(or))) res[i,] <- ORToRelRisk(or[i,], p0[i]) dimnames(res) <- dimnames(or) res } RR <- ORToRelRisk.mat(logisticOR, p0) round(RR, 2) ## Results are not completely identical to Zhang and Yu (1998) ## what probably is caused by the fact that the logistic OR values ## provided in the table are rounded and not true values.
(heart <- as.table(matrix(c(11, 2, 4, 6), nrow=2, dimnames = list(Exposure = c("High", "Low"), Response = c("Yes", "No"))))) RelRisk(heart) # calculated as (11/15)/(2/8) OddsRatio(heart) # calculated as (11/4)/(2/6) ORToRelRisk(OddsRatio(heart), p0 = 2/8) # Relative risk = odds ratio / (1 - p0 + (p0 * odds ratio)) # where p0 is the baseline risk ## single OR to RR ORToRelRisk(14.1, 0.05) ## OR and 95% confidence interval ORToRelRisk(c(14.1, 7.8, 27.5), 0.05) ## Logistic OR and 95% confidence interval logisticOR <- rbind(c(14.1, 7.8, 27.5), c(8.7, 5.5, 14.3), c(27.4, 17.2, 45.8), c(4.5, 2.7, 7.8), c(0.25, 0.17, 0.37), c(0.09, 0.05, 0.14)) colnames(logisticOR) <- c("OR", "2.5%", "97.5%") rownames(logisticOR) <- c("7.4", "4.2", "3.0", "2.0", "0.37", "0.14") logisticOR ## p0 p0 <- c(0.05, 0.12, 0.32, 0.27, 0.40, 0.40) ## Compute corrected RR ## helper function ORToRelRisk.mat <- function(or, p0){ res <- matrix(NA, nrow = nrow(or), ncol = ncol(or)) for(i in seq_len(nrow(or))) res[i,] <- ORToRelRisk(or[i,], p0[i]) dimnames(res) <- dimnames(or) res } RR <- ORToRelRisk.mat(logisticOR, p0) round(RR, 2) ## Results are not completely identical to Zhang and Yu (1998) ## what probably is caused by the fact that the logistic OR values ## provided in the table are rounded and not true values.
Return outliers following Tukey's boxplot and Hampel's median/mad definition.
Outlier(x, method = c("boxplot", "hampel"), value = TRUE,na.rm = FALSE)
Outlier(x, method = c("boxplot", "hampel"), value = TRUE,na.rm = FALSE)
x |
a (non-empty) numeric vector of data values. |
method |
the method to be used. So far Tukey's boxplot and Hampel's rule are implemented. |
value |
logical. If |
na.rm |
logical. Should missing values be removed? Defaults to |
Outlier detection is a tricky problem and should be handled with care. We implement Tukey's boxplot rule as a rough idea of spotting extreme values.
Hampel considers values outside of median +/- 3 * (median absolute deviation) to be outliers.
the values of x lying outside the whiskers in a boxplot
or the indices of them
Andri Signorell <[email protected]>
Hampel F. R. (1974) The influence curve and its role in robust estimation, Journal of the American Statistical Association, 69, 382-393
Outlier(d.pizza$temperature, na.rm=TRUE) # it's the same as the result from boxplot sort(d.pizza$temperature[Outlier(d.pizza$temperature, value=FALSE, na.rm=TRUE)]) b <- boxplot(d.pizza$temperature, plot=FALSE) sort(b$out) # nice to find the corresponding rows d.pizza[Outlier(d.pizza$temperature, value=FALSE, na.rm=TRUE), ] # compare to Hampel's rule Outlier(d.pizza$temperature, method="hampel", na.rm=TRUE) # outliers for the each driver tapply(d.pizza$temperature, d.pizza$driver, Outlier, na.rm=TRUE) # the same as: boxplot(temperature ~ driver, d.pizza)$out
Outlier(d.pizza$temperature, na.rm=TRUE) # it's the same as the result from boxplot sort(d.pizza$temperature[Outlier(d.pizza$temperature, value=FALSE, na.rm=TRUE)]) b <- boxplot(d.pizza$temperature, plot=FALSE) sort(b$out) # nice to find the corresponding rows d.pizza[Outlier(d.pizza$temperature, value=FALSE, na.rm=TRUE), ] # compare to Hampel's rule Outlier(d.pizza$temperature, method="hampel", na.rm=TRUE) # outliers for the each driver tapply(d.pizza$temperature, d.pizza$driver, Outlier, na.rm=TRUE) # the same as: boxplot(temperature ~ driver, d.pizza)$out
Performs a Page test for ordered alternatives using an exact algorithm by Stefan Wellek (1989) with unreplicated blocked data.
PageTest(y, ...) ## Default S3 method: PageTest(y, groups, blocks, ...) ## S3 method for class 'formula' PageTest(formula, data, subset, na.action, ...)
PageTest(y, ...) ## Default S3 method: PageTest(y, groups, blocks, ...) ## S3 method for class 'formula' PageTest(formula, data, subset, na.action, ...)
y |
either a numeric vector of data values, or a data matrix. |
groups |
a vector giving the group for the corresponding
elements of |
blocks |
a vector giving the block for the corresponding
elements of |
formula |
a formula of the form |
data |
an optional matrix or data frame (or similar: see
|
subset |
an optional vector specifying a subset of observations to be used. |
na.action |
a function which indicates what should happen when
the data contain |
... |
further arguments to be passed to or from methods. |
PageTest
can be used for analyzing unreplicated complete
block designs (i.e., there is exactly one observation in y
for each combination of levels of groups
and blocks
)
where the normality assumption may be violated.
The null hypothesis is that apart from an effect of blocks
,
the location parameter of y
is the same in each of the
groups
.
The implemented alternative is, that the location parameter will be monotonly greater along the groups, ... (where at least one inequality is strict).
If the other direction is required, the order of the groups has to be reversed.
The Page test for ordered alternatives is slightly more powerful than
the Friedman analysis of variance by ranks.
If y
is a matrix, groups
and blocks
are
obtained from the column and row indices, respectively. NA
's
are not allowed in groups
or blocks
; if y
contains NA
's, corresponding blocks are removed.
For small values of k (methods) or N (data objects), ‘PageTest’ will calculate the exact p-values. For ‘k, N > 15, Inf’, a normal approximation is returned. Only one of these values will be returned.
A list with class "htest"
containing the following components:
statistic |
the L-statistic with names attribute “L”. |
p.value |
the p-value of the test. |
method |
the character string |
data.name |
a character string giving the names of the data. |
Special thanks to Prof. S. Wellek for porting old GAUSS code to R.
Stefan Wellek <[email protected]> (exact p-values), Andri Signorell <[email protected]> (interface) (strongly based on R-Core code)
Page, E. (1963): Ordered hypotheses for multiple treatments: A significance test for linear ranks. Journal of the American Statistical Association, 58, 216-230.
Siegel, S. & Castellan, N. J. Jr. (1988): Nonparametric statistics for the behavioral sciences. Boston, MA: McGraw-Hill.
Wellek, S. (1989): Computing exact p-values in Page's nonparametric test against trend. Biometrie und Informatik in Medizin und Biologie 20, 163-170
# Craig's data from Siegel & Castellan, p 186 soa.mat <- matrix(c(.797,.873,.888,.923,.942,.956, .794,.772,.908,.982,.946,.913, .838,.801,.853,.951,.883,.837, .815,.801,.747,.859,.887,.902), nrow=4, byrow=TRUE) PageTest(soa.mat) # Duller, pg. 236 pers <- matrix(c( 1, 72, 72, 71.5, 69, 70, 69.5, 68, 68, 67, 68, 2, 83, 81, 81, 82, 82.5, 81, 79, 80.5, 80, 81, 3, 95, 92, 91.5, 89, 89, 90.5, 89, 89, 88, 88, 4, 71, 72, 71, 70.5, 70, 71, 71, 70, 69.5, 69, 5, 79, 79, 78.5, 77, 77.5, 78, 77.5, 76, 76.5, 76, 6, 80, 78.5, 78, 77, 77.5, 77, 76, 76, 75.5, 75.5 ), nrow=6, byrow=TRUE) colnames(pers) <- c("person", paste("week",1:10)) # Alternative: week10 < week9 < week8 ... PageTest(pers[, 11:2]) # Sachs, pg. 464 pers <- matrix(c( 3,2,1,4, 4,2,3,1, 4,1,2,3, 4,2,3,1, 3,2,1,4, 4,1,2,3, 4,3,2,1, 3,1,2,4, 3,1,4,2), nrow=9, byrow=TRUE, dimnames=list(1:9, LETTERS[1:4])) # Alternative: B < C < D < A PageTest(pers[, c("B","C","D","A")]) # long shape and formula interface plng <- data.frame(expand.grid(1:9, c("B","C","D","A")), as.vector(pers[, c("B","C","D","A")])) colnames(plng) <- c("block","group","x") PageTest(plng$x, plng$group, plng$block) PageTest(x ~ group | block, data = plng) score <- matrix(c( 3,4,6,9, 4,3,7,8, 3,4,4,6, 5,6,8,9, 4,4,9,9, 6,7,11,10 ), nrow=6, byrow=TRUE) PageTest(score)
# Craig's data from Siegel & Castellan, p 186 soa.mat <- matrix(c(.797,.873,.888,.923,.942,.956, .794,.772,.908,.982,.946,.913, .838,.801,.853,.951,.883,.837, .815,.801,.747,.859,.887,.902), nrow=4, byrow=TRUE) PageTest(soa.mat) # Duller, pg. 236 pers <- matrix(c( 1, 72, 72, 71.5, 69, 70, 69.5, 68, 68, 67, 68, 2, 83, 81, 81, 82, 82.5, 81, 79, 80.5, 80, 81, 3, 95, 92, 91.5, 89, 89, 90.5, 89, 89, 88, 88, 4, 71, 72, 71, 70.5, 70, 71, 71, 70, 69.5, 69, 5, 79, 79, 78.5, 77, 77.5, 78, 77.5, 76, 76.5, 76, 6, 80, 78.5, 78, 77, 77.5, 77, 76, 76, 75.5, 75.5 ), nrow=6, byrow=TRUE) colnames(pers) <- c("person", paste("week",1:10)) # Alternative: week10 < week9 < week8 ... PageTest(pers[, 11:2]) # Sachs, pg. 464 pers <- matrix(c( 3,2,1,4, 4,2,3,1, 4,1,2,3, 4,2,3,1, 3,2,1,4, 4,1,2,3, 4,3,2,1, 3,1,2,4, 3,1,4,2), nrow=9, byrow=TRUE, dimnames=list(1:9, LETTERS[1:4])) # Alternative: B < C < D < A PageTest(pers[, c("B","C","D","A")]) # long shape and formula interface plng <- data.frame(expand.grid(1:9, c("B","C","D","A")), as.vector(pers[, c("B","C","D","A")])) colnames(plng) <- c("block","group","x") PageTest(plng$x, plng$group, plng$block) PageTest(x ~ group | block, data = plng) score <- matrix(c( 3,4,6,9, 4,3,7,8, 3,4,4,6, 5,6,8,9, 4,4,9,9, 6,7,11,10 ), nrow=6, byrow=TRUE) PageTest(score)
Implements a logic to run pairwise calculations on the columns of a data.frame or a matrix.
PairApply(x, FUN = NULL, ..., symmetric = FALSE)
PairApply(x, FUN = NULL, ..., symmetric = FALSE)
x |
a list, a data.frame or a matrix with columns to be processed pairwise. |
FUN |
a function to be calculated. It is assumed, that the first 2 arguments denominate x and y. |
... |
the dots are passed to FUN. |
symmetric |
logical. Does the function yield the same result for FUN(x, y) and FUN(y, x)? |
This code is based on the logic of cor()
and extended for asymmetric functions.
a matrix with the results of FUN.
Andri Signorell <[email protected]>
outer
, CombPairs
, pairwise.table
PairApply(d.diamonds[,c("colour","clarity","cut","polish")], FUN = CramerV, symmetric=TRUE) # user defined functions are ok as well PairApply(d.diamonds[,c("clarity","cut","polish","symmetry")], FUN = function(x,y) wilcox.test(as.numeric(x), as.numeric(y))$p.value, symmetric=TRUE) # asymetric measure PairApply(d.diamonds[,c("colour", "clarity", "cut", "polish")], FUN = Lambda, direction = "row") # ... compare to: Lambda(x=d.diamonds$colour, y=d.diamonds$clarity, direction="row") Lambda(x=d.diamonds$colour, y=d.diamonds$clarity, direction="column") # the data.frame dfrm <- d.diamonds[, c("colour","clarity","cut","polish")] PairApply(dfrm, FUN = CramerV, symmetric=TRUE) # the same as matrix (columnwise) m <- as.matrix(dfrm) PairApply(m, FUN = CramerV, symmetric=TRUE) # ... and the list interface lst <- as.list(dfrm) PairApply(lst, FUN = CramerV, symmetric=TRUE)
PairApply(d.diamonds[,c("colour","clarity","cut","polish")], FUN = CramerV, symmetric=TRUE) # user defined functions are ok as well PairApply(d.diamonds[,c("clarity","cut","polish","symmetry")], FUN = function(x,y) wilcox.test(as.numeric(x), as.numeric(y))$p.value, symmetric=TRUE) # asymetric measure PairApply(d.diamonds[,c("colour", "clarity", "cut", "polish")], FUN = Lambda, direction = "row") # ... compare to: Lambda(x=d.diamonds$colour, y=d.diamonds$clarity, direction="row") Lambda(x=d.diamonds$colour, y=d.diamonds$clarity, direction="column") # the data.frame dfrm <- d.diamonds[, c("colour","clarity","cut","polish")] PairApply(dfrm, FUN = CramerV, symmetric=TRUE) # the same as matrix (columnwise) m <- as.matrix(dfrm) PairApply(m, FUN = CramerV, symmetric=TRUE) # ... and the list interface lst <- as.list(dfrm) PairApply(lst, FUN = CramerV, symmetric=TRUE)
Create a model frame for a formula object, by handling the left hand side the same way the right hand side is handled in model.frame. Especially variables separated by + are interpreted as separate variables.
ParseFormula(formula, data = parent.frame(), drop = TRUE)
ParseFormula(formula, data = parent.frame(), drop = TRUE)
formula |
an object of class "formula" (or one that can be coerced to that class): a symbolic description for the variables to be described. |
data |
an optional data frame, list or environment (or object coercible by as.data.frame to a data frame) containing the variables in the model. If not found in data, the variables are taken from environment(formula), typically the environment from which lm is called. |
drop |
if |
This is used by Desc.formula
for describing data by groups while remaining flexible for using
I(...)
constructions, functions or interaction terms.
a list of 3 elements
formula |
the formula which had to be parsed |
lhs |
a list of 3 elements: |
rhs |
a list of 3 elements: |
Andri Signorell <[email protected]>
The functions used to handle formulas: model.frame
, terms
, formula
Used in: Desc.formula
set.seed(17) piz <- d.pizza[sample(nrow(d.pizza),10), c("temperature","price","driver","weekday")] f1 <- formula(. ~ driver) f2 <- formula(temperature ~ .) f3 <- formula(temperature + price ~ .) f4 <- formula(temperature ~ . - driver) f5 <- formula(temperature + price ~ driver) f6 <- formula(temperature + price ~ driver * weekday) f7 <- formula(I(temperature^2) + sqrt(price) ~ driver + weekday) f8 <- formula(temperature + price ~ 1) f9 <- formula(temperature + price ~ driver * weekday - price) ParseFormula(f1, data=piz) ParseFormula(f2, data=piz) ParseFormula(f3, data=piz) ParseFormula(f4, data=piz) ParseFormula(f5, data=piz) ParseFormula(f6, data=piz) ParseFormula(f7, data=piz) ParseFormula(f8, data=piz)
set.seed(17) piz <- d.pizza[sample(nrow(d.pizza),10), c("temperature","price","driver","weekday")] f1 <- formula(. ~ driver) f2 <- formula(temperature ~ .) f3 <- formula(temperature + price ~ .) f4 <- formula(temperature ~ . - driver) f5 <- formula(temperature + price ~ driver) f6 <- formula(temperature + price ~ driver * weekday) f7 <- formula(I(temperature^2) + sqrt(price) ~ driver + weekday) f8 <- formula(temperature + price ~ 1) f9 <- formula(temperature + price ~ driver * weekday - price) ParseFormula(f1, data=piz) ParseFormula(f2, data=piz) ParseFormula(f3, data=piz) ParseFormula(f4, data=piz) ParseFormula(f5, data=piz) ParseFormula(f6, data=piz) ParseFormula(f7, data=piz) ParseFormula(f8, data=piz)
A parser for simple SAS dataline command texts. A data.frame
is being built with the columnnames listed in the input section. The data object will be created in the given environment.
ParseSASDatalines(x, env = .GlobalEnv, overwrite = FALSE)
ParseSASDatalines(x, env = .GlobalEnv, overwrite = FALSE)
x |
the SAS text |
env |
environment in which the dataset should be created. |
overwrite |
logical. If set to TRUE, the function will silently overwrite a potentially existing object in |
The SAS function DATA
is designed for quickly creating a dataset from scratch. The whole step normally consists out of the DATA
part defining the name of the dataset, an INPUT
line declaring the variables and a DATALINES
command followed by the values.
The default delimiter used to separate the different variables is a space (thus each variable should be one word). The $ after the variable name indicates that the variable preceding contain character values and not numeric values. Without specific instructions, SAS assumes that variables are numeric. The function will fail, if it encounters a character in the place of an expected numeric value.
Each new row in datalines will create a corresponding unique row in the dataset. Notice that a ; is not needed after every row, rather it is included at the end of the entire data step.
More complex command structures, i.e. other delimiters (dlm), in the INPUT
-section are not (yet) supported.
a data.frame
Andri Signorell <[email protected]>
txt <- " DATA asurvey; INPUT id sex $ age inc r1 r2 r3 ; DATALINES; 1 F 35 17 7 2 2 17 M 50 14 5 5 3 33 F 45 6 7 2 7 49 M 24 14 7 5 7 65 F 52 9 4 7 7 81 M 44 11 7 7 7 2 F 34 17 6 5 3 18 M 40 14 7 5 2 34 F 47 6 6 5 6 50 M 35 17 5 7 5 ; " (d.frm <- ParseSASDatalines(txt))
txt <- " DATA asurvey; INPUT id sex $ age inc r1 r2 r3 ; DATALINES; 1 F 35 17 7 2 2 17 M 50 14 5 5 3 33 F 45 6 7 2 7 49 M 24 14 7 5 7 65 F 52 9 4 7 7 81 M 44 11 7 7 7 2 F 34 17 6 5 3 18 M 40 14 7 5 2 34 F 47 6 6 5 6 50 M 35 17 5 7 5 ; " (d.frm <- ParseSASDatalines(txt))
Brings up a tcltk dialog centered on the screen, designed for entering passwords while displaying only ****.
PasswordDlg(option_txt = NULL)
PasswordDlg(option_txt = NULL)
option_txt |
an optional text, if it is defined, there will be a checkbox added to the dialog with the label being set with |
the entered password
the status of the optional checkbox will be returned as attribute:
attr(pw, "option")
Markus Naepflin <[email protected]>
## Not run: pw <- PasswordDlg() pw ## End(Not run)
## Not run: pw <- PasswordDlg() pw ## End(Not run)
PDF versions of the manual are usually not included as vignettes in R packages. Still this format is convenient for reading and doing full text search.
This function creates the appropriate link to the pdf file on CRAN and opens the pdf manual in a browser window.
PDFManual(package)
PDFManual(package)
package |
name of the package. |
Andri Signorell <[email protected]>
## Not run: PDFManual(DescTools) ## End(Not run)
## Not run: PDFManual(DescTools) ## End(Not run)
Performs the Pearson chi-square test for the composite hypothesis of normality.
PearsonTest(x, n.classes = ceiling(2 * (n^(2/5))), adjust = TRUE)
PearsonTest(x, n.classes = ceiling(2 * (n^(2/5))), adjust = TRUE)
x |
a numeric vector of data values. Missing values are allowed. |
n.classes |
The number of classes. The default is due to Moore (1986). |
adjust |
logical; if |
The Pearson test statistic is ,
where
is the number of counted and
is the number of expected observations
(under the hypothesis) in class
. The classes are build is such a way that they are equiprobable under the hypothesis
of normality. The p-value is computed from a chi-square distribution with
n.classes
-3 degrees of freedom
if adjust
is TRUE
and from a chi-square distribution with n.classes
-1
degrees of freedom otherwise. In both cases this is not (!) the correct p-value,
lying somewhere between the two, see also Moore (1986).
A list of class htest
, containing the following components:
statistic |
the value of the Pearson chi-square statistic. |
p.value |
the p-value for the test. |
method |
the character string “Pearson chi-square normality test”. |
data.name |
a character string giving the name(s) of the data. |
n.classes |
the number of classes used for the test. |
df |
the degress of freedom of the chi-square distribution used to compute the p-value. |
The Pearson chi-square test is usually not recommended for testing the composite hypothesis of normality
due to its inferior power properties compared to other tests. It is common practice to compute the p-value
from the chi-square distribution with n.classes
- 3 degrees of freedom, in order to adjust for the
additional estimation of two parameters. (For the simple hypothesis of normality (mean and variance known)
the test statistic is asymptotically chi-square distributed with
n.classes
- 1 degrees of freedom.)
This is, however, not correct as long as the parameters are estimated by mean(x)
and var(x)
(or sd(x)
), as it is usually done, see Moore (1986) for details.
Since the true p-value is somewhere between the two, it is suggested to run PearsonTest
twice, with
adjust = TRUE
(default) and with adjust = FALSE
.
It is also suggested to slightly change the default number of classes, in order
to see the effect on the p-value. Eventually, it is suggested not to rely upon the result of the test.
The function call PearsonTest(x)
essentially produces
the same result as the S-PLUS function call chisq.gof((x-mean(x))/sqrt(var(x)), n.param.est=2)
.
Juergen Gross <[email protected]>
Moore, D.S., (1986) Tests of the chi-squared type. In: D'Agostino, R.B. and Stephens, M.A., eds.: Goodness-of-Fit Techniques. Marcel Dekker, New York.
Thode Jr., H.C., (2002) Testing for Normality. Marcel Dekker, New York. Sec. 5.2
shapiro.test
for performing the Shapiro-Wilk test for normality.
AndersonDarlingTest
, CramerVonMisesTest
,
LillieTest
, ShapiroFranciaTest
for performing further tests for normality.
qqnorm
for producing a normal quantile-quantile plot.
PearsonTest(rnorm(100, mean = 5, sd = 3)) PearsonTest(runif(100, min = 2, max = 4))
PearsonTest(rnorm(100, mean = 5, sd = 3)) PearsonTest(runif(100, min = 2, max = 4))
PercentRank()
takes a vector x
and returns the percentile that elements of x
correspond to.
PercentRank(x)
PercentRank(x)
x |
a numeric, complex, character or logical vector. |
A numeric vector of the same length as x
with names copied from x
(unless na.last = NA
, when missing values are removed). The vector is of integer type unless x
is a long vector.
Andri Signorell <[email protected]>
Rank
, rank
, factor
, order
, sort
(r1 <- rank(x1 <- c(3, 1, 4, 15, 92))) x2 <- c(3, 1, 4, 1, 5, 9, 2, 6, 5, 3, 5) names(x2) <- letters[1:11] (r2 <- rank(x2)) # ties are averaged PercentRank(x2)
(r1 <- rank(x1 <- c(3, 1, 4, 15, 92))) x2 <- c(3, 1, 4, 1, 5, 9, 2, 6, 5, 3, 5) names(x2) <- letters[1:11] (r2 <- rank(x2)) # ties are averaged PercentRank(x2)
Prints a 2-way contingency table along with percentages, marginal, and conditional distributions. All the frequencies are nested into one single table.
## Default S3 method: PercTable(x, y = NULL, ...) ## S3 method for class 'table' PercTable(tab, row.vars = NULL, col.vars = NULL, justify = "right", freq = TRUE, rfrq = "100", expected = FALSE, residuals = FALSE, stdres = FALSE, margins = NULL, digits = NULL, ...) ## S3 method for class 'formula' PercTable(formula, data, subset, na.action, ...) ## S3 method for class 'PercTable' print(x, vsep = NULL, ...) Margins(tab, ...)
## Default S3 method: PercTable(x, y = NULL, ...) ## S3 method for class 'table' PercTable(tab, row.vars = NULL, col.vars = NULL, justify = "right", freq = TRUE, rfrq = "100", expected = FALSE, residuals = FALSE, stdres = FALSE, margins = NULL, digits = NULL, ...) ## S3 method for class 'formula' PercTable(formula, data, subset, na.action, ...) ## S3 method for class 'PercTable' print(x, vsep = NULL, ...) Margins(tab, ...)
x , y
|
objects which can be interpreted as factors (including character strings).
x and y will be tabulated via |
tab |
a r x c-contingency table |
row.vars |
a vector of row variables (see Details). |
col.vars |
a vector of column variables (see Details). If this is left to |
justify |
either |
freq |
boolean. Should absolute frequencies be included? Defaults to TRUE. |
rfrq |
a string with 3 characters, each of them being 1 or 0. The first position means total percentages, the second means row percentages and the third column percentages. "011" produces a table output with row and column percentages. |
expected |
the expected counts under the null hypothesis. |
residuals |
the Pearson residuals, (observed - expected) / sqrt(expected). |
stdres |
standardized residuals, (observed - expected) / sqrt(V), where V is the residual cell variance (for the case where x is a matrix, n * p * (1 - p) otherwise). |
margins |
a vector, consisting out of 1 and/or 2. Defines the margin sums to be included.
1 stands for row margins, 2 for column margins, c(1,2) for both. Default is |
digits |
integer. With how many digits shoud the relative frequencies be formatted? Default can be set by |
formula |
a formula of the form |
data |
an optional matrix or data frame (or similar: see |
subset |
an optional vector specifying a subset of observations to be used. |
na.action |
a function which indicates what should happen when the data contain NAs. Defaults to |
vsep |
logical, defining if an empty row should be introduced between the table rows. Default is FALSE, if only a table with one single description (either frequencies or percents) should be returned and |
... |
the dots are passed from |
PercTable prints a 2-dimensional table. The absolute and relative frequencies are nested into one flat table by means of ftable
.
row.vars
, resp. col.vars
can be used to define the structure of the table. row.vars
can either be the names of
the dimensions (included percentages are named "idx"
) or numbers (1:3, where 1 is the first dimension of the table,
2 the second and 3 the percentages).
Use Sort()
if you want to have your table sorted by rows.
The style in which numbers are formatted is selected by
Fmt()
from the DescTools options.
Absolute frequencies will use Fmt("abs")
and Fmt("per")
will do it for the percentages. The options can be changed with Fmt(abs=as.fmt(...))
which is basically a "fmt"
-object containing any format information used in Format
.
Margins()
returns a list containing all the one dimensional margin tables of a n-dimensional table along the given dimensions. It uses margin.table()
for all the dimensions and adds the appropriate percentages.
Returns an object of class "ftable"
.
Andri Signorell <[email protected]>
Agresti, Alan (2007) Introduction to categorical data analysis. NY: John Wiley and Sons, Section 2.4.5
Freq
, table
, ftable
, prop.table
, addmargins
, DescToolsOptions
, Fmt
There are similar functions in package sfsmisc printTable2
and package vcd table2d_summary
, both
lacking some of the flexibility we needed here.
tab <- table(driver=d.pizza$driver, area=d.pizza$area) PercTable(tab=tab, col.vars=2) PercTable(tab=tab, col.vars=2, margins=c(1,2)) PercTable(tab=tab, col.vars=2, margins=2) PercTable(tab=tab, col.vars=2, margins=1) PercTable(tab=tab, col.vars=2, margins=NULL) PercTable(tab=tab, col.vars=2, rfrq="000") # just the percentages without absolute values PercTable(tab=tab, col.vars=2, rfrq="110", freq=FALSE) # just the row percentages in percent format (pfmt = TRUE) PercTable(tab, freq= FALSE, rfrq="010", pfmt=TRUE, digits=1) # just the expected frequencies and the standard residuals PercTable(tab=tab, rfrq="000", expected = TRUE, stdres = TRUE) # rearrange output such that freq are inserted as columns instead of rows PercTable(tab=tab, col.vars=c(3,2), rfrq="111") # putting the areas in rows PercTable(tab=tab, col.vars=c(3,1), rfrq="100", margins=c(1,2)) # formula interface with subset PercTable(driver ~ area, data=d.pizza, subset=wine_delivered==0) # sort the table by rows, order first column (Zurich), then third, then row.names (0) PercTable(tab=Sort(tab, ord=c(1,3,0))) # reverse the row variables, so that absolute frequencies and percents # are not nested together PercTable(tab, row.vars=c(3, 1)) # the vector interface PercTable(x=d.pizza$driver, y=d.pizza$area) PercTable(x=d.pizza$driver, y=d.pizza$area, margins=c(1,2), rfrq="000", useNA="ifany") # one dimensional x falls back to the function Freq() PercTable(x=d.pizza$driver) # the margin tables Margins(Titanic)
tab <- table(driver=d.pizza$driver, area=d.pizza$area) PercTable(tab=tab, col.vars=2) PercTable(tab=tab, col.vars=2, margins=c(1,2)) PercTable(tab=tab, col.vars=2, margins=2) PercTable(tab=tab, col.vars=2, margins=1) PercTable(tab=tab, col.vars=2, margins=NULL) PercTable(tab=tab, col.vars=2, rfrq="000") # just the percentages without absolute values PercTable(tab=tab, col.vars=2, rfrq="110", freq=FALSE) # just the row percentages in percent format (pfmt = TRUE) PercTable(tab, freq= FALSE, rfrq="010", pfmt=TRUE, digits=1) # just the expected frequencies and the standard residuals PercTable(tab=tab, rfrq="000", expected = TRUE, stdres = TRUE) # rearrange output such that freq are inserted as columns instead of rows PercTable(tab=tab, col.vars=c(3,2), rfrq="111") # putting the areas in rows PercTable(tab=tab, col.vars=c(3,1), rfrq="100", margins=c(1,2)) # formula interface with subset PercTable(driver ~ area, data=d.pizza, subset=wine_delivered==0) # sort the table by rows, order first column (Zurich), then third, then row.names (0) PercTable(tab=Sort(tab, ord=c(1,3,0))) # reverse the row variables, so that absolute frequencies and percents # are not nested together PercTable(tab, row.vars=c(3, 1)) # the vector interface PercTable(x=d.pizza$driver, y=d.pizza$area) PercTable(x=d.pizza$driver, y=d.pizza$area, margins=c(1,2), rfrq="000", useNA="ifany") # one dimensional x falls back to the function Freq() PercTable(x=d.pizza$driver) # the margin tables Margins(Titanic)
Return the set of permutations for a given set of values.
The values can be numeric values, characters or factors. CombN
computes the number of combinations with and without replacement and order, whereas CombSet
returns the value sets.
Permn(x, sort = FALSE) CombN(n, m, repl = FALSE, ord = FALSE) CombSet(x, m, repl = FALSE, ord = FALSE, as.list = FALSE)
Permn(x, sort = FALSE) CombN(n, m, repl = FALSE, ord = FALSE) CombSet(x, m, repl = FALSE, ord = FALSE, as.list = FALSE)
x |
a vector of numeric values or characters. Characters need not be unique. |
n |
number of elements from which to choose. |
m |
number of elements to choose. For |
repl |
logical. Should repetition of the same element be allowed? Defaults to FALSE |
ord |
logical. Does the order matter? Default is FALSE. |
sort |
logical, defining if the result set should be sorted. Default is FALSE. |
as.list |
logical, defining if the results should be returned in a flat list, say every sample is a single element of the resulting list. Default is FALSE. |
The vector x need not contain unique values. The permutations will automatically be filtered for unique sets, if the same element is given twice or more.
a matrix with all possible permutations or combinations of the values in x for Permn
and CombSet
if m contains more than one element the result will be a list of matrices or a flat list if as.list
is set to TRUE
an integer value for CombN
Friederich Leisch <[email protected]>
Andri Signorell <[email protected]> (CombSet, CombN)
combn
, choose
, factorial
, CombPairs
vignette("Combinatorics")
Permn(letters[2:5]) Permn(2:5) # containing the same element more than once Permn(c("a", "b", "c", "a")) # only combinations of 2, but in every possible order x <- letters[1:4] n <- length(x) m <- 2 # the samples CombSet(x, m, repl=TRUE, ord=FALSE) CombSet(x, m, repl=TRUE, ord=TRUE) CombSet(x, m, repl=FALSE, ord=TRUE) CombSet(x, m, repl=FALSE, ord=FALSE) # the number of the samples CombN(n, m, repl=TRUE, ord=FALSE) CombN(n, m, repl=TRUE, ord=TRUE) CombN(n, m, repl=FALSE, ord=TRUE) CombN(n, m, repl=FALSE, ord=FALSE) # build all subsets of length 1, 3 and 5 and return a flat list x <- letters[1:5] CombSet(x=x, m=c(1, 3, 5), as.list=TRUE)
Permn(letters[2:5]) Permn(2:5) # containing the same element more than once Permn(c("a", "b", "c", "a")) # only combinations of 2, but in every possible order x <- letters[1:4] n <- length(x) m <- 2 # the samples CombSet(x, m, repl=TRUE, ord=FALSE) CombSet(x, m, repl=TRUE, ord=TRUE) CombSet(x, m, repl=FALSE, ord=TRUE) CombSet(x, m, repl=FALSE, ord=FALSE) # the number of the samples CombN(n, m, repl=TRUE, ord=FALSE) CombN(n, m, repl=TRUE, ord=TRUE) CombN(n, m, repl=FALSE, ord=TRUE) CombN(n, m, repl=FALSE, ord=FALSE) # build all subsets of length 1, 3 and 5 and return a flat list x <- letters[1:5] CombSet(x=x, m=c(1, 3, 5), as.list=TRUE)
Formulating the results of a comparison of means is quite common. This function assembles a descriptive text about the results of a t-test, describing group sizes, means, p-values and confidence intervals.
Phrase(x, g, glabels = NULL, xname = NULL, unit = NULL, lang = "engl", na.rm = FALSE)
Phrase(x, g, glabels = NULL, xname = NULL, unit = NULL, lang = "engl", na.rm = FALSE)
x |
a (non-empty) numeric vector of data values. |
g |
a vector or factor object giving the group for the corresponding elements of x. The number of levels must equal 2. |
glabels |
the labels of the two groups, if left to NULL, the levels will be used. |
xname |
the name of the variable to be used in the text. |
unit |
an optional unit for be appended to the numeric results. |
lang |
the language to be used. Only english (default) and german implemented (so far). |
na.rm |
logical, should |
a text
Andri Signorell <[email protected]>
data("cats", package = "MASS") cat(Phrase(cats$Bwt, cats$Sex, xname="weight", unit="grams", glabels=c("female", "male"))) # oder auf deutsch cat(Phrase(cats$Bwt, cats$Sex, xname="Geburtsgewicht", glabels=c("weiblich", "maennlich"), lang="german"))
data("cats", package = "MASS") cat(Phrase(cats$Bwt, cats$Sex, xname="weight", unit="grams", glabels=c("female", "male"))) # oder auf deutsch cat(Phrase(cats$Bwt, cats$Sex, xname="Geburtsgewicht", glabels=c("weiblich", "maennlich"), lang="german"))
Combined plot of a time Series and its autocorrelation and partial autocorrelation
PlotACF(series, lag.max = 10 * log10(length(series)), main = NULL, cex = NULL, ...) PlotGACF(series, lag.max = 10 * log10(length(series)), type = "cor", ylab = NULL, ...)
PlotACF(series, lag.max = 10 * log10(length(series)), main = NULL, cex = NULL, ...) PlotGACF(series, lag.max = 10 * log10(length(series)), type = "cor", ylab = NULL, ...)
series |
univariate time series. |
lag.max |
integer. Defines the number of lags to be displayed. The default is 10 * log10(length(series)). |
main |
an overall title for the plot |
cex |
numerical value giving the amount by which plotting text and symbols should be magnified relative to the default. |
type |
character string giving the type of acf to be computed. Allowed values are |
ylab |
a title for the y axis: see |
... |
the dots are passed to the plot command. |
PlotACF plots a combination of the time series and its autocorrelation and partial autocorrelation. PlotGACF is used as subfunction to produce the acf- and pacf-plots.
Markus Huerzeler (ETH Zurich), some minor modifications Andri Signorell <[email protected]>
PlotACF(AirPassengers)
PlotACF(AirPassengers)
Produce a stacked area plot, or add polygons to an existing plot.
## Default S3 method: PlotArea(x, y = NULL, prop = FALSE, add = FALSE, xlab = NULL, ylab = NULL, col = NULL, frame.plot = FALSE, ...) ## S3 method for class 'formula' PlotArea(formula, data, subset, na.action, ...)
## Default S3 method: PlotArea(x, y = NULL, prop = FALSE, add = FALSE, xlab = NULL, ylab = NULL, col = NULL, frame.plot = FALSE, ...) ## S3 method for class 'formula' PlotArea(formula, data, subset, na.action, ...)
x |
numeric vector of x values, or if |
y |
numeric vector of y values, or a matrix containing y values in columns. |
prop |
whether data should be plotted as proportions, so stacked areas equal 1. |
add |
whether polygons should be added to an existing plot. |
xlab |
label for x axis. |
ylab |
label for y axis. |
col |
fill color of polygon(s). The default is a vector of gray colors. |
frame.plot |
a logical indicating whether a box should be drawn around the plot. |
formula |
a |
data |
a data frame (or list) from which the variables in
|
subset |
an optional vector specifying a subset of observations to be used. |
na.action |
a function which indicates what should happen when the data contain |
... |
further arguments are passed to |
Matrix of cumulative sums that was used for plotting.
Arni Magnusson <[email protected]>
# PlotArea with stapled areas tab <- table( d.pizza$date, d.pizza$driver ) PlotArea(x=as.Date(rownames(tab)), y=tab, xaxt="n", xlab="Date", ylab="Pizzas delivered" ) # add x-axis and some text labels xrng <- pretty(range(as.Date(rownames(tab)))) axis(side=1, at=xrng, labels=xrng) text( x=min(d.pizza$date + .5, na.rm=TRUE), y=cumsum(tab[2,])-2.5, label=levels(d.pizza$driver), adj=c(0,0.5), col=TextContrastColor(gray.colors(7))) # formula PlotArea(Armed.Forces~Year, data=longley) PlotArea(cbind(Armed.Forces,Unemployed)~Year, data=longley) # add=TRUE plot(1940:1970, 500*runif(31), ylim=c(0,500)) PlotArea(Armed.Forces~Year, data=longley, add=TRUE) # matrix PlotArea(WorldPhones) PlotArea(WorldPhones, prop=TRUE, col=rainbow(10)) # table PlotArea(table(d.pizza$weekday)) PlotArea(table(d.pizza$weekday, d.pizza$driver)) # ts/mts PlotArea(austres) PlotArea(Seatbelts[,c("drivers","front","rear")], ylab="Killed or seriously injured") abline(v=1983+1/12, lty=3)
# PlotArea with stapled areas tab <- table( d.pizza$date, d.pizza$driver ) PlotArea(x=as.Date(rownames(tab)), y=tab, xaxt="n", xlab="Date", ylab="Pizzas delivered" ) # add x-axis and some text labels xrng <- pretty(range(as.Date(rownames(tab)))) axis(side=1, at=xrng, labels=xrng) text( x=min(d.pizza$date + .5, na.rm=TRUE), y=cumsum(tab[2,])-2.5, label=levels(d.pizza$driver), adj=c(0,0.5), col=TextContrastColor(gray.colors(7))) # formula PlotArea(Armed.Forces~Year, data=longley) PlotArea(cbind(Armed.Forces,Unemployed)~Year, data=longley) # add=TRUE plot(1940:1970, 500*runif(31), ylim=c(0,500)) PlotArea(Armed.Forces~Year, data=longley, add=TRUE) # matrix PlotArea(WorldPhones) PlotArea(WorldPhones, prop=TRUE, col=rainbow(10)) # table PlotArea(table(d.pizza$weekday)) PlotArea(table(d.pizza$weekday, d.pizza$driver)) # ts/mts PlotArea(austres) PlotArea(Seatbelts[,c("drivers","front","rear")], ylab="Killed or seriously injured") abline(v=1983+1/12, lty=3)
PlotBag()
creates a twodimensional boxplot called "bagplot" based on two numerical variables x and y. plot.PlotBag()
is the plotting routine for a bagplot object. compute.PlotBag()
contains the computation logic the object.
PlotBag(x, y, factor = 3, na.rm = FALSE, approx.limit = 300, show.outlier = TRUE, show.whiskers = TRUE, show.looppoints = TRUE, show.bagpoints = TRUE, show.loophull = TRUE, show.baghull = TRUE, create.plot = TRUE, add = FALSE, pch = 16, cex = 0.4, dkmethod = 2, precision = 1, verbose = FALSE, debug.plots = "no", col.loophull = "#aaccff", col.looppoints = "#3355ff", col.baghull = "#7799ff", col.bagpoints = "#000088", transparency = FALSE, ... ) PlotBagPairs(dm, trim = 0.0, main, numeric.only = TRUE, factor = 3, approx.limit = 300, pch = 16, cex = 0.8, precision = 1, col.loophull = "#aaccff", col.looppoints = "#3355ff", col.baghull = "#7799ff", col.bagpoints = "#000088", ...) compute.bagplot(x, y, factor = 3, na.rm = FALSE, approx.limit = 300, dkmethod = 2, precision = 1, verbose = FALSE, debug.plots = "no" ) ## S3 method for class 'bagplot' plot(x, show.outlier = TRUE, show.whiskers = TRUE, show.looppoints = TRUE, show.bagpoints = TRUE, show.loophull = TRUE, show.baghull = TRUE, add = FALSE, pch = 16, cex = .4, verbose = FALSE, col.loophull = "#aaccff", col.looppoints = "#3355ff", col.baghull = "#7799ff", col.bagpoints = "#000088", transparency = FALSE,...)
PlotBag(x, y, factor = 3, na.rm = FALSE, approx.limit = 300, show.outlier = TRUE, show.whiskers = TRUE, show.looppoints = TRUE, show.bagpoints = TRUE, show.loophull = TRUE, show.baghull = TRUE, create.plot = TRUE, add = FALSE, pch = 16, cex = 0.4, dkmethod = 2, precision = 1, verbose = FALSE, debug.plots = "no", col.loophull = "#aaccff", col.looppoints = "#3355ff", col.baghull = "#7799ff", col.bagpoints = "#000088", transparency = FALSE, ... ) PlotBagPairs(dm, trim = 0.0, main, numeric.only = TRUE, factor = 3, approx.limit = 300, pch = 16, cex = 0.8, precision = 1, col.loophull = "#aaccff", col.looppoints = "#3355ff", col.baghull = "#7799ff", col.bagpoints = "#000088", ...) compute.bagplot(x, y, factor = 3, na.rm = FALSE, approx.limit = 300, dkmethod = 2, precision = 1, verbose = FALSE, debug.plots = "no" ) ## S3 method for class 'bagplot' plot(x, show.outlier = TRUE, show.whiskers = TRUE, show.looppoints = TRUE, show.bagpoints = TRUE, show.loophull = TRUE, show.baghull = TRUE, add = FALSE, pch = 16, cex = .4, verbose = FALSE, col.loophull = "#aaccff", col.looppoints = "#3355ff", col.baghull = "#7799ff", col.bagpoints = "#000088", transparency = FALSE,...)
x |
x values of a data set;
in |
y |
y values of the data set |
factor |
factor defining the loop |
na.rm |
if |
approx.limit |
if the number of data points exceeds
|
show.outlier |
if |
show.whiskers |
if |
show.looppoints |
if |
show.bagpoints |
if |
show.loophull |
if |
show.baghull |
if |
create.plot |
if |
add |
if |
pch |
sets the plotting character |
cex |
sets characters size |
dkmethod |
1 or 2, there are two method of approximating the bag, method 1 is very rough (only based on observations |
precision |
precision of approximation, default: 1 |
verbose |
automatic commenting of calculations |
debug.plots |
if |
col.loophull |
color of loop hull |
col.looppoints |
color of the points of the loop |
col.baghull |
color of bag hull |
col.bagpoints |
color of the points of the bag |
transparency |
see section details |
dm |
x |
trim |
x |
main |
x |
numeric.only |
x |
... |
additional graphical parameters |
A bagplot is a bivariate generalization of the well known
boxplot. It has been proposed by Rousseeuw, Ruts, and Tukey.
In the bivariate case the box of the boxplot changes to a
convex polygon, the bag of bagplot. In the bag are 50 percent
of all points. The fence separates points within the fence from
points outside. It is computed by increasing the
the bag. The loop is defined as the convex hull containing
all points inside the fence.
If all points are on a straight line you get a classical
boxplot.
PlotBag()
plots bagplots that are very similar
to the one described in Rousseeuw et al.
Remarks:
The two dimensional median is approximated.
For large data sets the error will be very small.
On the other hand it is not very wise to make a (graphical)
summary of e.g. 10 bivariate data points.
In case you want to plot multiple (overlapping) bagplots,
you may want plots that are semi-transparent. For this
you can use the transparency
flag.
If transparency==TRUE
the alpha layer is set to '99' (hex).
This causes the bagplots to appear semi-transparent,
but ONLY if the output device is PDF and opened using:
pdf(file="filename.pdf", version="1.4")
.
For this reason, the default is transparency==FALSE
.
This feature as well as the arguments
to specify different colors has been proposed by Wouter Meuleman.
compute.bagplot
returns an object of class
bagplot
that could be plotted by
plot.bagplot()
.
An object of the bagplot class is a list with the following
elements: center
is a two dimensional vector with
the coordinates of the center. hull.center
is a
two column matrix, the rows are the coordinates of the
corners of the center region. hull.bag
and
hull.loop
contain the coordinates of the hull of the bag
and the hull of the loop. pxy.bag
shows you the
coordinates of the points of the bag. pxy.outer
is
the two column matrix of the points that are within the
fence. pxy.outlier
represent the outliers. The vector
hdepths
shows the depths of data points. is.one.dim
is TRUE
if the data set is (nearly) one dimensional.
The dimensionality is decided by analysing the result of prcomp
which is stored in the element prdata
. xy
shows you
the data that are used for the bagplot. In the case of very large
data sets subsets of the data are used for constructing the
bagplot. A data set is very large if there are more data points
than approx.limit
. xydata
are the input data structured
in a two column matrix.
Version of bagplot: 10/2012
Hans Peter Wolf <[email protected]>
P. J. Rousseeuw, I. Ruts, J. W. Tukey (1999): The bagplot: a bivariate boxplot, The American Statistician, vol. 53, no. 4, 382–387
# example: 100 random points and one outlier dat <- cbind(rnorm(100) + 100, rnorm(100) + 300) dat <- rbind(dat, c(105,295)) PlotBag(dat,factor=2.5,create.plot=TRUE,approx.limit=300, show.outlier=TRUE,show.looppoints=TRUE, show.bagpoints=TRUE,dkmethod=2, show.whiskers=TRUE,show.loophull=TRUE, show.baghull=TRUE,verbose=FALSE) # example of Rousseeuw et al., see R-package rpart cardata <- structure(as.integer( c(2560,2345,1845,2260,2440, 2285, 2275, 2350, 2295, 1900, 2390, 2075, 2330, 3320, 2885, 3310, 2695, 2170, 2710, 2775, 2840, 2485, 2670, 2640, 2655, 3065, 2750, 2920, 2780, 2745, 3110, 2920, 2645, 2575, 2935, 2920, 2985, 3265, 2880, 2975, 3450, 3145, 3190, 3610, 2885, 3480, 3200, 2765, 3220, 3480, 3325, 3855, 3850, 3195, 3735, 3665, 3735, 3415, 3185, 3690, 97, 114, 81, 91, 113, 97, 97, 98, 109, 73, 97, 89, 109, 305, 153, 302, 133, 97, 125, 146, 107, 109, 121, 151, 133, 181, 141, 132, 133, 122, 181, 146, 151, 116, 135, 122, 141, 163, 151, 153, 202, 180, 182, 232, 143, 180, 180, 151, 189, 180, 231, 305, 302, 151, 202, 182, 181, 143, 146, 146)), .Dim = as.integer(c(60, 2)), .Dimnames = list(NULL, c("Weight", "Disp."))) PlotBag(cardata,factor=3,show.baghull=TRUE, show.loophull=TRUE,precision=1, dkmethod=2) title("car data Chambers/Hastie 1992") # points of y=x*x PlotBag(x=1:30,y=(1:30)^2,verbose=FALSE,dkmethod=2) # one dimensional subspace PlotBag(x=1:50,y=1:50) # pairwise bagplots par(las=1) PlotBagPairs(swiss[, 1:2], main="Swiss Fertility and Socioeconomic Indicators (1888) Data")
# example: 100 random points and one outlier dat <- cbind(rnorm(100) + 100, rnorm(100) + 300) dat <- rbind(dat, c(105,295)) PlotBag(dat,factor=2.5,create.plot=TRUE,approx.limit=300, show.outlier=TRUE,show.looppoints=TRUE, show.bagpoints=TRUE,dkmethod=2, show.whiskers=TRUE,show.loophull=TRUE, show.baghull=TRUE,verbose=FALSE) # example of Rousseeuw et al., see R-package rpart cardata <- structure(as.integer( c(2560,2345,1845,2260,2440, 2285, 2275, 2350, 2295, 1900, 2390, 2075, 2330, 3320, 2885, 3310, 2695, 2170, 2710, 2775, 2840, 2485, 2670, 2640, 2655, 3065, 2750, 2920, 2780, 2745, 3110, 2920, 2645, 2575, 2935, 2920, 2985, 3265, 2880, 2975, 3450, 3145, 3190, 3610, 2885, 3480, 3200, 2765, 3220, 3480, 3325, 3855, 3850, 3195, 3735, 3665, 3735, 3415, 3185, 3690, 97, 114, 81, 91, 113, 97, 97, 98, 109, 73, 97, 89, 109, 305, 153, 302, 133, 97, 125, 146, 107, 109, 121, 151, 133, 181, 141, 132, 133, 122, 181, 146, 151, 116, 135, 122, 141, 163, 151, 153, 202, 180, 182, 232, 143, 180, 180, 151, 189, 180, 231, 305, 302, 151, 202, 182, 181, 143, 146, 146)), .Dim = as.integer(c(60, 2)), .Dimnames = list(NULL, c("Weight", "Disp."))) PlotBag(cardata,factor=3,show.baghull=TRUE, show.loophull=TRUE,precision=1, dkmethod=2) title("car data Chambers/Hastie 1992") # points of y=x*x PlotBag(x=1:30,y=(1:30)^2,verbose=FALSE,dkmethod=2) # one dimensional subspace PlotBag(x=1:50,y=1:50) # pairwise bagplots par(las=1) PlotBagPairs(swiss[, 1:2], main="Swiss Fertility and Socioeconomic Indicators (1888) Data")
Draw a bubble plot, defined by a pair of coordinates x, y to place the bubbles, an area
definition configuring the dimension and a color vector setting the color of the bubbles.
The legitimation to define a new function instead of just using plot(symbols(...))
is the automated calculation of the axis limits, ensuring that
all bubbles will be fully visible.
PlotBubble(x, ...) ## Default S3 method: PlotBubble(x, y, area, col = NA, cex = 1, border = par("fg"), xlim = NULL, ylim = NULL, na.rm = FALSE, ...) ## S3 method for class 'formula' PlotBubble(formula, data = parent.frame(), ..., subset, ylab = varnames[response])
PlotBubble(x, ...) ## Default S3 method: PlotBubble(x, y, area, col = NA, cex = 1, border = par("fg"), xlim = NULL, ylim = NULL, na.rm = FALSE, ...) ## S3 method for class 'formula' PlotBubble(formula, data = parent.frame(), ..., subset, ylab = varnames[response])
x , y
|
the x and y co-ordinates for the centres of the bubbles. They can be specified in any way which is accepted by |
area |
a vector giving the area of the bubbles. |
col |
colors for the bubbles, passed to |
cex |
extension factor for the area. |
border |
the border color fot the bubbles. The default means |
xlim , ylim
|
axes limits. |
na.rm |
logical, should NAs be omitted? Defaults to FALSE. |
formula |
a formula of the form |
data |
an optional matrix or data frame (or similar: see |
subset |
an optional vector specifying a subset of observations to be used. |
ylab |
the y-label for the plot used in the formula interface. |
... |
the dots are passed to the |
Argument inches
controls the sizes of the symbols. If TRUE
(the default), the symbols are scaled so that the largest dimension of any symbol is one inch. If a positive number is given the symbols are scaled to make largest dimension this size in inches (so TRUE
and 1
are equivalent). If inches
is FALSE
, the units are taken to be those of the appropriate axes.
This behaviour is the same as in symbols
.
A legend can be added with BubbleLegend
.
Andri Signorell <[email protected]>
BubbleLegend
, symbols
, sunflowerplot
PlotBubble(latitude ~ longitude, area=(smoky+1)*2e8, col=SetAlpha(1, 0.5), data=d.whisky) cols <- c("olivedrab1","orange","green","mediumturquoise","mediumorchid2","firebrick1") PlotBubble(x = state.x77[,"Income"], y = state.x77[,"Life Exp"], cex=.00004, area = state.x77[,"Population"], col = cols[state.region], border="grey50", panel.first=grid(), xlab="Income", ylab="Life Exp.", las=1 ) BubbleLegend(x = "topright", area = c(20000, 10000, 1000), cex=.00004, frame=NA, cols=cols[1:3], labels = c(20000, 10000, 1000), cex.names=0.7) legend(x="bottomright", fill=cols[1:4], legend=levels(state.region))
PlotBubble(latitude ~ longitude, area=(smoky+1)*2e8, col=SetAlpha(1, 0.5), data=d.whisky) cols <- c("olivedrab1","orange","green","mediumturquoise","mediumorchid2","firebrick1") PlotBubble(x = state.x77[,"Income"], y = state.x77[,"Life Exp"], cex=.00004, area = state.x77[,"Population"], col = cols[state.region], border="grey50", panel.first=grid(), xlab="Income", ylab="Life Exp.", las=1 ) BubbleLegend(x = "topright", area = c(20000, 10000, 1000), cex=.00004, frame=NA, cols=cols[1:3], labels = c(20000, 10000, 1000), cex.names=0.7) legend(x="bottomright", fill=cols[1:4], legend=levels(state.region))
Plot a candlestick chart. This is used primarily to describe price movements of a security, derivative, or currency over time. Candlestick charts are a visual aid for decision making in stock, foreign exchange, commodity, and option trading.
PlotCandlestick(x, y, vol = NA, xlim = NULL, ylim = NULL, col = c("springgreen4","firebrick"), border = NA, args.bar = NULL, args.grid = NULL, ...)
PlotCandlestick(x, y, vol = NA, xlim = NULL, ylim = NULL, col = c("springgreen4","firebrick"), border = NA, args.bar = NULL, args.grid = NULL, ...)
x |
a numeric vector for the x-values. Usually a date. |
y |
the y-values in a matrix (or a data.frame that can be coerced to a matrix) with 4 columns, whereas the first column contains the open price, the second the high, the third the lowest and the 4th the close price of daily stock prices. |
vol |
the volume, if it should be included in the plot as separate part. |
xlim |
the x limits (x1, x2) of the plot. The default value, |
ylim |
the y limits of the plot. |
col |
color for the body. To better highlight price movements, modern candlestick charts often replace the black or white of the candlestick body with colors such as red for a lower closing and blue or green for a higher closing. |
border |
the border color of the rectangles. Default is NA, meaning no border will be plotted. |
args.grid |
the arguments of a potential grid. Default is |
args.bar |
optional additional arguments for the volume barplot. |
... |
the dots are passed to |
Candlesticks are usually composed of the body (black or white), and an upper and a lower shadow (wick): the area between the open and the close is called the real body, price excursions above and below the real body are called shadows. The wick illustrates the highest and lowest traded prices of a security during the time interval represented. The body illustrates the opening and closing trades. If the security closed higher than it opened, the body is white or unfilled, with the opening price at the bottom of the body and the closing price at the top. If the security closed lower than it opened, the body is black, with the opening price at the top and the closing price at the bottom. A candlestick need not have either a body or a wick.
Andri Signorell <[email protected]>
nov <- rbind( "2013-05-28"= c(70.99,71.82,70.49,71.49), "2013-05-29"= c(71.13,71.90,70.81,71.57), "2013-05-30"= c(71.25,71.53,70.90,71.01), "2013-05-31"= c(70.86,70.92,70.30,70.30), "2013-06-03"= c(70.56,70.89,70.05,70.74), "2013-06-04"= c(70.37,71.11,69.67,69.90), "2013-06-05"= c(69.76,69.76,68.92,68.99), "2013-06-06"= c(69.13,70.02,68.56,70.02), "2013-06-07"= c(70.45,70.52,69.51,70.20), "2013-06-10"= c(70.53,70.75,70.05,70.20), "2013-06-11"= c(69.36,69.66,69.01,69.17), "2013-06-12"= c(69.65,70.03,68.85,69.21), "2013-06-13"= c(69.21,70.18,69.13,70.10), "2013-06-14"= c(70.17,70.48,69.30,69.58), "2013-06-17"= c(70.14,70.96,69.98,70.44), "2013-06-18"= c(70.55,71.97,70.55,71.49), "2013-06-19"= c(71.33,72.00,70.89,70.97), "2013-06-20"= c(70.04,70.06,68.40,68.55), "2013-06-21"= c(69.15,69.27,67.68,68.21) ) colnames(nov) <- c("open","high","low","close") PlotCandlestick(x=as.Date(rownames(nov)), y=nov, border=NA, las=1, ylab="") # include some volume information v <- c(213,108,310,762,70,46,411,652,887,704,289,579,934,619,860,35,215,211,8) PlotCandlestick(x=as.Date(rownames(nov)), y=nov, vol=v, border=NA, las=1, ylab="")
nov <- rbind( "2013-05-28"= c(70.99,71.82,70.49,71.49), "2013-05-29"= c(71.13,71.90,70.81,71.57), "2013-05-30"= c(71.25,71.53,70.90,71.01), "2013-05-31"= c(70.86,70.92,70.30,70.30), "2013-06-03"= c(70.56,70.89,70.05,70.74), "2013-06-04"= c(70.37,71.11,69.67,69.90), "2013-06-05"= c(69.76,69.76,68.92,68.99), "2013-06-06"= c(69.13,70.02,68.56,70.02), "2013-06-07"= c(70.45,70.52,69.51,70.20), "2013-06-10"= c(70.53,70.75,70.05,70.20), "2013-06-11"= c(69.36,69.66,69.01,69.17), "2013-06-12"= c(69.65,70.03,68.85,69.21), "2013-06-13"= c(69.21,70.18,69.13,70.10), "2013-06-14"= c(70.17,70.48,69.30,69.58), "2013-06-17"= c(70.14,70.96,69.98,70.44), "2013-06-18"= c(70.55,71.97,70.55,71.49), "2013-06-19"= c(71.33,72.00,70.89,70.97), "2013-06-20"= c(70.04,70.06,68.40,68.55), "2013-06-21"= c(69.15,69.27,67.68,68.21) ) colnames(nov) <- c("open","high","low","close") PlotCandlestick(x=as.Date(rownames(nov)), y=nov, border=NA, las=1, ylab="") # include some volume information v <- c(213,108,310,762,70,46,411,652,887,704,289,579,934,619,860,35,215,211,8) PlotCandlestick(x=as.Date(rownames(nov)), y=nov, vol=v, border=NA, las=1, ylab="")
A cash flow plot is a plot used in finance and allows you to graphically depict the timing of the cash flows as well as their nature as either inflows or outflows. An "up" arrow represents money received and a "down" arrow money paid out.
PlotCashFlow(x, y, xlim = NULL, labels = y, mar = NULL, cex.per = par("cex"), cex.tck = par("cex") * 0.8, cex.cash = par("cex"))
PlotCashFlow(x, y, xlim = NULL, labels = y, mar = NULL, cex.per = par("cex"), cex.tck = par("cex") * 0.8, cex.cash = par("cex"))
x |
time period of the cashflows (in and out) |
y |
amount of the cashflows |
xlim |
range of the x-axis, defaults to |
labels |
the labels of the cashflows will be printed outside the arrows. |
mar |
a vector with 4 elements, defining the margins for the plot |
cex.per |
the character extension for the period labels |
cex.tck |
character extension for the ticklabels, tipically years |
cex.cash |
the character extension for the labels of the cashflows |
Andri Signorell <[email protected]>
PlotCashFlow(x=c(6:9, 13:15), y=-c(rep(40, 4), rep(50,3)), xlim=c(6,17), labels=c(rep(40, 4), rep(50,3))) PlotCashFlow(x=c(6,8,9,12,17), y=c(10,30,40,50,70))
PlotCashFlow(x=c(6:9, 13:15), y=-c(rep(40, 4), rep(50,3)), xlim=c(6,17), labels=c(rep(40, 4), rep(50,3))) PlotCashFlow(x=c(6,8,9,12,17), y=c(10,30,40,50,70))
This visualising scheme represents the unidirectional relationship between the rows and the columns of a contingency table.
PlotCirc(tab, acol = rainbow(sum(dim(tab))), aborder = "darkgrey", rcol = SetAlpha(acol[1:nrow(tab)], 0.5), rborder = "darkgrey", gap = 5, main = "", labels = NULL, cex.lab = 1.0, las = 1, adj = NULL, dist = 2)
PlotCirc(tab, acol = rainbow(sum(dim(tab))), aborder = "darkgrey", rcol = SetAlpha(acol[1:nrow(tab)], 0.5), rborder = "darkgrey", gap = 5, main = "", labels = NULL, cex.lab = 1.0, las = 1, adj = NULL, dist = 2)
tab |
a table to be visualised. |
acol |
the colors for the peripheral annuli. |
aborder |
the border colors for the peripheral annuli. |
rcol |
the colors for the ribbons. |
rborder |
the border colors for the ribbons. |
gap |
the gap between the entities in degrees. |
main |
the main title, defaults to "". |
labels |
the labels. Defaults to the column names and rownames of the table. |
las |
alignment of the labels, 1 means horizontal, 2 radial and 3 vertical. |
adj |
adjustments for the labels. (Left: 0, Right: 1, Mid: 0.5) |
dist |
gives the distance of the labels from the outer circle. Default is 2. |
cex.lab |
the character extension for the labels. |
The visual scheme of representing relationships can be applied to a table, given the observation that a table cell is a relationship (with a value) between a row and column. By representing the row and columns as segments along the circle, the information in the corresponding cell can be encoded as a link between the segments. In general, the cell represents a unidirectional relationship (e.g. row->column) - in this relationship the role of the segments is not interchangeable (e.g. (row,col) and (col,row) are different cells). To identify the role of the segment, as a row or column, the ribbon is made to terminate at the row segment but slightly away from the column segment. In this way, for a given ribbon, it is easy to identify which segment is the row and which is the column.
the calculated points for the labels, which can be used to place userdefined labels.
Andri Signorell <[email protected]>
Inspired by https://circos.ca/presentations/articles/vis_tables1/
tab <- matrix(c(2,5,8,3,10,12,5,7,15), nrow=3, byrow=FALSE) dimnames(tab) <- list(c("A","B","C"), c("D","E","F")) tab PlotCirc( tab, acol = c("dodgerblue","seagreen2","limegreen","olivedrab2","goldenrod2","tomato2"), rcol = SetAlpha(c("red","orange","olivedrab1"), 0.5) ) tab <- table(d.pizza$weekday, d.pizza$operator) par(mfrow=c(1,2)) PlotCirc(tab, main="weekday ~ operator") PlotCirc(t(tab), main="operator ~ weekday")
tab <- matrix(c(2,5,8,3,10,12,5,7,15), nrow=3, byrow=FALSE) dimnames(tab) <- list(c("A","B","C"), c("D","E","F")) tab PlotCirc( tab, acol = c("dodgerblue","seagreen2","limegreen","olivedrab2","goldenrod2","tomato2"), rcol = SetAlpha(c("red","orange","olivedrab1"), 0.5) ) tab <- table(d.pizza$weekday, d.pizza$operator) par(mfrow=c(1,2)) PlotCirc(tab, main="weekday ~ operator") PlotCirc(t(tab), main="operator ~ weekday")
Plot conditional densities by group. For describing how the conditional distribution of a categorical variable y changes over a numerical variable x we have the function cdplot. But if we want to compare multiple densities much work is required. PlotConDens allows to easily enter a grouping variable.
PlotConDens(formula, data, col = NULL, lwd = 2, lty = 1, xlim = NULL, rev = TRUE, args.dens = NULL, ...)
PlotConDens(formula, data, col = NULL, lwd = 2, lty = 1, xlim = NULL, rev = TRUE, args.dens = NULL, ...)
formula |
a |
data |
a data frame containing values for any variables in the formula. By default the environment where |
col |
a vector of colors to be used to plot the lines. If too short, the values are recycled. |
lwd |
a vector of linewidths to be used to plot the lines. If too short, the values are recycled. |
lty |
a vector of linetypes to be used to plot the lines. If too short, the values are recycled. |
xlim |
the range for the x axis. |
rev |
logical, should the values of the response variable be reversed? Default is |
args.dens |
additional arguments for the densitiy curves. |
... |
the dots are passed on to |
Especially when we're modelling binary response variables we might want to know, how the binary variable behaves along some numeric predictors.
the functions for the curves
Andri Signorell <[email protected]>
cdplot
, spineplot
, density
, PlotMultiDens
data(Pima.tr2, package="MASS") PlotConDens (type ~ age | I((npreg > 0)*1L), data=Pima.tr2, col=c(DescTools::hblue, DescTools::hred), rev=FALSE, panel.first=quote(grid()))
data(Pima.tr2, package="MASS") PlotConDens (type ~ age | I((npreg > 0)*1L), data=Pima.tr2, col=c(DescTools::hblue, DescTools::hred), rev=FALSE, panel.first=quote(grid()))
This function produces a graphical display of a correlation matrix. The cells of the matrix can be shaded or colored to show the correlation value.
PlotCorr(x, cols = colorRampPalette(c(Pal()[2], "white", Pal()[1]), space = "rgb")(20), breaks = seq(-1, 1, length = length(cols) + 1), border = "grey", lwd = 1, args.colorlegend = NULL, xaxt = par("xaxt"), yaxt = par("yaxt"), cex.axis = 0.8, las = 2, mar = c(3, 8, 8, 8), mincor = 0, main = "", clust = FALSE, ...)
PlotCorr(x, cols = colorRampPalette(c(Pal()[2], "white", Pal()[1]), space = "rgb")(20), breaks = seq(-1, 1, length = length(cols) + 1), border = "grey", lwd = 1, args.colorlegend = NULL, xaxt = par("xaxt"), yaxt = par("yaxt"), cex.axis = 0.8, las = 2, mar = c(3, 8, 8, 8), mincor = 0, main = "", clust = FALSE, ...)
x |
x is a correlation matrix to be visualized. |
cols |
the colors for shading the matrix. Uses the package's option |
breaks |
a set of breakpoints for the colours: must give one more breakpoint than colour. These are passed to |
border |
color for borders. The default is |
lwd |
line width for borders. Default is 1. |
args.colorlegend |
list of arguments for the |
xaxt |
parameter to define, whether to draw an x-axis, defaults to |
yaxt |
parameter to define, whether to draw an y-axis, defaults to |
cex.axis |
character extension for the axis labels. |
las |
the style of axis labels. |
mar |
sets the margins, defaults to mar = c(3, 8, 8, 8) as we need a bit more room on the right. |
mincor |
numeric value between 0 and 1, defining the smallest correlation that is to be displayed. If this is >0 then all correlations with a lower value are suppressed. |
main |
character, the main title. |
clust |
logical. If set to |
... |
the dots are passed to the function |
no values returned.
Andri Signorell <[email protected]>
image
, ColorLegend
, corrgram()
, PlotWeb()
m <- cor(d.pizza[,sapply(d.pizza, IsNumeric, na.rm=TRUE)], use="pairwise.complete.obs") PlotCorr(m, cols=colorRampPalette(c("red", "black", "green"), space = "rgb")(20)) PlotCorr(m, cols=colorRampPalette(c("red", "black", "green"), space = "rgb")(20), args.colorlegend=NA) m <- PairApply(d.diamonds[, sapply(d.diamonds, is.factor)], CramerV, symmetric=TRUE) PlotCorr(m, cols = colorRampPalette(c("white", "steelblue"), space = "rgb")(20), breaks=seq(0, 1, length=21), border="black", args.colorlegend = list(labels=sprintf("%.1f", seq(0, 1, length = 11)), frame=TRUE) ) title(main="Cramer's V", line=2) text(x=rep(1:ncol(m),ncol(m)), y=rep(1:ncol(m),each=ncol(m)), label=sprintf("%0.2f", m[,ncol(m):1]), cex=0.8, xpd=TRUE) # Spearman correlation on ordinal factors csp <- cor(data.frame(lapply(d.diamonds[,c("carat", "clarity", "cut", "polish", "symmetry", "price")], as.numeric)), method="spearman") PlotCorr(csp) m <- cor(mtcars) PlotCorr(m, col=Pal("RedWhiteBlue1", 100), border="grey", args.colorlegend=list(labels=Format(seq(-1,1,.25), digits=2), frame="grey")) # display only correlation with a value > 0.7 PlotCorr(m, mincor = 0.7) x <- matrix(rep(1:ncol(m),each=ncol(m)), ncol=ncol(m)) y <- matrix(rep(ncol(m):1,ncol(m)), ncol=ncol(m)) txt <- Format(m, digits=3, ldigits=0) idx <- upper.tri(matrix(x, ncol=ncol(m)), diag=FALSE) # place the text on the upper triagonal matrix text(x=x[idx], y=y[idx], label=txt[idx], cex=0.8, xpd=TRUE) # or let's get rid of all non significant correlations p <- PairApply(mtcars, function(x, y) cor.test(x, y)$p.value, symmetric=TRUE) # or somewhat more complex with outer p0 <- outer(1:ncol(m), 1:ncol(m), function(a, b) mapply( function(x, y) cor.test(mtcars[, x], mtcars[, y])$p.value, a, b)) # ok, got all the p-values, now replace > 0.05 with NAs m[p > 0.05] <- NA PlotCorr(m) # the text n <- ncol(m) text(x=rep(seq(n), times=n), y=rep(rev(seq(n)), rep.int(n, n)), labels=Format(m, digits=2, na.form=""), cex=0.8, xpd=TRUE) # the text could also be set with outer, but this function returns an error, # based on the fact that text() does not return some kind of result # outer(X = 1:nrow(m), Y = ncol(m):1, # FUN = "text", labels = Format(m, digits=2, na.form = ""), # cex=0.8, xpd=TRUE) # put similiar correlations together PlotCorr(m, clust=TRUE) # same as idx <- order.dendrogram(as.dendrogram( hclust(dist(m), method = "mcquitty") )) PlotCorr(m[idx, idx]) # plot only upper triangular matrix and move legend to bottom m <- cor(mtcars) m[lower.tri(m, diag=TRUE)] <- NA p <- PairApply(mtcars, function(x, y) cor.test(x, y)$p.value, symmetric=TRUE) m[p > 0.05] <- NA PlotCorr(m, mar=c(8,8,8,8), yaxt="n", args.colorlegend = list(x="bottom", inset=-.15, horiz=TRUE, height=abs(LineToUser(line = 2.5, side = 1)), width=ncol(m))) mtext(text = rev(rownames(m)), side = 4, at=1:ncol(m), las=1, line = -5, cex=0.8) text(1:ncol(m), ncol(m):1, colnames(m), xpd=NA, cex=0.8, font=2) n <- ncol(m) text(x=rep(seq(n), times=n), y=rep(rev(seq(n)), rep.int(n, n)), labels=Format(t(m), digits=2, na.form=""), cex=0.8, xpd=TRUE)
m <- cor(d.pizza[,sapply(d.pizza, IsNumeric, na.rm=TRUE)], use="pairwise.complete.obs") PlotCorr(m, cols=colorRampPalette(c("red", "black", "green"), space = "rgb")(20)) PlotCorr(m, cols=colorRampPalette(c("red", "black", "green"), space = "rgb")(20), args.colorlegend=NA) m <- PairApply(d.diamonds[, sapply(d.diamonds, is.factor)], CramerV, symmetric=TRUE) PlotCorr(m, cols = colorRampPalette(c("white", "steelblue"), space = "rgb")(20), breaks=seq(0, 1, length=21), border="black", args.colorlegend = list(labels=sprintf("%.1f", seq(0, 1, length = 11)), frame=TRUE) ) title(main="Cramer's V", line=2) text(x=rep(1:ncol(m),ncol(m)), y=rep(1:ncol(m),each=ncol(m)), label=sprintf("%0.2f", m[,ncol(m):1]), cex=0.8, xpd=TRUE) # Spearman correlation on ordinal factors csp <- cor(data.frame(lapply(d.diamonds[,c("carat", "clarity", "cut", "polish", "symmetry", "price")], as.numeric)), method="spearman") PlotCorr(csp) m <- cor(mtcars) PlotCorr(m, col=Pal("RedWhiteBlue1", 100), border="grey", args.colorlegend=list(labels=Format(seq(-1,1,.25), digits=2), frame="grey")) # display only correlation with a value > 0.7 PlotCorr(m, mincor = 0.7) x <- matrix(rep(1:ncol(m),each=ncol(m)), ncol=ncol(m)) y <- matrix(rep(ncol(m):1,ncol(m)), ncol=ncol(m)) txt <- Format(m, digits=3, ldigits=0) idx <- upper.tri(matrix(x, ncol=ncol(m)), diag=FALSE) # place the text on the upper triagonal matrix text(x=x[idx], y=y[idx], label=txt[idx], cex=0.8, xpd=TRUE) # or let's get rid of all non significant correlations p <- PairApply(mtcars, function(x, y) cor.test(x, y)$p.value, symmetric=TRUE) # or somewhat more complex with outer p0 <- outer(1:ncol(m), 1:ncol(m), function(a, b) mapply( function(x, y) cor.test(mtcars[, x], mtcars[, y])$p.value, a, b)) # ok, got all the p-values, now replace > 0.05 with NAs m[p > 0.05] <- NA PlotCorr(m) # the text n <- ncol(m) text(x=rep(seq(n), times=n), y=rep(rev(seq(n)), rep.int(n, n)), labels=Format(m, digits=2, na.form=""), cex=0.8, xpd=TRUE) # the text could also be set with outer, but this function returns an error, # based on the fact that text() does not return some kind of result # outer(X = 1:nrow(m), Y = ncol(m):1, # FUN = "text", labels = Format(m, digits=2, na.form = ""), # cex=0.8, xpd=TRUE) # put similiar correlations together PlotCorr(m, clust=TRUE) # same as idx <- order.dendrogram(as.dendrogram( hclust(dist(m), method = "mcquitty") )) PlotCorr(m[idx, idx]) # plot only upper triangular matrix and move legend to bottom m <- cor(mtcars) m[lower.tri(m, diag=TRUE)] <- NA p <- PairApply(mtcars, function(x, y) cor.test(x, y)$p.value, symmetric=TRUE) m[p > 0.05] <- NA PlotCorr(m, mar=c(8,8,8,8), yaxt="n", args.colorlegend = list(x="bottom", inset=-.15, horiz=TRUE, height=abs(LineToUser(line = 2.5, side = 1)), width=ncol(m))) mtext(text = rev(rownames(m)), side = 4, at=1:ncol(m), las=1, line = -5, cex=0.8) text(1:ncol(m), ncol(m):1, colnames(m), xpd=NA, cex=0.8, font=2) n <- ncol(m) text(x=rep(seq(n), times=n), y=rep(rev(seq(n)), rep.int(n, n)), labels=Format(t(m), digits=2, na.form=""), cex=0.8, xpd=TRUE)
Draw a Cleveland dot plot. This is an extended version of dotchart
with an added option for error bars, an add
argument and several more options. PlotCI()
is a small helpfunction to facilitate ci-plots of several models.
PlotDot(x, labels = NULL, groups = NULL, gdata = NULL, cex = par("cex"), pch = 21, gpch = 21, bg = par("bg"), color = par("fg"), gcolor = par("fg"), lcolor = "gray", lblcolor = par("fg"), xlim = NULL, main = NULL, xlab = NULL, ylab = NULL, xaxt = NULL, yaxt = NULL, add = FALSE, args.errbars = NULL, cex.axis = par("cex.axis"), cex.pch = 1.2, cex.gpch = 1.2, gshift = 2, automar = TRUE, ...) PlotDotCI(..., grp = 1, cex = par("cex"), pch = 21, gpch = 21, bg = par("bg"), color = par("fg"), gcolor = par("fg"), lcolor = "gray", lblcolor = par("fg"), xlim = NULL, main = NULL, xlab = NULL, ylab = NULL, xaxt = NULL, yaxt = NULL, cex.axis = par("cex.axis"), cex.pch = 1.2, cex.gpch = 1.2, gshift = 2, automar = TRUE)
PlotDot(x, labels = NULL, groups = NULL, gdata = NULL, cex = par("cex"), pch = 21, gpch = 21, bg = par("bg"), color = par("fg"), gcolor = par("fg"), lcolor = "gray", lblcolor = par("fg"), xlim = NULL, main = NULL, xlab = NULL, ylab = NULL, xaxt = NULL, yaxt = NULL, add = FALSE, args.errbars = NULL, cex.axis = par("cex.axis"), cex.pch = 1.2, cex.gpch = 1.2, gshift = 2, automar = TRUE, ...) PlotDotCI(..., grp = 1, cex = par("cex"), pch = 21, gpch = 21, bg = par("bg"), color = par("fg"), gcolor = par("fg"), lcolor = "gray", lblcolor = par("fg"), xlim = NULL, main = NULL, xlab = NULL, ylab = NULL, xaxt = NULL, yaxt = NULL, cex.axis = par("cex.axis"), cex.pch = 1.2, cex.gpch = 1.2, gshift = 2, automar = TRUE)
x |
either a vector or matrix of numeric values ( |
labels |
a vector of labels for each point.
For vectors the default is to use |
groups |
an optional factor indicating how the elements of
|
gdata |
data values for the groups. This is typically a summary such as the median or mean of each group. |
cex |
the character size to be used. Setting |
pch |
the plotting character or symbol to be used. Default is 21. |
gpch |
the plotting character or symbol to be used for group values. |
bg |
the background color of plotting characters or symbols to be
used; use |
color |
the color(s) to be used for points and labels. |
gcolor |
the single color to be used for group labels and values. |
lcolor |
the color(s) to be used for the horizontal lines. |
lblcolor |
the color(s) to be used for labels. |
xlim |
horizontal range for the plot, see
|
main |
overall title for the plot, see |
xlab , ylab
|
axis annotations as in |
xaxt |
a character which specifies the x axis type. Specifying |
yaxt |
a character which specifies the y axis type. Specifying |
add |
logical specifying if bars should be added to an already existing plot; defaults to |
args.errbars |
optional arguments for adding error bars. All arguments for |
cex.axis |
The magnification to be used for axis annotation relative to the current setting of cex. |
cex.pch |
The magnification to be used for plot symbols relative to the current setting of cex. |
cex.gpch |
The magnification to be used for group symbols relative to the current setting of cex. |
gshift |
the number of characters, for which the grouplabels should be shift to the left compared to the sublabels. |
automar |
logical (default |
... |
graphical parameters can also be specified as arguments. |
grp |
an integer, defining if the the coefficients should be grouped along the first or the second dimension (default is 1). |
Dot plots are a reasonable substitute for bar plots. This function is invoked to produce dotplots as described in Cleveland (1985).
For PlotDotCI()
the dots are a list of matrices with 3 columns, whereas the first is the coefficent, the second the lower and the third the upper end of the confidence interval.
Return the y-values used for plotting.
R-Core with some extensions by Andri Signorell <[email protected]>
Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) The New S Language. Wadsworth & Brooks/Cole.
Cleveland, W. S. (1985) The Elements of Graphing Data. Monterey, CA: Wadsworth.
Murrell, P. (2005) R Graphics. Chapman & Hall/CRC Press.
PlotDot(VADeaths, main = "Death Rates in Virginia - 1940") op <- par(xaxs = "i") # 0 -- 100% PlotDot(t(VADeaths), xlim = c(0,100), main = "Death Rates in Virginia - 1940") par(op) # add some error bars PlotDot(VADeaths, main="Death Rates in Virginia - 1940", col="red", pch=21, args.errbars = list(from=VADeaths-2, to=VADeaths+2, mid=VADeaths, cex=1.4)) # add some other values PlotDot(VADeaths+3, pch=15, col="blue", add=TRUE) # same as PlotDotCI xci <- do.call(rbind, tapply( d.pizza$delivery_min, d.pizza$driver, MeanCI, conf.level=0.99, na.rm=TRUE)) PlotDot(xci[,1], main="delivery_min ~ driver", pch=21, bg="grey80", col="black", args.errbars = list(from=xci[,2], to=xci[,3], mid=xci[,1], lwd=2, col="grey40", cex=1.5), xlim=c(15,35), panel.before=grid()) # with group data x <- with(d.pizza, tapply(temperature, list(area, driver), mean, na.rm=TRUE)) PlotDot(x, gdata = tapply(d.pizza$temperature, d.pizza$driver, mean, na.rm=TRUE), gpch = 15) # special format par(lend=1) PlotDot(VADeaths, main="Death Rates in Virginia - 1940", pch="|", lcolor = DescTools::hecru, col=DescTools::hred, args.errbars = list(from=VADeaths-2, to=VADeaths+2, mid=VADeaths, cex=1.3, lwd=8, code=0, col=DescTools::hgreen)) # Error bars for binomial confidence intervals tab <- table(d.pizza$driver, d.pizza$wine_delivered) xci <- SetNames(BinomCI(tab[,1], rowSums(tab)), rownames=rownames(tab)) PlotDot(xci[,1], main="wine delivered ~ driver ", xlim=c(0,1), args.errbars=list(from=xci[,-1], mid=xci[,1], pch=21)) # Error bars for confidence intervals for means xci <- do.call(rbind, tapply(d.pizza$delivery_min, d.pizza$driver, MeanCI, conf.level=0.99, na.rm=TRUE)) PlotDot(xci[, 1], main="delivery_min ~ driver", args.errbars=list(from=xci)) # Setting the colours # define some error bars first lci <- sweep(x = VADeaths, MARGIN = 2, FUN = "-", 1:4) uci <- sweep(x = VADeaths, MARGIN = 1, FUN = "+", 1:5) PlotDot(VADeaths, main="This should only show how to set the colours, not be pretty", pch=21, col=c("blue","grey"), bg=c("red", "yellow"), gcolor = c("green", "blue", "orange", "magenta"), gdata=c(10,20,30,40), gpch = c(15:18), lcolor = "orange", args.errbars = list(from=lci, to=uci, mid=VADeaths, cex=1.4))
PlotDot(VADeaths, main = "Death Rates in Virginia - 1940") op <- par(xaxs = "i") # 0 -- 100% PlotDot(t(VADeaths), xlim = c(0,100), main = "Death Rates in Virginia - 1940") par(op) # add some error bars PlotDot(VADeaths, main="Death Rates in Virginia - 1940", col="red", pch=21, args.errbars = list(from=VADeaths-2, to=VADeaths+2, mid=VADeaths, cex=1.4)) # add some other values PlotDot(VADeaths+3, pch=15, col="blue", add=TRUE) # same as PlotDotCI xci <- do.call(rbind, tapply( d.pizza$delivery_min, d.pizza$driver, MeanCI, conf.level=0.99, na.rm=TRUE)) PlotDot(xci[,1], main="delivery_min ~ driver", pch=21, bg="grey80", col="black", args.errbars = list(from=xci[,2], to=xci[,3], mid=xci[,1], lwd=2, col="grey40", cex=1.5), xlim=c(15,35), panel.before=grid()) # with group data x <- with(d.pizza, tapply(temperature, list(area, driver), mean, na.rm=TRUE)) PlotDot(x, gdata = tapply(d.pizza$temperature, d.pizza$driver, mean, na.rm=TRUE), gpch = 15) # special format par(lend=1) PlotDot(VADeaths, main="Death Rates in Virginia - 1940", pch="|", lcolor = DescTools::hecru, col=DescTools::hred, args.errbars = list(from=VADeaths-2, to=VADeaths+2, mid=VADeaths, cex=1.3, lwd=8, code=0, col=DescTools::hgreen)) # Error bars for binomial confidence intervals tab <- table(d.pizza$driver, d.pizza$wine_delivered) xci <- SetNames(BinomCI(tab[,1], rowSums(tab)), rownames=rownames(tab)) PlotDot(xci[,1], main="wine delivered ~ driver ", xlim=c(0,1), args.errbars=list(from=xci[,-1], mid=xci[,1], pch=21)) # Error bars for confidence intervals for means xci <- do.call(rbind, tapply(d.pizza$delivery_min, d.pizza$driver, MeanCI, conf.level=0.99, na.rm=TRUE)) PlotDot(xci[, 1], main="delivery_min ~ driver", args.errbars=list(from=xci)) # Setting the colours # define some error bars first lci <- sweep(x = VADeaths, MARGIN = 2, FUN = "-", 1:4) uci <- sweep(x = VADeaths, MARGIN = 1, FUN = "+", 1:5) PlotDot(VADeaths, main="This should only show how to set the colours, not be pretty", pch=21, col=c("blue","grey"), bg=c("red", "yellow"), gcolor = c("green", "blue", "orange", "magenta"), gdata=c(10,20,30,40), gpch = c(15:18), lcolor = "orange", args.errbars = list(from=lci, to=uci, mid=VADeaths, cex=1.4))
Faster alternative for plotting the empirical cumulative distribution function (ecdf). The function offers the option to construct the ecdf on the base of a histogram, which makes sense, when x is large. So the plot process is much faster, without loosing much precision in the details.
PlotECDF(x, breaks = NULL, col = Pal()[1], ylab = "", lwd = 2, xlab = NULL, ...)
PlotECDF(x, breaks = NULL, col = Pal()[1], ylab = "", lwd = 2, xlab = NULL, ...)
x |
numeric vector of the observations for ecdf. |
breaks |
will be passed directly to |
col |
color of the line. |
ylab |
label for the y-axis. |
lwd |
line width. |
xlab |
label for the x-axis. |
... |
arguments to be passed to subsequent functions. |
The stats function plot.ecdf
is fine for vectors that are not too large. However for n ~ 1e7 we would observe a dramatic performance breakdown (possibly in combination with the use of do.call
).
PlotECDF
is designed as alternative for quicker plotting the ecdf for larger vectors. If breaks
are provided as argument, a histogram with that number of breaks will be calculated and the ecdf will use those frequencies instead of respecting every single point.
Note that a plot will rarely need more than ~1'000 points on x to have a sufficient resolution on usual terms. PlotFdist
will also use this number of breaks by default.
no value returned, use plot.ecdf
if any results are required.
Andri Signorell <[email protected]>
PlotECDF(d.pizza$temperature) # make large vector x <- rnorm(n=1e7) # plot only 1000 points instead of 1e7 PlotECDF(x, breaks=1000)
PlotECDF(d.pizza$temperature) # make large vector x <- rnorm(n=1e7) # plot only 1000 points instead of 1e7 PlotECDF(x, breaks=1000)
Plot Chernoff faces. The rows of a data matrix represent cases and the columns the variables.
PlotFaces(xy, which.row, fill = FALSE, nr, nc, scale = TRUE, byrow = FALSE, main, labels, col = "white")
PlotFaces(xy, which.row, fill = FALSE, nr, nc, scale = TRUE, byrow = FALSE, main, labels, col = "white")
xy |
|
which.row |
defines a permutation of the rows of the input matrix. |
fill |
logic. If set to |
nr |
number of columns of faces on graphics device |
nc |
number of rows of faces |
scale |
logic. If set to |
byrow |
|
main |
title. |
labels |
character strings to use as names for the faces. |
col |
a vector of colors used for the parts of the faces. Colors are recycled in the order: "nose", "eyes", "hair", "face", "lips", "ears". Default is NA, which will omit colors. |
The features paramters of this implementation are:
1 height of face
2 width of face
3 shape of face
4 height of mouth
5 width of mouth
6 curve of smile
7 height of eyes
8 width of eyes
9 height of hair
10 width of hair
11 styling of hair
12 height of nose
13 width of nose
14 width of ears
15 height of ears
For details look at the literate program of faces
information about usage of variables for face elements is returned invisibly
based on version 12/2009
H. P. Wolf, some changes Andri Signorell <[email protected]>
Chernoff, H. (1973) The use of faces to represent statistiscal assoziation,
JASA, 68, pp 361–368.
The smooth curves are computed by an algorithm found in:
Ralston, A. and Rabinowitz, P. (1985)
A first course in numerical analysis, McGraw-Hill, pp 76ff.
PlotFaces(rbind(1:3,5:3,3:5,5:7)) data(longley) PlotFaces(longley[1:9,]) set.seed(17) PlotFaces(matrix(sample(1:1000,128,), 16, 8), main="random faces") means <- lapply(iris[,-5], tapply, iris$Species, mean) m <- t(do.call(rbind, means)) m <- cbind(m, matrix(rep(1, 11*3), nrow=3)) # define the colors, first for all faces the same col <- replicate(3, c("orchid1", "olivedrab", "goldenrod4", "peachpuff", "darksalmon", "peachpuff3")) rownames(col) <- c("nose","eyes","hair","face","lips","ears") # change haircolor individually for each face col[3, ] <- c("lightgoldenrod", "coral3", "sienna4") z <- PlotFaces(m, nr=1, nc=3, col=col) # print the used coding print(z$info, right=FALSE)
PlotFaces(rbind(1:3,5:3,3:5,5:7)) data(longley) PlotFaces(longley[1:9,]) set.seed(17) PlotFaces(matrix(sample(1:1000,128,), 16, 8), main="random faces") means <- lapply(iris[,-5], tapply, iris$Species, mean) m <- t(do.call(rbind, means)) m <- cbind(m, matrix(rep(1, 11*3), nrow=3)) # define the colors, first for all faces the same col <- replicate(3, c("orchid1", "olivedrab", "goldenrod4", "peachpuff", "darksalmon", "peachpuff3")) rownames(col) <- c("nose","eyes","hair","face","lips","ears") # change haircolor individually for each face col[3, ] <- c("lightgoldenrod", "coral3", "sienna4") z <- PlotFaces(m, nr=1, nc=3, col=col) # print the used coding print(z$info, right=FALSE)
This function is designed to give a univariate graphic representation of a numeric vectors frequency distribution.
It combines a histogram, a density curve, a boxplot and the empirical cumulative distribution function (ecdf) in one single plot. A rug as well as a model distribution curve (e.g. a normal curve) can optionally be superposed. This results in a dense and informative picture of the facts.
Still the function remains flexible as all possible arguments can be passed to the single components (hist
, boxplot
etc.) as a list (see examples).
PlotFdist(x, main = deparse(substitute(x)), xlab = "", xlim = NULL, args.hist = NULL, args.rug = NA, args.dens = NULL, args.curve = NA, args.boxplot = NULL, args.ecdf = NULL, args.curve.ecdf = NA, heights = NULL, pdist = NULL, na.rm = FALSE, cex.axis = NULL, cex.main = NULL, mar = NULL, las = 1)
PlotFdist(x, main = deparse(substitute(x)), xlab = "", xlim = NULL, args.hist = NULL, args.rug = NA, args.dens = NULL, args.curve = NA, args.boxplot = NULL, args.ecdf = NULL, args.curve.ecdf = NA, heights = NULL, pdist = NULL, na.rm = FALSE, cex.axis = NULL, cex.main = NULL, mar = NULL, las = 1)
x |
the numerical variable, whose distribution is to be plotted. |
main |
main title of the plot. |
xlab |
label of the x-axis, defaults to |
xlim |
range of the x-axis, defaults to a pretty |
args.hist |
list of additional arguments to be passed to the histogram |
args.rug |
list of additional arguments to be passed to the function |
args.dens |
list of additional arguments to be passed to |
args.curve |
list of additional arguments to be passed to |
args.boxplot |
list of additional arguments to be passed to the boxplot |
args.ecdf |
list of additional arguments to be passed to |
args.curve.ecdf |
list of additional arguments to be passed to |
heights |
heights of the plotparts, defaults to |
pdist |
distances of the plotparts, defaults to |
na.rm |
logical, should |
cex.axis |
character extension factor for the axes. |
cex.main |
character extension factor for the main title. Must be set in dependence of the plot parts in order to get a harmonic view. |
mar |
A numerical vector of the form |
las |
numeric in |
Performance has been significantly improved, but if x
is growing large (n > 1e7) the function will take its time to complete. Especially the density curve and the ecdf, but as well as the boxplot (due to the chosen alpha channel) will take their time to calculate and plot.
In such cases consider taking a sample, i.e. PlotFdist(x[sample(length(x), size=5000)])
, the big picture of the distribution won't usually change much.
.
Andri Signorell <[email protected]>
hist
, boxplot
, ecdf
, density
, rug
, layout
PlotFdist(x=d.pizza$delivery_min, na.rm=TRUE) # define additional arguments for hist, dens and boxplot # do not display the mean and its CI on the boxplot PlotFdist(d.pizza$delivery_min, args.hist=list(breaks=50), args.dens=list(col="olivedrab4"), na.rm=TRUE, args.boxplot=list(col="olivedrab2", pch.mean=NA, col.meanci=NA)) # do a "h"-plot instead of a histogram for integers x <- sample(runif(10), 100, replace = TRUE) PlotFdist(x, args.hist=list(type="mass")) pp <- rpois(n = 100, lambda = 3) PlotFdist(pp, args.hist = list(type="mass", pch=21, col=DescTools::horange, cex.pch=2.5, col.pch=DescTools::hred, lwd=3, bg.pch="white"), args.boxplot = NULL, args.ecdf = NA, main="Probability mass function") # special arguments for hist, density and ecdf PlotFdist(x=faithful$eruptions, args.hist=list(breaks=20), args.dens=list(bw=.1), args.ecdf=list(cex=1.2, pch=16, lwd=1), args.rug=TRUE) # no density curve, no ecdf but add rug instead, make boxplot a bit higher PlotFdist(x=d.pizza$delivery_min, na.rm=TRUE, args.dens=NA, args.ecdf=NA, args.hist=list(xaxt="s"), # display x-axis on the histogram args.rug=TRUE, heights=c(3, 2.5), pdist=2.5, main="Delivery time") # alpha channel on rug is cool, but takes its time for being drawn... PlotFdist(x=d.pizza$temperature, args.rug=list(col=SetAlpha("black", 0.1)), na.rm=TRUE) # plot a normal density curve, but no boxplot nor ecdf x <- rnorm(1000) PlotFdist(x, args.curve = NULL, args.boxplot=NA, args.ecdf=NA) # compare with a t-distribution PlotFdist(x, args.curve = list(expr="dt(x, df=2)", col="darkgreen"), args.boxplot=NA, args.ecdf=NA) legend(x="topright", legend=c("kernel density", "t-distribution (df=2)"), fill=c(getOption("col1", DescTools::hred), "darkgreen"), xpd=NA) # add a gamma distribution curve to both, histogram and ecdf ozone <- airquality$Ozone; m <- mean(ozone, na.rm = TRUE); v <- var(ozone, na.rm = TRUE) PlotFdist(ozone, args.hist = list(breaks=15), args.curve = list(expr="dgamma(x, shape = m^2/v, scale = v/m)", col=DescTools::hecru), args.curve.ecdf = list(expr="pgamma(x, shape = m^2/v, scale = v/m)", col=DescTools::hecru), na.rm = TRUE, main = "Airquality - Ozone") legend(x="topright", xpd=NA, legend=c(expression(plain("gamma: ") * Gamma * " " * bgroup("(", k * " = " * over(bar(x)^2, s^2) * " , " * theta * plain(" = ") * over(s^2, bar(x)), ")") ), "kernel density"), fill=c(DescTools::hecru, getOption("col1", DescTools::hred)), text.width = 0.25)
PlotFdist(x=d.pizza$delivery_min, na.rm=TRUE) # define additional arguments for hist, dens and boxplot # do not display the mean and its CI on the boxplot PlotFdist(d.pizza$delivery_min, args.hist=list(breaks=50), args.dens=list(col="olivedrab4"), na.rm=TRUE, args.boxplot=list(col="olivedrab2", pch.mean=NA, col.meanci=NA)) # do a "h"-plot instead of a histogram for integers x <- sample(runif(10), 100, replace = TRUE) PlotFdist(x, args.hist=list(type="mass")) pp <- rpois(n = 100, lambda = 3) PlotFdist(pp, args.hist = list(type="mass", pch=21, col=DescTools::horange, cex.pch=2.5, col.pch=DescTools::hred, lwd=3, bg.pch="white"), args.boxplot = NULL, args.ecdf = NA, main="Probability mass function") # special arguments for hist, density and ecdf PlotFdist(x=faithful$eruptions, args.hist=list(breaks=20), args.dens=list(bw=.1), args.ecdf=list(cex=1.2, pch=16, lwd=1), args.rug=TRUE) # no density curve, no ecdf but add rug instead, make boxplot a bit higher PlotFdist(x=d.pizza$delivery_min, na.rm=TRUE, args.dens=NA, args.ecdf=NA, args.hist=list(xaxt="s"), # display x-axis on the histogram args.rug=TRUE, heights=c(3, 2.5), pdist=2.5, main="Delivery time") # alpha channel on rug is cool, but takes its time for being drawn... PlotFdist(x=d.pizza$temperature, args.rug=list(col=SetAlpha("black", 0.1)), na.rm=TRUE) # plot a normal density curve, but no boxplot nor ecdf x <- rnorm(1000) PlotFdist(x, args.curve = NULL, args.boxplot=NA, args.ecdf=NA) # compare with a t-distribution PlotFdist(x, args.curve = list(expr="dt(x, df=2)", col="darkgreen"), args.boxplot=NA, args.ecdf=NA) legend(x="topright", legend=c("kernel density", "t-distribution (df=2)"), fill=c(getOption("col1", DescTools::hred), "darkgreen"), xpd=NA) # add a gamma distribution curve to both, histogram and ecdf ozone <- airquality$Ozone; m <- mean(ozone, na.rm = TRUE); v <- var(ozone, na.rm = TRUE) PlotFdist(ozone, args.hist = list(breaks=15), args.curve = list(expr="dgamma(x, shape = m^2/v, scale = v/m)", col=DescTools::hecru), args.curve.ecdf = list(expr="pgamma(x, shape = m^2/v, scale = v/m)", col=DescTools::hecru), na.rm = TRUE, main = "Airquality - Ozone") legend(x="topright", xpd=NA, legend=c(expression(plain("gamma: ") * Gamma * " " * bgroup("(", k * " = " * over(bar(x)^2, s^2) * " , " * theta * plain(" = ") * over(s^2, bar(x)), ")") ), "kernel density"), fill=c(DescTools::hecru, getOption("col1", DescTools::hred)), text.width = 0.25)
Plots mathematical expressions in one variable using the formula syntax.
PlotFun(FUN, args = NULL, from = NULL, to = NULL, by = NULL, xlim = NULL, ylim = NULL, polar = FALSE, type = "l", col = par("col"), lwd = par("lwd"), lty = par("lty"), pch = NA, mar = NULL, add = FALSE, ...)
PlotFun(FUN, args = NULL, from = NULL, to = NULL, by = NULL, xlim = NULL, ylim = NULL, polar = FALSE, type = "l", col = par("col"), lwd = par("lwd"), lty = par("lty"), pch = NA, mar = NULL, add = FALSE, ...)
FUN |
a mathematical expression defined using the formula syntax: |
args |
a list of additional parameters defined in the expression besides the independent variable. |
from , to
|
the range over which the function will be plotted. |
by |
number: increment of the sequence. |
xlim , ylim
|
|
polar |
logical. Should polar coordinates be used? Defaults to |
type |
plot type: see |
col |
colors of the lines. |
lwd |
line widths for the lines. |
lty |
line type of the lines. |
pch |
plotting 'character', i.e., symbol to use. |
mar |
A numerical vector of the form |
add |
logical; if |
... |
the dots are passed to the plot, resp. lines function. |
A function can be plotted with curve
. This function here adds some more features, one enabling to use a formula for defining the function to plot.
This enables as well a parametric equation to be entered straight forward. Parameters of a function can be set separatedly.
The aspect ratio y/x will be set to 1 by default. (See plot.window
for details.)
If the argument axes
(given in the dots) is not set to FALSE
centered axis at a horizontal and vertical position of 0 will be drawn, containing major and minor ticks.
Andri Signorell <[email protected]>
# simple quadratic function y = x^2 PlotFun(x^2 ~ x) par(mfrow=c(3,4)) # Cartesian leaf PlotFun(3*a*z^2 /(z^3+1) ~ 3*a*z /(z^3+1+b), args=list(a=2, b=.1), from=-10, to=10, by=0.1, xlim=c(-5,5), ylim=c(-5,5), col="magenta", asp=1, lwd=2 ) # family of functions PlotFun(a*exp(-x/5)*sin(n*x) ~ x, args=list(n=4, a=3), from=0, to=10, by=0.01, col="green") PlotFun(a*exp(-x/5)*sin(n*x) ~ x, args=list(n=6, a=3), from=0, to=10, by=0.01, col="darkgreen", add=TRUE) # cardioid PlotFun(a*(1+cos(t)) ~ t, args=list(a=2), polar=TRUE, from=0, to=2*pi+0.1, by=0.01, asp=1) PlotFun(13*cos(t) - 5*cos(2*t) - 2*cos(3*t) - cos(4*t) ~ 16*sin(t)^3, from=0, to=2*pi, by=0.01, asp=1, xlim=c(-20,20), col="red", lwd=2) PlotFun(a*sin(2*t)*cos(2*t) ~ t, args=list(a=6), polar=TRUE, from=0, to=2*pi+0.1, by=0.01, col="orange") # astroid PlotFun(a*sin(t)^3 ~ a*cos(t)^3, args=list(a=2), from=0, to=2*pi+0.1, lwd=3, by=0.01, col="red") # lemniscate of Bernoulli PlotFun((2*a^2*cos(2*t))^2 ~ t, args=list(a=1), polar=TRUE, from=0, to=2*pi+0.1, by=0.01, col="darkblue") # Cycloid PlotFun(a*(1-cos(t)) ~ a*(t-sin(t)), args=list(a=0.5), from=0, to=30, by=0.01, col="orange") # Kreisevolvente PlotFun(a*(sin(t) - t*cos(t)) ~ a*(cos(t) + t*sin(t)), args=list(a=0.2), from=0, to=50, by=0.01, col="brown") PlotFun(sin(2*t) ~ sin(t), from=0, to=2*pi, by=0.01, col="blue", lwd=2) # multiple values for one parameter sapply(1:3, function(a) PlotFun(sin(a*x) ~ x, args=list(a=a), from=0, to=2*pi, by=0.01, add=(a!=1), col=a)) PlotFun(sin(3*x) ~ x, polar=TRUE, from=0, to=pi, by=0.001, col=DescTools::hred, lwd=2) PlotFun(1 + 1/10 * sin(10*x) ~ x, polar=TRUE, from=0, to=2*pi, by=0.001, col=DescTools::hred) PlotFun(sin(x) ~ cos(x), polar=FALSE, from=0, to=2*pi, by=0.01, add=TRUE, col="blue")
# simple quadratic function y = x^2 PlotFun(x^2 ~ x) par(mfrow=c(3,4)) # Cartesian leaf PlotFun(3*a*z^2 /(z^3+1) ~ 3*a*z /(z^3+1+b), args=list(a=2, b=.1), from=-10, to=10, by=0.1, xlim=c(-5,5), ylim=c(-5,5), col="magenta", asp=1, lwd=2 ) # family of functions PlotFun(a*exp(-x/5)*sin(n*x) ~ x, args=list(n=4, a=3), from=0, to=10, by=0.01, col="green") PlotFun(a*exp(-x/5)*sin(n*x) ~ x, args=list(n=6, a=3), from=0, to=10, by=0.01, col="darkgreen", add=TRUE) # cardioid PlotFun(a*(1+cos(t)) ~ t, args=list(a=2), polar=TRUE, from=0, to=2*pi+0.1, by=0.01, asp=1) PlotFun(13*cos(t) - 5*cos(2*t) - 2*cos(3*t) - cos(4*t) ~ 16*sin(t)^3, from=0, to=2*pi, by=0.01, asp=1, xlim=c(-20,20), col="red", lwd=2) PlotFun(a*sin(2*t)*cos(2*t) ~ t, args=list(a=6), polar=TRUE, from=0, to=2*pi+0.1, by=0.01, col="orange") # astroid PlotFun(a*sin(t)^3 ~ a*cos(t)^3, args=list(a=2), from=0, to=2*pi+0.1, lwd=3, by=0.01, col="red") # lemniscate of Bernoulli PlotFun((2*a^2*cos(2*t))^2 ~ t, args=list(a=1), polar=TRUE, from=0, to=2*pi+0.1, by=0.01, col="darkblue") # Cycloid PlotFun(a*(1-cos(t)) ~ a*(t-sin(t)), args=list(a=0.5), from=0, to=30, by=0.01, col="orange") # Kreisevolvente PlotFun(a*(sin(t) - t*cos(t)) ~ a*(cos(t) + t*sin(t)), args=list(a=0.2), from=0, to=50, by=0.01, col="brown") PlotFun(sin(2*t) ~ sin(t), from=0, to=2*pi, by=0.01, col="blue", lwd=2) # multiple values for one parameter sapply(1:3, function(a) PlotFun(sin(a*x) ~ x, args=list(a=a), from=0, to=2*pi, by=0.01, add=(a!=1), col=a)) PlotFun(sin(3*x) ~ x, polar=TRUE, from=0, to=pi, by=0.001, col=DescTools::hred, lwd=2) PlotFun(1 + 1/10 * sin(10*x) ~ x, polar=TRUE, from=0, to=2*pi, by=0.001, col=DescTools::hred) PlotFun(sin(x) ~ cos(x), polar=FALSE, from=0, to=2*pi, by=0.01, add=TRUE, col="blue")
Plot the columns of one matrix against the columns of another. Adds a legend on the right at the endpoints of lines.
PlotLinesA(x, y, col = 1:5, lty = 1, lwd = 1, lend = par("lend"), xlab = NULL, ylab = NULL, xlim = NULL, ylim = NULL, xaxt = NULL, yaxt = NULL, cex = 1, args.legend = NULL, main = NULL, grid = TRUE, mar = NULL, pch = NA, pch.col = par("fg"), pch.bg = par("bg"), pch.cex = 1, ...)
PlotLinesA(x, y, col = 1:5, lty = 1, lwd = 1, lend = par("lend"), xlab = NULL, ylab = NULL, xlim = NULL, ylim = NULL, xaxt = NULL, yaxt = NULL, cex = 1, args.legend = NULL, main = NULL, grid = TRUE, mar = NULL, pch = NA, pch.col = par("fg"), pch.bg = par("bg"), pch.cex = 1, ...)
x , y
|
vectors or matrices of data for plotting. The number of rows should match. If one of them are missing, the other is taken as y and an x vector of |
col |
vector of colors. Colors are used cyclically. |
lty , lwd , lend
|
vector of line types, widths, and end styles. The first element is for the first column, the second element for the second column, etc., even if lines are not plotted for all columns. Line types will be used cyclically until all plots are drawn. |
xlab , ylab
|
titles for x and y axes, as in plot. |
xlim , ylim
|
ranges of x and y axes, as in plot. |
xaxt , yaxt
|
a character which specifies the x axis type. Specifying " |
cex |
character expansion factor relative to current |
args.legend |
list of additional arguments for the legend; names of the list are used as argument names. If set to NA, the legend will be suppressed. See details. |
main |
an overall title for the plot. |
grid |
logical adds an nx by ny rectangular grid to an existing plot. |
mar |
the margins of the plot. |
pch |
character string or vector of 1-characters or integers for plotting characters, see |
pch.col |
vector of colors for the points. Colors are used cyclically. Ignored if |
pch.bg |
vector of background (fill) colors for the open plot symbols given by |
pch.cex |
vector of character expansion sizes, used cyclically. This works as a multiple of |
... |
the dots are sent to |
.
This function is rather a template, than a function. It wraps matplot
to generate a lines plot and adds a rather sophisticated legend on the right side, while calculating appropriate margins. A grid option is included (as panel.first
does not work in matplot).
As in matplot, the first column of x is plotted against the first column of y, the second column of x against the second column of y, etc. If one matrix has fewer columns, plotting will cycle back through the columns again. (In particular, either x or y may be a vector, against which all columns of the other argument will be plotted.)
The legend can be controlled by following arguments:
list(line = c(1, 1), width = 1, y = SpreadOut(unlist(last), mindist = 1.2 * strheight("M") * par("cex")), labels = names(last), cex = par("cex"), col = col[ord], lwd = lwd[ord], lty = lty[ord])
All arguments are recycled.
Andri Signorell <[email protected]>
m <- matrix(c(3,4,5,1,5,4,2,6,2), nrow = 3, dimnames = list(dose = c("A","B","C"), age = c("2010","2011","2012"))) PlotLinesA(m, col=c(Pal("Helsana")), main="Dose ~ age_grp", lwd=3, ylim=c(1, 10)) # example from MASS shoes <- list( A = c(13.2, 8.2, 10.9, 14.3, 10.7, 6.6, 9.5, 10.8, 8.8, 13.3), B = c(14, 8.8, 11.2, 14.2, 11.8, 6.4, 9.8, 11.3, 9.3, 13.6)) PlotLinesA(do.call(rbind, shoes), xlim=c(0.75,2.25), col=1, main="shoes", pch=21, pch.bg="white", pch.col=1, pch.cex=1.5) # let's define some arbitrary x-coordinates PlotLinesA(x=c(1,2,6,8,15), y=VADeaths)
m <- matrix(c(3,4,5,1,5,4,2,6,2), nrow = 3, dimnames = list(dose = c("A","B","C"), age = c("2010","2011","2012"))) PlotLinesA(m, col=c(Pal("Helsana")), main="Dose ~ age_grp", lwd=3, ylim=c(1, 10)) # example from MASS shoes <- list( A = c(13.2, 8.2, 10.9, 14.3, 10.7, 6.6, 9.5, 10.8, 8.8, 13.3), B = c(14, 8.8, 11.2, 14.2, 11.8, 6.4, 9.8, 11.3, 9.3, 13.6)) PlotLinesA(do.call(rbind, shoes), xlim=c(0.75,2.25), col=1, main="shoes", pch=21, pch.bg="white", pch.col=1, pch.cex=1.5) # let's define some arbitrary x-coordinates PlotLinesA(x=c(1,2,6,8,15), y=VADeaths)
The base function grid()
does not support logarithmic scales very well. Especially when more lines are required, grids have to be created manually. PlotLog
creates a plot with at least one logarithmic axis and places a logarithmic grid in the background of the data.
PlotLog(x, ..., args.grid = NULL, log = "xy")
PlotLog(x, ..., args.grid = NULL, log = "xy")
x |
the coordinates of points in the plot. Alternatively, a single plotting structure, function or any R object with a plot method can be provided. |
... |
the dots are passed on to the function |
args.grid |
a list of arguments for the grid. This contains line type, line width and line color, separately for major gridlines and for minor gridlines. |
log |
a character string which contains " |
nothing
Andri Signorell <[email protected]>
PlotLog(brain ~ body, data=MASS::Animals, log="xy", xlim=c(.01, 1e5), ylim=c(.1, 1e4), main="Animal brain/body size", pch=21, bg="grey", cex=1.5)
PlotLog(brain ~ body, data=MASS::Animals, log="xy", xlim=c(.01, 1e5), ylim=c(.1, 1e4), main="Animal brain/body size", pch=21, bg="grey", cex=1.5)
Draw a scatter plot with marginal densities on the x- and y-axis. Groups can be defined by grp.
PlotMarDens(x, y, grp = 1, xlim = NULL, ylim = NULL, col = rainbow(nlevels(factor(grp))), mardens = c("all","x","y"), pch = 1, pch.cex = 1, main = "", na.rm = FALSE, args.legend = NULL, args.dens = NULL, ...)
PlotMarDens(x, y, grp = 1, xlim = NULL, ylim = NULL, col = rainbow(nlevels(factor(grp))), mardens = c("all","x","y"), pch = 1, pch.cex = 1, main = "", na.rm = FALSE, args.legend = NULL, args.dens = NULL, ...)
x |
numeric vector of x values. |
y |
numeric vector of y values (of same length as x). |
grp |
grouping variable(s), typically factor(s), all of the same length as x. |
xlim |
the x limits of the plot. |
ylim |
the y limits of the plot. |
col |
the colors for lines and points. Uses |
mardens |
which marginal densities to plot. Can be set to either just x or y, or both ( |
pch |
a vector of plotting characters or symbols. |
pch.cex |
magnification to be used for plotting characters relative to the current setting of cex. |
main |
a main title for the plot, see also title. |
na.rm |
logical, should NAs be omitted? Defaults to |
args.legend |
list of additional arguments for the legend. |
args.dens |
list of additional arguments to be passed to |
... |
further arguments are passed to the function |
Andri Signorell <[email protected]>
# best seen with: x11(7.5, 4.7) # just one variable with marginal densities PlotMarDens( y=d.pizza$temperature, x=d.pizza$delivery_min, grp=1 , xlab="delivery_min", ylab="temperature", col=SetAlpha("brown", 0.4) , pch=15, lwd=3 , panel.first= grid(), args.legend=NA , main="Temp ~ delivery" ) # use a group variable PlotMarDens( y=d.pizza$temperature, x=d.pizza$delivery_min, grp=d.pizza$area , xlab="delivery_min", ylab="temperature", col=c("brown","orange","lightsteelblue") , panel.first=list( grid() ) , main = "temperature ~ delivery_min | area" ) # reset layout par(mfrow=c(1,1))
# best seen with: x11(7.5, 4.7) # just one variable with marginal densities PlotMarDens( y=d.pizza$temperature, x=d.pizza$delivery_min, grp=1 , xlab="delivery_min", ylab="temperature", col=SetAlpha("brown", 0.4) , pch=15, lwd=3 , panel.first= grid(), args.legend=NA , main="Temp ~ delivery" ) # use a group variable PlotMarDens( y=d.pizza$temperature, x=d.pizza$delivery_min, grp=d.pizza$area , xlab="delivery_min", ylab="temperature", col=c("brown","orange","lightsteelblue") , panel.first=list( grid() ) , main = "temperature ~ delivery_min | area" ) # reset layout par(mfrow=c(1,1))
Takes a data frame and displays the location of missing data. The missings can be clustered and be displayed together.
PlotMiss(x, col = DescTools::hred, bg = SetAlpha(DescTools::hecru, 0.3), clust = FALSE, main = NULL, ...)
PlotMiss(x, col = DescTools::hred, bg = SetAlpha(DescTools::hecru, 0.3), clust = FALSE, main = NULL, ...)
x |
a data.frame to be analysed. |
col |
the colour of the missings. |
bg |
the background colour of the plot. |
clust |
logical, defining if the missings should be clustered. Default is |
main |
the main title. |
... |
the dots are passed to |
A graphical display of the position of the missings can be help to detect dependencies or patterns within the missings.
if clust is set to TRUE, the new order will be returned invisibly.
Andri Signorell <[email protected]>, following an idea of Henk Harmsen <[email protected]>
PlotMiss(d.pizza, main="Missing pizza data")
PlotMiss(d.pizza, main="Missing pizza data")
Plot seasonal effects of a univariate time series following Cleveland's definition for cycle plots.
PlotMonth(x, type = "l", labels, xlab = "", ylab = deparse(substitute(x)), ...)
PlotMonth(x, type = "l", labels, xlab = "", ylab = deparse(substitute(x)), ...)
x |
univariate time series |
type |
one out of |
labels |
the labels for the cyclic component to be displayed on the x-axis |
xlab |
a title for the x axis: see |
ylab |
a title for the y axis: see |
... |
the dots are passed to the plot command. |
A cycle plot is a graphical method invented to study teh behaviour of a seasonal time series. The seasonal component of a univariate series is graphed. First the January values are graphed for successive years, then the February values and so forth. For each monthly subseries the mean of the values is portrayed by a horizontal line.
Markus Huerzeler (ETH Zurich), slight changes Andri Signorell <[email protected]>
Cleveland, W. S. (1985) The Elements of Graphing Data. Monterey, CA: Wadsworth.
PlotMonth(AirPassengers)
PlotMonth(AirPassengers)
Plots a mosaic on the current graphics device.
PlotMosaic(x, main = deparse(substitute(x)), horiz = TRUE, cols = NULL, off = 0.02, mar = NULL, xlab = NULL, ylab = NULL, cex = par("cex"), las = 2, ...)
PlotMosaic(x, main = deparse(substitute(x)), horiz = TRUE, cols = NULL, off = 0.02, mar = NULL, xlab = NULL, ylab = NULL, cex = par("cex"), las = 2, ...)
x |
a contingency table in array form, with optional category labels specified in the dimnames(x) attribute. The table is best created by the table() command. So far only 2-way tables are allowed. |
main |
character string for the mosaic title. |
horiz |
logical, defining the orientation of the mosaicplot. |
cols |
the colors of the plot. |
off |
the offset between the rectangles. Default is 0.02. |
mar |
the margin for the plot. |
xlab , ylab
|
x- and y-axis labels used for the plot; by default, the first and second element of names(dimnames(X)) (i.e., the name of the first and second variable in X). |
cex |
numeric character expansion factor; multiplied by |
las |
the style of axis labels. 0 - parallel to the axis, 1 - horizontal, 2 - perpendicular, 3 - vertical. |
... |
additional arguments are passed to the text function. |
The reason for this function to exist are the unsatisfying labels in base mosaicplot.
list with the midpoints of the rectangles
Andri Signorell <[email protected]>
Friendly, M. (1994) Mosaic displays for multi-way contingency tables. Journal of the American Statistical Association, 89, 190-200.
PlotMosaic(HairEyeColor[,,1])
PlotMosaic(HairEyeColor[,,1])
Multiple density curves are plotted on the same plot. The function plots the density curves in the defined colors and linetypes, after having calculated the globally appropriate xlim- and ylim-values. A legend can directly be included.
PlotMultiDens(x, ...) ## Default S3 method: PlotMultiDens(x, xlim = NULL, ylim = NULL, col = Pal(), lty = "solid", lwd = 2, fill = NA, xlab = "x", ylab = "density", args.dens = NULL, args.legend = NULL, na.rm = FALSE, flipxy = FALSE, ...) ## S3 method for class 'formula' PlotMultiDens(formula, data, subset, na.action, ...)
PlotMultiDens(x, ...) ## Default S3 method: PlotMultiDens(x, xlim = NULL, ylim = NULL, col = Pal(), lty = "solid", lwd = 2, fill = NA, xlab = "x", ylab = "density", args.dens = NULL, args.legend = NULL, na.rm = FALSE, flipxy = FALSE, ...) ## S3 method for class 'formula' PlotMultiDens(formula, data, subset, na.action, ...)
x |
a list of vectors whose densities are to be plotted. Use |
xlim , ylim
|
xlim, ylim of the plot. |
col |
colors of the lines, defaults to |
lty |
line type of the lines. |
lwd |
line widths for the lines. |
fill |
colors for fill the area under the density curve. If set to |
xlab , ylab
|
a title for the x, resp. y axis. Defaults to |
args.dens |
list of additional arguments to be passed to the |
args.legend |
list of additional arguments to be passed to the |
na.rm |
should |
flipxy |
logical, should x- and y-axis be flipped? Defaults to |
formula |
a formula of the form |
data |
an optional matrix or data frame (or similar: see |
subset |
an optional vector specifying a subset of observations to be used. |
na.action |
a function which indicates what should happen when the data contain NAs. Defaults to |
... |
the dots are passed to |
All style arguments, density arguments and data list elements will be recycled if necessary.
The argument flipxy
leads to exchanged x- and y-values. This option can be used to plot density curves with a vertical orientation for displaying marginal densities.
data.frame with 3 columns, containing the bw
, n
and kernel
parameters used for the list elements.
The number of rows correspond to the length of the list x.
Consider using:
library(lattice) densityplot( ~ delivery_min | driver, data=d.pizza)
as alternative when not all curves should be plotted in the same plot.
Andri Signorell <[email protected]>
x <- rnorm(1000,0,1) y <- rnorm(1000,0,2) z <- rnorm(1000,2,1.5) # the input of the following function MUST be a numeric list PlotMultiDens(list(x=x,y=y,z=z)) # use area fill PlotMultiDens(list(x=x,y=y,z=z), fill=SetAlpha(c("red","green","blue"), 0.4)) PlotMultiDens( x=split(d.pizza$delivery_min, d.pizza$driver), na.rm=TRUE , main="delivery time ~ driver", xlab="delivery time [min]", ylab="density" , lwd=1:7, lty=1:7 , panel.first=grid()) # this example demonstrates the definition of different line types and -colors # an is NOT thought as recommendation for good plotting practice... :-) # the formula interface PlotMultiDens(delivery_min ~ driver, data=d.pizza) # recyling of the density parameters res <- PlotMultiDens(x=split(d.pizza$temperature, d.pizza$driver), args.dens = list(bw=c(5,2), kernel=c("rect","epanechnikov")), na.rm=TRUE) res # compare bandwidths PlotMultiDens(x=split(d.pizza$temperature, d.pizza$driver)[1], args.dens = list(bw=c(1:5)), na.rm=TRUE, args.legend=NA, main="Compare bw") legend(x="topright", legend=gettextf("bw = %s", 1:5), fill=rainbow(5))
x <- rnorm(1000,0,1) y <- rnorm(1000,0,2) z <- rnorm(1000,2,1.5) # the input of the following function MUST be a numeric list PlotMultiDens(list(x=x,y=y,z=z)) # use area fill PlotMultiDens(list(x=x,y=y,z=z), fill=SetAlpha(c("red","green","blue"), 0.4)) PlotMultiDens( x=split(d.pizza$delivery_min, d.pizza$driver), na.rm=TRUE , main="delivery time ~ driver", xlab="delivery time [min]", ylab="density" , lwd=1:7, lty=1:7 , panel.first=grid()) # this example demonstrates the definition of different line types and -colors # an is NOT thought as recommendation for good plotting practice... :-) # the formula interface PlotMultiDens(delivery_min ~ driver, data=d.pizza) # recyling of the density parameters res <- PlotMultiDens(x=split(d.pizza$temperature, d.pizza$driver), args.dens = list(bw=c(5,2), kernel=c("rect","epanechnikov")), na.rm=TRUE) res # compare bandwidths PlotMultiDens(x=split(d.pizza$temperature, d.pizza$driver)[1], args.dens = list(bw=c(1:5)), na.rm=TRUE, args.legend=NA, main="Compare bw") legend(x="topright", legend=gettextf("bw = %s", 1:5), fill=rainbow(5))
A matrix of scatterplots is produced.The upper triangular matrices contain nothing else than the correlation coefficient. The diagonal displays a histogram of the variable. The lower triangular matrix displays a scatterplot superposed by a smoother. It's possible to define groups to be differntiated by color and also by individual smoothers.
The used code is not much more than the pairs()
code and some examples, but condenses it to a practical amount.
PlotPairs(x, g = NULL, col = 1, pch = 19, col.smooth = 1, main = "", upper = FALSE, ...)
PlotPairs(x, g = NULL, col = 1, pch = 19, col.smooth = 1, main = "", upper = FALSE, ...)
x |
the coordinates of points given as numeric columns of a matrix or data frame. Logical and factor columns are converted to numeric in the same way that data.matrix does. Will directly be passed on to |
g |
a group variable |
col |
color for pointcharacter |
pch |
point character |
col.smooth |
color for the smoother(s) |
main |
the main title |
upper |
logical, determines if the correlation coefficients should be displayed in the upper triangular matrix (default) or in the lower one. |
... |
additional arguments passed to |
Andri Signorell <[email protected]>
PlotPairs(iris[1:4], g=iris$Species, main = "Anderson's Iris Data -- 3 species", col=c(DescTools::hred, DescTools::hblue, DescTools::hgreen), col.smooth=c("black", DescTools::hred, DescTools::hblue, DescTools::hgreen))
PlotPairs(iris[1:4], g=iris$Species, main = "Anderson's Iris Data -- 3 species", col=c(DescTools::hred, DescTools::hblue, DescTools::hgreen), col.smooth=c("black", DescTools::hred, DescTools::hblue, DescTools::hgreen))
PlotPolar
creates a polar coordinate plot of the radius r in function of the angle theta.
0 degrees is drawn at the 3 o'clock position and angular values increase in a counterclockwise direction.
PlotPolar(r, theta = NULL, type = "p", rlim = NULL, main = "", lwd = par("lwd"), lty = par("lty"), col = par("col"), pch = par("pch"), fill = NA, cex = par("cex"), mar = c(2, 2, 5, 2), add = FALSE, ...)
PlotPolar(r, theta = NULL, type = "p", rlim = NULL, main = "", lwd = par("lwd"), lty = par("lty"), col = par("col"), pch = par("pch"), fill = NA, cex = par("cex"), mar = c(2, 2, 5, 2), add = FALSE, ...)
r |
a vector of radial data. |
theta |
a vector of angular data specified in radians. |
type |
one out of |
rlim |
the r limits (r1, r2) of the plot |
main |
a main title for the plot, see also |
lwd |
a vector of line widths, see |
lty |
a vector of line types, see |
col |
The colors for lines and points. Multiple colors can be specified so that each point can be given its own color. If there are fewer colors than points they are recycled in the standard fashion. Lines will all be plotted in the first colour specified. |
pch |
a vector of plotting characters or symbols: see |
fill |
fill color, defaults to |
cex |
a numerical vector giving the amount by which plotting characters and symbols should
be scaled relative to the default. This works as a multiple of |
mar |
A numerical vector of the form c(bottom, left, top, right) which gives the number of lines of margin to be specified on the four sides of the plot. |
add |
defines whether points should be added to an existing plot. |
... |
further arguments are passed to the plot command. |
The function is rather flexible and can produce quite a lot of of different plots. So is it also possible to create spider webs or radar plots.
Andri Signorell <[email protected]>
testlen <- c(sin(seq(0, 1.98*pi, length=100))+2+rnorm(100)/10) testpos <- seq(0, 1.98*pi, length=100) PlotPolar(testlen, testpos, type="l", main="Test Polygon", col="blue") PolarGrid(ntheta=9, col="grey", lty="solid", lblradians=TRUE) # start at 12 o'clock and plot clockwise PlotPolar(testlen, -(testpos - pi/2), type="p", main="Test Polygon", col="green", pch=16) PolarGrid(ntheta = rev(seq(0, 2*pi, by=2*pi/9) + pi/2), alabels=Format(seq(0, 2*pi, by=2*pi/9), digits=2)[-10], col="grey", lty="solid", lblradians=TRUE) # just because of it's beauty t <- seq(0,2*pi,0.01) PlotPolar( r=sin(2*t)*cos(2*t), theta=t, type="l", lty="dashed", col="red" ) PolarGrid() # use some filled polygons ions <- c(3.2,5,1,3.1,2.1,5) ion.names <- c("Na","Ca","Mg","Cl","HCO3","SO4") PlotPolar(r = ions, type="l", fill="yellow") # the same, but let's have a grid first PlotPolar(r = ions, type="l", lwd=2, col="blue", main="Ions", panel.first=PolarGrid(nr=seq(0, 6, 1)) ) # leave the radial grid out PlotPolar(r = ions, type="l", fill="yellow") PolarGrid(nr = NA, ntheta = length(ions), alabels = ion.names, col = "grey", lty = "solid" ) # display radial lines PlotPolar(r = ions, type="h", col="blue", lwd=3) # add some points PlotPolar(r = ions, type="p", pch=16, add=TRUE, col="red", cex=1.5) # spiderweb (not really recommended...) posmat <- matrix(sample(2:9,30,TRUE),nrow=3) PlotPolar(posmat, type="l", main="Spiderweb plot", col=2:4, lwd=1:3) PolarGrid(nr=NA, ntheta=ncol(posmat), alabels=paste("X", 1:ncol(posmat), sep=""), col="grey", lty="solid" ) # example from: The grammar of graphics (L. Wilkinson) data("UKgas") m <- matrix(UKgas, ncol=4, byrow=TRUE) cols <- c(SetAlpha(rep("green", 10), seq(0,1,0.1)), SetAlpha(rep("blue", 10), seq(0,1,0.1)), SetAlpha(rep("orange", 10), seq(0,1,0.1))) PlotPolar(r=m, type="l", col=cols, lwd=2 ) PolarGrid(ntheta=4, alabels=c("Winter","Spring","Summer","Autumn"), lty="solid") legend(x="topright", legend=c(1960,1970,1980), fill=c("green","blue","orange")) # radarplot (same here, consider alternatives...) data(mtcars) d.car <- scale(mtcars[1:6,1:7], center=FALSE) # let's have a palette with transparent colors (alpha = 32) cols <- SetAlpha(colorRampPalette(c("red","yellow","blue"), space = "rgb")(6), 0.25) PlotPolar(d.car, type="l", fill=cols, main="Cars in radar") PolarGrid(nr=NA, ntheta=ncol(d.car), alabels=colnames(d.car), lty="solid", col="black") # a polar barplot x <- c(4,8,2,8,2,6,5,7,3,3,5,3) theta <- (0:12) * pi / 6 PlotPolar(x, type = "n", main="Some data") PolarGrid(nr = 0:9, ntheta = 24, col="grey", lty=1, rlabels = NA, alabels = NA) DrawCircle(x=0, y=0, r.in=0, r.out=x, theta.1 = theta[-length(theta)], theta.2 = theta[-1], col=SetAlpha(rainbow(12), 0.7), border=NA) segments(x0 = -10:10, y0 = -.2, y1=0.2) segments(x0=-10, x1=10, y0 = 0) segments(y0 = -10:10, x0 = -.2, x1=0.2) segments(y0=-10, y1=10, x0 = 0) BoxedText(x=0, y=c(0,3,6,9), labels = c(0,3,6,9), xpad = .3, ypad=.3, border="grey35") # USJudgeRatings PlotPolar(USJudgeRatings[1,], type="l", col=DescTools::hblue, lwd=2, cex=0.8, panel.first=PolarGrid(ntheta=ncol(USJudgeRatings), col="grey", lty="solid", las=1, alabels=colnames(USJudgeRatings), lblradians=TRUE)) PlotPolar(USJudgeRatings[2,], type="l", col=DescTools::hred, lwd=2, add=TRUE) PlotPolar(USJudgeRatings[5,], type="l", col=DescTools::horange, lwd=2, add=TRUE) legend(x="topright", inset=-0.18, col = c(DescTools::hblue, DescTools::hred, DescTools::horange), lwd=2, legend=rownames(USJudgeRatings)[c(1, 2, 5)])
testlen <- c(sin(seq(0, 1.98*pi, length=100))+2+rnorm(100)/10) testpos <- seq(0, 1.98*pi, length=100) PlotPolar(testlen, testpos, type="l", main="Test Polygon", col="blue") PolarGrid(ntheta=9, col="grey", lty="solid", lblradians=TRUE) # start at 12 o'clock and plot clockwise PlotPolar(testlen, -(testpos - pi/2), type="p", main="Test Polygon", col="green", pch=16) PolarGrid(ntheta = rev(seq(0, 2*pi, by=2*pi/9) + pi/2), alabels=Format(seq(0, 2*pi, by=2*pi/9), digits=2)[-10], col="grey", lty="solid", lblradians=TRUE) # just because of it's beauty t <- seq(0,2*pi,0.01) PlotPolar( r=sin(2*t)*cos(2*t), theta=t, type="l", lty="dashed", col="red" ) PolarGrid() # use some filled polygons ions <- c(3.2,5,1,3.1,2.1,5) ion.names <- c("Na","Ca","Mg","Cl","HCO3","SO4") PlotPolar(r = ions, type="l", fill="yellow") # the same, but let's have a grid first PlotPolar(r = ions, type="l", lwd=2, col="blue", main="Ions", panel.first=PolarGrid(nr=seq(0, 6, 1)) ) # leave the radial grid out PlotPolar(r = ions, type="l", fill="yellow") PolarGrid(nr = NA, ntheta = length(ions), alabels = ion.names, col = "grey", lty = "solid" ) # display radial lines PlotPolar(r = ions, type="h", col="blue", lwd=3) # add some points PlotPolar(r = ions, type="p", pch=16, add=TRUE, col="red", cex=1.5) # spiderweb (not really recommended...) posmat <- matrix(sample(2:9,30,TRUE),nrow=3) PlotPolar(posmat, type="l", main="Spiderweb plot", col=2:4, lwd=1:3) PolarGrid(nr=NA, ntheta=ncol(posmat), alabels=paste("X", 1:ncol(posmat), sep=""), col="grey", lty="solid" ) # example from: The grammar of graphics (L. Wilkinson) data("UKgas") m <- matrix(UKgas, ncol=4, byrow=TRUE) cols <- c(SetAlpha(rep("green", 10), seq(0,1,0.1)), SetAlpha(rep("blue", 10), seq(0,1,0.1)), SetAlpha(rep("orange", 10), seq(0,1,0.1))) PlotPolar(r=m, type="l", col=cols, lwd=2 ) PolarGrid(ntheta=4, alabels=c("Winter","Spring","Summer","Autumn"), lty="solid") legend(x="topright", legend=c(1960,1970,1980), fill=c("green","blue","orange")) # radarplot (same here, consider alternatives...) data(mtcars) d.car <- scale(mtcars[1:6,1:7], center=FALSE) # let's have a palette with transparent colors (alpha = 32) cols <- SetAlpha(colorRampPalette(c("red","yellow","blue"), space = "rgb")(6), 0.25) PlotPolar(d.car, type="l", fill=cols, main="Cars in radar") PolarGrid(nr=NA, ntheta=ncol(d.car), alabels=colnames(d.car), lty="solid", col="black") # a polar barplot x <- c(4,8,2,8,2,6,5,7,3,3,5,3) theta <- (0:12) * pi / 6 PlotPolar(x, type = "n", main="Some data") PolarGrid(nr = 0:9, ntheta = 24, col="grey", lty=1, rlabels = NA, alabels = NA) DrawCircle(x=0, y=0, r.in=0, r.out=x, theta.1 = theta[-length(theta)], theta.2 = theta[-1], col=SetAlpha(rainbow(12), 0.7), border=NA) segments(x0 = -10:10, y0 = -.2, y1=0.2) segments(x0=-10, x1=10, y0 = 0) segments(y0 = -10:10, x0 = -.2, x1=0.2) segments(y0=-10, y1=10, x0 = 0) BoxedText(x=0, y=c(0,3,6,9), labels = c(0,3,6,9), xpad = .3, ypad=.3, border="grey35") # USJudgeRatings PlotPolar(USJudgeRatings[1,], type="l", col=DescTools::hblue, lwd=2, cex=0.8, panel.first=PolarGrid(ntheta=ncol(USJudgeRatings), col="grey", lty="solid", las=1, alabels=colnames(USJudgeRatings), lblradians=TRUE)) PlotPolar(USJudgeRatings[2,], type="l", col=DescTools::hred, lwd=2, add=TRUE) PlotPolar(USJudgeRatings[5,], type="l", col=DescTools::horange, lwd=2, add=TRUE) legend(x="topright", inset=-0.18, col = c(DescTools::hblue, DescTools::hred, DescTools::horange), lwd=2, legend=rownames(USJudgeRatings)[c(1, 2, 5)])
Produce a plot from a probability distribution with shaded areas. This is often needed in theory texts for classes in statistics.
PlotProbDist(breaks, FUN, blab = NULL, main = "", xlim = NULL, col = NULL, density = 7, alab = LETTERS[1:(length(breaks) - 1)], alab_x = NULL, alab_y = NULL, ylab = "density", ...)
PlotProbDist(breaks, FUN, blab = NULL, main = "", xlim = NULL, col = NULL, density = 7, alab = LETTERS[1:(length(breaks) - 1)], alab_x = NULL, alab_y = NULL, ylab = "density", ...)
breaks |
a numeric vector containing the breaks of different areas. The start and end must not be infinity. |
FUN |
the (typically) distribution function |
blab |
text for labelling the breaks |
main |
main title for the plot |
xlim |
the x-limits for the plot |
col |
the color for the shaded areas |
density |
the density for the shaded areas |
alab |
the labels for areas |
alab_x |
the x-coord for the area labels |
alab_y |
the y-coord for the area labels, if left to default they will be placed in the middle of the plot |
ylab |
the label for they y-axis |
... |
further parameters passed to internally used function |
The function sets up a two-step plot procedure based on curve() and Shade() with additional labelling for convenience.
nothing returned
Andri Signorell <[email protected]>
# plot t-distribution PlotProbDist(breaks=c(-6, -2.3, 1.5, 6), function(x) dt(x, df=8), blab=c("A","B"), xlim=c(-4,4), alab=NA, main="t-Distribution (df=8)", col=c(DescTools::hred, DescTools::hblue, DescTools::horange), density=c(20, 7)) # Normal PlotProbDist(breaks=c(-10, -1, 12), function(x) dnorm(x, mean=2, sd=2), blab="A", xlim=c(-7,10), main="Normal-Distribution N(2,2)", col=c(DescTools::hred, DescTools::hblue), density=c(20, 7)) # same for Chi-square PlotProbDist(breaks=c(0, 15, 35), function(x) dchisq(x, df=8), blab="B", xlim=c(0, 30), main=expression(paste(chi^2-Distribution, " (df=8)")), col=c(DescTools::hblue, DescTools::hred), density=c(0, 20))
# plot t-distribution PlotProbDist(breaks=c(-6, -2.3, 1.5, 6), function(x) dt(x, df=8), blab=c("A","B"), xlim=c(-4,4), alab=NA, main="t-Distribution (df=8)", col=c(DescTools::hred, DescTools::hblue, DescTools::horange), density=c(20, 7)) # Normal PlotProbDist(breaks=c(-10, -1, 12), function(x) dnorm(x, mean=2, sd=2), blab="A", xlim=c(-7,10), main="Normal-Distribution N(2,2)", col=c(DescTools::hred, DescTools::hblue), density=c(20, 7)) # same for Chi-square PlotProbDist(breaks=c(0, 15, 35), function(x) dchisq(x, df=8), blab="B", xlim=c(0, 30), main=expression(paste(chi^2-Distribution, " (df=8)")), col=c(DescTools::hblue, DescTools::hred), density=c(0, 20))
Pyramid plots are a common way to display the distribution of age groups.
PlotPyramid(lx, rx = NA, ylab = "", ylab.x = 0, col = c("red", "blue"), border = par("fg"), main = "", lxlab = "", rxlab = "", xlim = NULL, gapwidth = NULL, xaxt = TRUE, args.grid = NULL, cex.axis = par("cex.axis"), cex.lab = par("cex.axis"), cex.names = par("cex.axis"), adj = 0.5, rev = FALSE, ...)
PlotPyramid(lx, rx = NA, ylab = "", ylab.x = 0, col = c("red", "blue"), border = par("fg"), main = "", lxlab = "", rxlab = "", xlim = NULL, gapwidth = NULL, xaxt = TRUE, args.grid = NULL, cex.axis = par("cex.axis"), cex.lab = par("cex.axis"), cex.names = par("cex.axis"), adj = 0.5, rev = FALSE, ...)
lx |
either a vector or matrix of values describing the bars which make up the plot. If lx is a vector, it will be used to construct the left barplot. If lx is a matrix the first column will be plotted to the left side and the second to the right side. Other columsn are ignored. |
rx |
a vector with the values used to build the right barplot. lx and rx should be of equal length. |
ylab |
a vector of names to be plotted either in the middle or at the left side of the plot. If this argument is omitted, then the names are taken from the names attribute of lx if this is a vector. |
ylab.x |
the x-position of the y-labels. |
col |
the color(s) of the bars. If there are more than one the colors will be recycled. |
border |
the border color of the bars. Set this to |
main |
overall title for the plot. |
lxlab |
a label for the left x axis. |
rxlab |
a label for the right x axis. |
xlim |
limits for the x axis. The first value will determine the limit on the left, the second the one on the right. |
gapwidth |
the width of a gap in the middle of the plot. If set to 0, no gap will be plotted. Default is NULL which will make the gap as wide, as it is necessary to plot the longest ylab. |
xaxt |
a character which specifies the x axis type. Specifying "n" suppresses plotting of the axis. |
args.grid |
list of additional arguments for the grid. Set this argument to |
cex.axis |
expansion factor for numeric axis labels. |
cex.lab |
expansion factor for numeric variable labels. |
cex.names |
expansion factor for y labels (names). |
adj |
one or two values in [0, 1] which specify the x (and optionally y) adjustment of the labels. |
rev |
logical, if set to |
... |
the dots are passed to the |
Pyramid plots are a common way to display the distribution of age groups in a human population. The percentages of people within a given age category are arranged in a barplot, typically back to back. Such displays can be used to distinguish males vs. females, differences between two different countries or the distribution of age at different timepoints. The plot type can also be used to display other types of opposed bar charts with suitable modification of the arguments.
A numeric vector giving the coordinates of all the bar midpoints drawn, useful for adding to the graph.
Andri Signorell <[email protected]>
d.sda <- data.frame( kt_x = c("ZH","BL","ZG","SG","LU","AR","SO","GL","SZ", "NW","TG","UR","AI","OW","GR","BE","SH","AG", "BS","FR","GE","JU","NE","TI","VD","VS"), apo_n = c(18,16,13,11,9,12,11,8,9,8,11,9,7,9,24,19, 19,20,43,27,41,31,37,62,38,39), sda_n = c(235,209,200,169,166,164,162,146,128,127, 125,121,121,110,48,34,33,0,0,0,0,0,0,0,0,0) ) PlotPyramid(lx=d.sda[,c("apo_n","sda_n")], ylab=d.sda$kt_x, col=c("lightslategray", "orange2"), border = NA, ylab.x=0, xlim=c(-110,250), gapwidth = NULL, cex.lab = 0.8, cex.axis=0.8, xaxt = TRUE, lxlab="Drugstores", rxlab="General practitioners", main="Density of general practitioners and drugstores in CH (2010)", space=0.5, args.grid=list(lty=1)) par(mfrow=c(1,3)) m.pop<-c(3.2,3.5,3.6,3.6,3.5,3.5,3.9,3.7,3.9,3.5, 3.2,2.8,2.2,1.8,1.5,1.3,0.7,0.4) f.pop<-c(3.2,3.4,3.5,3.5,3.5,3.7,4,3.8,3.9,3.6,3.2, 2.5,2,1.7,1.5,1.3,1,0.8) age <- c("0-4","5-9","10-14","15-19","20-24","25-29", "30-34","35-39","40-44","45-49","50-54", "55-59","60-64","65-69","70-74","75-79","80-44","85+") PlotPyramid(m.pop, f.pop, ylab = age, space = 0, col = c("cornflowerblue", "indianred"), main="Age distribution at baseline of HELP study", lxlab="male", rxlab="female" ) PlotPyramid(m.pop, f.pop, ylab = age, space = 0, col = c("cornflowerblue", "indianred"), xlim=c(-5,5), main="Age distribution at baseline of HELP study", lxlab="male", rxlab="female", gapwidth=0, ylab.x=-5 ) PlotPyramid(c(1,3,5,2,0.5), c(2,4,6,1,0), ylab = LETTERS[1:5], space = 0.3, col = rep(rainbow(5), each=2), xlim=c(-10,10), args.grid=NA, cex.names=1.5, adj=1, lxlab="Group A", rxlab="Group B", gapwidth=0, ylab.x=-8, xaxt="n")
d.sda <- data.frame( kt_x = c("ZH","BL","ZG","SG","LU","AR","SO","GL","SZ", "NW","TG","UR","AI","OW","GR","BE","SH","AG", "BS","FR","GE","JU","NE","TI","VD","VS"), apo_n = c(18,16,13,11,9,12,11,8,9,8,11,9,7,9,24,19, 19,20,43,27,41,31,37,62,38,39), sda_n = c(235,209,200,169,166,164,162,146,128,127, 125,121,121,110,48,34,33,0,0,0,0,0,0,0,0,0) ) PlotPyramid(lx=d.sda[,c("apo_n","sda_n")], ylab=d.sda$kt_x, col=c("lightslategray", "orange2"), border = NA, ylab.x=0, xlim=c(-110,250), gapwidth = NULL, cex.lab = 0.8, cex.axis=0.8, xaxt = TRUE, lxlab="Drugstores", rxlab="General practitioners", main="Density of general practitioners and drugstores in CH (2010)", space=0.5, args.grid=list(lty=1)) par(mfrow=c(1,3)) m.pop<-c(3.2,3.5,3.6,3.6,3.5,3.5,3.9,3.7,3.9,3.5, 3.2,2.8,2.2,1.8,1.5,1.3,0.7,0.4) f.pop<-c(3.2,3.4,3.5,3.5,3.5,3.7,4,3.8,3.9,3.6,3.2, 2.5,2,1.7,1.5,1.3,1,0.8) age <- c("0-4","5-9","10-14","15-19","20-24","25-29", "30-34","35-39","40-44","45-49","50-54", "55-59","60-64","65-69","70-74","75-79","80-44","85+") PlotPyramid(m.pop, f.pop, ylab = age, space = 0, col = c("cornflowerblue", "indianred"), main="Age distribution at baseline of HELP study", lxlab="male", rxlab="female" ) PlotPyramid(m.pop, f.pop, ylab = age, space = 0, col = c("cornflowerblue", "indianred"), xlim=c(-5,5), main="Age distribution at baseline of HELP study", lxlab="male", rxlab="female", gapwidth=0, ylab.x=-5 ) PlotPyramid(c(1,3,5,2,0.5), c(2,4,6,1,0), ylab = LETTERS[1:5], space = 0.3, col = rep(rainbow(5), each=2), xlim=c(-10,10), args.grid=NA, cex.names=1.5, adj=1, lxlab="Group A", rxlab="Group B", gapwidth=0, ylab.x=-8, xaxt="n")
Create a QQ-plot for a variable of any distribution. The assumed underlying distribution can be defined as a function of f(p), including all required parameters. Confidence bands are provided by default.
PlotQQ(x, qdist=qnorm, main = NULL, xlab = NULL, ylab = NULL, datax = FALSE, add = FALSE, args.qqline = NULL, conf.level = 0.95, args.cband = NULL, ...)
PlotQQ(x, qdist=qnorm, main = NULL, xlab = NULL, ylab = NULL, datax = FALSE, add = FALSE, args.qqline = NULL, conf.level = 0.95, args.cband = NULL, ...)
x |
the data sample |
qdist |
the quantile function of the assumed distribution. Can either be given as simple function name or defined as own function using the required arguments. Default is |
main |
the main title for the plot. This will be "Q-Q-Plot" by default |
xlab |
the xlab for the plot |
ylab |
the ylab for the plot |
datax |
logical. Should data values be on the x-axis? Default is |
add |
logical specifying if the points should be added to an already existing plot; defaults to |
args.qqline |
arguments for the qqline. This will be estimated
as a line through the 25% and 75% quantiles by default, which is the same procedure as |
conf.level |
confidence level for the confidence interval. Set this to |
args.cband |
list of arguments for the confidence band, such as color or border (see |
... |
the dots are passed to the plot function. |
The function generates a sequence of points between 0 and 1 and transforms those into quantiles by means of the defined assumed distribution.
The code is inspired by the tip 10.22 "Creating other Quantile-Quantile plots" from R Cookbook and based on R-Core code from the function qqline
. The calculation of confidence bands are rewritten based on an algorithm published in the package BoutrosLab.plotting.general
.
Andri Signorell <[email protected]>, Ying Wu <[email protected]>
Teetor, P. (2011) R Cookbook. O'Reilly, pp. 254-255.
y <- rexp(100, 1/10) PlotQQ(y, function(p) qexp(p, rate=1/10)) w <- rweibull(100, shape=2) PlotQQ(w, qdist = function(p) qweibull(p, shape=4)) z <- rchisq(100, df=5) PlotQQ(z, function(p) qchisq(p, df=5), args.qqline=list(col=2, probs=c(0.1, 0.6)), main=expression("Q-Q plot for" ~~ {chi^2}[nu == 3])) abline(0,1) # add 5 random sets for(i in 1:5){ z <- rchisq(100, df=5) PlotQQ(z, function(p) qchisq(p, df=5), add=TRUE, args.qqline = NA, col="grey", lty="dotted") }
y <- rexp(100, 1/10) PlotQQ(y, function(p) qexp(p, rate=1/10)) w <- rweibull(100, shape=2) PlotQQ(w, qdist = function(p) qweibull(p, shape=4)) z <- rchisq(100, df=5) PlotQQ(z, function(p) qchisq(p, df=5), args.qqline=list(col=2, probs=c(0.1, 0.6)), main=expression("Q-Q plot for" ~~ {chi^2}[nu == 3])) abline(0,1) # add 5 random sets for(i in 1:5){ z <- rchisq(100, df=5) PlotQQ(z, function(p) qchisq(p, df=5), add=TRUE, args.qqline = NA, col="grey", lty="dotted") }
PlotTernary
plots in a triangle the values of three variables. Useful for mixtures
(chemistry etc.).
PlotTernary(x, y = NULL, z = NULL, args.grid = NULL, lbl = NULL, main = "",...)
PlotTernary(x, y = NULL, z = NULL, args.grid = NULL, lbl = NULL, main = "",...)
x |
vector of first variable. Will be placed on top of the triangle. |
y |
vector of second variable (the right corner). |
z |
vector of third variable (on the left corner). |
args.grid |
list of additional arguments for the grid. Set this argument to |
main |
overall title for the plot. |
lbl |
the labels for the corner points. Default to the names of x, y, z. |
... |
the dots are sent to |
Andri Signorell <[email protected]> based on example code by W. N. Venables and B. D. Ripley mentioned
J. Aitchison (1986) The Statistical Analysis of Compositional Data. Chapman and Hall, p.360.
Venables, W. N. and Ripley, B. D. (2002) Modern Applied Statistics with S. Fourth edition. Springer.
example in Skye
# some random data in three variables c1 <- runif(25) c2 <- runif(25) c3 <- runif(25) # basic plot par(mfrow=c(1, 2)) PlotTernary(c1, c2, c3, args.grid=NA) ## Not run: # plot with different symbols and a grid using a dataset from MASS data(Skye, package="MASS") PlotTernary(Skye[c(1,3,2)], pch=15, col=DescTools::hred, main="Skye", lbl=c("A Sodium", "F Iron", "M Magnesium")) ## End(Not run)
# some random data in three variables c1 <- runif(25) c2 <- runif(25) c3 <- runif(25) # basic plot par(mfrow=c(1, 2)) PlotTernary(c1, c2, c3, args.grid=NA) ## Not run: # plot with different symbols and a grid using a dataset from MASS data(Skye, package="MASS") PlotTernary(Skye[c(1,3,2)], pch=15, col=DescTools::hred, main="Skye", lbl=c("A Sodium", "F Iron", "M Magnesium")) ## End(Not run)
Creates a treemap where rectangular regions of different size, color, and groupings visualize the elements.
PlotTreemap(x, grp = NULL, labels = NULL, cex = 1, text.col = "black", col = rainbow(length(x)), labels.grp = NULL, cex.grp = 3, text.col.grp = "black", border.grp = "grey50", lwd.grp = 5, main = "")
PlotTreemap(x, grp = NULL, labels = NULL, cex = 1, text.col = "black", col = rainbow(length(x)), labels.grp = NULL, cex.grp = 3, text.col.grp = "black", border.grp = "grey50", lwd.grp = 5, main = "")
x |
a vector storing the values to be used to calculate the areas of rectangles. |
grp |
a vector specifying the group (i.e. country, sector, etc.) to which each element belongs. |
labels |
a vector specifying the labels. |
cex |
the character extension for the area labels. Default is 1. |
text.col |
the text color of the area labels. Default is "black". |
col |
a vector storing the values to be used to calculate the color of rectangles. |
labels.grp |
a character vector specifying the labels for the groups. |
cex.grp |
the character extension for the group labels. Default is 3. |
text.col.grp |
the text color of the group labels. Default is "black". |
border.grp |
the border color for the group rectangles. Default is "grey50". Set this to |
lwd.grp |
the linewidth of the group borders. Default is 5. |
main |
a title for the plot. |
A treemap is a two-dimensional visualization for quickly analyzing large, hierarchical data sets. Treemaps are unique among visualizations because they provide users with the ability to see both a high level overview of data as well as fine-grained details. Users can find outliers, notice trends, and perform comparisons using treemaps. Each data element contained in a treemap is represented with a rectangle, or a cell. Treemap cell arrangement, size, and color are each mapped to an attribute of that element. Treemap cells can be grouped by common attributes. Within a group, larger cells are placed towards the bottom left, and smaller cells are placed at the top right.
returns a list with groupwise organized midpoints in x and y for the rectangles within a group and for the groups themselves.
Andri Signorell <[email protected]>, strongly based on code from Jeff Enos [email protected]
set.seed(1789) N <- 20 area <- rlnorm(N) PlotTreemap(x=sort(area, decreasing=TRUE), labels=letters[1:20], col=Pal("RedToBlack", 20)) grp <- sample(x=1:3, size=20, replace=TRUE, prob=c(0.2,0.3,0.5)) z <- Sort(data.frame(area=area, grp=grp), c("grp","area"), decreasing=c(FALSE,TRUE)) z$col <- SetAlpha(c("steelblue","green","yellow")[z$grp], unlist(lapply(split(z$area, z$grp), function(...) LinScale(..., newlow=0.1, newhigh=0.6)))) PlotTreemap(x=z$area, grp=z$grp, labels=letters[1:20], col=z$col) b <- PlotTreemap(x=z$area, grp=z$grp, labels=letters[1:20], labels.grp=NA, col=z$col, main="Treemap") # the function returns the midpoints of the areas # extract the group midpoints from b mid <- do.call(rbind, lapply(lapply(b, "[", 1), data.frame)) # and draw some visible text BoxedText( x=mid$grp.x, y=mid$grp.y, labels=LETTERS[1:3], cex=3, border=NA, col=SetAlpha("white",0.7) )
set.seed(1789) N <- 20 area <- rlnorm(N) PlotTreemap(x=sort(area, decreasing=TRUE), labels=letters[1:20], col=Pal("RedToBlack", 20)) grp <- sample(x=1:3, size=20, replace=TRUE, prob=c(0.2,0.3,0.5)) z <- Sort(data.frame(area=area, grp=grp), c("grp","area"), decreasing=c(FALSE,TRUE)) z$col <- SetAlpha(c("steelblue","green","yellow")[z$grp], unlist(lapply(split(z$area, z$grp), function(...) LinScale(..., newlow=0.1, newhigh=0.6)))) PlotTreemap(x=z$area, grp=z$grp, labels=letters[1:20], col=z$col) b <- PlotTreemap(x=z$area, grp=z$grp, labels=letters[1:20], labels.grp=NA, col=z$col, main="Treemap") # the function returns the midpoints of the areas # extract the group midpoints from b mid <- do.call(rbind, lapply(lapply(b, "[", 1), data.frame)) # and draw some visible text BoxedText( x=mid$grp.x, y=mid$grp.y, labels=LETTERS[1:3], cex=3, border=NA, col=SetAlpha("white",0.7) )
This function produces Venn diagrams for up to 5 datasets.
PlotVenn(x, col = "transparent", plotit = TRUE, labels = NULL)
PlotVenn(x, col = "transparent", plotit = TRUE, labels = NULL)
x |
the list with the sets to be analysed. Those can be factors or something coercable to a factor. |
col |
the colors for the sets on the plot. |
plotit |
logical. Should a plot be produced or just the results be calculated. |
labels |
special labels for the plot. By default the names of the list x will be used. If those are missing, the LETTERS A..E will be chosen. Set this argument to NA, if no labels at all should be plotted. |
The function calculates the necessary frequencies and plots the venn diagram.
a list with 2 elements, the first contains a table with the observed frequencies in the given sets. The second returns a data.frame with the xy coordinates for the labels in the venn diagram, the specific combination of factors and the frequency in that intersection area. The latter can be 0 as well.
Andri Signorell <[email protected]>
Venn, J. (1880): On the Diagrammatic and Mechanical Representation of Propositions and Reasonings. Dublin Philosophical Magazine and Journal of Science 9 (59): 1-18.
Edwards, A.W.F. (2004): Cogwheels of the mind: the story of Venn diagrams. JHU Press ISBN 978-0-8018-7434-5.
element <- function() paste(sample(LETTERS, 5, replace=TRUE), collapse="") group <- replicate(1000, element()) GroupA <- sample(group, 400, replace=FALSE) GroupB <- sample(group, 750, replace=FALSE) GroupC <- sample(group, 250, replace=FALSE) GroupD <- sample(group, 300, replace=FALSE) x <- list(GroupA, GroupB, GroupC, GroupD) x PlotVenn(x=list(GroupA, GroupB)) PlotVenn(x=list(Set_1=GroupA, Set_2=GroupB)) PlotVenn(x=list(GroupA, GroupB), labels=c("English","Spanish")) PlotVenn(x=x[1:3]) PlotVenn(x=x[1:4], col=SetAlpha(c("blue","red","yellow","green","lightblue"), 0.2)) r.venn <- PlotVenn(x=x[1:5], col=SetAlpha(c("blue","red","yellow","green","lightblue"), 0.2)) r.venn
element <- function() paste(sample(LETTERS, 5, replace=TRUE), collapse="") group <- replicate(1000, element()) GroupA <- sample(group, 400, replace=FALSE) GroupB <- sample(group, 750, replace=FALSE) GroupC <- sample(group, 250, replace=FALSE) GroupD <- sample(group, 300, replace=FALSE) x <- list(GroupA, GroupB, GroupC, GroupD) x PlotVenn(x=list(GroupA, GroupB)) PlotVenn(x=list(Set_1=GroupA, Set_2=GroupB)) PlotVenn(x=list(GroupA, GroupB), labels=c("English","Spanish")) PlotVenn(x=x[1:3]) PlotVenn(x=x[1:4], col=SetAlpha(c("blue","red","yellow","green","lightblue"), 0.2)) r.venn <- PlotVenn(x=x[1:5], col=SetAlpha(c("blue","red","yellow","green","lightblue"), 0.2)) r.venn
This function serves the same utility as side-by-side boxplots, only it provides more detail about the different distribution. It plots violins instead of boxplots. That is, instead of a box, it uses the density function to plot the density. For skewed distributions, the results look like "violins". Hence the name.
PlotViolin(x, ...) ## Default S3 method: PlotViolin(x, ..., horizontal = FALSE, bw = "SJ", na.rm = FALSE, names = NULL, args.boxplot = NULL) ## S3 method for class 'formula' PlotViolin(formula, data, subset, na.action, ...)
PlotViolin(x, ...) ## Default S3 method: PlotViolin(x, ..., horizontal = FALSE, bw = "SJ", na.rm = FALSE, names = NULL, args.boxplot = NULL) ## S3 method for class 'formula' PlotViolin(formula, data, subset, na.action, ...)
x |
Either a sequence of variable names, or a data frame, or a model formula |
horizontal |
logical indicating if the densityplots should be horizontal; default |
bw |
the smoothing bandwidth (method) being used by |
na.rm |
logical, should NAs be omitted? The density-function can't do with missings. Defaults to |
names |
a vector of names for the groups. |
formula |
a formula, such as y ~ grp, where y is a numeric vector of data values to be split into groups according to the grouping variable grp (usually a factor). |
data |
a data.frame (or list) from which the variables in formula should be taken. |
subset |
an optional vector specifying a subset of observations to be used for plotting. |
na.action |
a function which indicates what should happen when the data contain NAs. Defaults to |
... |
The dots are passed to |
args.boxplot |
list of arguments for a boxplot to be superposed to the densityplot. By default (NULL) a
black boxplot will be drawn. Set this to |
If a boxplot was drawn then the function returns a list with the following components:
stats |
a matrix, each column contains the extreme of the lower whisker, the lower hinge, the median, the upper hinge and the extreme of the upper whisker for one group/plot. If all the inputs have the same class attribute, so will this component. |
n |
a vector with the number of observations in each group. |
conf |
a matrix where each column contains the lower and upper extremes of the notch. |
out |
the values of any data points which lie beyond the extremes of the whiskers. |
group |
a vector of the same length as |
names |
a vector of names for the groups. |
This function is based on violinplot
(package UsingR).
Some adaptions were made in the interface, such as to accept the same
arguments as boxplot
does. Moreover the function was extended by the option to have a boxplot superposed.
John Verzani, Andri Signorell <[email protected]>
The code is based on the boxplot function from R/base.
boxplot
, PlotMultiDens
, density
# make a "violin" x <- c(rnorm(100), rnorm(50,5)) PlotViolin(x, col = "brown") par(mfrow=c(1,2)) f <- factor(rep(1:5, 30)) # make a quintet. Note also choice of bandwidth PlotViolin(x ~ f, col = SetAlpha("steelblue",0.3), bw = "SJ", main="Vertical") # and the same, but in horizontal arrangement PlotViolin(x ~ f, col = SetAlpha("steelblue",0.3), bw = "SJ", horizontal = TRUE, las=1, main="Horizontal") # example taken from boxplot boxplot(count ~ spray, data = InsectSprays, col = "lightgray", main="Boxplot") PlotViolin(count ~ spray, data = InsectSprays, col = "lightgray", main="Violinplot") # groupwise densityplots defined the same way as in boxplot boxplot(len ~ supp*dose, data = ToothGrowth, main = "Guinea Pigs' Tooth Growth", xlab = "Vitamin C dose mg", ylab = "tooth length", col=c("yellow", "orange"), lty=c(1,2) ) b <- PlotViolin(len ~ supp*dose, data = ToothGrowth, main = "Guinea Pigs' Tooth Growth", xlab = "Vitamin C dose mg", ylab = "tooth length", col=c("yellow", "orange"), lty=c(1,2) ) # use points, if the medians deserve special attention points(x=1:6, y=b$stats[3,], pch=21, bg="white", col="black", cex=1.2)
# make a "violin" x <- c(rnorm(100), rnorm(50,5)) PlotViolin(x, col = "brown") par(mfrow=c(1,2)) f <- factor(rep(1:5, 30)) # make a quintet. Note also choice of bandwidth PlotViolin(x ~ f, col = SetAlpha("steelblue",0.3), bw = "SJ", main="Vertical") # and the same, but in horizontal arrangement PlotViolin(x ~ f, col = SetAlpha("steelblue",0.3), bw = "SJ", horizontal = TRUE, las=1, main="Horizontal") # example taken from boxplot boxplot(count ~ spray, data = InsectSprays, col = "lightgray", main="Boxplot") PlotViolin(count ~ spray, data = InsectSprays, col = "lightgray", main="Violinplot") # groupwise densityplots defined the same way as in boxplot boxplot(len ~ supp*dose, data = ToothGrowth, main = "Guinea Pigs' Tooth Growth", xlab = "Vitamin C dose mg", ylab = "tooth length", col=c("yellow", "orange"), lty=c(1,2) ) b <- PlotViolin(len ~ supp*dose, data = ToothGrowth, main = "Guinea Pigs' Tooth Growth", xlab = "Vitamin C dose mg", ylab = "tooth length", col=c("yellow", "orange"), lty=c(1,2) ) # use points, if the medians deserve special attention points(x=1:6, y=b$stats[3,], pch=21, bg="white", col="black", cex=1.2)
This plot can be used to graphically display a correlation matrix by using the linewidth between the nodes in proportion to the correlation of two variables. It will place the elements homogenously around a circle and draw connecting lines between the points.
PlotWeb(m, col = c(DescTools::hred, DescTools::hblue), lty = NULL, lwd = NULL, args.legend=NULL, pch = 21, pt.cex = 2, pt.col = "black", pt.bg = "darkgrey", cex.lab = 1, las = 1, adj = NULL, dist = 0.5, ...)
PlotWeb(m, col = c(DescTools::hred, DescTools::hblue), lty = NULL, lwd = NULL, args.legend=NULL, pch = 21, pt.cex = 2, pt.col = "black", pt.bg = "darkgrey", cex.lab = 1, las = 1, adj = NULL, dist = 0.5, ...)
m |
a symmetric matrix of numeric values |
col |
the color for the connecting lines |
lty |
the line type for the connecting lines, the default will be |
lwd |
the line widths for the connecting lines. If left to |
args.legend |
list of additional arguments to be passed to the |
pch |
the plotting symbols appearing in the plot, as a non-negative numeric vector (see |
pt.cex |
expansion factor(s) for the points. |
pt.col |
the foreground color for the points, corresponding to its argument |
pt.bg |
the background color for the points, corresponding to its argument |
las |
alignment of the labels, 1 means horizontal, 2 radial and 3 vertical. |
adj |
adjustments for the labels. (Left: 0, Right: 1, Mid: 0.5) |
dist |
gives the distance of the labels from the outer circle. Default is 2. |
cex.lab |
the character extension for the labels. |
... |
dots are passed to plot. |
The function uses the lower triangular matrix of m
, so this is the order colors, linewidth etc. must be given, when the defaults are to be overrun.
A list of x and y coordinates, giving the coordinates of all the points drawn, useful for adding other elements to the plot.
Andri Signorell <[email protected]>
m <- cor(d.pizza[, which(sapply(d.pizza, IsNumeric, na.rm=TRUE))[-c(1:2)]], use="pairwise.complete.obs") PlotWeb(m=m, col=c(DescTools::hred, DescTools::hblue), main="Pizza Correlation") # let's describe only the significant corrs and start with a dataset d.m <- d.pizza[, which(sapply(d.pizza, IsNumeric, na.rm=TRUE))[-c(1:2)]] # get the correlation matrix m <- cor(d.m, use="pairwise.complete.obs") # let's get rid of all non significant correlations ctest <- PairApply(d.m, function(x, y) cor.test(x, y)$p.value, symmetric=TRUE) # ok, got all the p-values, now replace > 0.05 with NAs m[ctest > 0.05] <- NA # How does that look like now? Format(m, na.form = ". ", ldigits=0, digits=3, align = "right") PlotWeb(m, las=2, cex=1.2) # define line widths PlotWeb(m, lwd=abs(m[lower.tri(m)] * 10))
m <- cor(d.pizza[, which(sapply(d.pizza, IsNumeric, na.rm=TRUE))[-c(1:2)]], use="pairwise.complete.obs") PlotWeb(m=m, col=c(DescTools::hred, DescTools::hblue), main="Pizza Correlation") # let's describe only the significant corrs and start with a dataset d.m <- d.pizza[, which(sapply(d.pizza, IsNumeric, na.rm=TRUE))[-c(1:2)]] # get the correlation matrix m <- cor(d.m, use="pairwise.complete.obs") # let's get rid of all non significant correlations ctest <- PairApply(d.m, function(x, y) cor.test(x, y)$p.value, symmetric=TRUE) # ok, got all the p-values, now replace > 0.05 with NAs m[ctest > 0.05] <- NA # How does that look like now? Format(m, na.form = ". ", ldigits=0, digits=3, align = "right") PlotWeb(m, las=2, cex=1.2) # define line widths PlotWeb(m, lwd=abs(m[lower.tri(m)] * 10))
PMT
computes the periodic payment of an annuity.
IPMT
calculates what portion of a period payment is going towards interest in a particular period and
PPMT
what portion of a period payment is going towards principal in a particular period. RBAL
yields the remaining balance in a particular period.
PMT(rate, nper, pv, fv = 0, type = 0) IPMT(rate, per, nper, pv, fv = 0, type = 0) PPMT(rate, per, nper, pv, fv = 0, type = 0) RBAL(rate, per, nper, pv, fv = 0, type = 0)
PMT(rate, nper, pv, fv = 0, type = 0) IPMT(rate, per, nper, pv, fv = 0, type = 0) PPMT(rate, per, nper, pv, fv = 0, type = 0) RBAL(rate, per, nper, pv, fv = 0, type = 0)
rate |
specifies the interest rate. |
per |
specifies the period of the payment to be applied to interest or to principal. |
nper |
specifies the number of payment periods. |
pv |
specifies the present value or the lump-sum amount that a series of future payments is worth currently. |
fv |
specifies the future value or a cash balance that you want to attain after the last payment is made. Default is 0. |
type |
specifies the number 0 or 1 and indicates when payments are due. Default is 0. |
a numeric value
Andri Signorell <[email protected]>
# original principal: 20'000 # loan term (years): 5 # annual interest rate: 8% # annual payment: -4'156.847 # simple amortization schedule cbind( year = 1:5, payment = PMT(rate=0.08, nper=5, pv=20000, fv=-5000, type=0), interest = IPMT(rate=0.08, per=1:5, nper=5, pv=20000, fv=-5000, type=0), principal = PPMT(rate=0.08, per=1:5, nper=5, pv=20000, fv=-5000, type=0), balance = RBAL(rate=0.08, per=1:5, nper=5, pv=20000, fv=-5000, type=0) ) # year payment interest principal balance # [1,] 1 -4156.847 -1600.0000 -2556.847 17443.153 # [2,] 2 -4156.847 -1395.4523 -2761.395 14681.759 # [3,] 3 -4156.847 -1174.5407 -2982.306 11699.452 # [4,] 4 -4156.847 -935.9562 -3220.891 8478.562 # [5,] 5 -4156.847 -678.2849 -3478.562 5000.000
# original principal: 20'000 # loan term (years): 5 # annual interest rate: 8% # annual payment: -4'156.847 # simple amortization schedule cbind( year = 1:5, payment = PMT(rate=0.08, nper=5, pv=20000, fv=-5000, type=0), interest = IPMT(rate=0.08, per=1:5, nper=5, pv=20000, fv=-5000, type=0), principal = PPMT(rate=0.08, per=1:5, nper=5, pv=20000, fv=-5000, type=0), balance = RBAL(rate=0.08, per=1:5, nper=5, pv=20000, fv=-5000, type=0) ) # year payment interest principal balance # [1,] 1 -4156.847 -1600.0000 -2556.847 17443.153 # [2,] 2 -4156.847 -1395.4523 -2761.395 14681.759 # [3,] 3 -4156.847 -1174.5407 -2982.306 11699.452 # [4,] 4 -4156.847 -935.9562 -3220.891 8478.562 # [5,] 5 -4156.847 -678.2849 -3478.562 5000.000
Computes the confidence intervals of a poisson distributed variable's lambda. Several methods are implemented, see details.
PoissonCI(x, n = 1, conf.level = 0.95, sides = c("two.sided","left","right"), method = c("exact", "score", "wald", "byar"))
PoissonCI(x, n = 1, conf.level = 0.95, sides = c("two.sided","left","right"), method = c("exact", "score", "wald", "byar"))
x |
number of events. |
n |
time base for event count. |
conf.level |
confidence level, defaults to 0.95. |
sides |
a character string specifying the side of the confidence interval, must be one of |
method |
character string specifing which method to use; can be one out of
|
The Wald interval uses the asymptotic normality of the test statistic.
Byar's method is quite a good approximation. Rothman and Boice (1979) mention that these limits were first proposed by Byar (unpublished).
A vector with 3 elements for estimate, lower confidence intervall and upper for the upper one.
Andri Signorell <[email protected]>
Agresti, A. and Coull, B.A. (1998) Approximate is better than "exact" for interval estimation of binomial proportions. American Statistician, 52, pp. 119-126.
Rothman KJ, Boice JD, Jr. (1979) Epidemiologic Analysis with a Programmable Calculator (NIH Publication 79-1649). Washington DC: US Government Printing Office.
Garwood, F. (1936) Fiducial Limits for the Poisson distribution. Biometrika 28:437-442.
https://www.ine.pt/revstat/pdf/rs120203.pdf
poisson.test
, BinomCI
, MultinomCI
# the horse kick example count <- 0:4 deaths <- c(144, 91, 32, 11, 2) n <- sum(deaths) x <- sum(count * deaths) lambda <- x/n PoissonCI(x=x, n=n, method = c("exact","score", "wald", "byar")) exp <- dpois(0:4, lambda) * n barplot(rbind(deaths, exp * n/sum(exp)), names=0:4, beside=TRUE, col=c(DescTools::hred, DescTools::hblue), main = "Deaths from Horse Kicks", xlab = "count") legend("topright", legend=c("observed","expected"), fill=c(DescTools::hred, DescTools::hblue), bg="white") ## SMR, Welsh Nickel workers PoissonCI(x=137, n=24.19893)
# the horse kick example count <- 0:4 deaths <- c(144, 91, 32, 11, 2) n <- sum(deaths) x <- sum(count * deaths) lambda <- x/n PoissonCI(x=x, n=n, method = c("exact","score", "wald", "byar")) exp <- dpois(0:4, lambda) * n barplot(rbind(deaths, exp * n/sum(exp)), names=0:4, beside=TRUE, col=c(DescTools::hred, DescTools::hblue), main = "Deaths from Horse Kicks", xlab = "count") legend("topright", legend=c("observed","expected"), fill=c(DescTools::hred, DescTools::hblue), bg="white") ## SMR, Welsh Nickel workers PoissonCI(x=137, n=24.19893)
PolarGrid
adds a polar grid to an existing plot. The number of radial gridlines are
set by ntheta
and the tangential lines by nr
. Labels for the angular grid and the radial axis can be provided.
PolarGrid(nr = NULL, ntheta = NULL, col = "lightgray", lty = "dotted", lwd = par("lwd"), rlabels = NULL, alabels = NULL, lblradians = FALSE, cex.lab = 1, las = 1, adj = NULL, dist = NULL)
PolarGrid(nr = NULL, ntheta = NULL, col = "lightgray", lty = "dotted", lwd = par("lwd"), rlabels = NULL, alabels = NULL, lblradians = FALSE, cex.lab = 1, las = 1, adj = NULL, dist = NULL)
nr |
number of circles. When NULL, as per default, the grid aligns with the tick marks on the corresponding default axis (i.e., tickmarks as computed by axTicks). When NA, no circular grid lines are drawn. |
ntheta |
number of radial grid lines. Defaults to 12 uniformly distributed between 0 and 2*pi (each pi/3). |
col |
character or (integer) numeric; color of the grid lines. |
lty |
character or (integer) numeric; line type of the grid lines. |
lwd |
non-negative numeric giving line width of the grid lines. |
rlabels |
the radius labels. Use |
alabels |
the labels for the angles, they are printed on a circle outside the plot. Use |
lblradians |
logic, defines if angle labels will be in degrees (default) or in radians. |
cex.lab |
the character extension for the labels. |
las |
alignment of the labels, 1 means horizontal, 2 radial and 3 vertical. |
adj |
adjustments for the labels. (Left: 0, Right: 1, Mid: 0.5) The default is 1 for the levels on the right side of the circle, 0 for labels on the left and 0.5 for labels exactly on north on south. |
dist |
gives the radius for the labels, in user coordinates. Default is par("usr")[2] * 1.07. |
Andri Signorell <[email protected]>
Canvas(xlim=c(-5,5), xpd=TRUE) PolarGrid() Canvas(xlim=c(-5,5), xpd=TRUE) PolarGrid(nr=0:5, ntheta=6) Canvas(xlim=c(-5,5), xpd=TRUE) PolarGrid(ntheta=36, rlabels=NA, lblradians=TRUE)
Canvas(xlim=c(-5,5), xpd=TRUE) PolarGrid() Canvas(xlim=c(-5,5), xpd=TRUE) PolarGrid(nr=0:5, ntheta=6) Canvas(xlim=c(-5,5), xpd=TRUE) PolarGrid(ntheta=36, rlabels=NA, lblradians=TRUE)
A convenience wrapper for computing post-hoc test after having calculated an ANOVA.
PostHocTest(x, ...) ## S3 method for class 'aov' PostHocTest(x, which = NULL, method = c("hsd", "bonferroni", "lsd", "scheffe", "newmankeuls", "duncan"), conf.level = 0.95, ordered = FALSE, ...) ## S3 method for class 'table' PostHocTest(x, method = c("none", "fdr", "BH", "BY", "bonferroni", "holm", "hochberg", "hommel"), conf.level = 0.95, ...) ## S3 method for class 'PostHocTest' print(x, digits = getOption("digits", 3), ...) ## S3 method for class 'PostHocTest' plot(x, ...)
PostHocTest(x, ...) ## S3 method for class 'aov' PostHocTest(x, which = NULL, method = c("hsd", "bonferroni", "lsd", "scheffe", "newmankeuls", "duncan"), conf.level = 0.95, ordered = FALSE, ...) ## S3 method for class 'table' PostHocTest(x, method = c("none", "fdr", "BH", "BY", "bonferroni", "holm", "hochberg", "hommel"), conf.level = 0.95, ...) ## S3 method for class 'PostHocTest' print(x, digits = getOption("digits", 3), ...) ## S3 method for class 'PostHocTest' plot(x, ...)
x |
an aov object. |
method |
one of |
which |
a character vector listing terms in the fitted model for which the intervals should be calculated. Defaults to all the terms. |
conf.level |
a numeric value between zero and one giving the family-wise confidence level to use. If this is set to NA, just a matrix with the p-values will be returned. |
ordered |
a logical value indicating if the levels of the factor should be ordered according to increasing average in the sample before taking differences. If ordered is |
digits |
controls the number of fixed digits to print. |
... |
further arguments, not used so far. |
The function is designed to consolidate a couple of post-hoc tests with the same interface for input and output.
Choosing Tests.
Different post hoc tests use different methods to control familywise (FW) and per experiment error rate (PE). Some tests are very conservative. Conservative tests go to great lengths to prevent the user from committing a type 1 error. They use more stringent criterion for determining significance. Many of these tests become more and more stringent as the number of groups increases (directly limiting the FW and PE error rate). Although these tests buy you protection against type 1 error, it comes at a cost. As the tests become more stringent, you loose power (1-B). More liberal tests, buy you power but the cost is an increased chance of type 1 error. There is no set rule for determining which test to use, but different researchers have offered some guidelines for choosing. Mostly it is an issue of pragmatics and whether the number of comparisons exceeds k-1.
The Fisher's LSD
(Least Significant Different) sets alpha level per comparison. alpha = .05 for every comparison. df = df error (i.e. df within).
This test is the most liberal of all post hoc tests. The critical t for significance is unaffected by the number of groups.
This test is appropriate when you have 3 means to compare. In general the alpha is held at .05 because of the criterion that you can't look at LSD's unless the ANOVA is significant.
This test is generally not considered appropriate if you have more than 3 means unless there is reason to believe that there is no more than one true null hypothesis hidden in the means.
Dunn's (Bonferroni) t-test
is sometimes referred to as the Bonferroni t because it used the Bonferroni PE correction procedure in determining the critical value for significance. In general, this test should be used when the number of comparisons you are making exceeds the number of degrees of freedom you have between groups (e.g. k-1). This test sets alpha per experiment; alpha = (.05)/c for every comparison. df = df error (c = number of comparisons (k(k-1))/2)
This test is extremely conservative and rapidly reduces power as the number of comparisons being made increase.
Newman-Keuls
is a step down procedure that is not as conservative as Dunn's t test. First, the means of the groups are ordered (ascending or descending) and then the largest and smallest means are tested for significant differences. If those means are different, then test smallest with next largest, until you reach a test that is not significant. Once you reach that point then you can only test differences between means that exceed the difference between the means that were found to be non-significant.
Newman-Keuls is perhaps one of the most common post hoc test, but it is a rather controversial test. The major problem with this test is that when there is more than one true null hypothesis in a set of means it will overestimate the FW error rate.
In general we would use this when the number of comparisons we are making is larger than k-1 and we don't want to be as conservative as the Dunn's test is.
Tukey's HSD
(Honestly Significant Difference) is essentially like the Newman-Keuls, but the tests between each mean are compared to the critical value that is set for the test of the means that are furthest apart (rmax e.g. if there are 5 means we use the critical value determined for the test of X1 and X5).
This method corrects for the problem found in the Newman-Keuls where the FW is inflated when there is more than one true null hypothesis in a set of means. It buys protection against type 1 error, but again at the cost of power.
It tends to be the most common and preferred test because it is very conservative with respect to type 1 error when the null hypothesis is true. In general, HSD is preferred when you will make all the possible comparisons between a large set of means (6 or more means).
The Scheffe test
is designed to protect against a type 1 error when all possible complex and simple comparisons are made. That is we are not just looking the possible combinations of comparisons between pairs of means. We are also looking at the possible combinations of comparisons between groups of means. Thus Scheffe is the most conservative of all tests.
Because this test does give us the capacity to look at complex comparisons, it essentially uses the same statistic as the linear contrasts tests. However, Scheffe uses a different critical value (or at least it makes an adjustment to the critical value of F).
This test has less power than the HSD when you are making pairwise (simple) comparisons, but it has more power than HSD when you are making complex comparisons.
In general, only use this when you want to make many post hoc complex comparisons (e.g. more than k-1).
Tables
For tables pairwise chi-square test can be performed, either without correction or with correction for multiple testing following the logic in p.adjust
.
an object of type "PostHocTest", which will either be
A) a list of data.frames containing the mean difference, lower ci, upper ci and the p-value,
if a conf.level was defined (something else than NA) or
B) a list of matrices with the p-values, if conf.level has been set to NA.
Andri Signorell <[email protected]>
TukeyHSD
, aov
, pairwise.t.test
,
ScheffeTest
PostHocTest(aov(breaks ~ tension, data = warpbreaks), method = "lsd") PostHocTest(aov(breaks ~ tension, data = warpbreaks), method = "hsd") PostHocTest(aov(breaks ~ tension, data = warpbreaks), method = "scheffe") r.aov <- aov(breaks ~ tension, data = warpbreaks) # compare p-values: round(cbind( lsd= PostHocTest(r.aov, method="lsd")$tension[,"pval"] , bonf=PostHocTest(r.aov, method="bonf")$tension[,"pval"] ), 4) # only p-values by setting conf.level to NA PostHocTest(aov(breaks ~ tension, data = warpbreaks), method = "hsd", conf.level=NA)
PostHocTest(aov(breaks ~ tension, data = warpbreaks), method = "lsd") PostHocTest(aov(breaks ~ tension, data = warpbreaks), method = "hsd") PostHocTest(aov(breaks ~ tension, data = warpbreaks), method = "scheffe") r.aov <- aov(breaks ~ tension, data = warpbreaks) # compare p-values: round(cbind( lsd= PostHocTest(r.aov, method="lsd")$tension[,"pval"] , bonf=PostHocTest(r.aov, method="bonf")$tension[,"pval"] ), 4) # only p-values by setting conf.level to NA PostHocTest(aov(breaks ~ tension, data = warpbreaks), method = "hsd", conf.level=NA)
Compute power of test or determine parameters to obtain target
power (same as power.anova.test
).
power.chisq.test(n = NULL, w = NULL, df = NULL, sig.level = 0.05, power = NULL)
power.chisq.test(n = NULL, w = NULL, df = NULL, sig.level = 0.05, power = NULL)
n |
total number of observations. |
w |
effect size. |
df |
degree of freedom (depends on the chosen test. |
sig.level |
Significance level (Type I error probability). |
power |
Power of test (1 minus Type II error probability). |
Exactly one of the parameters w
, n
, power
or
sig.level
must be passed as NULL, and this parameter is
determined from the others. Note that the last one has non-NULL
default, so NULL
must be explicitly passed, if you want to compute
it.
Object of class "power.htest", a list of the arguments (including the computed one) augmented with 'method' and 'note' elements.
uniroot
is used to solve power equation for unknowns, so you may
see errors from it, notably about inability to bracket the root
when invalid arguments are given.
Stephane Champely <[email protected]>
but this is a mere copy of Peter Dalgaard's work on power.t.test
Cohen, J. (1988) Statistical power analysis for the behavioral sciences (2nd ed.) Hillsdale, NJ: Lawrence Erlbaum.
## Exercise 7.1 P. 249 from Cohen (1988) power.chisq.test(w=0.289, df=(4-1), n=100, sig.level=0.05) ## Exercise 7.3 p. 251 power.chisq.test(w=0.346, df=(2-1)*(3-1), n=140, sig.level=0.01) ## Exercise 7.8 p. 270 power.chisq.test(w=0.1, df=(5-1)*(6-1), power=0.80, sig.level=0.05)
## Exercise 7.1 P. 249 from Cohen (1988) power.chisq.test(w=0.289, df=(4-1), n=100, sig.level=0.05) ## Exercise 7.3 p. 251 power.chisq.test(w=0.346, df=(2-1)*(3-1), n=140, sig.level=0.01) ## Exercise 7.8 p. 270 power.chisq.test(w=0.1, df=(5-1)*(6-1), power=0.80, sig.level=0.05)
A couple of functions to get R-stuff into MS-Powerpoint.
GetNewPP()
starts a new instance of PowerPoint and returns its handle. A new presentation with one empty slide will be created thereby. The handle is needed for addressing the presentation afterwards.GetCurrPP()
will look for a running PowerPoint instance and return its handle. NULL
is returned if nothing's found.
PpAddSlide()
inserts a new slide into the active
presentation.PpPlot()
inserts the active plot into PowerPoint. The image is transferred by saving the picture to a file in R and
inserting the file in PowerPoint. The format of the plot can be selected, as well as crop options and the
size factor for inserting.PpText()
inserts a new textbox with given text and box properties.
GetNewPP(visible = TRUE, template = "Normal") GetCurrPP() PpAddSlide(pos = NULL, pp = DescToolsOptions("lastPP")) PpPlot(type = "png", crop = c(0, 0, 0, 0), picscale = 100, x = 1, y = 1, height = NA, width = NA, res=200, dfact=1.6, pp = DescToolsOptions("lastPP")) PpText(txt, x = 1, y = 1, height = 50, width = 100, fontname = "Calibri", fontsize = 18, bold = FALSE, italic = FALSE, col = "black", bg = "white", hasFrame = TRUE, pp = DescToolsOptions("lastPP"))
GetNewPP(visible = TRUE, template = "Normal") GetCurrPP() PpAddSlide(pos = NULL, pp = DescToolsOptions("lastPP")) PpPlot(type = "png", crop = c(0, 0, 0, 0), picscale = 100, x = 1, y = 1, height = NA, width = NA, res=200, dfact=1.6, pp = DescToolsOptions("lastPP")) PpText(txt, x = 1, y = 1, height = 50, width = 100, fontname = "Calibri", fontsize = 18, bold = FALSE, italic = FALSE, col = "black", bg = "white", hasFrame = TRUE, pp = DescToolsOptions("lastPP"))
visible |
logical, should PowerPoint made visible by |
template |
the name of the template to be used for the new presentation. |
pos |
position of the new inserted slide within the presentation. |
type |
the format for the picture file, default is |
crop |
crop options for the picture, defined by a 4-elements-vector. The first element is the bottom side, the second the left and so on. |
picscale |
scale factor of the picture in percent, default ist 100. |
x , y
|
left/upper xy-coordinate for the plot or for the textbox. |
height |
height in cm, this overrides the picscale if both are given. |
width |
width in cm, this overrides the picscale if both are given. |
res |
resolution for the png file, defaults to 200. |
dfact |
the size factor for the graphic. |
txt |
text to be placed in the textbox |
fontname |
used font for textbox |
fontsize |
used fontsize for textbox |
bold |
logic. Text is set bold if this is set to |
italic |
logic. Text is set italic if this is to |
col |
font color, defaults to |
bg |
background color for textboxdefaults to |
hasFrame |
logical. Defines if a textbox is to be framed. Default is TRUE. |
pp |
the pointer to a PowerPoint instance, can be a new one, created by |
See PowerPoint-objectmodel for further informations.
The functions return the pointer to the created object.
Andri Signorell <[email protected]>
## Not run: # Windows-specific example # let's have some graphic plot(1,type="n", axes=FALSE, xlab="", ylab="", xlim=c(0,1), ylim=c(0,1)) rect(0,0,1,1,col="black") segments(x0=0.5, y0=seq(0.632,0.67, length.out=100), y1=seq(0.5,0.6, length.out=100), x1=1, col=rev(rainbow(100))) polygon(x=c(0.35,0.65,0.5), y=c(0.5,0.5,0.75), border="white", col="black", lwd=2) segments(x0=0,y0=0.52, x1=0.43, y1=0.64, col="white", lwd=2) x1 <- seq(0.549,0.578, length.out=50) segments(x0=0.43, y0=0.64, x1=x1, y1=-tan(pi/3)* x1 + tan(pi/3) * 0.93, col=rgb(1,1,1,0.35)) # get a handle to a new PowerPoint instance pp <- GetNewPP() # insert plot with a specified height PpPlot(pp=pp, x=150, y=150, height=10, width=10) PpText("Remember?\n", fontname="Arial", x=200, y=70, height=30, fontsize=14, bold=TRUE, pp=pp, bg="lemonchiffon", hasFrame=TRUE) PpAddSlide(pp=pp) # crop the picture pic <- PpPlot(pp=pp, x=1, y=200, height=10, width=10, crop=c(9,9,0,0)) pic # some more automatic procedure pp <- GetNewPP() PpText("Hello to my presentation", x=100, y=100, fontsize=32, bold=TRUE, width=300, hasFrame=FALSE, col="blue", pp=pp) for(i in 1:4){ barplot(1:4, col=i) PpAddSlide(pp=pp) PpPlot(height=15, width=21, x=50, y=50, pp=pp) PpText(gettextf("This is my barplot nr %s", i), x=100, y=10, width=300, pp=pp) } ## End(Not run)
## Not run: # Windows-specific example # let's have some graphic plot(1,type="n", axes=FALSE, xlab="", ylab="", xlim=c(0,1), ylim=c(0,1)) rect(0,0,1,1,col="black") segments(x0=0.5, y0=seq(0.632,0.67, length.out=100), y1=seq(0.5,0.6, length.out=100), x1=1, col=rev(rainbow(100))) polygon(x=c(0.35,0.65,0.5), y=c(0.5,0.5,0.75), border="white", col="black", lwd=2) segments(x0=0,y0=0.52, x1=0.43, y1=0.64, col="white", lwd=2) x1 <- seq(0.549,0.578, length.out=50) segments(x0=0.43, y0=0.64, x1=x1, y1=-tan(pi/3)* x1 + tan(pi/3) * 0.93, col=rgb(1,1,1,0.35)) # get a handle to a new PowerPoint instance pp <- GetNewPP() # insert plot with a specified height PpPlot(pp=pp, x=150, y=150, height=10, width=10) PpText("Remember?\n", fontname="Arial", x=200, y=70, height=30, fontsize=14, bold=TRUE, pp=pp, bg="lemonchiffon", hasFrame=TRUE) PpAddSlide(pp=pp) # crop the picture pic <- PpPlot(pp=pp, x=1, y=200, height=10, width=10, crop=c(9,9,0,0)) pic # some more automatic procedure pp <- GetNewPP() PpText("Hello to my presentation", x=100, y=100, fontsize=32, bold=TRUE, width=300, hasFrame=FALSE, col="blue", pp=pp) for(i in 1:4){ barplot(1:4, col=i) PpAddSlide(pp=pp) PpPlot(height=15, width=21, x=50, y=50, pp=pp) PpText(gettextf("This is my barplot nr %s", i), x=100, y=10, width=300, pp=pp) } ## End(Not run)
Density, distribution function, quantile function and random
generation for the “Reverse” Gumbel distribution with
parameters location
and scale
.
dRevGumbel (x, location = 0, scale = 1) pRevGumbel (q, location = 0, scale = 1) qRevGumbel (p, location = 0, scale = 1) rRevGumbel (n, location = 0, scale = 1) qRevGumbelExp(p)
dRevGumbel (x, location = 0, scale = 1) pRevGumbel (q, location = 0, scale = 1) qRevGumbel (p, location = 0, scale = 1) rRevGumbel (n, location = 0, scale = 1) qRevGumbelExp(p)
x , q
|
numeric vector of abscissa (or quantile) values at which to evaluate the density or distribution function. |
p |
numeric vector of probabilities at which to evaluate the quantile function. |
location |
location of the distribution |
scale |
scale ( |
n |
number of random variates, i.e., |
a numeric vector, of the same length as x
, q
, or
p
for the first three functions, and of length n
for
rRevGumbel()
.
Werner Stahel; partly inspired by package VGAM. Martin Maechler for numeric cosmetic.
the Weibull
distribution functions in R's stats package.
curve(pRevGumbel(x, scale= 1/2), -3,2, n=1001, col=1, lwd=2, main = "RevGumbel(x, scale = 1/2)") abline(h=0:1, v = 0, lty=3, col = "gray30") curve(dRevGumbel(x, scale= 1/2), n=1001, add=TRUE, col = (col.d <- adjustcolor(2, 0.5)), lwd=3) legend("left", c("cdf","pdf"), col=c("black", col.d), lwd=2:3, bty="n") med <- qRevGumbel(0.5, scale=1/2) cat("The median is:", format(med),"\n")
curve(pRevGumbel(x, scale= 1/2), -3,2, n=1001, col=1, lwd=2, main = "RevGumbel(x, scale = 1/2)") abline(h=0:1, v = 0, lty=3, col = "gray30") curve(dRevGumbel(x, scale= 1/2), n=1001, add=TRUE, col = (col.d <- adjustcolor(2, 0.5)), lwd=3) legend("left", c("cdf","pdf"), col=c("black", col.d), lwd=2:3, bty="n") med <- qRevGumbel(0.5, scale=1/2) cat("The median is:", format(med),"\n")
Find all prime numbers aka ‘primes’ less than .
Uses an obvious sieve method and some care, working with logical and integers to be quite fast.
Primes(n)
Primes(n)
n |
a (typically positive integer) number. |
As the function only uses max(n)
, n
can also be a
vector of numbers.
numeric vector of all prime numbers .
This function was previously published in the package sfsmisc as primes
and has been integrated here without logical changes.
Bill Venables (); Martin Maechler gained another 40% speed,
working with logicals and integers.
(p1 <- Primes(100)) system.time(p1k <- Primes(1000)) # still lightning .. stopifnot(length(p1k) == 168)
(p1 <- Primes(100)) system.time(p1k <- Primes(1000)) # still lightning .. stopifnot(length(p1k) == 168)
Although there's no commonly accepted agreement on how to assess the fit of a logistic regression, there are some approaches. The goodness of fit of the logistic regression model can be expressed by some variants of pseudo R squared statistics, most of which being based on the deviance of the model.
PseudoR2(x, which = NULL)
PseudoR2(x, which = NULL)
x |
the |
which |
character, one out of |
Cox and Snell's is based on the log likelihood for the model compared to the log likelihood for a baseline model. However, with categorical outcomes, it has a theoretical maximum value of less than 1, even for a "perfect" model.
Nagelkerke's (also sometimes called Cragg-Uhler) is an adjusted version of the Cox and Snell's
that adjusts the scale of the statistic to cover the full range from 0 to 1.
McFadden's is another version, based on the log-likelihood kernels for the intercept-only model and the full estimated model.
Veall and Zimmermann concluded that from a set of six widely used measures the measure suggested by McKelvey and Zavoina had the closest correspondance to ordinary least square R2. The Aldrich-Nelson pseudo-R2 with the Veall-Zimmermann correction is the best approximation of the McKelvey-Zavoina pseudo-R2. Efron, Aldrich-Nelson, McFadden and Nagelkerke approaches severely underestimate the "true R2".
the value of the specific statistic. AIC
, LogLik
, LogLikNull
and G2
will only be reported with option "all"
.
McFadden |
McFadden pseudo- |
McFaddenAdj |
McFadden adjusted pseudo- |
CoxSnell |
Cox and Snell pseudo- |
Nagelkerke |
Nagelkerke pseudo |
AldrichNelson |
AldrichNelson pseudo- |
VeallZimmermann |
VeallZimmermann pseudo- |
McKelveyZavoina |
McKelvey and Zavoina pseudo- |
Efron |
Efron pseudo- |
Tjur |
Tjur's pseudo- |
AIC |
Akaike's information criterion |
LogLik |
log-Likelihood for the fitted model (by maximum likelihood) |
LogLikNull |
log-Likelihood for the null model. The null model will include the offset, and an intercept if there is one in the model. |
G2 |
differenz of the null deviance - model deviance |
Andri Signorell <[email protected]> with contributions of Ben Mainwaring <[email protected]> and Daniel Wollschlaeger
Aldrich, J. H. and Nelson, F. D. (1984): Linear Probability, Logit, and probit Models, Sage University Press, Beverly Hills.
Cox D R & Snell E J (1989) The Analysis of Binary Data 2nd ed. London: Chapman and Hall.
Efron, B. (1978). Regression and ANOVA with zero-one data: Measures of residual variation. Journal of the American Statistical Association, 73(361), 113–121.
Hosmer, D. W., & Lemeshow, S. (2000). Applied logistic regression (2nd ed.). Hoboke, NJ: Wiley.
McFadden D (1979). Quantitative methods for analysing travel behavior of individuals: Some recent developments. In D. A. Hensher & P. R. Stopher (Eds.), Behavioural travel modelling (pp. 279-318). London: Croom Helm.
McKelvey, R. D., & Zavoina, W. (1975). A statistical model for the analysis of ordinal level dependent variables. The Journal of Mathematical Sociology, 4(1), 103–120
Nagelkerke, N. J. D. (1991). A note on a general definition of the coefficient of determination. Biometrika, 78(3), 691–692.
Tjur, T. (2009) Coefficients of determination in logistic regression models - a new proposal: The coefficient of discrimination. The American Statistician, 63(4): 366-372
Veall, M.R., & Zimmermann, K.F. (1992) Evalutating Pseudo-R2's fpr binary probit models. Quality&Quantity, 28, pp. 151-164
r.glm <- glm(Survived ~ ., data=Untable(Titanic), family=binomial) PseudoR2(r.glm) PseudoR2(r.glm, c("McFadden", "Nagel"))
r.glm <- glm(Survived ~ ., data=Untable(Titanic), family=binomial) PseudoR2(r.glm) PseudoR2(r.glm, c("McFadden", "Nagel"))
PtInPoly
works out, whether XY-points lie within the boundaries of a given polygon.
Note: Points that lie on the boundaries of the polygon or vertices are assumed to lie within the polygon.
PtInPoly(pnts, poly.pnts)
PtInPoly(pnts, poly.pnts)
pnts |
a 2-column matrix or dataframe defining locations of the points of interest |
poly.pnts |
a 2-column matrix or dataframe defining the locations of vertices of the polygon of interest |
The algorithm implements a sum of the angles made between the test point and each pair of points making up the polygon. The point is interior if the sum is 2pi, otherwise, the point is exterior if the sum is 0. This works for simple and complex polygons (with holes) given that the hole is defined with a path made up of edges into and out of the hole.
This sum of angles is not able to consistently assign points that fall on vertices or on the boundary of the polygon. The algorithm defined here assumes that points falling on a boundary or polygon vertex are part of the polygon.
A 3-column dataframe where the first 2 columns are the original locations of the points. The third column (names pip) is a vector of binary values where 0 represents points not with the polygon and 1 within the polygon.
Jeremy VanDerWal <[email protected]>
#define the points and polygon pnts <- expand.grid(x=seq(1,6,0.1), y=seq(1,6,0.1)) polypnts <- cbind(x=c(2,3,3.5,3.5,3,4,5,4,5,5,4,3,3,3,2,2,1,1,1,1,2), y=c(1,2,2.5,2,2,1,2,3,4,5,4,5,4,3,3,4,5,4,3,2,2) ) #plot the polygon and all points to be checked plot(rbind(polypnts, pnts)) polygon(polypnts, col='#99999990') #create check which points fall within the polygon out <- PtInPoly(pnts, polypnts) head(out) #identify points not in the polygon with an X points(out[which(out$pip==0), 1:2], pch='X')
#define the points and polygon pnts <- expand.grid(x=seq(1,6,0.1), y=seq(1,6,0.1)) polypnts <- cbind(x=c(2,3,3.5,3.5,3,4,5,4,5,5,4,3,3,3,2,2,1,1,1,1,2), y=c(1,2,2.5,2,2,1,2,3,4,5,4,5,4,3,3,4,5,4,3,2,2) ) #plot the polygon and all points to be checked plot(rbind(polypnts, pnts)) polygon(polypnts, col='#99999990') #create check which points fall within the polygon out <- PtInPoly(pnts, polypnts) head(out) #identify points not in the polygon with an X points(out[which(out$pip==0), 1:2], pch='X')
Compute weighted quantiles (Eurostat definition).
Quantile(x, weights = NULL, probs = seq(0, 1, 0.25), na.rm = FALSE, names = TRUE, type = 7, digits = 7)
Quantile(x, weights = NULL, probs = seq(0, 1, 0.25), na.rm = FALSE, names = TRUE, type = 7, digits = 7)
x |
a numeric vector. |
weights |
an optional numeric vector giving the sample weights. |
probs |
numeric vector of probabilities with values in |
na.rm |
a logical indicating whether missing values in |
names |
logical; if true, the result has a |
type |
an integer between 1 and 9 selecting one of the nine quantile algorithms detailed below to be used. Currently only |
digits |
used only when |
The implementation strictly follows the Eurostat definition.
A named numeric vector containing the weighted quantiles of values in
x
at probabilities probs
is returned.
Andreas Alfons, Matthias Templ, some tweaks Andri Signorell <[email protected]>
Working group on Statistics on Income and Living Conditions (2004) Common cross-sectional EU indicators based on EU-SILC; the gender pay gap. EU-SILC 131-rev/04, Eurostat.
Quantile(d.pizza$temperature, rep(c(1:3), length.out=nrow(d.pizza)))
Quantile(d.pizza$temperature, rep(c(1:3), length.out=nrow(d.pizza)))
Calculates the confidence interval for any quantile. Although bootstrapping might be a good approach for getting senisble confidence intervals there's sometimes need to have a nonparameteric alternative. This function offers one.
QuantileCI(x, probs=seq(0, 1, .25), conf.level = 0.95, sides = c("two.sided", "left", "right"), na.rm = FALSE, method = c("exact", "boot"), R = 999)
QuantileCI(x, probs=seq(0, 1, .25), conf.level = 0.95, sides = c("two.sided", "left", "right"), na.rm = FALSE, method = c("exact", "boot"), R = 999)
x |
a (non-empty) numeric vector of data values. |
probs |
numeric vector of probabilities with values in [0,1]. (Values up to |
conf.level |
confidence level of the interval |
sides |
a character string specifying the side of the confidence interval, must be one of |
na.rm |
logical. Should missing values be removed? Defaults to |
method |
defining the type of interval that should be calculated (one out of |
R |
The number of bootstrap replicates. Usually this will be a single positive integer. See
|
The "exact"
method corresponds to the way the confidence interval for the median is calculated in SAS.
The boot confidence interval type is calculated by means of boot.ci
with default type "basic"
.
if probs was of length 1 a numeric vector with 3 elements:
est |
est |
lwr.ci |
lower bound of the confidence interval |
upr.ci |
upper bound of the confidence interval |
or, if probs was a vector, a matrix with 3 columns consisting of estimate, lower ci, upper ci
est, lwr.ci, upr.ci
Andri Signorell <[email protected]> based on code of W Huber on StackExchange
QuantileCI(d.pizza$price, probs=0.25, na.rm=TRUE) QuantileCI(d.pizza$price, na.rm=TRUE) QuantileCI(d.pizza$price, conf.level=0.99, na.rm=TRUE) # multiple probs QuantileCI(1:100, method="exact" , probs = c(0.25, 0.75, .80, 0.95)) QuantileCI(1:100, method="boot" , probs = c(0.25, 0.75, .80, 0.95))
QuantileCI(d.pizza$price, probs=0.25, na.rm=TRUE) QuantileCI(d.pizza$price, na.rm=TRUE) QuantileCI(d.pizza$price, conf.level=0.99, na.rm=TRUE) # multiple probs QuantileCI(1:100, method="exact" , probs = c(0.25, 0.75, .80, 0.95)) QuantileCI(1:100, method="boot" , probs = c(0.25, 0.75, .80, 0.95))
Returns suitably lagged and iterated quotients
Quot(x, lag = 1, quotients = 1, ...)
Quot(x, lag = 1, quotients = 1, ...)
x |
a numeric vector or matrix containing the values to be used for calculating the quotients. |
lag |
an integer indicating which lag to use. |
quotients |
an integer indicating the order of the quotient. |
... |
further arguments to be passed to or from methods. |
NA
's propagate.
If x
is a vector of length n
and quotients = 1
,
then the computed result is equal to the successive quotients
x[(1+lag):n] - x[1:(n-lag)]
.
If quotients
is larger than one this algorithm is applied
recursively to x
.
Note that the returned value is a vector which is shorter than
x
.
If x
is a matrix then the division operations are carried out
on each column separately.
Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) The New S Language. Wadsworth & Brooks/Cole.
Quot(1:10, 2) Quot(1:10, 2, 2) x <- cumprod(cumprod(1:10)) Quot(x, lag = 2) Quot(x, quotients = 2)
Quot(1:10, 2) Quot(1:10, 2, 2) x <- cumprod(cumprod(1:10)) Quot(x, lag = 2) Quot(x, quotients = 2)
Determines the range of the data, which can possibly be trimmed before calculating the extreme values. The robust range version is calculated on the basis of the trimmed mean and variance (see Details).
Range(x, trim = NULL, robust = FALSE, na.rm = FALSE, ...)
Range(x, trim = NULL, robust = FALSE, na.rm = FALSE, ...)
x |
a numeric vector. |
trim |
the fraction (0 to 0.5) of observations to be trimmed from each
end of |
robust |
logical, determining whether the robust or the convential range should be returned. |
na.rm |
a logical value indicating whether |
... |
the dots are sent to |
The R base function range returns the minimum and maximum value of a numeric object. Here we return the span of a (possibly trimmed) numeric vector, say the difference of maximum and minimum value.
If robust is set to TRUE
the function determines the trimmed mean m
and then the "upper trimmed mean" s of absolute deviations from m,
multiplied by fac
(fac is 3 by default). The robust minimum is then
defined as m-fac*s or min(x), whichever is larger, and similarly for the
maximum.
If trim
is zero (the default), the arithmetic mean of the
values in x
is computed, as a numeric or complex vector of length
one. If x
is not logical (coerced to numeric), numeric (including
integer) or complex, NA_real_
is returned, with a warning.
If trim
is non-zero, a symmetrically trimmed mean is computed with a
fraction of trim
observations deleted from each end before the mean
is computed.
Werner Stahel, ETH Zurich (robust range)
Andri Signorell
[email protected]
x <- c(0:10, 50) xm <- Range(x) c(xm, Range(x, trim = 0.10)) x <- c(rnorm(20), rnorm(3, 5, 20)) Range(x, robust=TRUE) # compared to Range(x)
x <- c(0:10, 50) xm <- Range(x) c(xm, Range(x, trim = 0.10)) x <- c(rnorm(20), rnorm(3, 5, 20)) Range(x, robust=TRUE) # compared to Range(x)
The function base::rank
has various weaknesses. Apart from the fact that it is not very fast, the option to calculate dense ranks is not implemented. Then, an argument for specifying the ranking direction is missing (assuming that this can be done with the ranking of the negative variables) and finally, multiple columns cannot be used in the case of ties for further ranking.
The function data.table::frankv
provides a more elaborated interface and convinces by very performant calculations and is much faster than the original.
It further accepts vectors, lists, data.frame
s or data.table
s as input. In addition to the ties.method
possibilities provided by base::rank
, it also provides ties.method="dense"
.
The present function Rank
is merely a somewhat customized parameterization of the data.table
function.
Rank(..., decreasing = FALSE, na.last = TRUE, ties.method = c("average", "first", "last", "random", "max", "min", "dense"))
Rank(..., decreasing = FALSE, na.last = TRUE, ties.method = c("average", "first", "last", "random", "max", "min", "dense"))
... |
A vector, or list with all its elements identical in length or |
decreasing |
An |
na.last |
Control treatment of |
ties.method |
A character string specifying how ties are treated, see |
To be consistent with other data.table
operations, NA
s are considered identical to other NA
s (and NaN
s to other NaN
s), unlike base::rank
. Therefore, for na.last=TRUE
and na.last=FALSE
, NA
s (and NaN
s) are given identical ranks, unlike rank
.
Rank
is not limited to vectors. It accepts data.table
s (and list
s and data.frame
s) as well. It accepts unquoted column names (with names preceded with a -
sign for descending order, even on character vectors), for e.g., Rank(DT, a, -b, c, ties.method="first")
where a,b,c
are columns in DT
.
In addition to the ties.method
values possible using base's rank
, it also provides another additional argument "dense"
.
Dense ranks are consecutive integers beginning with 1. No ranks are skipped if there are ranks with multiple items. So the largest rank value is the number of unique values of x. See examples.
Like forder
, sorting is done in "C-locale"; in particular, this may affect how capital/lowercase letters are ranked. See Details on forder
for more.
bit64::integer64
type is also supported.
A numeric vector of length equal to NROW(x)
(unless na.last = NA
, when missing values are removed). The vector is of integer type unless ties.method = "average"
when it is of double type (irrespective of ties).
frankv
, data.table
, setkey
, setorder
# on vectors x <- c(4, 1, 4, NA, 1, NA, 4) # NAs are considered identical (unlike base R) # default is average Rank(x) # na.last=TRUE Rank(x, na.last=FALSE) # ties.method = min Rank(x, ties.method="min") # ties.method = dense Rank(x, ties.method="dense") # on data.frame, using both columns d.set <- data.frame(x, y=c(1, 1, 1, 0, NA, 0, 2)) Rank(d.set, na.last="keep") Rank(d.set, ties.method="dense", na.last=NA) # decreasing argument Rank(d.set, decreasing=c(FALSE, TRUE), ties.method="first")
# on vectors x <- c(4, 1, 4, NA, 1, NA, 4) # NAs are considered identical (unlike base R) # default is average Rank(x) # na.last=TRUE Rank(x, na.last=FALSE) # ties.method = min Rank(x, ties.method="min") # ties.method = dense Rank(x, ties.method="dense") # on data.frame, using both columns d.set <- data.frame(x, y=c(1, 1, 1, 0, NA, 0, 2)) Rank(d.set, na.last="keep") Rank(d.set, ties.method="dense", na.last=NA) # decreasing argument Rank(d.set, decreasing=c(FALSE, TRUE), ties.method="first")
While haven is a great package it uses tibbles as basic data structures. Older R users (as myself) might prefer a more archaic structures. This function returns SPSS files in form of a data.frame and the nominal variables as factors.
ReadSPSS(fn, encoding = NULL)
ReadSPSS(fn, encoding = NULL)
fn |
Either a path to a file, a connection, or literal data (either a single string or a raw vector). |
encoding |
The character encoding used for the file. The default, NULL, use the encoding specified in the file, but sometimes this value is incorrect and it is useful to be able to override it. |
A data frame.
Variable labels are stored in the "label" attribute of each variable. It is not printed on the console, but the RStudio viewer will show it.
Andri Signorell <[email protected]>
Combining or rearranging a factor can be tedious if it has many levels. Recode
supports this step by
accepting a direct definition of new levels by enumerating old levelnames as argument and adding an "elselevel"
option. If new levels are given as integer values they will be translated in the according levels.
Recode(x, ..., elselevel = NA, use.empty = FALSE, num = FALSE)
Recode(x, ..., elselevel = NA, use.empty = FALSE, num = FALSE)
x |
the factor whose levels are to be altered. If x is |
... |
the old levels (combined by |
elselevel |
the value for levels, which are not matched by newlevel list.
If this is set to |
use.empty |
logical. Defines how a new level, which can't be found in x, should be handled. Should it be left in the level's list or be dropped? The default is |
num |
logical. If set to |
the factor having the new levels applied.
Andri Signorell <[email protected]>
factor
, levels
There's another solution in the package car.
set.seed(1984) x <- factor(sample(1:15, 20, replace=TRUE)) levels(x) <- paste("old", levels(x), sep="_") y <- Recode(x, "new_1" = c("old_1","old_4","old_5"), "new_2" = c("old_6","old_10","old_11"), "new_3" = c("old_12","old_13"), elselevel = "other") data.frame(x=x, y=y) # Coding NAs, NA is recoded to new_1 x[5:6] <- NA x <- x[1:7] data.frame( x, RecodeNA = Recode(x, "new_1" = c("old_4","old_8", NA), elselevel = "other"), # NAs remain untouched NoRecodeNA = Recode(x, "new_1" = c("old_4","old_8"), elselevel = "other") ) x <- factor(letters[1:6]) z1 <- Recode(x, AB=c("a","b"), CD=c("c","d"), elselevel="none of these") z2 <- Recode(x, AB=c("a","b"), CD=c("c","d"), elselevel=NA) z3 <- Recode(x, AB=c("a","b"), CD=c("c","d"), elselevel=NULL) z4 <- Recode(x, AB=c("a","b"), GH=c("g","h"), elselevel=NA, use.empty=TRUE) z5 <- Recode(x, AB=c("a","b"), GH=c("g","h"), elselevel=NA, use.empty=FALSE) data.frame(z1, z2, z3, z4, z5) lapply(data.frame(z1, z2, z3, z4, z5), levels) # empty level GH exists in z4... table(z4, useNA="ifany") # and is dropped in z5 table(z5, useNA="ifany") # use integers to define the groups to collapse set.seed(1972) (likert <- factor(sample(1:10, size=15, replace=TRUE), levels=1:10, labels=gettextf("(%s)", 1:10))) Recode(likert, det=1:6, pas=7:8, pro=9:10) # or directly turned to numeric Recode(likert, "1"=1:6, "2"=7:8, "5"=9:10, num=TRUE)
set.seed(1984) x <- factor(sample(1:15, 20, replace=TRUE)) levels(x) <- paste("old", levels(x), sep="_") y <- Recode(x, "new_1" = c("old_1","old_4","old_5"), "new_2" = c("old_6","old_10","old_11"), "new_3" = c("old_12","old_13"), elselevel = "other") data.frame(x=x, y=y) # Coding NAs, NA is recoded to new_1 x[5:6] <- NA x <- x[1:7] data.frame( x, RecodeNA = Recode(x, "new_1" = c("old_4","old_8", NA), elselevel = "other"), # NAs remain untouched NoRecodeNA = Recode(x, "new_1" = c("old_4","old_8"), elselevel = "other") ) x <- factor(letters[1:6]) z1 <- Recode(x, AB=c("a","b"), CD=c("c","d"), elselevel="none of these") z2 <- Recode(x, AB=c("a","b"), CD=c("c","d"), elselevel=NA) z3 <- Recode(x, AB=c("a","b"), CD=c("c","d"), elselevel=NULL) z4 <- Recode(x, AB=c("a","b"), GH=c("g","h"), elselevel=NA, use.empty=TRUE) z5 <- Recode(x, AB=c("a","b"), GH=c("g","h"), elselevel=NA, use.empty=FALSE) data.frame(z1, z2, z3, z4, z5) lapply(data.frame(z1, z2, z3, z4, z5), levels) # empty level GH exists in z4... table(z4, useNA="ifany") # and is dropped in z5 table(z5, useNA="ifany") # use integers to define the groups to collapse set.seed(1972) (likert <- factor(sample(1:10, size=15, replace=TRUE), levels=1:10, labels=gettextf("(%s)", 1:10))) Recode(likert, det=1:6, pas=7:8, pro=9:10) # or directly turned to numeric Recode(likert, "1"=1:6, "2"=7:8, "5"=9:10, num=TRUE)
This function recycles all supplied elments to the maximal dimension.
Recycle(...)
Recycle(...)
... |
a number of vectors of elements. |
a list of the supplied elementsattr(,"maxdim")
contains the maximal dimension of the recyled list
Andri Signorell <[email protected]>
Recycle(x=1:5, y=1, s=letters[1:2]) z <- Recycle(x=letters[1:5], n=2:3, sep=c("-"," ")) sapply(1:attr(z, "maxdim"), function(i) paste(rep(z$x[i], times=z$n[i]), collapse=z$sep[i]))
Recycle(x=1:5, y=1, s=letters[1:2]) z <- Recycle(x=letters[1:5], n=2:3, sep=c("-"," ")) sapply(1:attr(z, "maxdim"), function(i) paste(rep(z$x[i], times=z$n[i]), collapse=z$sep[i]))
Computes the relative risk and its confidence intervals.
Confidence intervals are calculated using normal approximation ("wald"
), ("score"
) or by
using odds ratio ("use.or"
)
RelRisk(x, y = NULL, conf.level = NA, method = c("score", "wald", "use.or"), delta = 0.5, ...)
RelRisk(x, y = NULL, conf.level = NA, method = c("score", "wald", "use.or"), delta = 0.5, ...)
x |
a numeric vector or a 2x2 numeric matrix, resp. table. |
y |
|
conf.level |
confidence level. Default is |
method |
method for calculating the relative risk and the confidence intervals. Can be one out of
|
delta |
small constant to be added to the numerator for calculating the log risk ratio (Wald method). Usual choice is 0.5 although there does not seem to be any theory behind this. (Dewey, M. 2006) |
... |
further arguments are passed to the function |
Best is to always put the outcome variable (disease yes/no) in the columns and the exposure variable in the rows. In other words, put the dependent variable – the one that describes the problem under study – in the columns. And put the independent variable – the factor assumed to cause the problem – in the rows. (Gerritsen, 2010)
According to this, the function expects the following table structure:
diseased=1 diseased=0 exposed=1 (ref) n00 n01 exposed=0 n10 n11
The relative risk is then calculated as:
(exposed & diseased) / exposed rr = ---------------------------------- (unexposed & diseased) / unexposed
If the table to be used is not in the
required shape, use the function Rev()
and/or t()
to reverse rows, columns, or both, resp. to transpose the table.
If conf.level
is not NA
then the result will be
a vector with 3 elements for estimate, lower confidence intervall and upper for the upper one.
Else the relative risk will be reported as a single value.
Andri Signorell <[email protected]>, based on code of Yongyi Min and Michael Dewey
Rothman, K. J. and Greenland, S. (1998) Modern Epidemiology. Lippincott-Raven Publishers
Rothman, K. J. (2002) Epidemiology: An Introduction. Oxford University Press
Jewell, N. P. (2004) Statistics for Epidemiology. 1st Edition, 2004, Chapman & Hall, pp. 73-81
Selvin, S. (1998) Modern Applied Biostatistical Methods Using S-Plus. 1st Edition, Oxford University Press
Gerritsen, A (2010) https://www.theanalysisfactor.com/cross-tabulation-in-cohort-and-case-control-studies/
m <- matrix(c(78,50,1422,950), nrow=2, dimnames = list(water=c("cont", "clean"), diarrhea=c("yes", "no"))) RelRisk(m, conf.level = 0.95) mm <- cbind(c(9,20),c(41,29)) mm RelRisk(t(mm), conf.level=0.95) RelRisk(t(mm), conf.level=0.95, method="wald") RelRisk(t(mm), conf.level=0.95, method="use.or")
m <- matrix(c(78,50,1422,950), nrow=2, dimnames = list(water=c("cont", "clean"), diarrhea=c("yes", "no"))) RelRisk(m, conf.level = 0.95) mm <- cbind(c(9,20),c(41,29)) mm RelRisk(t(mm), conf.level=0.95) RelRisk(t(mm), conf.level=0.95, method="wald") RelRisk(t(mm), conf.level=0.95, method="use.or")
Rename
changes the names of a named object.
Rename(x, ..., gsub = FALSE, fixed = TRUE, warn = TRUE)
Rename(x, ..., gsub = FALSE, fixed = TRUE, warn = TRUE)
x |
Any named object |
... |
A sequence of named arguments, all of type character |
gsub |
a logical value; if TRUE, |
fixed |
a logical value, passed to |
warn |
a logical value; should a warning be issued if those names to change are not found? |
This function changes the names of x
according to the
remaining arguments.
If gsub
is FALSE, argument tags are the old
names, the values are the new names.
If gsub
is TRUE, arguments are substrings of the names
that are substituted by the argument values.
The object x
with new names defined by the ... arguments.
This function was previously published in the package memisc as rename
and has been integrated here without logical changes.
Martin Elff <[email protected]>
SetNames
, Recode
for recoding of a factor (renaming or combining levels)
x <- c(a=1, b=2) Rename(x, a="A", b="B") str(Rename( iris, Sepal.Length="Sepal_Length", Sepal.Width ="Sepal_Width", Petal.Length="Petal_Length", Petal.Width ="Petal_Width" )) str(Rename(iris, .="_", gsub=TRUE))
x <- c(a=1, b=2) Rename(x, a="A", b="B") str(Rename( iris, Sepal.Length="Sepal_Length", Sepal.Width ="Sepal_Width", Petal.Length="Petal_Length", Petal.Width ="Petal_Width" )) str(Rename(iris, .="_", gsub=TRUE))
Reorder the levels of a factor
## S3 method for class 'factor' reorder(x, X, FUN, ..., order = is.ordered(x), new.order, sort = SortMixed)
## S3 method for class 'factor' reorder(x, X, FUN, ..., order = is.ordered(x), new.order, sort = SortMixed)
x |
factor |
X |
auxillary data vector |
FUN |
function to be applied to subsets of |
... |
optional parameters to |
order |
logical value indicating whether the returned
object should be an |
new.order |
a vector of indexes or a vector of label names giving the order of the new factor levels |
sort |
function to use to sort the factor level names, used only
when |
This function changes the order of the levels of a factor. It can do
so via three different mechanisms, depending on whether, X
and FUN
, new.order
or sort
are provided.
If X
and Fun
are provided: The data in X
is grouped by the levels of x
and FUN
is applied.
The groups are then sorted by this value, and the resulting order is
used for the new factor level names.
If new.order
is provided: For a numeric vector, the new factor
level names are constructed by reordering the factor levels according
to the numeric values. For vectors, new.order
gives the list of
new factor level names. In either case levels omitted from
new.order
will become missing (NA
) values.
If sort
is provided (as it is by default): The new
factor level names are generated by applying the supplied function
to the existing factor level names. With sort=mixedsort
the
factor levels are sorted so that combined numeric and character
strings are sorted in according to character rules on the character
sections (including ignoring case), and the numeric rules for the
numeric sections. See mixedsort
for details.
A new factor with reordered levels
Gregory R. Warnes [email protected]
# Create a 4 level example factor trt <- factor( sample( c("PLACEBO", "300 MG", "600 MG", "1200 MG"), 100, replace=TRUE ) ) summary(trt) # Note that the levels are not in a meaningful order. # Change the order to something useful # default "mixedsort" ordering trt2 <- reorder(trt) summary(trt2) # using indexes: trt3 <- reorder(trt, new.order=c(4, 2, 3, 1)) summary(trt3) # using label names: trt4 <- reorder(trt, new.order=c("PLACEBO", "300 MG", "600 MG", "1200 MG")) summary(trt4) # using frequency trt5 <- reorder(trt, X=as.numeric(trt), FUN=length) summary(trt5) # drop out the '300 MG' level trt6 <- reorder(trt, new.order=c("PLACEBO", "600 MG", "1200 MG")) summary(trt6)
# Create a 4 level example factor trt <- factor( sample( c("PLACEBO", "300 MG", "600 MG", "1200 MG"), 100, replace=TRUE ) ) summary(trt) # Note that the levels are not in a meaningful order. # Change the order to something useful # default "mixedsort" ordering trt2 <- reorder(trt) summary(trt2) # using indexes: trt3 <- reorder(trt, new.order=c(4, 2, 3, 1)) summary(trt3) # using label names: trt4 <- reorder(trt, new.order=c("PLACEBO", "300 MG", "600 MG", "1200 MG")) summary(trt4) # using frequency trt5 <- reorder(trt, X=as.numeric(trt), FUN=length) summary(trt5) # drop out the '300 MG' level trt6 <- reorder(trt, new.order=c("PLACEBO", "600 MG", "1200 MG")) summary(trt6)
Rev
provides a reversed version of its argument.
Unlike the basic function, it does in higher-dimensional structures such as matrices not reverse the elements, but the order of the rows and/or columns. It further offers additional interfaces for higher dimensional arrays or tables.
Rev(x, ...) ## S3 method for class 'matrix' Rev(x, margin, ...) ## S3 method for class 'table' Rev(x, margin, ...) ## S3 method for class 'array' Rev(x, margin, ...) ## S3 method for class 'data.frame' Rev(x, margin, ...)
Rev(x, ...) ## S3 method for class 'matrix' Rev(x, margin, ...) ## S3 method for class 'table' Rev(x, margin, ...) ## S3 method for class 'array' Rev(x, margin, ...) ## S3 method for class 'data.frame' Rev(x, margin, ...)
x |
a vector, a matrix or a higher dimensional table to be reversed. |
margin |
vector of dimensions which to be reversed (1 for rows, 2 for columns, etc.). If not defined, all dimensions will be reverted. |
... |
the dots are passed to the array interface. |
Andri Signorell <[email protected]>
tab <- matrix(c(1, 11, 111, 2, 22, 222, 3, 33, 333), byrow=TRUE, nrow=3, dimnames=list(mar1=1:3, mar2=c("a","b","c"))) Rev(tab, margin=1) Rev(tab, margin=2) # reverse both dimensions Rev(tab, margin=c(1, 2)) t(tab) # reverse 3dimensional array aa <- Abind(tab, 2 * tab, along=3) dimnames(aa)[[3]] <- c("A","Z") # reverse rows Rev(aa, 1) # reverse columns Rev(aa, 2) # reverse 3th dimension Rev(aa, 3) # reverse all dimensions Rev(aa) # same as Rev(aa, margin=(1:3))
tab <- matrix(c(1, 11, 111, 2, 22, 222, 3, 33, 333), byrow=TRUE, nrow=3, dimnames=list(mar1=1:3, mar2=c("a","b","c"))) Rev(tab, margin=1) Rev(tab, margin=2) # reverse both dimensions Rev(tab, margin=c(1, 2)) t(tab) # reverse 3dimensional array aa <- Abind(tab, 2 * tab, along=3) dimnames(aa)[[3]] <- c("A","Z") # reverse rows Rev(aa, 1) # reverse columns Rev(aa, 2) # reverse 3th dimension Rev(aa, 3) # reverse all dimensions Rev(aa) # same as Rev(aa, margin=(1:3))
In psychology variables often need to be recoded into reverse order in cases that items are negatively worded. So it can be ensured that a high value indicate the same type of response on every item. Let's say we have a Likert scale from 1 to 5 and we want to recode the variable so that a 5 becomes a 1, 4 a 2 and so on.
RevCode(x, ...)
RevCode(x, ...)
x |
a numerical or logical vector, or a factor. |
... |
the dots are sent to |
The function recodes based on:
min(x, na.rm=TRUE) + max(x, na.rm=TRUE) - x
the recoded vector
Andri Signorell <[email protected]>
x <- 1:5 data.frame(x, rev_num=RevCode(x), rev_fac=RevCode(factor(x))) s <- c(3,4,2,7,4,9,NA,10) RevCode(factor(s, levels=1:10)) i <- c(1,0,0,0,1,1) cbind(i, RevCode(i)) k <- as.logical(c(1,0,0,0,1,1)) cbind(k, RevCode(k)) x <- factor(sample(letters[1:5], 10, replace = TRUE)) RevCode(x) # we want to set the level 5 to NA before reversing RevCode(factor(NAIf(x, "e")))
x <- 1:5 data.frame(x, rev_num=RevCode(x), rev_fac=RevCode(factor(x))) s <- c(3,4,2,7,4,9,NA,10) RevCode(factor(s, levels=1:10)) i <- c(1,0,0,0,1,1) cbind(i, RevCode(i)) k <- as.logical(c(1,0,0,0,1,1)) cbind(k, RevCode(k)) x <- factor(sample(letters[1:5], 10, replace = TRUE)) RevCode(x) # we want to set the level 5 to NA before reversing RevCode(factor(NAIf(x, "e")))
Density function, distribution function, quantile function and random generation for the reverse (or negative) Weibull distribution with location, scale and shape parameters.
dRevWeibull(x, loc=0, scale=1, shape=1, log = FALSE) pRevWeibull(q, loc=0, scale=1, shape=1, lower.tail = TRUE) qRevWeibull(p, loc=0, scale=1, shape=1, lower.tail = TRUE) rRevWeibull(n, loc=0, scale=1, shape=1) dNegWeibull(x, loc=0, scale=1, shape=1, log = FALSE) pNegWeibull(q, loc=0, scale=1, shape=1, lower.tail = TRUE) qNegWeibull(p, loc=0, scale=1, shape=1, lower.tail = TRUE) rNegWeibull(n, loc=0, scale=1, shape=1)
dRevWeibull(x, loc=0, scale=1, shape=1, log = FALSE) pRevWeibull(q, loc=0, scale=1, shape=1, lower.tail = TRUE) qRevWeibull(p, loc=0, scale=1, shape=1, lower.tail = TRUE) rRevWeibull(n, loc=0, scale=1, shape=1) dNegWeibull(x, loc=0, scale=1, shape=1, log = FALSE) pNegWeibull(q, loc=0, scale=1, shape=1, lower.tail = TRUE) qNegWeibull(p, loc=0, scale=1, shape=1, lower.tail = TRUE) rNegWeibull(n, loc=0, scale=1, shape=1)
x , q
|
Vector of quantiles. |
p |
Vector of probabilities. |
n |
Number of observations. |
loc , scale , shape
|
Location, scale and shape parameters (can be given as vectors). |
log |
Logical; if |
lower.tail |
Logical; if |
The reverse (or negative) Weibull distribution function with parameters
,
and
is
for and one otherwise, where
and
.
dRevWeibull
and dNegWeibull
give the density function,
pRevWeibull
and pNegWeibull
give the distribution function,
qRevWeibull
and qNegWeibull
give the quantile function,
rRevWeibull
and rNegWeibull
generate random deviates.
Within extreme value theory the reverse Weibull distibution (also known as the negative Weibull distribution) is often referred to as the Weibull distribution. We make a distinction to avoid confusion with the three-parameter distribution used in survival analysis, which is related by a change of sign to the distribution given above.
Alec Stephenson <[email protected]>
rFrechet
, rGenExtrVal
, rGumbel
dRevWeibull(-5:-3, -1, 0.5, 0.8) pRevWeibull(-5:-3, -1, 0.5, 0.8) qRevWeibull(seq(0.9, 0.6, -0.1), 2, 0.5, 0.8) rRevWeibull(6, -1, 0.5, 0.8) p <- (1:9)/10 pRevWeibull(qRevWeibull(p, -1, 2, 0.8), -1, 2, 0.8) ## [1] 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
dRevWeibull(-5:-3, -1, 0.5, 0.8) pRevWeibull(-5:-3, -1, 0.5, 0.8) qRevWeibull(seq(0.9, 0.6, -0.1), 2, 0.5, 0.8) rRevWeibull(6, -1, 0.5, 0.8) p <- (1:9)/10 pRevWeibull(qRevWeibull(p, -1, 2, 0.8), -1, 2, 0.8) ## [1] 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
These function convert colors between RGB and CMYK system.
RgbToCmy(col, maxColorValue = 1) CmykToRgb(cyan, magenta, yellow, black, maxColorValue=1) CmyToCmyk(col) CmykToCmy(col)
RgbToCmy(col, maxColorValue = 1) CmykToRgb(cyan, magenta, yellow, black, maxColorValue=1) CmyToCmyk(col) CmykToCmy(col)
col |
the matrix of the color to be converted |
cyan |
cyan values of the color(s) to be converted |
magenta |
magenta values of the color(s) to be converted |
yellow |
yellow values of the color(s) to be converted |
black |
black values of the color(s) to be converted |
maxColorValue |
the value for the color |
the converted value
Andri Signorell <[email protected]>
CmykToRgb(0.42, 45.23, 85.14, maxColorValue=100)
CmykToRgb(0.42, 45.23, 85.14, maxColorValue=100)
Converting a RGB-color to a named R-Color means looking for a color in the R-palette, which is nearest to the given RGB-color.
This function uses the minimum of squared distance ("euclidean"
) or the minimum absolute distance ("manhattan"
) as proximity measure. RgbToLong()
converts a RGB-color to a long integer in numeric format. LongToRGB()
does it the other way round.
RgbToCol(col, method = "rgb", metric = "euclidean") RgbToLong(col) LongToRgb(col)
RgbToCol(col, method = "rgb", metric = "euclidean") RgbToLong(col) LongToRgb(col)
col |
the color in rgb code, say a matrix with the red, green and blue code in the rows. |
method |
character string specifying the color space to be used. Can be |
metric |
character string specifying the metric to be used for calculating distances between the colors.
Available options are |
It may not be clear from the start which method, rgb or hsv, yield the more natural results. Trying and comparing is a recommended strategy. Moreover the shortest numerical distance will not always be the best choice, when comparing the colours visually.
the name of the nearest found R color.
Andri Signorell <[email protected]>
ColToRgb
and the other conversion functions
RgbToCol(matrix(c(162,42,42), nrow=3)) RgbToLong(matrix(c(162,42,42), nrow=3))
RgbToCol(matrix(c(162,42,42), nrow=3)) RgbToLong(matrix(c(162,42,42), nrow=3))
Create pairs of correlated random numbers.
RndPairs(n, r, rdist1 = rnorm(n = n, mean = 0, sd = 1), rdist2 = rnorm(n = n, mean = 0, sd = 1), prop = NULL) RndWord(size, length, x = LETTERS, replace = TRUE, prob = NULL)
RndPairs(n, r, rdist1 = rnorm(n = n, mean = 0, sd = 1), rdist2 = rnorm(n = n, mean = 0, sd = 1), prop = NULL) RndWord(size, length, x = LETTERS, replace = TRUE, prob = NULL)
n |
number of pairs. If length(n) > 1, the length is taken to be the number required. |
r |
the correlation between the two sets. |
rdist1 , rdist2
|
the distribution of the random vector X1 and X2. Default is standard normal distribution. |
size |
a non-negative integer giving the number of artificial words to build. |
length |
a non-negative integer giving the length of the words. |
x |
elements to choose from. |
replace |
Should sampling be with replacement? |
prop |
proportions for ordinal variable, must sum to 1. |
prob |
a vector of probability weights for obtaining the elements of the vector being sampled. |
a data.frame with 2 columns, X1 and X2 containing the random numbers
Andri Signorell <[email protected]>
runif
, rnorm
, Random
and friends
# produce 100 pairs of a normal distributed random number with a correlation of 0.7 d.frm <- RndPairs(n=100, r=0.7) plot(d.frm) lines(lm(y ~ x,d.frm)) # change the distribution d.frm <- RndPairs(n=100, r=0.7, rdist2 = rlnorm(n = 100, meanlog = 1, sdlog = .8)) d.frm <- RndPairs(n=100, r=0.7, rdist2 = runif(n = 100, -1, 4)) x <- StrCap(sapply(sample(3:15, 10), function(i) RndWord(1, i, x=letters))) # produce some artificial words with defined probabilities for the letters p <- c(6.51,1.89,3.06,5.08,17.4,1.66,3.01,4.76,7.55,0.27,1.21,3.44,2.53, 9.78,2.51,0.79,0.02,7,7.27,6.15,4.35,0.67,1.89,0.03,0.04,1.13) sapply(sample(3:15, 10), function(i) RndWord(1, i, x=letters, prob=p)) # produce associated ordinal variables d.ord <- RndPairs(500, r=0.8, prop = list(c(.15, .3, .55), c(.3, .5, .2))) levels(d.ord$y) <- levels(d.ord$x) <- LETTERS[1:3] PlotMosaic(table(d.ord$x, d.ord$y), las=1, main="")
# produce 100 pairs of a normal distributed random number with a correlation of 0.7 d.frm <- RndPairs(n=100, r=0.7) plot(d.frm) lines(lm(y ~ x,d.frm)) # change the distribution d.frm <- RndPairs(n=100, r=0.7, rdist2 = rlnorm(n = 100, meanlog = 1, sdlog = .8)) d.frm <- RndPairs(n=100, r=0.7, rdist2 = runif(n = 100, -1, 4)) x <- StrCap(sapply(sample(3:15, 10), function(i) RndWord(1, i, x=letters))) # produce some artificial words with defined probabilities for the letters p <- c(6.51,1.89,3.06,5.08,17.4,1.66,3.01,4.76,7.55,0.27,1.21,3.44,2.53, 9.78,2.51,0.79,0.02,7,7.27,6.15,4.35,0.67,1.89,0.03,0.04,1.13) sapply(sample(3:15, 10), function(i) RndWord(1, i, x=letters, prob=p)) # produce associated ordinal variables d.ord <- RndPairs(500, r=0.8, prop = list(c(.15, .3, .55), c(.3, .5, .2))) levels(d.ord$y) <- levels(d.ord$x) <- LETTERS[1:3] PlotMosaic(table(d.ord$x, d.ord$y), las=1, main="")
RobScale
is a wrapper function for robust standardization, using median
and mad
instead of mean
and sd
.
RobScale(x, center = TRUE, scale = TRUE)
RobScale(x, center = TRUE, scale = TRUE)
x |
a numeric matrix(like object). |
center |
a logical value defining whether x should be centered by the median. Centering is done by subtracting the column medians (omitting NAs) of x from their corresponding columns. If center is FALSE, no centering is done. |
scale |
a logical value defining whether x should be scaled by the mad. Scaling is done by dividing the (centered) columns of x by their mad. If scale is FALSE, no scaling is done. |
the centered, scaled matrix. The numeric centering and scalings used (if any) are returned as attributes "scaled:center" and "scaled:scale"
Andri Signorell <[email protected]>
scale
, sweep
x <- d.pizza$temperature plot(x=seq_along(x), y=RobScale(x), xlim=c(0,100)) points(x=seq_along(x), y=scale(x), col="red" )
x <- d.pizza$temperature plot(x=seq_along(x), y=RobScale(x), xlim=c(0,100)) points(x=seq_along(x), y=scale(x), col="red" )
Convert roman numerals to integers
RomanToInt(x)
RomanToInt(x)
x |
character vector containing roman numerals |
This functon will convert roman numerals to integers without the upper bound imposed by R (3899), ignoring case.
A integer vector with the same length as roman
. Character
strings which are not valid roman numerals will be converted to NA
.
Gregory R. Warnes <[email protected]>
RomanToInt( c('I', 'V', 'X', 'C', 'L', 'D', 'M' ) ) # works regardless of case RomanToInt( 'MMXVI' ) RomanToInt( 'mmxvi' ) # works beyond R's limit of 3899 val.3899 <- 'MMMDCCCXCIX' val.3900 <- 'MMMCM' val.4000 <- 'MMMM' as.numeric(as.roman( val.3899 )) as.numeric(as.roman( val.3900 )) as.numeric(as.roman( val.4000 )) RomanToInt(val.3899) RomanToInt(val.3900) RomanToInt(val.4000)
RomanToInt( c('I', 'V', 'X', 'C', 'L', 'D', 'M' ) ) # works regardless of case RomanToInt( 'MMXVI' ) RomanToInt( 'mmxvi' ) # works beyond R's limit of 3899 val.3899 <- 'MMMDCCCXCIX' val.3900 <- 'MMMCM' val.4000 <- 'MMMM' as.numeric(as.roman( val.3899 )) as.numeric(as.roman( val.3900 )) as.numeric(as.roman( val.4000 )) RomanToInt(val.3899) RomanToInt(val.3900) RomanToInt(val.4000)
Rotate a geometric structure by an angle theta around a centerpoint xy.
Rotate(x, y = NULL, mx = NULL, my = NULL, theta = pi/3, asp = 1)
Rotate(x, y = NULL, mx = NULL, my = NULL, theta = pi/3, asp = 1)
x , y
|
vectors containing the coordinates of the vertices of the polygon
, which has to be rotated. The coordinates can be passed in a plotting structure (a list with x and y components), a two-column matrix, .... See |
mx , my
|
xy-coordinates of the center of the rotation. If left to NULL, the centroid of the structure will be used. |
theta |
angle of the rotation |
asp |
the aspect ratio for the rotation. Helpful for rotate structures along an ellipse. |
The function invisibly returns a list of the coordinates for the rotated shape(s).
Andri Signorell <[email protected]>
polygon
, DrawRegPolygon
, DrawEllipse
, DrawArc
# let's have a triangle Canvas(main="Rotation") x <- DrawRegPolygon(nv=3)[[1]] # and rotate sapply( (0:3) * pi/6, function(theta) { xy <- Rotate( x=x, theta=theta ) polygon(xy, col=SetAlpha("blue", 0.2)) } ) abline(v=0, h=0)
# let's have a triangle Canvas(main="Rotation") x <- DrawRegPolygon(nv=3)[[1]] # and rotate sapply( (0:3) * pi/6, function(theta) { xy <- Rotate( x=x, theta=theta ) polygon(xy, col=SetAlpha("blue", 0.2)) } ) abline(v=0, h=0)
Returns a number rounded to the nearest specified multiple.
RoundTo(x, multiple = 1, FUN = round)
RoundTo(x, multiple = 1, FUN = round)
x |
numeric. The value to round. |
multiple |
numeric. The multiple to which the number is to be rounded. Default is 1. |
FUN |
the rounding function as character or as expression. Can be one out of |
There are several functions to convert to integers. round
rounds to the nearest integer or to any number of digits. Using a negative number rounds to a power of ten, so that round (x, -3)
rounds to thousands.
Each of trunc
, floor
and ceiling
round in a fixed direction, towards zero, down and up respectively. round
is documented to round to even, so round(2.5)
is 2
.
RoundTo
uses round(x/multiple)*multiple
to get the result. So if x
is equally close to two multiples, the multiple with the smaller absolute value will be returned when round(x/multiple)
is even (and the greater when it's odd).
If FUN
is set to ceiling
it will always round up, and if set to floor
it will always round down. See examples for comparison).
the rounded value
Andri Signorell <[email protected]>
RoundTo(10, 3) # Rounds 10 to a nearest multiple of 3 (9) RoundTo(-10, -3) # Rounds -10 to a nearest multiple of -3 (-9) RoundTo(1.3, 0.2) # Rounds 1.3 to a nearest multiple of 0.2 (1.2) RoundTo(-1.3, 0.2) # Rounds -1.3 to a nearest multiple of 0.2 (-1.2) RoundTo(5, -2) # Returns an error, because -2 and 5 have different signs # Round down RoundTo(c(1,-1) * 1.2335, 0.05, floor) RoundTo(c(1,-1) * 1233.5, 100, floor) # Round up RoundTo(c(1,-1) * 1.2335, 0.05, ceiling) RoundTo(c(1,-1) * 1233.5, 100, ceiling) # Round towards zero RoundTo(c(1,-1) * 1.2335, 0.05, trunc) RoundTo(c(1,-1) * 1233.5, 100, trunc) x <- c(-1.5,-1.3, 1.3, 1.5) cbind(x = x, round = RoundTo(x, 0.2, FUN=round), trunc = RoundTo(x, 0.2, FUN=trunc), ceiling = RoundTo(x, 0.2, FUN=ceiling), floor = RoundTo(x, 0.2, FUN=floor) ) x <- -10:10 cbind(x = x, round = RoundTo(x, 2, FUN=round), trunc = RoundTo(x, 2, FUN=trunc), ceiling = RoundTo(x, 2, FUN=ceiling), floor = RoundTo(x, 2, FUN=floor) )
RoundTo(10, 3) # Rounds 10 to a nearest multiple of 3 (9) RoundTo(-10, -3) # Rounds -10 to a nearest multiple of -3 (-9) RoundTo(1.3, 0.2) # Rounds 1.3 to a nearest multiple of 0.2 (1.2) RoundTo(-1.3, 0.2) # Rounds -1.3 to a nearest multiple of 0.2 (-1.2) RoundTo(5, -2) # Returns an error, because -2 and 5 have different signs # Round down RoundTo(c(1,-1) * 1.2335, 0.05, floor) RoundTo(c(1,-1) * 1233.5, 100, floor) # Round up RoundTo(c(1,-1) * 1.2335, 0.05, ceiling) RoundTo(c(1,-1) * 1233.5, 100, ceiling) # Round towards zero RoundTo(c(1,-1) * 1.2335, 0.05, trunc) RoundTo(c(1,-1) * 1233.5, 100, trunc) x <- c(-1.5,-1.3, 1.3, 1.5) cbind(x = x, round = RoundTo(x, 0.2, FUN=round), trunc = RoundTo(x, 0.2, FUN=trunc), ceiling = RoundTo(x, 0.2, FUN=ceiling), floor = RoundTo(x, 0.2, FUN=floor) ) x <- -10:10 cbind(x = x, round = RoundTo(x, 2, FUN=round), trunc = RoundTo(x, 2, FUN=trunc), ceiling = RoundTo(x, 2, FUN=ceiling), floor = RoundTo(x, 2, FUN=floor) )
RSessionAlive()
returns the time the R session has been running in hours. The function uses powershell in Windows and is thus restricted to run in windows only. RTempdirAlive()
does the same for temporary directories, but runs on all systems.
RSessionAlive() RTempdirAlive()
RSessionAlive() RTempdirAlive()
time in hours
Markus Napflin <markus.naepfl@in>, Andri Signorell <[email protected]>
Calculate bootstrap intervals for the the R squared of a linear model as returned
by lm
.
RSqCI( object, conf.level = 0.95, sides = c("two.sided", "left", "right"), adjusted = TRUE, ... )
RSqCI( object, conf.level = 0.95, sides = c("two.sided", "left", "right"), adjusted = TRUE, ... )
object |
the model object as returned by glm. |
conf.level |
confidence level of the interval. |
sides |
a character string specifying the side of the confidence
interval, must be one of |
adjusted |
logical, defining if the R squared or the adjusted R squared
should be used. Default is |
... |
further arguments are passed to the |
a numeric vector with 3 elements:
mean |
mean |
lwr.ci |
lower bound of the confidence interval |
upr.ci |
upper bound of the confidence interval |
Andri Signorell [email protected]
# get linear model r.lm <- lm(Fertility ~ Agriculture + Examination + Education + Catholic + Infant.Mortality, data=swiss) # calculate confidence intervals for the R2 summary(r.lm)$r.squared RSqCI(r.lm, R=99) # use higher R in real life!
# get linear model r.lm <- lm(Fertility ~ Agriculture + Examination + Education + Catholic + Infant.Mortality, data=swiss) # calculate confidence intervals for the R2 summary(r.lm)$r.squared RSqCI(r.lm, R=99) # use higher R in real life!
Generates a vector of uniformly distributed random numbers which sum to 1.
rSum21(size, digits = NULL)
rSum21(size, digits = NULL)
size |
a non-negative integer giving the number of numbers to generate. |
digits |
integer indicating the number of decimal places to be used. |
a vector of length size with elements drawn
Andri Signorell <[email protected]>
runif
, (Dirichlet distribution)
# generate 5 numbers x <- rSum21(5) sum(x)
# generate 5 numbers x <- rSum21(5) sum(x)
Performs a test whether the elements of x
are serially independent - say, whether
they occur in a random order - by counting how many runs there are above and below a threshold.
If y
is supplied a two sample Wald-Wolfowitz-Test testing the equality of two distributions against general alternatives will be computed.
RunsTest(x, ...) ## Default S3 method: RunsTest(x, y = NULL, alternative = c("two.sided", "less", "greater"), exact = NULL, correct = TRUE, na.rm = FALSE, ...) ## S3 method for class 'formula' RunsTest(formula, data, subset, na.action, ...)
RunsTest(x, ...) ## Default S3 method: RunsTest(x, y = NULL, alternative = c("two.sided", "less", "greater"), exact = NULL, correct = TRUE, na.rm = FALSE, ...) ## S3 method for class 'formula' RunsTest(formula, data, subset, na.action, ...)
x |
a dichotomous vector of data values or a (non-empty) numeric vector of data values. |
y |
an optional (non-empty) numeric vector of data values. |
formula |
a formula of the form |
data |
an optional matrix or data frame (or similar: see |
subset |
an optional vector specifying a subset of observations to be used. |
na.action |
a function which indicates what should happen when the data contain NAs. Defaults to |
alternative |
a character string specifying the alternative hypothesis, must be one of |
exact |
a logical indicating whether an exact p-value should be computed. By default exact values will be calculated for small vectors with a total length <= 30 and the normal approximation for longer ones. |
correct |
a logical indicating whether to apply continuity correction when computing the test statistic. Default is |
na.rm |
defines if |
... |
further arguments to be passed to or from methods. |
The runs test for randomness
is used to test the hypothesis
that a series of numbers is random.
For a categorical variable, the number of runs correspond to the number of times the category changes, that is,
where belongs to one category and
belongs to the other. The number of runs is the number of sign changes plus one.
For a numeric variable x containing more than two values, a run is a set of sequential values that are either all above or below a specified cutpoint, typically the median. This is not necessarily the best choice. If another threshold should be used use a code like: RunsTest(x > mean(x))
.
The exact distribution of runs and the p-value based on it are described in the manual of SPSS "Exact tests" https://www.sussex.ac.uk/its/pdfs/SPSS_Exact_Tests_21.pdf.
The normal approximation of the runs test is calculated with the expected number of runs under the null
and its variance
as
where the number of values below/above the threshold and
the number of runs.
Setting the continuity correction correct = TRUE
will yield the normal approximation as SAS (and SPSS if n < 50) does it, see http://support.sas.com/kb/33/092.html.
The c is set to if
and to
if
.
The Wald-Wolfowitz test
is a 2-sample nonparametric test to evaluate if two continuous cumulative distributions are significantly different or not. Ideally there should be no ties in the data. In practice there is no problem with ties within a group, but if ties occur between members of the different groups then there is no unique sequence of observations. For example the data sets A: 10,14,17,19,34 and B: 12,13,17,19,22 can give four possible sequences, with two possible values for r (7 or 9). The "solution" to this is to list every possible combination, and calculate the test statistic for each one. If all test statistics are significant at the chosen level, then one can reject the null hypothesis. If only some are significant, then Siegel (1956) suggests that the average of the P-values is taken.
Help for finding all permutations of ties can be found at: https://stackoverflow.com/questions/47565066/all-possible-permutations-in-factor-variable-when-ties-exist-in-r
However this solutions seems quite coarse and in general, the test should not be used if there are more than one or two ties. We have better tests to distinguish between two samples!
A list with the following components.
statistic |
z, the value of the standardized runs statistic, if not exact p-values are computed. |
parameter |
the number of runs, the total number of zeros (m) and ones (n) |
p.value |
the p-value for the test. |
data.name |
a character string giving the names of the data. |
alternative |
a character string describing the alternative hypothesis. |
Andri Signorell <[email protected]>, exact p-values by Detlew Labes <[email protected]>
Wackerly, D., Mendenhall, W. Scheaffer, R. L. (1986) Mathematical Statistics with Applications, 3rd Ed., Duxbury Press, CA.
Wald, A. and Wolfowitz, J. (1940): On a test whether two samples are from the same population, Ann. Math Statist. 11, 147-162.
Siegel, S. (1956) Nonparametric Statistics for the Behavioural Sciences, McGraw-Hill Kogakusha, Tokyo.
Run Length Encoding rle
# x will be coerced to a dichotomous variable x <- c("S","S", "T", "S", "T","T","T", "S", "T") RunsTest(x) x <- c(13, 3, 14, 14, 1, 14, 3, 8, 14, 17, 9, 14, 13, 2, 16, 1, 3, 12, 13, 14) RunsTest(x) # this will be treated as RunsTest(x > median(x)) plot( (x < median(x)) - 0.5, type="s", ylim=c(-1,1) ) abline(h=0) set.seed(123) x <- sample(0:1, size=100, replace=TRUE) RunsTest(x) # As you would expect of values from a random number generator, the test fails to reject # the null hypothesis that the data are random. # SPSS example x <- c(31,23,36,43,51,44,12,26,43,75,2,3,15,18,78,24,13,27,86,61,13,7,6,8) RunsTest(x, exact=TRUE) # exact probability RunsTest(x, exact=FALSE) # normal approximation # SPSS example small dataset x <- c(1, 1, 1, 1, 0, 0, 0, 0, 1, 1) RunsTest(x) RunsTest(x, exact=FALSE) # if y is not NULL, the Wald-Wolfowitz-Test will be performed A <- c(35,44,39,50,48,29,60,75,49,66) B <- c(17,23,13,24,33,21,18,16,32) RunsTest(A, B, exact=TRUE) RunsTest(A, B, exact=FALSE)
# x will be coerced to a dichotomous variable x <- c("S","S", "T", "S", "T","T","T", "S", "T") RunsTest(x) x <- c(13, 3, 14, 14, 1, 14, 3, 8, 14, 17, 9, 14, 13, 2, 16, 1, 3, 12, 13, 14) RunsTest(x) # this will be treated as RunsTest(x > median(x)) plot( (x < median(x)) - 0.5, type="s", ylim=c(-1,1) ) abline(h=0) set.seed(123) x <- sample(0:1, size=100, replace=TRUE) RunsTest(x) # As you would expect of values from a random number generator, the test fails to reject # the null hypothesis that the data are random. # SPSS example x <- c(31,23,36,43,51,44,12,26,43,75,2,3,15,18,78,24,13,27,86,61,13,7,6,8) RunsTest(x, exact=TRUE) # exact probability RunsTest(x, exact=FALSE) # normal approximation # SPSS example small dataset x <- c(1, 1, 1, 1, 0, 0, 0, 0, 1, 1) RunsTest(x) RunsTest(x, exact=FALSE) # if y is not NULL, the Wald-Wolfowitz-Test will be performed A <- c(35,44,39,50,48,29,60,75,49,66) B <- c(17,23,13,24,33,21,18,16,32) RunsTest(A, B, exact=TRUE) RunsTest(A, B, exact=FALSE)
Sample
takes a sample of the specified size from the elements of x using either with or without replacement. The function does the same as the base::sample() and offers additionally an interface for data frames.
Sample(x, size, replace = FALSE, prob = NULL)
Sample(x, size, replace = FALSE, prob = NULL)
x |
either a vector of one or more elements from which to choose, or a positive integer. |
size |
a positive number, the number of items to choose from. |
replace |
a non-negative integer giving the number of items to choose. |
prob |
should sampling be with replacement? |
sampled elements in the same structure as x
Andri Signorell <[email protected]>
sample(d.pizza, size=5)
sample(d.pizza, size=5)
Draw a twin sample out of a population for a given recordset, by matching some strata criteria.
SampleTwins(x, stratanames = NULL, twins, method = c("srswor", "srswr", "poisson", "systematic"), pik, description = FALSE)
SampleTwins(x, stratanames = NULL, twins, method = c("srswor", "srswr", "poisson", "systematic"), pik, description = FALSE)
x |
the data to draw the sample from |
stratanames |
the stratanames to use |
twins |
the twin sample |
method |
method to select units; the following methods are implemented: simple random
sampling without replacement (srswor), simple random sampling with replacement (srswr),
Poisson sampling (poisson), systematic sampling (systematic); if "method" is missing,
the default method is "srswor". See |
pik |
vector of inclusion probabilities or auxiliary information used to compute them; this argument is only used for unequal probability sampling (Poisson and systematic). If an auxiliary information is provided, the function uses the inclusionprobabilities function for computing these probabilities. If the method is "srswr" and the sample size is larger than the population size, this vector is normalized to one. |
description |
a message is printed if its value is TRUE; the message gives the number of selected units and the number of the units in the population. By default, the value is FALSE. |
The function produces an object, which contains the following information:
id |
the identifier of the selected units. |
stratum |
the unit stratum. |
prob |
the final unit inclusion probability. |
Andri Signorell <[email protected]>
m <- rbind(matrix(rep("nc",165), 165, 1, byrow=TRUE), matrix(rep("sc", 70), 70, 1, byrow=TRUE)) m <- cbind.data.frame(m, c(rep(1, 100), rep(2,50), rep(3,15), rep(1,30), rep(2,40)), 1000*runif(235)) names(m) <- c("state","region","income") # this would be our sample to be reproduced by a twin sample d.smp <- m[sample(nrow(m), size=10, replace=TRUE),] # draw the sample s <- SampleTwins(x = m, stratanames=c("state","region"), twins = d.smp, method="srswor") d.twin <- m[s$id,] d.twin
m <- rbind(matrix(rep("nc",165), 165, 1, byrow=TRUE), matrix(rep("sc", 70), 70, 1, byrow=TRUE)) m <- cbind.data.frame(m, c(rep(1, 100), rep(2,50), rep(3,15), rep(1,30), rep(2,40)), 1000*runif(235)) names(m) <- c("state","region","income") # this would be our sample to be reproduced by a twin sample d.smp <- m[sample(nrow(m), size=10, replace=TRUE),] # draw the sample s <- SampleTwins(x = m, stratanames=c("state","region"), twins = d.smp, method="srswor") d.twin <- m[s$id,] d.twin
An R object cannot be saved in binary mode under a different name using the default save()
function. SaveAs()
extends the save function for this option.
SaveAs(x, objectname, file, ...)
SaveAs(x, objectname, file, ...)
x |
the object to save |
objectname |
the new name for the object. |
file |
a (writable binary-mode) connection or the name of the file where the data will be saved (when tilde expansion is done). |
... |
the dots are passed to the save function. |
Andri Signorell <[email protected]>
x <- stats::runif(20) SaveAs(x=x, objectname="NewX", file = "NewXFile.rda") unlink("NewXFile.rda")
x <- stats::runif(20) SaveAs(x=x, objectname="NewX", file = "NewXFile.rda") unlink("NewXFile.rda")
Scheffe's method applies to the set of estimates of all possible contrasts among the factor level means, not just the pairwise differences considered by Tukey's method.
ScheffeTest(x, ...) ## S3 method for class 'formula' ScheffeTest(formula, data, subset, na.action, ...) ## S3 method for class 'aov' ScheffeTest(x, which = NULL, contrasts = NULL, conf.level = 0.95, ...) ## Default S3 method: ScheffeTest(x, g = NULL, which = NULL, contrasts = NULL, conf.level = 0.95, ...)
ScheffeTest(x, ...) ## S3 method for class 'formula' ScheffeTest(formula, data, subset, na.action, ...) ## S3 method for class 'aov' ScheffeTest(x, which = NULL, contrasts = NULL, conf.level = 0.95, ...) ## Default S3 method: ScheffeTest(x, g = NULL, which = NULL, contrasts = NULL, conf.level = 0.95, ...)
x |
either a fitted model object, usually an |
g |
the grouping variable. |
which |
character vector listing terms in the fitted model for which the intervals should be calculated. Defaults to all the terms. |
contrasts |
a |
conf.level |
numeric value between zero and one giving the confidence level to use. If this is set to NA, just a matrix with the p-values will be returned. |
formula |
a formula of the form |
data |
an optional matrix or data frame (or similar: see
|
subset |
an optional vector specifying a subset of observations to be used. |
na.action |
a function which indicates what should happen when
the data contain |
... |
further arguments, currently not used. |
A list of classes c("PostHocTest")
, with one component for each term requested in which
. Each component is a matrix with columns diff
giving the difference in the observed means, lwr.ci
giving the lower end point of the interval, upr.ci
giving the upper end point and pval
giving the p-value after adjustment for the multiple comparisons.
There are print and plot methods for class "PostHocTest"
. The plot method does not accept xlab
, ylab
or main
arguments and creates its own values for each plot.
Andri Signorell <[email protected]>
Robert O. Kuehl, Steel R. (2000) Design of experiments. Duxbury
Steel R.G.D., Torrie J.H., Dickey, D.A. (1997) Principles and Procedures of Statistics, A Biometrical Approach. McGraw-Hill
fm1 <- aov(breaks ~ wool + tension, data = warpbreaks) ScheffeTest(x=fm1) ScheffeTest(x=fm1, which="tension") TukeyHSD(fm1) # some special contrasts y <- c(7,33,26,27,21,6,14,19,6,11,11,18,14,18,19,14,9,12,6, 24,7,10,1,10,42,25,8,28,30,22,17,32,28,6,1,15,9,15, 2,37,13,18,23,1,3,4,6,2) group <- factor(c(1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,3, 3,3,3,4,4,4,4,4,4,4,4,5,5,5,5,5,5,5,5,6,6,6,6,6,6,6,6)) r.aov <- aov(y ~ group) ScheffeTest(r.aov, contrasts=matrix( c(1,-0.5,-0.5,0,0,0, 0,0,0,1,-0.5,-0.5), ncol=2) ) # just p-values: ScheffeTest(r.aov, conf.level=NA)
fm1 <- aov(breaks ~ wool + tension, data = warpbreaks) ScheffeTest(x=fm1) ScheffeTest(x=fm1, which="tension") TukeyHSD(fm1) # some special contrasts y <- c(7,33,26,27,21,6,14,19,6,11,11,18,14,18,19,14,9,12,6, 24,7,10,1,10,42,25,8,28,30,22,17,32,28,6,1,15,9,15, 2,37,13,18,23,1,3,4,6,2) group <- factor(c(1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,3, 3,3,3,4,4,4,4,4,4,4,4,5,5,5,5,5,5,5,5,6,6,6,6,6,6,6,6)) r.aov <- aov(y ~ group) ScheffeTest(r.aov, contrasts=matrix( c(1,-0.5,-0.5,0,0,0, 0,0,0,1,-0.5,-0.5), ncol=2) ) # just p-values: ScheffeTest(r.aov, conf.level=NA)
This function computes the standard deviation of the values in
x
.
If na.rm
is TRUE
then missing values are removed before
computation proceeds. SDn
returns the uncorrected sample standard deviation (which is biased estimator for the sample standard deviation).
SD(x, weights = NULL, na.rm = FALSE, ...) SDN(x, na.rm = FALSE)
SD(x, weights = NULL, na.rm = FALSE, ...) SDN(x, na.rm = FALSE)
x |
a numeric vector or an R object which is coercible to one
by |
weights |
a numerical vector of weights the same length as |
na.rm |
logical. Should missing values be removed? |
... |
further arguments passed to or from other methods. |
Like var
this uses denominator .
The standard deviation of a zero-length vector (after removal of
NA
s if na.rm = TRUE
) is not defined and gives an error.
The standard deviation of a length-one vector is NA
.
var
for its square, and mad
, the most
robust alternative.
SD(1:2)^2
SD(1:2)^2
Sending emails in R can be required in some reporting tasks. As we already have RDCOMClient available we wrap the send code in a function.
SendOutlookMail(to, cc = NULL, bcc = NULL, subject, body, attachment = NULL)
SendOutlookMail(to, cc = NULL, bcc = NULL, subject, body, attachment = NULL)
to |
a vector of recipients |
cc |
a vector of recipients receiving a carbon copy |
bcc |
a vector of recipients receiving a blind carbon copy |
subject |
the subject of the mail |
body |
the body text of the mail |
attachment |
a vector of paths to attachments |
Nothing is returned
Andri Signorell <[email protected]> strongly based on code of Franziska Mueller
## Not run: SendOutlookMail(to=c("[email protected]", "[email protected]"), subject = "Some Info", body = "Hi all\r Find the files attached\r Regards, Andri", attachment = c("C:/temp/fileA.txt", "C:/temp/fileB.txt")) ## End(Not run)
## Not run: SendOutlookMail(to=c("[email protected]", "[email protected]"), subject = "Some Info", body = "Hi all\r Find the files attached\r Regards, Andri", attachment = c("C:/temp/fileA.txt", "C:/temp/fileB.txt")) ## End(Not run)
Add transparency to a color defined by its name or number. The function first converts the
color to RGB and then appends the alpha channel. Fade()
combines ColToOpaque(SetAlpha(col))
.
SetAlpha(col, alpha = 0.5) Fade(col, ...)
SetAlpha(col, alpha = 0.5) Fade(col, ...)
col |
vector of two kind of R colors, i.e., either a color name (an element of |
alpha |
the alpha value to be added. This can be any value from 0 (fully transparent) to 1 (opaque). |
... |
the dots in |
All arguments are recyled as necessary.
Vector with the same length as col
, giving the rgb-values extended by the alpha channel as hex-number (#rrggbbaa).
Andri Signorell <[email protected]>
ColToHex
, col2rgb
, adjustcolor
, ColToOpaque
SetAlpha("yellow", 0.2) SetAlpha(2, 0.5) # red Canvas(3) DrawCircle(x=c(-1,0,1), y=c(1,-1,1), r.out=2, col=SetAlpha(2:4, 0.4)) x <- rnorm(15000) par(mfrow=c(1,2)) plot(x, type="p", col="blue" ) plot(x, type="p", col=SetAlpha("blue", .2), main="Better insight with alpha channel" )
SetAlpha("yellow", 0.2) SetAlpha(2, 0.5) # red Canvas(3) DrawCircle(x=c(-1,0,1), y=c(1,-1,1), r.out=2, col=SetAlpha(2:4, 0.4)) x <- rnorm(15000) par(mfrow=c(1,2)) plot(x, type="p", col="blue" ) plot(x, type="p", col=SetAlpha("blue", .2), main="Better insight with alpha channel" )
This is a convenience function that sets the names of an object and returns it including the new names. It is most useful at the end of a function definition where one is creating the object to be returned and would prefer not to store it under a name just that the names can be assigned. In addition to the function setNames
in base R the user can decide, whether rownames, colnames or simply the names are to be set. Names are recyled.
SetNames(x, ...)
SetNames(x, ...)
x |
an object for which a names attribute will be meaningful |
... |
the names to be assigned to the object. This should be a character vector of names named |
An object of the same sort as object with the new names assigned.
Andri Signorell <[email protected]>
SetNames(1:5, names=letters[1:5]) # the default, if no argument names are provided, is "names" SetNames(1:5, letters[1:5]) tab <- table(d.pizza$driver, d.pizza$wine_delivered) # rownames and columnnames can be set at the same time SetNames(BinomCI(tab[,1], rowSums(tab)), rownames=rownames(tab), colnames=c("perc", "lci", "uci")) # can also be used to set the names to an empty string SetNames(diag(6), rownames="", colnames="") # setting dimnames works as well tab <- SetNames( as.table(rbind(c(84,43), c(10,92))), dimnames= list( dipstick=c("positive","negative"), culture=c("positive","negative")))
SetNames(1:5, names=letters[1:5]) # the default, if no argument names are provided, is "names" SetNames(1:5, letters[1:5]) tab <- table(d.pizza$driver, d.pizza$wine_delivered) # rownames and columnnames can be set at the same time SetNames(BinomCI(tab[,1], rowSums(tab)), rownames=rownames(tab), colnames=c("perc", "lci", "uci")) # can also be used to set the names to an empty string SetNames(diag(6), rownames="", colnames="") # setting dimnames works as well tab <- SetNames( as.table(rbind(c(84,43), c(10,92))), dimnames= list( dipstick=c("positive","negative"), culture=c("positive","negative")))
Sometimes the area under a density curve has to be color shaded, for instance to illustrate a p-value or a specific region under the normal curve.
This function draws a curve corresponding to a function over the interval [from, to]
. It can plot also an expression in the variable xname
, default x
.
Shade(expr, col = par("fg"), breaks, density = 10, n = 101, xname = "x", ...)
Shade(expr, col = par("fg"), breaks, density = 10, n = 101, xname = "x", ...)
expr |
the name of a function, or a |
col |
color to fill or shade the shape with. The default is taken from |
breaks |
numeric, a vector giving the breakpoints between the distinct areas to be shaded differently. Should be finite as there are no plots with infinite limits. |
density |
the density of the lines as needed in polygon. |
n |
integer; the number of x values at which to evaluate. Default is 101. |
xname |
character string giving the name to be used for the x axis. |
... |
the dots are passed on to |
Useful for shading the area under a curve as often needed for explaining significance tests.
A list with components x
and y
of the points that were drawn is returned invisibly.
Andri Signorell <[email protected]>
curve(dt(x, df=5), xlim=c(-6,6), main=paste("Student t-Distribution Probability Density Function, df = ", 5, ")", sep=""), type="n", las=1, ylab="probability", xlab="t") Shade(dt(x, df=5), breaks=c(-6, qt(0.025, df=5), qt(0.975, df=5), 6), col=c(DescTools::hred, DescTools::hblue), density=c(20, 7))
curve(dt(x, df=5), xlim=c(-6,6), main=paste("Student t-Distribution Probability Density Function, df = ", 5, ")", sep=""), type="n", las=1, ylab="probability", xlab="t") Shade(dt(x, df=5), breaks=c(-6, qt(0.025, df=5), qt(0.975, df=5), 6), col=c(DescTools::hred, DescTools::hblue), density=c(20, 7))
Performs the Shapiro-Francia test for the composite hypothesis of normality.
ShapiroFranciaTest(x)
ShapiroFranciaTest(x)
x |
a numeric vector of data values, the number of which must be between 5 and 5000. Missing values are allowed. |
The test statistic of the Shapiro-Francia test is simply the squared correlation between the ordered sample values and the (approximated) expected ordered quantiles from the standard normal distribution. The p-value is computed from the formula given by Royston (1993).
A list of class htest
, containing the following components:
statistic |
the value of the Shapiro-Francia statistic. |
p.value |
the p-value for the test. |
method |
the character string “Shapiro-Francia normality test”. |
data.name |
a character string giving the name(s) of the data. |
The Shapiro-Francia test is known to perform well,
see also the comments by Royston (1993). The expected ordered quantiles
from the standard normal distribution are approximated by
qnorm(ppoints(x, a = 3/8))
, being slightly different from the approximation
qnorm(ppoints(x, a = 1/2))
used for the normal quantile-quantile plot by
qqnorm
for sample sizes greater than 10.
Juergen Gross <[email protected]>
Royston, P. (1993): A pocket-calculator algorithm for the Shapiro-Francia test for non-normality: an application to medicine. Statistics in Medicine, 12, 181–184.
Thode Jr., H.C. (2002): Testing for Normality. Marcel Dekker, New York. (2002, Sec. 2.3.2)
shapiro.test
for performing the Shapiro-Wilk test for normality.
AndersonDarlingTest
, CramerVonMisesTest
,
LillieTest
, PearsonTest
for performing further tests for normality.
qqnorm
for producing a normal quantile-quantile plot.
ShapiroFranciaTest(rnorm(100, mean = 5, sd = 3)) ShapiroFranciaTest(runif(100, min = 2, max = 4))
ShapiroFranciaTest(rnorm(100, mean = 5, sd = 3)) ShapiroFranciaTest(runif(100, min = 2, max = 4))
Non-parametric Siegel-Tukey test for equality in variability.
The null hypothesis is that the variability of x is equal between two
groups. A rejection of the null hypothesis indicates that variability differs between
the two groups. SiegelTukeyRank
returns the ranks, calculated after Siegel Tukey logic.
SiegelTukeyTest(x, ...) ## Default S3 method: SiegelTukeyTest(x, y, adjust.median = FALSE, alternative = c("two.sided", "less", "greater"), mu = 0, exact = NULL, correct = TRUE, conf.int = FALSE, conf.level = 0.95, ...) ## S3 method for class 'formula' SiegelTukeyTest(formula, data, subset, na.action, ...) SiegelTukeyRank(x, g, drop.median = TRUE)
SiegelTukeyTest(x, ...) ## Default S3 method: SiegelTukeyTest(x, y, adjust.median = FALSE, alternative = c("two.sided", "less", "greater"), mu = 0, exact = NULL, correct = TRUE, conf.int = FALSE, conf.level = 0.95, ...) ## S3 method for class 'formula' SiegelTukeyTest(formula, data, subset, na.action, ...) SiegelTukeyRank(x, g, drop.median = TRUE)
x , y
|
numeric vector of data values. Non-finite (e.g. infinite or missing) values will be omitted. |
g |
a vector or factor object giving the group for the corresponding elements of x. |
adjust.median |
Should between-group differences in medians be leveled
before performing the test? In certain cases, the Siegel-Tukey test is
susceptible to median differences and may indicate significant differences
in variability that, in reality, stem from differences in medians. Default is |
alternative |
a character string specifying the alternative hypothesis, must be one of |
mu |
a number specifying an optional parameter used to form the null hypothesis. See Details. |
exact |
a logical indicating whether an exact p-value should be computed. This is passed directly to |
correct |
a logical indicating whether to apply continuity correction in the normal approximation for the p-value. |
conf.int |
a logical indicating whether a confidence interval should be computed. |
conf.level |
confidence level of the interval. |
formula |
a formula of the form |
data |
an optional matrix or data frame (or similar: see |
subset |
an optional vector specifying a subset of observations to be used. |
na.action |
a function which indicates what should happen when the data contain NAs. Defaults to |
drop.median |
logical, defining whether the median of the combined samples should be left out, ensuring that there's an even number of elements (which is a requirement of the Siegel-Tukey test). Defaults to |
... |
further arguments to be passed to or from methods. |
The Siegel-Tukey test has relatively low power and may, under certain
conditions, indicate significance due to differences in medians rather than
differences in variabilities (consider using the argument adjust.median
).
Consider also using mood.test
or ansari.test
.
A list of class htest
, containing the following components:
statistic |
Siegel-Tukey test (Wilcoxon test on tie-adjusted Siegel-Tukey ranks, after the median adjustment if specified). |
p.value |
the p-value for the test |
null.value |
is the value of the median specified by the null hypothesis. This
equals the input argument |
alternative |
a character string describing the alternative hypothesis. |
method |
the type of test applied |
data.name |
a character string giving the names of the data. |
Daniel Malter, Tal Galili <[email protected]>, Andri Signorell <[email protected]>
published on: https://www.r-statistics.com/2010/02/siegel-tukey-a-non-parametric-test-for-equality-in-variability-r-code/
Siegel, S., Tukey, J. W. (1960): A nonparametric sum of ranks procedure for relative spread in unpaired samples. Journal of the American Statistical Association.
Sheskin, D. J. (2004): Handbook of parametric and nonparametric statistical procedures 3rd edition. Chapman and Hall/CRC. Boca Raton, FL.
mood.test
, ansari.test
, wilcox.test
, LeveneTest
# Duller, S. 183 x <- c(12, 13, 29, 30) y <- c(15, 17, 18, 24, 25, 26) SiegelTukeyTest(x, y) SiegelTukeyTest(x, y, alternative="greater") # Duller, S. 323 old <- c(870,930,935,1045,1050,1052,1055) new <- c(932,970,980,1001,1009,1030,1032,1040,1046) SiegelTukeyTest(old, new, alternative = "greater") # compare to the recommended alternatives mood.test(old, new, alternative="greater") ansari.test(old, new, alternative="greater") # Bortz, S. 250 x <- c(26.3,26.5,26.8,27.0,27.0,27.2,27.3,27.3,27.4,27.5,27.6,27.8,27.9) id <- c(2,2,2,1,2,2,1,2,2,1,1,1,2)-1 SiegelTukeyTest(x ~ id) # Sachs, Angewandte Statistik, 12. Auflage, 2007, S. 314 A <- c(10.1,7.3,12.6,2.4,6.1,8.5,8.8,9.4,10.1,9.8) B <- c(15.3,3.6,16.5,2.9,3.3,4.2,4.9,7.3,11.7,13.1) SiegelTukeyTest(A, B) ### 1 x <- c(4,4,5,5,6,6) y <- c(0,0,1,9,10,10) SiegelTukeyTest(x, y) ### 2 # example for a non equal number of cases: x <- c(4,4,5,5,6,6) y <- c(0,0,1,9,10) SiegelTukeyTest(x, y) ### 3 x <- c(33, 62, 84, 85, 88, 93, 97, 4, 16, 48, 51, 66, 98) id <- c(0,0,0,0,0,0,0,1,1,1,1,1,1) SiegelTukeyTest(x ~ id) ### 4 x <- c(177,200,227,230,232,268,272,297,47,105,126,142,158,172,197,220,225,230,262,270) id <- c(rep(0,8),rep(1,12)) SiegelTukeyTest(x ~ id, adjust.median=TRUE) ### 5 x <- c(33,62,84,85,88,93,97) y <- c(4,16,48,51,66,98) SiegelTukeyTest(x, y) ### 6 x <- c(0,0,1,4,4,5,5,6,6,9,10,10) id <- c(0,0,0,1,1,1,1,1,1,0,0,0) SiegelTukeyTest(x ~ id) ### 7 x <- c(85,106,96, 105, 104, 108, 86) id <- c(0,0,1,1,1,1,1) SiegelTukeyTest(x ~ id)
# Duller, S. 183 x <- c(12, 13, 29, 30) y <- c(15, 17, 18, 24, 25, 26) SiegelTukeyTest(x, y) SiegelTukeyTest(x, y, alternative="greater") # Duller, S. 323 old <- c(870,930,935,1045,1050,1052,1055) new <- c(932,970,980,1001,1009,1030,1032,1040,1046) SiegelTukeyTest(old, new, alternative = "greater") # compare to the recommended alternatives mood.test(old, new, alternative="greater") ansari.test(old, new, alternative="greater") # Bortz, S. 250 x <- c(26.3,26.5,26.8,27.0,27.0,27.2,27.3,27.3,27.4,27.5,27.6,27.8,27.9) id <- c(2,2,2,1,2,2,1,2,2,1,1,1,2)-1 SiegelTukeyTest(x ~ id) # Sachs, Angewandte Statistik, 12. Auflage, 2007, S. 314 A <- c(10.1,7.3,12.6,2.4,6.1,8.5,8.8,9.4,10.1,9.8) B <- c(15.3,3.6,16.5,2.9,3.3,4.2,4.9,7.3,11.7,13.1) SiegelTukeyTest(A, B) ### 1 x <- c(4,4,5,5,6,6) y <- c(0,0,1,9,10,10) SiegelTukeyTest(x, y) ### 2 # example for a non equal number of cases: x <- c(4,4,5,5,6,6) y <- c(0,0,1,9,10) SiegelTukeyTest(x, y) ### 3 x <- c(33, 62, 84, 85, 88, 93, 97, 4, 16, 48, 51, 66, 98) id <- c(0,0,0,0,0,0,0,1,1,1,1,1,1) SiegelTukeyTest(x ~ id) ### 4 x <- c(177,200,227,230,232,268,272,297,47,105,126,142,158,172,197,220,225,230,262,270) id <- c(rep(0,8),rep(1,12)) SiegelTukeyTest(x ~ id, adjust.median=TRUE) ### 5 x <- c(33,62,84,85,88,93,97) y <- c(4,16,48,51,66,98) SiegelTukeyTest(x, y) ### 6 x <- c(0,0,1,4,4,5,5,6,6,9,10,10) id <- c(0,0,0,1,1,1,1,1,1,0,0,0) SiegelTukeyTest(x ~ id) ### 7 x <- c(85,106,96, 105, 104, 108, 86) id <- c(0,0,1,1,1,1,1) SiegelTukeyTest(x ~ id)
Performs one- and two-sample sign tests on vectors of data.
SignTest(x, ...) ## Default S3 method: SignTest(x, y = NULL, alternative = c("two.sided", "less", "greater"), mu = 0, conf.level = 0.95, ... ) ## S3 method for class 'formula' SignTest(formula, data, subset, na.action, ...)
SignTest(x, ...) ## Default S3 method: SignTest(x, y = NULL, alternative = c("two.sided", "less", "greater"), mu = 0, conf.level = 0.95, ... ) ## S3 method for class 'formula' SignTest(formula, data, subset, na.action, ...)
x |
numeric vector of data values. Non-finite (e.g. infinite or missing) values will be omitted. |
y |
an optional numeric vector of data values: as with x non-finite values will be omitted. |
mu |
a number specifying an optional parameter used to form the null hypothesis. See Details. |
alternative |
is a character string, one of |
conf.level |
confidence level for the returned confidence interval, restricted to lie between zero and one. |
formula |
a formula of the form |
data |
an optional matrix or data frame (or similar: see |
subset |
an optional vector specifying a subset of observations to be used. |
na.action |
a function which indicates what should happen when the data contain NAs. Defaults to |
... |
further arguments to be passed to or from methods. |
The formula interface is only applicable for the 2-sample test.
SignTest
computes a “Dependent-samples Sign-Test” if both
x
and y
are provided. If only x
is provided,
the “One-sample Sign-Test” will be computed.
For the one-sample sign-test, the null hypothesis is
that the median of the population from which x
is drawn is mu
.
For the two-sample dependent case, the null hypothesis is
that the median for the differences of the populations from which x
and y
are drawn is mu
.
The alternative hypothesis indicates the direction of divergence of the
population median for x
from mu
(i.e., "greater"
,
"less"
, "two.sided"
.)
The confidence levels are exact.
A list of class htest
, containing the following components:
statistic |
the S-statistic (the number of positive differences between the data and the hypothesized median), with names attribute “S”. |
parameter |
the total number of valid differences. |
p.value |
the p-value for the test. |
null.value |
is the value of the median specified by the null hypothesis. This
equals the input argument |
alternative |
a character string describing the alternative hypothesis. |
method |
the type of test applied. |
data.name |
a character string giving the names of the data. |
conf.int |
a confidence interval for the median. |
estimate |
the sample median. |
Andri Signorell <[email protected]>
Gibbons, J.D. and Chakraborti, S. (1992): Nonparametric Statistical Inference. Marcel Dekker Inc., New York.
Kitchens, L. J. (2003): Basic Statistics and Data Analysis. Duxbury.
Conover, W. J. (1980): Practical Nonparametric Statistics, 2nd ed. Wiley, New York.
t.test
, wilcox.test
, ZTest
, binom.test
,
SIGN.test
in the package BSDA (reporting approximative confidence intervals).
x <- c(1.83, 0.50, 1.62, 2.48, 1.68, 1.88, 1.55, 3.06, 1.30) y <- c(0.878, 0.647, 0.598, 2.05, 1.06, 1.29, 1.06, 3.14, 1.29) SignTest(x, y) wilcox.test(x, y, paired = TRUE) d.light <- data.frame( black = c(25.85,28.84,32.05,25.74,20.89,41.05,25.01,24.96,27.47), white <- c(18.23,20.84,22.96,19.68,19.5,24.98,16.61,16.07,24.59), d <- c(7.62,8,9.09,6.06,1.39,16.07,8.4,8.89,2.88) ) d <- d.light$d SignTest(x=d, mu = 4) wilcox.test(x=d, mu = 4, conf.int = TRUE) SignTest(x=d, mu = 4, alternative="less") wilcox.test(x=d, mu = 4, conf.int = TRUE, alternative="less") SignTest(x=d, mu = 4, alternative="greater") wilcox.test(x=d, mu = 4, conf.int = TRUE, alternative="greater") # test die interfaces x <- runif(10) y <- runif(10) g <- rep(1:2, each=10) xx <- c(x, y) SignTest(x ~ group, data=data.frame(x=xx, group=g )) SignTest(xx ~ g) SignTest(x, y) SignTest(x - y)
x <- c(1.83, 0.50, 1.62, 2.48, 1.68, 1.88, 1.55, 3.06, 1.30) y <- c(0.878, 0.647, 0.598, 2.05, 1.06, 1.29, 1.06, 3.14, 1.29) SignTest(x, y) wilcox.test(x, y, paired = TRUE) d.light <- data.frame( black = c(25.85,28.84,32.05,25.74,20.89,41.05,25.01,24.96,27.47), white <- c(18.23,20.84,22.96,19.68,19.5,24.98,16.61,16.07,24.59), d <- c(7.62,8,9.09,6.06,1.39,16.07,8.4,8.89,2.88) ) d <- d.light$d SignTest(x=d, mu = 4) wilcox.test(x=d, mu = 4, conf.int = TRUE) SignTest(x=d, mu = 4, alternative="less") wilcox.test(x=d, mu = 4, conf.int = TRUE, alternative="less") SignTest(x=d, mu = 4, alternative="greater") wilcox.test(x=d, mu = 4, conf.int = TRUE, alternative="greater") # test die interfaces x <- runif(10) y <- runif(10) g <- rep(1:2, each=10) xx <- c(x, y) SignTest(x ~ group, data=data.frame(x=xx, group=g )) SignTest(xx ~ g) SignTest(x, y) SignTest(x - y)
smooth.spline
smooth.spline has no formula interface, which is sometimes inconvenient, if one simply wants to copy a formula of a linear model or a plot to spline.
SmoothSpline(x, ...) ## Default S3 method: SmoothSpline(x, y = NULL, w = NULL, df, spar = NULL, cv = FALSE, all.knots = FALSE, nknots = .nknots.smspl, keep.data = TRUE, df.offset = 0, penalty = 1, control.spar = list(), tol = 0.000001 * IQR(x), ...) ## S3 method for class 'formula' SmoothSpline(formula, data, subset, na.action, ...)
SmoothSpline(x, ...) ## Default S3 method: SmoothSpline(x, y = NULL, w = NULL, df, spar = NULL, cv = FALSE, all.knots = FALSE, nknots = .nknots.smspl, keep.data = TRUE, df.offset = 0, penalty = 1, control.spar = list(), tol = 0.000001 * IQR(x), ...) ## S3 method for class 'formula' SmoothSpline(formula, data, subset, na.action, ...)
x |
a vector giving the values of the predictor variable, or a list or a two-column matrix specifying x and y. |
y |
responses. If |
w |
optional vector of weights of the same length as |
df |
the desired equivalent number of degrees of freedom (trace of the smoother matrix). |
spar |
smoothing parameter, typically (but not necessarily) in
|
cv |
ordinary ( |
all.knots |
if |
nknots |
integer or |
keep.data |
logical specifying if the input data should be kept
in the result. If |
df.offset |
allows the degrees of freedom to be increased by
|
penalty |
the coefficient of the penalty for degrees of freedom in the GCV criterion. |
control.spar |
optional list with named components controlling the
root finding when the smoothing parameter Note that this is partly experimental and may change with general spar computation improvements!
Note that |
tol |
a tolerance for same-ness or uniqueness of the |
formula |
a formula of the form |
data |
The data frame from which the formula should be evaluated. |
subset |
an optional vector specifying a subset of observations to be used. |
na.action |
a function which indicates what should happen when the data contain NAs. Defaults to
|
... |
Other arguments to be passed to
|
Andri Signorell <[email protected]>
smooth.spline
, lines.smooth.spline
plot(temperature ~ delivery_min, data=d.pizza) lines(SmoothSpline(temperature ~ delivery_min, data=d.pizza))
plot(temperature ~ delivery_min, data=d.pizza) lines(SmoothSpline(temperature ~ delivery_min, data=d.pizza))
For displaying the first and last elements of an object there are the functions head
and tail
. Sometimes one might want to see more randomly scattered elements. This function returns some random parts of a vector, matrix or a data frame. The order of the elements within the object will be preserved.
Some(x, n = 6L, ...) ## Default S3 method: Some(x, n = 6L, ...) ## S3 method for class 'data.frame' Some(x, n = 6L, ...) ## S3 method for class 'matrix' Some(x, n = 6L, addrownums = TRUE, ...)
Some(x, n = 6L, ...) ## Default S3 method: Some(x, n = 6L, ...) ## S3 method for class 'data.frame' Some(x, n = 6L, ...) ## S3 method for class 'matrix' Some(x, n = 6L, addrownums = TRUE, ...)
x |
an object |
n |
a single integer. If positive, size for the resulting
object: number of elements for a vector (including lists), rows for
a matrix or data frame or lines for a function. If negative, all but
the |
addrownums |
if there are no row names, create them from the row numbers. |
... |
arguments to be passed to or from other methods. |
For matrices, 2-dim tables and data frames, Some()
returns
some n
rows when n > 0
or all but the
some n
rows when n < 0
. Some.matrix()
is not exported (unlike head.matrix
).
If a matrix has no row names, then Some()
will add row names of
the form "[n,]"
to the result, so that it looks similar to the
last lines of x
when printed. Setting addrownums =
FALSE
suppresses this behaviour.
I desisted from implementing interfaces for tables, ftables and functions, as this would not make much sense.
An object (usually) like x
but generally smaller.
Andri Signorell, basically copying and just slightly modifying Patrick Burns and R-Core code.
Some(letters) Some(letters, n = -6L) Some(freeny.x, n = 10L) Some(freeny.y)
Some(letters) Some(letters, n = -6L) Some(freeny.x, n = 10L) Some(freeny.y)
Test if x contains only integer numbers, or if is numeric or if it is zero.
IsWhole(x, all = FALSE, tol = sqrt(.Machine$double.eps), na.rm = FALSE) IsZero(x, tol = sqrt(.Machine$double.eps), na.rm = FALSE) IsNumeric(x, length.arg = Inf, integer.valued = FALSE, positive = FALSE, na.rm = FALSE)
IsWhole(x, all = FALSE, tol = sqrt(.Machine$double.eps), na.rm = FALSE) IsZero(x, tol = sqrt(.Machine$double.eps), na.rm = FALSE) IsNumeric(x, length.arg = Inf, integer.valued = FALSE, positive = FALSE, na.rm = FALSE)
x |
a (non-empty) numeric vector of data values. |
all |
logical, specifying if the whole vector should be checked. If set to |
tol |
tolerance to be used |
length.arg |
integer, the length of the vector to be checked for. |
integer.valued |
logical, should x be checked as integer? |
positive |
logical, is x supposed to be positive? |
na.rm |
logical, indicating whether |
IsWhole is the suggested solution for checking for an integer value, as is.integer
tests for class(x) == "integer"
and does NOT test whether x (which might be of class "numeric") contains only integer numbers.
(Why not simply implement it in base?)
IsZero tests float numeric values for being zero.
IsNumeric combines a test for numeric and integers.
logical vector of the same dimension as x.
R-Core, Andri Signorell <[email protected]>, Thomas W. Yee
(x <- seq(1,5, by=0.5)) IsWhole( x ) #--> \code{TRUE} \code{FALSE} \code{TRUE} ... # ... These are people who live in ignorance of the Floating Point Gods. # These pagans expect ... (Burns, 2011)" the following to be TRUE: (.1 - .3 / 3) == 0 # they might be helped by IsZero(.1 - .3 / 3)
(x <- seq(1,5, by=0.5)) IsWhole( x ) #--> \code{TRUE} \code{FALSE} \code{TRUE} ... # ... These are people who live in ignorance of the Floating Point Gods. # These pagans expect ... (Burns, 2011)" the following to be TRUE: (.1 - .3 / 3) == 0 # they might be helped by IsZero(.1 - .3 / 3)
Calculate Somers' Delta statistic, a measure of association for ordinal factors in a two-way table. The function has interfaces for a table (matrix) and for single vectors.
SomersDelta(x, y = NULL, direction = c("row", "column"), conf.level = NA, ...)
SomersDelta(x, y = NULL, direction = c("row", "column"), conf.level = NA, ...)
x |
a numeric vector or a table. A matrix will be treated as table. |
y |
NULL (default) or a vector with compatible dimensions to |
direction |
direction of the calculation. Can be |
conf.level |
confidence level of the interval. If set to |
... |
further arguments are passed to the function |
Somers' D(C|R) and Somers' D(R|C) are asymmetric modifications of and Goodman-Kruskal's Gamma. C|R indicates that
the row variable x is regarded as the independent variable and the column variable y is regarded
as dependent. Similarly, R|C indicates that the column variable y is regarded as the independent
variable and the row variable x is regarded as dependent. It is logically very similar to Gamma, but differs in that it uses a correction only for pairs that are tied on the dependent variable. As Gamma and the Taus, D is appropriate only when both variables lie on an ordinal scale.
Somers' D is computed as
where P equals twice the number of concordances and Q twice the number of discordances and rowSums(tab). Its range lies [-1, 1]. The interpretation of d is analogous to Gamma.
a single numeric value if no confidence intervals are requested
and otherwise a numeric vector with 3 elements for the estimate, the lower and the upper confidence interval
Andri Signorell <[email protected]>
Agresti, A. (2002) Categorical Data Analysis. John Wiley & Sons, pp. 57–59.
Brown, M.B., Benedetti, J.K.(1977) Sampling Behavior of Tests for Correlation in Two-Way Contingency Tables, Journal of the American Statistical Association, 72, 309-315.
Goodman, L. A., & Kruskal, W. H. (1954) Measures of association for cross classifications. Journal of the American Statistical Association, 49, 732-764.
Somers, R. H. (1962) A New Asymmetric Measure of Association for Ordinal Variables, American Sociological Review, 27, 799–811.
Goodman, L. A., & Kruskal, W. H. (1963) Measures of association for cross classifications III: Approximate sampling theory. Journal of the American Statistical Association, 58, 310–364.
There's an implementation of Somers's D in Frank Harrell's Hmisc somers2
,
which is quite fast for large sample sizes. However it is restricted to computing Somers' Dxy rank correlation between a variable x and a binary (0-1) variable y.ConDisPairs
yields concordant and discordant pairs
Other association measures: KendallTauA
(tau-a), KendallTauB
(tau-b), cor
(method="kendall") for tau-b, StuartTauC
(tau-c),
GoodmanKruskalGamma
Lambda
, GoodmanKruskalTau
, UncertCoef
, MutInf
# example in: # http://support.sas.com/documentation/cdl/en/statugfreq/63124/PDF/default/statugfreq.pdf # pp. S. 1821 tab <- as.table(rbind(c(26,26,23,18,9),c(6,7,9,14,23))) # Somers' D C|R SomersDelta(tab, direction="column", conf.level=0.95) # Somers' D R|C SomersDelta(tab, direction="row", conf.level=0.95)
# example in: # http://support.sas.com/documentation/cdl/en/statugfreq/63124/PDF/default/statugfreq.pdf # pp. S. 1821 tab <- as.table(rbind(c(26,26,23,18,9),c(6,7,9,14,23))) # Somers' D C|R SomersDelta(tab, direction="column", conf.level=0.95) # Somers' D R|C SomersDelta(tab, direction="row", conf.level=0.95)
Sort a vector, a matrix, a table or a data.frame. The base sort function does not have an interface for classes other than vectors and coerces the whole world to a vector. This means you get a sorted vector as result while passing a matrix to sort
.Sort
wraps the base sort function and adds an interface for sorting the rows of the named 2-dimensional data structures by the order of one or more of its columns.
Sort(x, ...) ## Default S3 method: Sort(x, ...) ## S3 method for class 'matrix' Sort(x, ord = NULL, decreasing = FALSE, na.last = TRUE, ...) ## S3 method for class 'table' Sort(x, ord = NULL, decreasing = FALSE, na.last = TRUE, ...) ## S3 method for class 'data.frame' Sort(x, ord = NULL, decreasing = FALSE, factorsAsCharacter = TRUE, na.last = TRUE, ...)
Sort(x, ...) ## Default S3 method: Sort(x, ...) ## S3 method for class 'matrix' Sort(x, ord = NULL, decreasing = FALSE, na.last = TRUE, ...) ## S3 method for class 'table' Sort(x, ord = NULL, decreasing = FALSE, na.last = TRUE, ...) ## S3 method for class 'data.frame' Sort(x, ord = NULL, decreasing = FALSE, factorsAsCharacter = TRUE, na.last = TRUE, ...)
x |
a numeric, complex. character or logical vector, a factor, a table or a data.frame to be sorted. |
decreasing |
logical. Should the sort be increasing or decreasing? |
factorsAsCharacter |
logical. Should factors be sorted by the alphabetic order of their labels or by the order or their levels.
Default is |
ord |
vector of integers or columnames. Defines the columns in a table, in a matrix or in a data.frame to be sorted for. |
na.last |
for controlling the treatment of |
... |
further arguments to be passed to or from methods. |
The sort order for factors is the order of their levels (which is particularly appropriate for ordered factors), and usually confusing for unordered factors, whose levels may be defined in the sequence in which they appear in the data (which normally is unordered).
the sorted object.
Andri Signorell <[email protected]>
d.frm <- d.pizza[1:10, c("driver","temperature","delivery_min")] Sort(d.frm[,1]) # Sort follows the levels by default levels(d.frm[,1]) Sort(x=d.frm, ord="driver", decreasing=FALSE) # set factorsAsCharacter = TRUE, if alphabetical order is required Sort(x=d.frm, ord="driver", decreasing=FALSE, factorsAsCharacter=TRUE) Sort(x=d.frm, ord=c("driver","delivery_min"), factorsAsCharacter = TRUE) Sort(x=d.frm, ord=c("driver","delivery_min"), factorsAsCharacter = FALSE) Sort(x=d.frm, ord=c("driver","delivery_min"), decreasing=c(FALSE, TRUE), factorsAsCharacter = FALSE) # Sorting tables tab <- table(d.pizza$driver, d.pizza$area) Sort(x=tab, ord=c(0,2), decreasing=c(TRUE, FALSE)) Sort(x=tab, ord=2, decreasing=TRUE) # partial matching ok: Sort(tab, o=1, d=TRUE)
d.frm <- d.pizza[1:10, c("driver","temperature","delivery_min")] Sort(d.frm[,1]) # Sort follows the levels by default levels(d.frm[,1]) Sort(x=d.frm, ord="driver", decreasing=FALSE) # set factorsAsCharacter = TRUE, if alphabetical order is required Sort(x=d.frm, ord="driver", decreasing=FALSE, factorsAsCharacter=TRUE) Sort(x=d.frm, ord=c("driver","delivery_min"), factorsAsCharacter = TRUE) Sort(x=d.frm, ord=c("driver","delivery_min"), factorsAsCharacter = FALSE) Sort(x=d.frm, ord=c("driver","delivery_min"), decreasing=c(FALSE, TRUE), factorsAsCharacter = FALSE) # Sorting tables tab <- table(d.pizza$driver, d.pizza$area) Sort(x=tab, ord=c(0,2), decreasing=c(TRUE, FALSE)) Sort(x=tab, ord=2, decreasing=TRUE) # partial matching ok: Sort(tab, o=1, d=TRUE)
These functions sort or order character strings containing embedded numbers so that the numbers are numerically sorted rather than sorted by character value. I.e. "Asprin 50mg" will come before "Asprin 100mg". In addition, case of character strings is ignored so that "a", will come before "B" and "C".
SortMixed(x, decreasing=FALSE, na.last=TRUE, blank.last=FALSE, numeric.type=c("decimal", "roman"), roman.case=c("upper","lower","both") ) OrderMixed(x, decreasing=FALSE, na.last=TRUE, blank.last=FALSE, numeric.type=c("decimal", "roman"), roman.case=c("upper","lower","both") )
SortMixed(x, decreasing=FALSE, na.last=TRUE, blank.last=FALSE, numeric.type=c("decimal", "roman"), roman.case=c("upper","lower","both") ) OrderMixed(x, decreasing=FALSE, na.last=TRUE, blank.last=FALSE, numeric.type=c("decimal", "roman"), roman.case=c("upper","lower","both") )
x |
vector to be sorted. |
decreasing |
logical. Should the sort be increasing or
decreasing? Note that |
na.last |
logical, controlling the treatment of |
blank.last |
logical, controlling the treatment of blank values. If |
numeric.type |
either |
roman.case |
one of |
I often have character vectors (e.g. factor labels), such as compound and dose, that contain both text and numeric data. This function is useful for sorting these character vectors into a logical order.
It does so by splitting each character vector into a sequence of character and numeric sections, and then sorting along these sections, with numbers being sorted by numeric value (e.g. "50" comes before "100"), followed by characters strings sorted by character value (e.g. "A" comes before "B") ignoring case (e.g. 'A' has the same sort order as 'a').
By default, sort order is ascending, empty strings are sorted to the front,
and NA
values to the end. Setting descending=TRUE
changes the sort order to descending and reverses the meanings of
na.last
and blank.last
.
Parsing looks for decimal numbers unless numeric.type="roman"
,
in which parsing looks for roman numerals, with character case
specified by roman.case
.
OrderMixed
returns a vector giving the sort order of the input
elements. SortMixed
returns the sorted vector.
Gregory R. Warnes [email protected]
## compound & dose labels Treatment <- c("Control", "Asprin 10mg/day", "Asprin 50mg/day", "Asprin 100mg/day", "Acetomycin 100mg/day", "Acetomycin 1000mg/day") ## ordinary sort puts the dosages in the wrong order sort(Treatment) ## but SortMixed does the 'right' thing SortMixed(Treatment) ## Here is a more complex example x <- rev(c("AA 0.50 ml", "AA 1.5 ml", "AA 500 ml", "AA 1500 ml", "EXP 1", "AA 1e3 ml", "A A A", "1 2 3 A", "NA", NA, "1e2", "", "-", "1A", "1 A", "100", "100A", "Inf")) OrderMixed(x) SortMixed(x) # Notice that plain numbers, including 'Inf' show up # before strings, NAs at the end, and blanks at the # beginning . SortMixed(x, na.last=TRUE) # default SortMixed(x, na.last=FALSE) # push NAs to the front SortMixed(x, blank.last=FALSE) # default SortMixed(x, blank.last=TRUE) # push blanks to the end SortMixed(x, decreasing=FALSE) # default SortMixed(x, decreasing=TRUE) # reverse sort order ## Roman numerals chapters <- c("V. Non Sequiturs", "II. More Nonsense", "I. Nonsense", "IV. Nonesensical Citations", "III. Utter Nonsense") SortMixed(chapters, numeric.type="roman" ) ## Lower-case Roman numerals vals <- c("xix", "xii", "mcv", "iii", "iv", "dcclxxii", "cdxcii", "dcxcviii", "dcvi", "cci") (ordered <- SortMixed(vals, numeric.type="roman", roman.case="lower")) RomanToInt(ordered)
## compound & dose labels Treatment <- c("Control", "Asprin 10mg/day", "Asprin 50mg/day", "Asprin 100mg/day", "Acetomycin 100mg/day", "Acetomycin 1000mg/day") ## ordinary sort puts the dosages in the wrong order sort(Treatment) ## but SortMixed does the 'right' thing SortMixed(Treatment) ## Here is a more complex example x <- rev(c("AA 0.50 ml", "AA 1.5 ml", "AA 500 ml", "AA 1500 ml", "EXP 1", "AA 1e3 ml", "A A A", "1 2 3 A", "NA", NA, "1e2", "", "-", "1A", "1 A", "100", "100A", "Inf")) OrderMixed(x) SortMixed(x) # Notice that plain numbers, including 'Inf' show up # before strings, NAs at the end, and blanks at the # beginning . SortMixed(x, na.last=TRUE) # default SortMixed(x, na.last=FALSE) # push NAs to the front SortMixed(x, blank.last=FALSE) # default SortMixed(x, blank.last=TRUE) # push blanks to the end SortMixed(x, decreasing=FALSE) # default SortMixed(x, decreasing=TRUE) # reverse sort order ## Roman numerals chapters <- c("V. Non Sequiturs", "II. More Nonsense", "I. Nonsense", "IV. Nonesensical Citations", "III. Utter Nonsense") SortMixed(chapters, numeric.type="roman" ) ## Lower-case Roman numerals vals <- c("xix", "xii", "mcv", "iii", "iv", "dcclxxii", "cdxcii", "dcxcviii", "dcvi", "cci") (ordered <- SortMixed(vals, numeric.type="roman", roman.case="lower")) RomanToInt(ordered)
Calculate Spearman correlation coefficient and its confidence interval. In addition to the base R function cor()
, frequency tables are also accepted as arguments (i.e. actually weights are used).
SpearmanRho(x, y = NULL, use = c("everything", "all.obs", "complete.obs", "na.or.complete","pairwise.complete.obs"), conf.level = NA)
SpearmanRho(x, y = NULL, use = c("everything", "all.obs", "complete.obs", "na.or.complete","pairwise.complete.obs"), conf.level = NA)
x |
a numeric vector, an ordered factor, matrix or data frame. An ordered factor will be coerced to numeric. |
y |
|
use |
an optional character string giving a method for computing covariances in the presence of missing values.
This must be (an abbreviation of) one of the strings |
conf.level |
confidence level of the interval. If set to |
The function calculates Spearman's rho statistic by means of cor(..., method="spearman")
when two variables x
and y
are supplied. If a frequency table is provided an implementation based on SAS documentation is used.
The confidence intervals are calculated via z-Transformation.
Either a single numeric value, if no confidence interval is required,
or a vector with 3 elements for estimate, lower and upper confidence intervall.
Andri Signorell <[email protected]>
Conover W. J. (1999) Practical Nonparametric Statistics (3rd edition). Wiley
pain <- as.table(matrix(c(26, 6, 26, 7, 23, 9, 18, 14, 9, 23), ncol=5, byrow=TRUE, dimnames=list(adverse=c("no", "yes"), dose=1:5))) SpearmanRho(pain) SpearmanRho(pain, conf.level=0.95) # must be the same as with(Untable(pain), SpearmanRho(adverse, dose, conf.level=0.95))
pain <- as.table(matrix(c(26, 6, 26, 7, 23, 9, 18, 14, 9, 23), ncol=5, byrow=TRUE, dimnames=list(adverse=c("no", "yes"), dose=1:5))) SpearmanRho(pain) SpearmanRho(pain, conf.level=0.95) # must be the same as with(Untable(pain), SpearmanRho(adverse, dose, conf.level=0.95))
Implementation of a simple formula interface for the split
function.
## S3 method for class 'formula' split(x, f, drop = FALSE, data = NULL, ...)
## S3 method for class 'formula' split(x, f, drop = FALSE, data = NULL, ...)
x |
a formula of the form |
f |
a 'factor' in the sense that |
drop |
logical indicating if levels that do not occur should be dropped (if |
data |
the data frame from which the formula should be evaluated. |
... |
other arguments to be passed to
|
Andri Signorell <[email protected]>
split(extra ~ group, data = sleep)
split(extra ~ group, data = sleep)
Split a vector into several pieces at given positions.
SplitAt(x, pos)
SplitAt(x, pos)
x |
the vector to be splitted. |
pos |
integer vector, giving the positions at which the vector should be splitted. |
a list with the splitted parts of x.
flodel (on StackOverflow)
https://stackoverflow.com/questions/16357962/r-split-numeric-vector-at-position
x <- 1:10 SplitAt(x, pos=c(3, 8))
x <- 1:10 SplitAt(x, pos=c(3, 8))
Split a full path in its components. This is specifically an issue in Windows and not really interesting for other OSs.
SplitPath(path, last.is.file = NULL)
SplitPath(path, last.is.file = NULL)
path |
a path |
last.is.file |
logical, determining if the basename should be interpreted as filename or as last directory. If set to |
A list, containing the following components:
normpath |
the normalized path as returned by |
drive |
the drive if the OS is Windows, |
dirname |
the path without directory and without filename |
fullfilename |
the filename including extension |
filename |
the filename without extension |
extension |
the file extension |
Andri Signorell <[email protected]>
## Not run: # Windows-specific example path <- "C:/Documents/Projects/Import/eyestudy.dta" SplitPath(path) path <- "C:/Documents/Projects/Import/" SplitPath(path) path <- "C:/Documents/Projects/Import" SplitPath(path) # last entry will be taken as filename SplitPath(path, last.is.file=FALSE) ## End(Not run)
## Not run: # Windows-specific example path <- "C:/Documents/Projects/Import/eyestudy.dta" SplitPath(path) path <- "C:/Documents/Projects/Import/" SplitPath(path) path <- "C:/Documents/Projects/Import" SplitPath(path) # last entry will be taken as filename SplitPath(path, last.is.file=FALSE) ## End(Not run)
Splitting the string columns of a data frame into multiple columns requires a considerable number of codelines, which are condensed in this function for convenience.
SplitToCol(x, split = " ", fixed = TRUE, na.form = "", colnames = NULL)
SplitToCol(x, split = " ", fixed = TRUE, na.form = "", colnames = NULL)
x |
a data frame containing the string columns to be splitted. |
split |
character vector (or object which can be coerced to such)
containing regular expression(s) (unless |
fixed |
logical. If |
na.form |
character, string specifying how |
colnames |
columnnames for the resulting data.frame. Will be recycled. Can easily be set to |
A data.frame with all the columns splitted
A vector with the length of the number of columns of the data.frame containing the number of the found columns is returned as attribute namede "ncols"
.
Andri Signorell <[email protected]>
d.frm <- data.frame(res1=c("2 [-3,5] **", "5 [-2,6] ***", "9 [-3,1]"), res2=c("5 [6,8] **", "7 [-2,9]", "4 [3,5] **"), stringsAsFactors=FALSE) SplitToCol(d.frm, na.form="-", colnames=c("coef", "ci", "pval"))
d.frm <- data.frame(res1=c("2 [-3,5] **", "5 [-2,6] ***", "9 [-3,1]"), res2=c("5 [6,8] **", "7 [-2,9]", "4 [3,5] **"), stringsAsFactors=FALSE) SplitToCol(d.frm, na.form="-", colnames=c("coef", "ci", "pval"))
Split the strings of a character vector, put together all the unique pieces and return a matrix of dummy vectors for each single value.
SplitToDummy(x, split = ",", ...)
SplitToDummy(x, split = ",", ...)
x |
character vector, each element of which is to be split. Other inputs, including a factor, will give an error. |
split |
character vector (or object which can be coerced to such) containing regular expression(s) (unless |
... |
the dots are passed on to |
a data.frame containing x
and all the found dummy vectors
Andri Signorell <[email protected]>
d.frm <- data.frame(id=1:5, txt=c("A,C,D", "A","B,C","D","D,E")) SplitToDummy(d.frm$txt)
d.frm <- data.frame(id=1:5, txt=c("A,C,D", "A","B,C","D","D,E")) SplitToDummy(d.frm$txt)
Spread the numbers of a vector so that there is a minimum interval between any two numbers (in ascending or descending order). This is helpful when we want to place textboxes on a plot and ensure, that they do not mutually overlap.
SpreadOut(x, mindist = NULL, cex = 1.0)
SpreadOut(x, mindist = NULL, cex = 1.0)
x |
a numeric vector which may contain |
mindist |
the minimum interval between any two values. If this is left to |
cex |
numeric character expansion factor; multiplied by |
SpreadOut()
starts at or near the middle of the vector and increases the
intervals between the ordered values. NA
s are preserved. SpreadOut()
first tries to spread groups of values with intervals less than mindist
out neatly away from the mean of the group. If this doesn't entirely succeed,
a second pass that forces values away from the middle is performed.
SpreadOut()
can also be used to avoid overplotting of axis tick labels
where they may be close together.
On success, the spread out values. If there are less than two valid values, the original vector is returned.
This function is based on plotrix::spreadout()
and has been
integrated here with some minor changes.
Jim Lemon <[email protected]>
some extensions Andri Signorell <[email protected]>
SpreadOut(c(1, 3, 3, 3, 3, 5), 0.2) SpreadOut(c(1, 2.5, 2.5, 3.5, 3.5, 5), 0.2) SpreadOut(c(5, 2.5, 2.5, NA, 3.5, 1, 3.5, NA), 0.2) # this will almost always invoke the brute force second pass SpreadOut(rnorm(10), 0.5)
SpreadOut(c(1, 3, 3, 3, 3, 5), 0.2) SpreadOut(c(1, 2.5, 2.5, 3.5, 3.5, 5), 0.2) SpreadOut(c(5, 2.5, 2.5, NA, 3.5, 1, 3.5, NA), 0.2) # this will almost always invoke the brute force second pass SpreadOut(rnorm(10), 0.5)
Stamp the current plot in the extreme lower right corner. A free text or expression can be defined as text to the stamp.
Stamp(txt = NULL, las = par("las"), cex = 0.6)
Stamp(txt = NULL, las = par("las"), cex = 0.6)
txt |
an optional single text string. If it is not given, the function will look for a defined option named |
las |
numeric in |
cex |
numeric character expansion factor; multiplied by |
The text can be freely defined as option. If user and date should be included by default, the following option using an expression will help:
DescToolsOptions(stamp=expression(gettextf('%s/%s', Sys.getenv('USERNAME'), Format(Today(), fmt='yyyy-mm-dd') )))
For R results may not be satisfactory if par(mfrow=)
is in effect.
Frank E Harrell Jr <[email protected]>
with some amendments by Andri Signorell <[email protected]>
plot(1:20) Stamp()
plot(1:20) Stamp()
Standardize model coefficients by Standard Deviation or Partial Standard Deviation.
StdCoef(x, partial.sd = FALSE, ...) PartialSD(x)
StdCoef(x, partial.sd = FALSE, ...) PartialSD(x)
x |
a fitted model object. |
partial.sd |
logical, if set to |
... |
additional arguments passed to |
The standardized coefficients are meant to allow for a comparison of the importance of explanatory variables that have different variances. Each of them shows the effect on the response of increasing its predictor X(j) by one standard deviation, as a multiple of the response's standard deviation. This is often a more meaningful comparison of the relevance of the input variables.
Note, however, that increasing one X(j) without also changing others may not be possible in a given application, and therefore, interpretation of coefficients can always be tricky. Furthermore, for binary input variables, increasing the variable by one standard deviation is impossible, since an increase can only occur from 0 to 1, and therefore, the standardized coeffient is somewhat counterintuitive in this case.
Standardizing model coefficients has the same effect as centring and scaling the input variables.
“Classical” standardized coefficients
are calculated as
, where
is the unstandardized coefficient,
is the
standard deviation of associated depenent variable
and
is SD of the response variable.
If the variables are intercorrelated, the standard deviation of
used in computing the standardized coefficients
should be
replaced by a partial standard deviation of
which is adjusted for
the multiple correlation of
with the other
variables
included in the regression equation. The partial standard deviation is
calculated as
,
where VIF is the variance inflation factor,
n is the number of observations and p number of predictors in
the model. Coefficient is then transformed as
.
A matrix with at least two columns for standardized coefficient estimate and its standard error. Optionally, third column holds degrees of freedom associated with the coefficients.
Kamil Bartoń
Cade, B.S. (2015) Model averaging and muddled multimodel inferences. Ecology 96, 2370-2382.
Afifi A., May S., Clark V.A. (2011) Practical Multivariate Analysis, Fifth Edition. CRC Press.
Bring, J. (1994). How to standardize regression coefficients. The American Statistician 48, 209-213.
# Fit model to original data: fm <- lm(Fertility ~ Agriculture + Examination + Education + Catholic, data = swiss) # Partial SD for the default formula: psd <- PartialSD(lm(data = swiss))[-1] # remove first element for intercept # Standardize data: zswiss <- scale(swiss, scale = c(NA, psd), center = TRUE) # Note: first element of 'scale' is set to NA to ignore the first column 'y' # Coefficients of a model fitted to standardized data: # zapsmall(coefTable(stdizeFit(fm, data = zGPA))) # Standardized coefficients of a model fitted to original data: # zapsmall(StdCoef(fm, partial = TRUE)) # Standardizing nonlinear models: fam <- Gamma("inverse") fmg <- glm(log(Fertility) ~ Agriculture + Examination + Education + Catholic, data = swiss, family = fam) psdg <- PartialSD(fmg) # zGPA <- stdize(GPA, scale = c(NA, psdg[-1]), center = FALSE) # fmgz <- glm(log(y) ~ z.x1 + z.x2 + z.x3 + z.x4, zGPA, family = fam) # Coefficients using standardized data: # coef(fmgz) # (intercept is unchanged because the variables haven't been # centred) # Standardized coefficients: # coef(fmg) * psdg
# Fit model to original data: fm <- lm(Fertility ~ Agriculture + Examination + Education + Catholic, data = swiss) # Partial SD for the default formula: psd <- PartialSD(lm(data = swiss))[-1] # remove first element for intercept # Standardize data: zswiss <- scale(swiss, scale = c(NA, psd), center = TRUE) # Note: first element of 'scale' is set to NA to ignore the first column 'y' # Coefficients of a model fitted to standardized data: # zapsmall(coefTable(stdizeFit(fm, data = zGPA))) # Standardized coefficients of a model fitted to original data: # zapsmall(StdCoef(fm, partial = TRUE)) # Standardizing nonlinear models: fam <- Gamma("inverse") fmg <- glm(log(Fertility) ~ Agriculture + Examination + Education + Catholic, data = swiss, family = fam) psdg <- PartialSD(fmg) # zGPA <- stdize(GPA, scale = c(NA, psdg[-1]), center = FALSE) # fmgz <- glm(log(y) ~ z.x1 + z.x2 + z.x3 + z.x4, zGPA, family = fam) # Coefficients using standardized data: # coef(fmgz) # (intercept is unchanged because the variables haven't been # centred) # Standardized coefficients: # coef(fmg) * psdg
Basically a wrapper for str()
, extended with an enumeration for the variables of a data.frame.
Str(x, ...)
Str(x, ...)
x |
any |
... |
dots are passed to |
Andri Signorell <[email protected]>
Str(d.pizza)
Str(d.pizza)
Abbreviate a character vector. The function includes starting from the first character as many characters as there are needed to result in a vector of unique values.
StrAbbr(x, minchar = 1, method = c("left", "fix"))
StrAbbr(x, minchar = 1, method = c("left", "fix"))
x |
character vector to be abbreviated |
minchar |
integer, minimal number of characters for the abbreviations. |
method |
one out of |
The abbreviated strings.
Andri Signorell <[email protected]>
StrAbbr(x=levels(d.pizza$driver), minchar=2) StrAbbr(x=levels(d.pizza$driver), minchar=2, method="left") StrAbbr(x=levels(d.pizza$driver), minchar=2, method="fix") x <- c("Aaron", "Aaramis", "Berta", "Bello", "Claudia", "Cardinale", "Doretta", "Emilia") StrAbbr(x, minchar=2, method="left") StrAbbr(x, minchar=2, method="fix")
StrAbbr(x=levels(d.pizza$driver), minchar=2) StrAbbr(x=levels(d.pizza$driver), minchar=2, method="left") StrAbbr(x=levels(d.pizza$driver), minchar=2, method="fix") x <- c("Aaron", "Aaramis", "Berta", "Bello", "Claudia", "Cardinale", "Doretta", "Emilia") StrAbbr(x, minchar=2, method="left") StrAbbr(x, minchar=2, method="fix")
Align a vector of strings to the left, to the right, to the center or to the first occurance of a specified character, e.g. to the decimal separator. Alignment is achieved by padding the strings with empty spaces (which evidently only will have an alignment effect if the text is displayed with a monospaced font).
StrAlign(x, sep = "\\r")
StrAlign(x, sep = "\\r")
x |
a character vector to be aligned. |
sep |
the character on whose position the strings will be aligned. Left alignment can be requested by setting |
Alignment to the left or right leave no room for misinterpretation. The function will determine the maximum string size in the vector, resize all the strings to this size by padding empty spaces either at the beginning or at the end.
cbind(StrAlign(c("here", "there", "everywhere"), sep = "\r")) [1,] " here" [2,] " there" [3,] "everywhere"
When it comes to center strings, it's not clear where to place strings with an even length in case the maximum length is odd (or vice versa). We will put the shorter distance of an uneven string to the left (note the second term, that has 2 spaces on the left and 3 spaces on the right).
cbind(StrAlign(c("here", "there", "everywhere"), sep = "\c")) [1,] " here " [2,] " there " [3,] "everywhere"
Any specific length of the strings can be created by StrPad
if required.
In case of a given character as separator the strings will be aligned towards this separator. Frequently this might be the decimal separator. If a string does not contain the separator, the affected string will be aligned as if it had a separator as last character. This seems to be a good default, when integer numbers are to be aligned with numerical values. Note that the character length of the resulting strings can excceed the maximum length of the supplied strings.
z <- c(" 6.0", "6.00 ", " 45.12 ", "784", NA) cbind(StrAlign(z, sep=".")) [,1] [1,] " 6.0 " [2,] " 6.00" [3,] " 45.12" [4,] "784 " [5,] NA
The character strings will not be pruned of whitespaces, if the requested alignment does not explicitly require it. StrTrim
can be used for that.
a character vector containing the aligned strings
Andri Signorell <[email protected]>
# align on (the first occuring) B x <- c("ABCDMNB", "CDGHEBK", "BCI") cbind(StrAlign(x, sep="B")) # align to decimal separator (here point) z <- c(" 6.0", "6.00 ", " 45.12 ", "784", NA) cbind(StrAlign(z, sep=".")) # right align, the width will be the max number of characters in x cbind(StrAlign(x, sep="\\r")) # left align cbind(StrAlign(x, sep="\\l")) # center cbind(StrAlign(x, sep="\\c"))
# align on (the first occuring) B x <- c("ABCDMNB", "CDGHEBK", "BCI") cbind(StrAlign(x, sep="B")) # align to decimal separator (here point) z <- c(" 6.0", "6.00 ", " 45.12 ", "784", NA) cbind(StrAlign(z, sep=".")) # right align, the width will be the max number of characters in x cbind(StrAlign(x, sep="\\r")) # left align cbind(StrAlign(x, sep="\\l")) # center cbind(StrAlign(x, sep="\\c"))
Stratified sampling with equal/unequal probabilities.
Strata(x, stratanames = NULL, size, method = c("srswor", "srswr", "poisson", "systematic"), pik, description = FALSE)
Strata(x, stratanames = NULL, size, method = c("srswor", "srswr", "poisson", "systematic"), pik, description = FALSE)
x |
a data frame or a matrix; its number of rows is n, the population size. |
stratanames |
vector of stratification variables. |
size |
vector of stratum sample sizes (in the order in which the strata are given in the input data set). |
method |
method to select units; implemented are: a) simple random
sampling without replacement ( |
pik |
vector of inclusion probabilities or auxiliary information used to compute them; this argument is only used for unequal probability sampling (Poisson and systematic). If an auxiliary information is provided, the function uses the inclusionprobabilities function for computing these probabilities. If the method is "srswr" and the sample size is larger than the population size, this vector is normalized to one. |
description |
a message is printed if its value is TRUE; the message gives the number of selected units and the number of the units in the population. By default, the value is FALSE. |
The function produces an object, which contains the following information:
id |
the identifier of the selected units. |
stratum |
the unit stratum. |
prob |
the final unit inclusion probability. |
Andri Signorell <[email protected]>
rewritten based on the ideas of Yves Tille <[email protected]> and Alina Matei <[email protected]>
# Example from An and Watts (New SAS procedures for Analysis of Sample Survey Data) # generates artificial data (a 235X3 matrix with 3 columns: state, region, income). # the variable "state" has 2 categories ('nc' and 'sc'). # the variable "region" has 3 categories (1, 2 and 3). # the sampling frame is stratified by region within state. # the income variable is randomly generated m <- rbind(matrix(rep("nc",165), 165, 1, byrow=TRUE), matrix(rep("sc", 70), 70, 1, byrow=TRUE)) m <- cbind.data.frame(m, c(rep(1, 100), rep(2,50), rep(3,15), rep(1, 30), rep(2, 40)), 1000 * runif(235)) names(m) <- c("state", "region", "income") # computes the population stratum sizes table(m$region, m$state) # not run # nc sc # 1 100 30 # 2 50 40 # 3 15 0 # there are 5 cells with non-zero values # one draws 5 samples (1 sample in each stratum) # the sample stratum sizes are 10,5,10,4,6, respectively # the method is 'srswor' (equal probability, without replacement) s <- Strata(m, c("region", "state"), size=c(10, 5, 10, 4, 6), method="srswor") # extracts the observed data data.frame(income=m[s$id, "income"], s) # see the result using a contigency table table(s$region, s$state) # The same data as in Example 1 # the method is 'systematic' (unequal probability, without replacement) # the selection probabilities are computed using the variable 'income' s <- Strata(m,c("region", "state"), size=c(10, 5, 10, 4, 6), method="systematic", pik=m$income) # extracts the observed data data.frame(income=m[s$id, "income"], s) # see the result using a contigency table table(s$region, s$state)
# Example from An and Watts (New SAS procedures for Analysis of Sample Survey Data) # generates artificial data (a 235X3 matrix with 3 columns: state, region, income). # the variable "state" has 2 categories ('nc' and 'sc'). # the variable "region" has 3 categories (1, 2 and 3). # the sampling frame is stratified by region within state. # the income variable is randomly generated m <- rbind(matrix(rep("nc",165), 165, 1, byrow=TRUE), matrix(rep("sc", 70), 70, 1, byrow=TRUE)) m <- cbind.data.frame(m, c(rep(1, 100), rep(2,50), rep(3,15), rep(1, 30), rep(2, 40)), 1000 * runif(235)) names(m) <- c("state", "region", "income") # computes the population stratum sizes table(m$region, m$state) # not run # nc sc # 1 100 30 # 2 50 40 # 3 15 0 # there are 5 cells with non-zero values # one draws 5 samples (1 sample in each stratum) # the sample stratum sizes are 10,5,10,4,6, respectively # the method is 'srswor' (equal probability, without replacement) s <- Strata(m, c("region", "state"), size=c(10, 5, 10, 4, 6), method="srswor") # extracts the observed data data.frame(income=m[s$id, "income"], s) # see the result using a contigency table table(s$region, s$state) # The same data as in Example 1 # the method is 'systematic' (unequal probability, without replacement) # the selection probabilities are computed using the variable 'income' s <- Strata(m,c("region", "state"), size=c(10, 5, 10, 4, 6), method="systematic", pik=m$income) # extracts the observed data data.frame(income=m[s$id, "income"], s) # see the result using a contigency table table(s$region, s$state)
Capitalize the first letter of each element of the string vector.
StrCap(x, method=c("first", "word", "title"))
StrCap(x, method=c("first", "word", "title"))
x |
string to be capitalized. |
method |
one out of |
Returns a vector of charaters with the first letter capitalized
Charles Dupont <[email protected]>, Andri Signorell <[email protected]> (methods word and title)
# capitalize first character StrCap(c("Hello", "bob", "daN")) # but not all... StrCap(c("Hello bob, how are you?", "And you, DANIEL?")) # wordwise StrCap(c("Capitalize all words in titles of publications and documents", "but Up and UP, not all and all", NA), method="word") # wordwise omitting the ones listed above StrCap(c("Capitalize all words in titles of publications and documents", "but Up and UP, not all and all", NA), method="title") # do not touch non alphabetic characters z <- c("Lorem ipsum dolor", "-- sit amet", "consectetur --", " adipiscing elit ", "sed,.--(do) / +-*eiusmod") StrCap(z, method="title")
# capitalize first character StrCap(c("Hello", "bob", "daN")) # but not all... StrCap(c("Hello bob, how are you?", "And you, DANIEL?")) # wordwise StrCap(c("Capitalize all words in titles of publications and documents", "but Up and UP, not all and all", NA), method="word") # wordwise omitting the ones listed above StrCap(c("Capitalize all words in titles of publications and documents", "but Up and UP, not all and all", NA), method="title") # do not touch non alphabetic characters z <- c("Lorem ipsum dolor", "-- sit amet", "consectetur --", " adipiscing elit ", "sed,.--(do) / +-*eiusmod") StrCap(z, method="title")
Splitting a string into a number of sections of defined length is needed, when we want to split a table given as a number of lines without separator into columns. The cutting points can either be defined by the lengths of the sections or directly by position.
StrChop(x, len, pos)
StrChop(x, len, pos)
x |
the string to be cut in pieces. |
len |
a vector with the lengths of the pieces. |
pos |
a vector of cutting positions. Will be ignored when |
If length is going over the end of the string the last part will be returned, so if the rest of the string is needed, it's possible to simply enter a big number as last partlength.
len
and pos
can't be defined simultaneously, only alternatively.
Typical usages are
StrChop(x, len) StrChop(x, pos)
a vector with the parts of the string.
Andri Signorell <[email protected]>
x <- paste(letters, collapse="") StrChop(x=x, len = c(3,5,2)) # and with the rest integrated StrChop(x=x, len = c(3, 5, 2, nchar(x))) # cutpoints at 5th and 10th position StrChop(x=x, pos=c(5, 10))
x <- paste(letters, collapse="") StrChop(x=x, len = c(3,5,2)) # and with the rest integrated StrChop(x=x, len = c(3, 5, 2, nchar(x))) # cutpoints at 5th and 10th position StrChop(x=x, pos=c(5, 10))
Count the number of words that appear within a character string.
StrCountW(x)
StrCountW(x)
x |
a vector of strings to be parsed. |
This is just a wrapper for a fine regexpr. It uses the expression \b\W+\b
to separate the
words. The code \W
is equivalent to [^[:alnum:]_])
wheras [:alnum:]
contains [:alpha:]
and [:digit:]
.
So everything that is not an alphanumeric character, a digit or a _ (underscore) is used as separator for the words to be counted.
an integer defining the number of word in the string
Andri Signorell <[email protected]>, based on code from Adam Bradley <[email protected]>
http://stackoverflow.com/questions/8920145/count-the-number-of-words-in-a-string-in-r
StrCountW("This is a true story!") StrCountW("Just_one_word") StrCountW("Not-just.one/word") StrCountW("And what about numbers 8899 or special characters $$$/*?") StrCountW(" Starting'n ending with some whitespace ") StrCountW(c("This is a", "text in more", "than one line."))
StrCountW("This is a true story!") StrCountW("Just_one_word") StrCountW("Not-just.one/word") StrCountW("And what about numbers 8899 or special characters $$$/*?") StrCountW(" Starting'n ending with some whitespace ") StrCountW(c("This is a", "text in more", "than one line."))
StrDist
computes distances between strings following to Levenshtein or Hamming method.
StrDist(x, y, method = "levenshtein", mismatch = 1, gap = 1, ignore.case = FALSE)
StrDist(x, y, method = "levenshtein", mismatch = 1, gap = 1, ignore.case = FALSE)
x |
character vector, first string. |
y |
character vector, second string. |
method |
character, name of the distance method. This must be
|
mismatch |
numeric, distance value for a mismatch between symbols. |
gap |
numeric, distance value for inserting a gap. |
ignore.case |
if |
The function computes the Hamming and the Levenshtein (edit) distance of two given strings (sequences). The Hamming distance between two vectors is the number mismatches between corresponding entries.
In case of the Hamming distance the two strings must have the same length.
In case of the Levenshtein (edit) distance a scoring and a trace-back matrix are computed
and are saved as attributes "ScoringMatrix"
and "TraceBackMatrix"
.
The numbers in the trace-back matrix reflect insertion of a gap in string y
(1),
match/missmatch (2), and insertion of a gap in string x
(3).
The edit distance is useful, but normalizing the distance to fall within the interval [0,1] is
preferred because it is somewhat diffcult to judge whether an LD of for example 4 suggests a high or low degree of similarity.
The method "normlevenshtein"
for normalizing the LD is sensitive to this scenario.
In this implementation, the Levenshtein distance is transformed to fall in this interval as
follows:
where ld
is the edit distance and max(length(x), length(y))
denotes that we divide by the length of the larger of the two character strings. This normalization, referred to as the Levenshtein normalized distance (lnd), yields a statistic where 1 indicates perfect agreement between the two strings, and a 0 denotes imperfect agreement. The closer a value is to 1, the more certain we can be that the character strings are the same; the closer to 0, the less certain.
StrDist
returns an object of class "dist"
; cf. dist
.
For distances between strings and for string alignments see also Bioconductor package Biostrings
Matthias Kohl <[email protected]>
R. Merkl and S. Waack (2009) Bioinformatik Interaktiv. Wiley.
Harold C. Doran (2010) MiscPsycho. An R Package for Miscellaneous Psychometric Analyses
x <- "GACGGATTATG" y <- "GATCGGAATAG" ## Levenshtein distance d <- StrDist(x, y) d attr(d, "ScoringMatrix") attr(d, "TraceBackMatrix") ## Hamming distance StrDist(x, y, method="hamming")
x <- "GACGGATTATG" y <- "GATCGGAATAG" ## Levenshtein distance d <- StrDist(x, y) d attr(d, "ScoringMatrix") attr(d, "TraceBackMatrix") ## Hamming distance StrDist(x, y, method="hamming")
Extract a part of a string, defined as regular expression. StrExtractBetween()
is a convenience function used to extract parts between a left and right delimiter.
StrExtract(x, pattern, ...) StrExtractBetween(x, left, right, greedy = FALSE)
StrExtract(x, pattern, ...) StrExtractBetween(x, left, right, greedy = FALSE)
x |
a character vector where matches are sought, or an object which can be coerced by |
pattern |
character string containing a regular expression (or character string for |
left |
left character(s) limiting the string to be extracted |
right |
right character(s) limiting the string to be extracted |
greedy |
logical, determines whether the first found match for |
... |
the dots are passed to the the internally used function |
The function wraps regexpr
and regmatches
.
A character vector.
Andri Signorell <[email protected]>
txt <- c("G1:E001", "No points here", "G2:E002", "G3:E003", NA) # extract everything after the : StrExtract(x=txt, pattern=":.*") # extract everything between "left" and "right" z <- c("yBS (23A) 890", "l 89Z) 890.?/", "WS (55X) 8(90)", "123 abc", "none", NA) # everything enclosed by spaces StrExtractBetween(z, " ", " ") # note to escape special characters StrExtractBetween(z, "\\(", "\\)")
txt <- c("G1:E001", "No points here", "G2:E002", "G3:E003", NA) # extract everything after the : StrExtract(x=txt, pattern=":.*") # extract everything between "left" and "right" z <- c("yBS (23A) 890", "l 89Z) 890.?/", "WS (55X) 8(90)", "123 abc", "none", NA) # everything enclosed by spaces StrExtractBetween(z, " ", " ") # note to escape special characters StrExtractBetween(z, "\\(", "\\)")
For convenience we sometimes want to strip some or all attributes in a oneliner.
SetAttr(x, attr, attr_val) StripAttr(x, attr_names = NULL)
SetAttr(x, attr, attr_val) StripAttr(x, attr_names = NULL)
x |
the object whose attributes should be removed or to which an attribute should be added. |
attr |
name of a new attribute |
attr_val |
value for the new attribute |
attr_names |
a vector with attribute names, which will be removed. Leaving the default to |
the object x
without the attributes contained in attr_names
Andri Signorell <[email protected]>
x <- runif(10) x <- SetAttr(x, attr=c("some_attr", "other_attr"), attr_val=c("First attribute", "Second attribute")) # strip only single StripAttr(x, "other_attr") # strip all attributes StripAttr(x)
x <- runif(10) x <- SetAttr(x, attr=c("some_attr", "other_attr"), attr_val=c("First attribute", "Second attribute")) # strip only single StripAttr(x, "other_attr") # strip all attributes StripAttr(x)
Check whether a string does only contain numeric data.
StrIsNumeric(x)
StrIsNumeric(x)
x |
a character vector |
a logical vector with the same dimension as x
Andri Signorell <[email protected]>
Other string functions, e.g. StrTrunc
x <- c("123", "-3.141", "foobar123") StrIsNumeric(x)
x <- c("123", "-3.141", "foobar123") StrIsNumeric(x)
Returns the left part or the right part of a string. The number of characters are defined by the argument n
.
If n
is negative, this number of characters will be cut off from the other side.
StrLeft(x, n) StrRight(x, n)
StrLeft(x, n) StrRight(x, n)
x |
a vector of strings. |
n |
a positive or a negative integer, the number of characters to cut. If n is negative, this number of characters will be cut off
from the right with |
The functions StrLeft
and StrRight
are simple wrappers to substr
.
the left (right) n characters of x
Andri Signorell <[email protected]>
StrLeft("Hello world!", n=5) StrLeft("Hello world!", n=-5) StrRight("Hello world!", n=6) StrRight("Hello world!", n=-6) StrLeft(c("Lorem", "ipsum", "dolor","sit","amet"), n=2) StrRight(c("Lorem", "ipsum", "dolor","sit","amet"), n=c(2,3))
StrLeft("Hello world!", n=5) StrLeft("Hello world!", n=-5) StrRight("Hello world!", n=6) StrRight("Hello world!", n=-6) StrLeft(c("Lorem", "ipsum", "dolor","sit","amet"), n=2) StrRight(c("Lorem", "ipsum", "dolor","sit","amet"), n=c(2,3))
StrPad
will fill a string x with defined characters to fit a given length.
StrPad(x, width = NULL, pad = " ", adj = "left")
StrPad(x, width = NULL, pad = " ", adj = "left")
x |
a vector of strings to be padded. |
width |
resulting width of padded string. If x is a vector and width is left to NULL, it will be set to the length of the largest string in x. |
pad |
string to pad with. Will be repeated as often as necessary. Default is " ". |
adj |
adjustement of the old string, one of |
If a string x has more characters than width, it will be chopped on the length of width.
the string
Christian W. Hoffmann <[email protected]>
some extensions Andri Signorell <[email protected]>
StrPad("My string", 25, "XoX", "center") # [1] "XoXXoXXoMy stringXXoXXoXX"
StrPad("My string", 25, "XoX", "center") # [1] "XoXXoXXoMy stringXXoXXoXX"
Returns the numeric position of the first occurrence of a substring within a string. If the search string is not found, the result will be NA
.
StrPos(x, pattern, pos = 1, ...)
StrPos(x, pattern, pos = 1, ...)
x |
a character vector in which to search for the pattern, or an object which can be coerced by as.character to a character vector. |
pattern |
character string (search string) containing the pattern to be matched in the given character vector. This can be a character string or a regular expression. |
pos |
integer, defining the start position for the search within x. The result will then be relative to the begin of the truncated string. Will be recycled. |
... |
the dots are passed to the function |
This is just a wrapper for the function regexpr
.
a vector of the first position of pattern in x
Andri Signorell <[email protected]>
StrPos(x = levels(d.pizza$driver), pattern = "t")
StrPos(x = levels(d.pizza$driver), pattern = "t")
Returns a string in reverse order.
StrRev(x)
StrRev(x)
x |
a string to be processed. |
string
Andri Signorell <[email protected]> solely copying R core code from strsplit example
String functions:
nchar
, match
, grep
, regexpr
,
substr
, sub
, gsub
,
StrTrunc
, StrDist
StrRev("home") StrRev("Anna")
StrRev("home") StrRev("Anna")
The function splits a string into single characters and returns their representation in either the NATO phonetic alphabet or the Morse alphabet. The 26 code words in the NATO phonetic alphabet are assigned to the 26 letters of the English alphabet in alphabetical order as follows: Alfa, Bravo, Charlie, Delta, Echo, Foxtrot, Golf, Hotel, India, Juliett, Kilo, Lima, Mike, November, Oscar, Papa, Quebec, Romeo, Sierra, Tango, Uniform, Victor, Whiskey, X-ray, Yankee, Zulu. Digits 0-9 are also supported.
StrSpell(x, upr = "CAP", type = c("NATO", "Morse"))
StrSpell(x, upr = "CAP", type = c("NATO", "Morse"))
x |
character, the string to be encoded. |
upr |
character, a shortcut to be used to characterise capital letters. Ignored if |
type |
the type of phonetic alphabet, either |
a character vector containing the code words
Andri Signorell <[email protected]>
https://en.wikipedia.org/wiki/NATO_phonetic_alphabet
# ... ever had to communicate a password by phone? ;-) StrSpell("Yailov9teb6i") paste(StrSpell("Andri", type="Morse"), collapse="|")
# ... ever had to communicate a password by phone? ;-) StrSpell("Yailov9teb6i") paste(StrSpell("Andri", type="Morse"), collapse="|")
Split the elements of a character vector x into substrings according to the matches to substring split within them.
This is a verbatim copy of the base R function strsplit
, but with a split default of ""
and returning a vector instead of a list, when x had the length 1.
StrSplit(x, split = "", fixed = FALSE, perl = FALSE, useBytes = FALSE)
StrSplit(x, split = "", fixed = FALSE, perl = FALSE, useBytes = FALSE)
x |
character vector, each element of which is to be split. Other inputs, including a factor, will give an error. |
split |
character vector (or object which can be coerced to such)
containing regular expression(s) (unless |
fixed |
logical. If |
perl |
logical. Should Perl-compatible regexps be used? |
useBytes |
logical. If |
See strsplit
for the details.
A list of the same length as x
, the i
-th element of which
contains the vector of splits of x[i]
.
If the length x was 1 a vecotor with the splits will be returned.
paste
for the reverse,
grep
and sub
for string search and
manipulation; also nchar
, substr
.
‘regular expression’ for the details of the pattern specification.
noquote(StrSplit("A text I want to display with spaces")) # the same as ... noquote(strsplit("A text I want to display with spaces", NULL)[[1]])
noquote(StrSplit("A text I want to display with spaces")) # the same as ... noquote(strsplit("A text I want to display with spaces", NULL)[[1]])
The function removes whitespace characters as spaces, tabs and newlines from the beginning and end of the supplied string.
Whitespace characters occurring in the middle of the string are retained.
Trimming with method "left"
deletes only leading whitespaces, "right"
only trailing.
Designed for users who were socialized by SQL.
StrTrim(x, pattern = " \t\n", method = "both")
StrTrim(x, pattern = " \t\n", method = "both")
x |
the string to be trimmed. |
pattern |
the pattern of the whitespaces to be deleted, defaults to space, tab and newline: |
method |
one out of |
The functions are defined depending on method asboth: gsub( pattern=gettextf("^[%s]+|[%s]+$", pattern, pattern), replacement="", x=x)
left: gsub( pattern=gettextf("^[%s]+",pattern), replacement="", x=x)
right: gsub( pattern=gettextf("[%s]+$",pattern), replacement="", x=x)
the string x without whitespaces
Andri Signorell <[email protected]>
String functions:
trimws
, nchar
, match
, grep
, regexpr
, substr
, sub
, gsub
,
StrTrunc
, StrDist
StrTrim(" Hello world! ") StrTrim(" Hello world! ", method="left") StrTrim(" Hello world! ", method="right") # user defined pattern StrTrim(" ..Hello ... world! ", pattern=" \\.")
StrTrim(" Hello world! ") StrTrim(" Hello world! ", method="left") StrTrim(" Hello world! ", method="right") # user defined pattern StrTrim(" ..Hello ... world! ", pattern=" \\.")
Truncates one or more strings to a specified length, adding an ellipsis (...)
to those strings that have been truncated. The truncation can also be performed using word boundaries.
Use StrAlign()
to justify the strings if needed.
StrTrunc(x, maxlen = 20, ellipsis = "...", wbound = FALSE)
StrTrunc(x, maxlen = 20, ellipsis = "...", wbound = FALSE)
x |
a vector of strings. |
maxlen |
the maximum length of the returned strings (NOT counting the appended ellipsis). |
ellipsis |
the string to be appended, if the string is longer than the given maximal length. The default is |
wbound |
logical. Determines if the maximal length should be reduced to the next smaller word boundary and so words are not chopped. Default is |
The string(s) passed as ‘x’ now with a maximum length of ‘maxlen’ + 3 (for the ellipsis).
Andri Signorell,
once following an idea of Jim Lemon in truncString()
String functions:
nchar
, match
, grep
, regexpr
, substr
, sub
, gsub
,
StrTrim
, StrDist
x <- c("this is short", "and this is a longer text", "whereas this is a much longer story, which could not be told shorter") # simple truncation on 10 characters StrTrunc(x, maxlen=10) # NAs remain NA StrTrunc(c(x, NA_character_), maxlen=15, wbound=TRUE) # using word boundaries for(i in -5:20) print(StrTrunc(x, maxlen=i, wbound=TRUE)) # compare for(i in -5:20) print(StrTrunc(x, maxlen=i, wbound=FALSE))
x <- c("this is short", "and this is a longer text", "whereas this is a much longer story, which could not be told shorter") # simple truncation on 10 characters StrTrunc(x, maxlen=10) # NAs remain NA StrTrunc(c(x, NA_character_), maxlen=15, wbound=TRUE) # using word boundaries for(i in -5:20) print(StrTrunc(x, maxlen=i, wbound=TRUE)) # compare for(i in -5:20) print(StrTrunc(x, maxlen=i, wbound=FALSE))
Extract all numeric values from a string using a regular expression and return a list of all found values. If there are several, the values can be either pasted and/or casted from characters to numeric values.
StrVal(x, paste = FALSE, as.numeric = FALSE, dec = getOption("OutDec"))
StrVal(x, paste = FALSE, as.numeric = FALSE, dec = getOption("OutDec"))
x |
a character vector |
paste |
should separatetly extracted numbers be pasted together? This can be useful to reverse a prior format action. Default is |
as.numeric |
logical value, determining if the extracted values should be converted to a number or be returned as characters. Default is |
dec |
character string containing a single character. The preferred character to be used as the decimal point. Defaults |
If there are multiple numbers in the same string to paste and cast to numeric, pasting will be done first and after pasting the conversion will be performed. So if for example the numbers in x = "34 way 066"
should be extracted StrVal(x, paste = TRUE, as.numeric = TRUE)
will lead to 34066
. This is a useful choice for converting formatted numbers having some kind of bigmark.
depending on the results the function will return either a character vector, in the case every element of x contained only one number, or a list of character vectors containing the found numbers.
Andri Signorell <[email protected]>, Markus Naepflin <[email protected]> provided an optimized regex
other string functions in DescTools-package
, section String functions
# a simple vector with only one number per element StrVal(x=c("week 1", "week 3", "week 4", "week 5")) # several numbers per element, extract each part, do not paste and return characters StrVal(x=c("This is 1. place: 45.2", "none", "12.1 but -2.7 follow, 10.2e23 "), paste = FALSE, as.numeric = FALSE) # critical are numbers combined with signs, where we sequentially extract valid numbers StrVal(x=c("78-23-99", "1e-15-34*789+9", "- 34values"), paste = FALSE, as.numeric = FALSE) # a typical use case for this function is to reverse a previously # applied number format x <- c(100000, 4564654632, -456463) xf <- Format(x, big.mark="'") StrVal(xf, paste = TRUE, as.numeric = TRUE) StrVal(xf, paste = TRUE, as.numeric = FALSE) StrVal(xf, paste = FALSE, as.numeric = TRUE) StrVal(xf, paste = FALSE, as.numeric = FALSE) # use an alternative decimal point StrVal("8 452,12", dec=",")
# a simple vector with only one number per element StrVal(x=c("week 1", "week 3", "week 4", "week 5")) # several numbers per element, extract each part, do not paste and return characters StrVal(x=c("This is 1. place: 45.2", "none", "12.1 but -2.7 follow, 10.2e23 "), paste = FALSE, as.numeric = FALSE) # critical are numbers combined with signs, where we sequentially extract valid numbers StrVal(x=c("78-23-99", "1e-15-34*789+9", "- 34values"), paste = FALSE, as.numeric = FALSE) # a typical use case for this function is to reverse a previously # applied number format x <- c(100000, 4564654632, -456463) xf <- Format(x, big.mark="'") StrVal(xf, paste = TRUE, as.numeric = TRUE) StrVal(xf, paste = TRUE, as.numeric = FALSE) StrVal(xf, paste = FALSE, as.numeric = TRUE) StrVal(xf, paste = FALSE, as.numeric = FALSE) # use an alternative decimal point StrVal("8 452,12", dec=",")
This function computes the marginal homogeneity test for a matrix of assignments of objects to
k
categories or two vectors x
, y
of category scores for n
data objects by two raters. The statistic is distributed as with
k-1
degrees of freedom.
It can be viewed as an extension of the McNemar test to table.
StuartMaxwellTest(x, y = NULL)
StuartMaxwellTest(x, y = NULL)
x |
either a 2-way |
y |
a factor with the same levels as x; ignored if x is a matrix. |
The null is that the probabilities of being classified into cells [i, j] and [j, i] are the same.
If x
is a matrix, it is taken as a two-dimensional contingency table, and hence its entries should be nonnegative integers. Otherwise, both x and y must be vectors or factors of the same length and with the same levels.
Incomplete cases are removed, vectors are coerced into factors, and the contingency table is computed from these.
If there is perfect agreement for any category k, that category must be omitted in order to invert matrix S.
If for any category k
, all frequencies in row k
and column k
are 0, except possibly for the main diagonal element (e.g., for perfect agreement for category k
, in such cases also the corresponding row and column marginal frequencies would be equal), then the category is not included in the test and should be ignored, say the Stuart-Maxwell test is performed with respect to the remaining categories only. The degree of freedom df
in this case can still be considered k - 1
, where k
is the number of original categories; this treats omitted categories as if they were included but contributed 0 to the value of - a reasonable view since such categories have equal row and column marginals. (See: https://www.john-uebersax.com/stat/mcnemar.htm#stuart)
A list with class "htest"
containing the following components:
statistic |
the value of the test statistic. |
parameter |
the degrees of freedom. |
p.value |
the p-value of the test. |
method |
a character string indicating what type of test was performed. |
data.name |
a character string giving the name of the data. |
Andri Signorell <[email protected]>, based on Code from Jim Lemon
Stuart, A (1955) A test for homogeneity of the marginal distributions in a two-way classification. Biometrika, 42, 412-416.
Maxwell, A.E. (1970) Comparing the classification of subjects by two independent judges. British Journal of Psychiatry, 116, 651-655.
Agresti, A. (2002) Categorical Data Analysis. John Wiley & Sons, pp 86 ff.
BhapkarTest
for a more powerful alternative to the Stuart-Maxwell test
mcnemar.test
, chisq.test
, MHChisqTest
,
BreslowDayTest
# Source: https://www.john-uebersax.com/stat/mcnemar.htm#stuart hyp <- as.table(matrix(c(20,3,0,10,30,5,5,15,40), nrow=3)) StuartMaxwellTest(hyp) # same as defined with two vectors d.hyp <- Untable(hyp) StuartMaxwellTest(x=d.hyp[,1], y=d.hyp[,2]) mc <- as.table(matrix(c( 732, 1524, 1575, 1577, 1602, 837, 1554, 1437, 1672, 1600, 841, 1363, 1385, 1484, 1524, 791), nrow=4)) StuartMaxwellTest(mc)
# Source: https://www.john-uebersax.com/stat/mcnemar.htm#stuart hyp <- as.table(matrix(c(20,3,0,10,30,5,5,15,40), nrow=3)) StuartMaxwellTest(hyp) # same as defined with two vectors d.hyp <- Untable(hyp) StuartMaxwellTest(x=d.hyp[,1], y=d.hyp[,2]) mc <- as.table(matrix(c( 732, 1524, 1575, 1577, 1602, 837, 1554, 1437, 1672, 1600, 841, 1363, 1385, 1484, 1524, 791), nrow=4)) StuartMaxwellTest(mc)
Calculate Stuart's statistic, a measure of
association for ordinal factors in a two-way table.
The function has interfaces for a table (matrix) and for single vectors.
StuartTauC(x, y = NULL, conf.level = NA, ...)
StuartTauC(x, y = NULL, conf.level = NA, ...)
x |
a numeric vector or a table. A matrix will be treated as table. |
y |
NULL (default) or a vector with compatible dimensions to |
conf.level |
confidence level of the interval. If set to |
... |
further arguments are passed to the function |
Stuart's makes an adjustment for table size in addition to a correction for ties.
is
appropriate only when both variables lie on an ordinal scale.
It is estimated by
where P equals the number of concordances and Q the number of discordances, n is the total amount of observations and m = min(R, C). The range of is [-1, 1].
See http://support.sas.com/documentation/cdl/en/statugfreq/63124/PDF/default/statugfreq.pdf, pp. 1739 for the estimation of the asymptotic variance.
The use of Stuart's Tau-c versus Kendall's Tau-b is recommended when the two ordinal variables under consideration have different numbers of values, e.g. good, medium, bad versus high, low.
a single numeric value if no confidence intervals are requested,
and otherwise a numeric vector with 3 elements for the estimate, the lower and the upper confidence interval
Andri Signorell <[email protected]>
Agresti, A. (2002) Categorical Data Analysis. John Wiley & Sons, pp. 57–59.
Brown, M.B., Benedetti, J.K.(1977) Sampling Behavior of Tests for Correlation in Two-Way Contingency Tables, Journal of the American Statistical Association, 72, 309-315.
Goodman, L. A., & Kruskal, W. H. (1954) Measures of association for cross classifications. Journal of the American Statistical Association, 49, 732-764.
Goodman, L. A., & Kruskal, W. H. (1963) Measures of association for cross classifications III: Approximate sampling theory. Journal of the American Statistical Association, 58, 310-364.
ConDisPairs
yields concordant and discordant pairs
Other association measures: GoodmanKruskalGamma
, KendallTauA
(),
cor
(method="kendall") for ,
SomersDelta
Lambda
, GoodmanKruskalTau
, UncertCoef
, MutInf
# example in: # http://support.sas.com/documentation/cdl/en/statugfreq/63124/PDF/default/statugfreq.pdf # pp. S. 1821 tab <- as.table(rbind(c(26,26,23,18,9),c(6,7,9,14,23))) StuartTauC(tab, conf.level=0.95)
# example in: # http://support.sas.com/documentation/cdl/en/statugfreq/63124/PDF/default/statugfreq.pdf # pp. S. 1821 tab <- as.table(rbind(c(26,26,23,18,9),c(6,7,9,14,23))) StuartTauC(tab, conf.level=0.95)
SysInfo is a convenience function to compile some information about the computing system and environment used.
SysInfo() FindRProfile()
SysInfo() FindRProfile()
The function SysInfo
is mainly used to save the system environment information
in ncdf files containing the results of some calculations. FindRProfile
returns path candidates where the profile could be found.
character string with all version and system information of the current R system
Jannis v. Buttlar <[email protected]>, Andri Signorell <[email protected]>
Text of a certain color when viewed against certain backgrounds can be hard to see.
TextContrastColor
returns either black or white depending on which has the better contrast.
TextContrastColor(col, white = "white", black = "black", method = c("glynn", "sonego"))
TextContrastColor(col, white = "white", black = "black", method = c("glynn", "sonego"))
col |
vector of any of the three kind of R colors, i.e., either a color name (an element of |
white |
the color for the dark backgrounds, default is |
black |
the color for the bright backgrounds, default is |
method |
defines the algorithm to be used. Can be one out of |
A simple heuristic in defining a text color for a given background color, is to pick the one that
is "farthest" away from "black" or "white".
The way Glynn chooses to do this is to compute the color intensity,
defined as the mean of the RGB triple, and pick "black" (intensity 0) for text color if the background
intensity is greater than 127, or "white" (intensity 255) when the background intensity is less than or equal to 127.
Sonego calculates L <- c(0.2, 0.6, 0) %*% col2rgb(color)/255
and returns "black" if L >= 0.2 and "white" else.
a vector containing the contrast color (either black or white)
Andri Signorell <[email protected]> based on code of Earl F. Glynn, Stowers Institute for Medical Research, 2004
# works fine for grays PlotArea( y=matrix(rep(1, times=3, each=8), ncol=8), x=1:3, col=gray(1:8 / 8), ylab="", xlab="", axes=FALSE ) text( x=2, y=1:8-0.5, levels(d.pizza$driver), col=TextContrastColor(gray(1:8 / 8))) # and not so fine, but still ok, for colors par(mfrow=c(1,2)) PlotArea( y=matrix(rep(1, times=3, each=12), ncol=12), x=1:3, col=rainbow(12), ylab="", xlab="", axes=FALSE, main="method = Glynn" ) text( x=2, y=1:12-0.5, levels(d.pizza$driver), col=TextContrastColor(rainbow(12))) PlotArea( y=matrix(rep(1, times=3, each=12), ncol=12), x=1:3, col=rainbow(12), ylab="", xlab="", axes=FALSE, main="method = Sonego" ) text( x=2, y=1:12-0.5, levels(d.pizza$driver), col=TextContrastColor(rainbow(12), method="sonego"))
# works fine for grays PlotArea( y=matrix(rep(1, times=3, each=8), ncol=8), x=1:3, col=gray(1:8 / 8), ylab="", xlab="", axes=FALSE ) text( x=2, y=1:8-0.5, levels(d.pizza$driver), col=TextContrastColor(gray(1:8 / 8))) # and not so fine, but still ok, for colors par(mfrow=c(1,2)) PlotArea( y=matrix(rep(1, times=3, each=12), ncol=12), x=1:3, col=rainbow(12), ylab="", xlab="", axes=FALSE, main="method = Glynn" ) text( x=2, y=1:12-0.5, levels(d.pizza$driver), col=TextContrastColor(rainbow(12))) PlotArea( y=matrix(rep(1, times=3, each=12), ncol=12), x=1:3, col=rainbow(12), ylab="", xlab="", axes=FALSE, main="method = Sonego" ) text( x=2, y=1:12-0.5, levels(d.pizza$driver), col=TextContrastColor(rainbow(12), method="sonego"))
Try to convert a string to a table, by first creating a data frame using read.table
. This can then be coerced to a matrix first, and subsequently to a table. The names of the dimensions can be specified.
TextToTable(x, dimnames = NULL, check.names = FALSE, ...)
TextToTable(x, dimnames = NULL, check.names = FALSE, ...)
x |
the string to be interpreted as table. |
dimnames |
the names of the dimensions. |
check.names |
passed on to |
... |
the dots will be passed to the function |
a table
Andri Signorell <[email protected]>
read.table
, as.table
, as.matrix
txt <- " Democrat, Independent, Republican M, 762, 327, 468 F, 484, 239, 477" (tab <- TextToTable(txt, header=TRUE, sep=",", dimnames=c("gender", "party")))
txt <- " Democrat, Independent, Republican M, 762, 327, 468 F, 484, 239, 477" (tab <- TextToTable(txt, header=TRUE, sep=",", dimnames=c("gender", "party")))
Calculate Theil's U index of inequality.
TheilU(a, p, type = c(2, 1), na.rm = FALSE)
TheilU(a, p, type = c(2, 1), na.rm = FALSE)
a |
a numeric vector with the actual observed values. |
p |
a numeric vector containing the predictions. |
type |
defining the type of Theil's two U measures, see Details. Default is 2. |
na.rm |
logical, indicating whether |
Theil proposed two error measures, but at different times and under the same symbol U, which has caused some confusion.
U type = 1
is taken from Theil (1958, pp. 31-42). The argument a
represents the actual observations and p
the corresponding predictions. He left it open whether a
and p
should be used as absolute values or as observed and predicted changes.
Theil (1966, chapter 2) proposed U type = 2
as a measure of forecast quality: "...where and
stand for a pair of predicted and observed changes. ..."
As has some serious disadvantages (see Bliemel 1973) it is recommended to use
.
Andri Signorell <[email protected]>
Theil, H. (1958): Economic Forecasts and Policy. Amsterdam: North Holland.
Thiel, H. (1966): Applied Economic Forecasting. Chicago: Rand McNally.
Bliemel, F. (1973): Theil's Forecast Accuracy Coefficient: A Clarification, Journal of Marketing Research Vol. 10, No. 4 (Nov., 1973), pp. 444-446
TheilU(1:10, 2:11, type=1) TheilU(1:10, 2:11, type=2)
TheilU(1:10, 2:11, type=1) TheilU(1:10, 2:11, type=2)
The function can be used to add a title to a plot surrounded by a rectangular box. This is useful for plotting several plots in narrow distances.
TitleRect(label, bg = "grey", border = 1, col = "black", xjust = 0.5, line = 2, ...)
TitleRect(label, bg = "grey", border = 1, col = "black", xjust = 0.5, line = 2, ...)
label |
the main title |
bg |
the background color of the box. |
border |
the border color of the box |
col |
the font color of the title |
xjust |
the x-justification of the text. This can be |
line |
on which MARgin line, starting at 0 counting outwards |
... |
the dots are passed to the |
nothing is returned
Andri Signorell <[email protected]>
plot(pressure) TitleRect("pressure")
plot(pressure) TitleRect("pressure")
Collect the coefficients and some qualifying statistics of linear models and organize it in a table for comparison and reporting. The function supports linear and general linear models.
TMod(..., FUN = NULL, order = NA) ModSummary(x, ...) ## S3 method for class 'lm' ModSummary(x, conf.level = 0.95, ...) ## S3 method for class 'glm' ModSummary(x, conf.level = 0.95, use.profile = TRUE, ...) ## S3 method for class 'TMod' plot(x, terms = NULL, intercept = FALSE, ...) ## S3 method for class 'TMod' print(x, digits = 3, na.form = "-", ...)
TMod(..., FUN = NULL, order = NA) ModSummary(x, ...) ## S3 method for class 'lm' ModSummary(x, conf.level = 0.95, ...) ## S3 method for class 'glm' ModSummary(x, conf.level = 0.95, use.profile = TRUE, ...) ## S3 method for class 'TMod' plot(x, terms = NULL, intercept = FALSE, ...) ## S3 method for class 'TMod' print(x, digits = 3, na.form = "-", ...)
x |
a (general) linear model object. |
... |
a list of (general) linear models. |
conf.level |
the level for the confidence intervals. |
FUN |
function with arguments |
order |
row of the results table to be used as order for the models (as typically "AIC"). Can be any label in the first column of the results table. Default is |
terms |
a vector with the terms of the model formula to be plotted. By default this will be all of them. |
use.profile |
logical. Defines if profile approach should be used, which normally is a good choice for small datasets. Calculating profile can however take ages for large datasets and not be necessary there. So we can fallback to normal confidence intervals. |
intercept |
logical, defining whether the intercept should be plotted (default is |
digits |
integer, the desired (fixed) number of digits after the decimal point. Unlike |
na.form |
character, string specifying how |
In order to compare the coefficients of linear models, the user is left to his own devices. R offers no support in this respect. TMod()
jumps into the breach and displays the coefficients of several models in tabular form. For this purpose, different quality indicators for the models are displayed, so that a comprehensive comparison of the models is possible. In particular, it is easy to see the effect that adding or omitting variables has on forecast quality.
A plot function for a TMod
object will produce a dotchart with the coefficients and their confidence intervals.
character table
Andri Signorell <[email protected]>
r.full <- lm(Fertility ~ . , swiss) r.nox <- lm(Fertility ~ . -Examination - Catholic, swiss) r.grp <- lm(Fertility ~ . -Education - Catholic + CutQ(Catholic), swiss) r.gam <- glm(Fertility ~ . , swiss, family=Gamma(link="identity")) r.gama <- glm(Fertility ~ .- Agriculture , swiss, family=Gamma(link="identity")) r.gaml <- glm(Fertility ~ . , swiss, family=Gamma(link="log")) TMod(r.full, r.nox, r.grp, r.gam, r.gama, r.gaml) # display confidence intervals TMod(r.full, r.nox, r.gam, FUN = function(est, se, tval, pval, lci, uci){ gettextf("%s [%s, %s]", Format(est, fmt=Fmt("num")), Format(lci, digits=3), Format(uci, digits=2) ) }) # cbind interface is not supported!! # d.titanic <- reshape(as.data.frame(Titanic), # idvar = c("Class","Sex","Age"), # timevar="Survived", # direction = "wide") # # r.glm0 <- glm(cbind(Freq.Yes, Freq.No) ~ 1, data=d.titanic, family="binomial") # r.glm1 <- glm(cbind(Freq.Yes, Freq.No) ~ Class, data=d.titanic, family="binomial") # r.glm2 <- glm(cbind(Freq.Yes, Freq.No) ~ ., data=d.titanic, family="binomial") d.titanic <- Untable(Titanic) r.glm0 <- glm(Survived ~ 1, data=d.titanic, family="binomial") r.glm1 <- glm(Survived ~ Class, data=d.titanic, family="binomial") r.glm2 <- glm(Survived ~ ., data=d.titanic, family="binomial") TMod(r.glm0, r.glm1, r.glm2) # plot OddsRatios d.pima <- MASS::Pima.tr2 r.a <- glm(type ~ npreg + bp + skin + bmi + ped + age, data=d.pima, family=binomial) r.b <- glm(type ~ npreg + glu + bp + skin, data=d.pima, family=binomial) r.c <- glm(type ~ npreg + age, data=d.pima, family=binomial) or.a <- OddsRatio(r.a) or.b <- OddsRatio(r.b) or.c <- OddsRatio(r.c) # create the model table tm <- TMod(m_A=or.a, m_B=or.b, m_C=or.c) # .. and plotit plot(tm, main="ORs for Models A, B, C", intercept=FALSE, pch=15, col=c(DescTools::hred, DescTools::hblue, DescTools::horange), panel.first=abline(v=1, col="grey30"))
r.full <- lm(Fertility ~ . , swiss) r.nox <- lm(Fertility ~ . -Examination - Catholic, swiss) r.grp <- lm(Fertility ~ . -Education - Catholic + CutQ(Catholic), swiss) r.gam <- glm(Fertility ~ . , swiss, family=Gamma(link="identity")) r.gama <- glm(Fertility ~ .- Agriculture , swiss, family=Gamma(link="identity")) r.gaml <- glm(Fertility ~ . , swiss, family=Gamma(link="log")) TMod(r.full, r.nox, r.grp, r.gam, r.gama, r.gaml) # display confidence intervals TMod(r.full, r.nox, r.gam, FUN = function(est, se, tval, pval, lci, uci){ gettextf("%s [%s, %s]", Format(est, fmt=Fmt("num")), Format(lci, digits=3), Format(uci, digits=2) ) }) # cbind interface is not supported!! # d.titanic <- reshape(as.data.frame(Titanic), # idvar = c("Class","Sex","Age"), # timevar="Survived", # direction = "wide") # # r.glm0 <- glm(cbind(Freq.Yes, Freq.No) ~ 1, data=d.titanic, family="binomial") # r.glm1 <- glm(cbind(Freq.Yes, Freq.No) ~ Class, data=d.titanic, family="binomial") # r.glm2 <- glm(cbind(Freq.Yes, Freq.No) ~ ., data=d.titanic, family="binomial") d.titanic <- Untable(Titanic) r.glm0 <- glm(Survived ~ 1, data=d.titanic, family="binomial") r.glm1 <- glm(Survived ~ Class, data=d.titanic, family="binomial") r.glm2 <- glm(Survived ~ ., data=d.titanic, family="binomial") TMod(r.glm0, r.glm1, r.glm2) # plot OddsRatios d.pima <- MASS::Pima.tr2 r.a <- glm(type ~ npreg + bp + skin + bmi + ped + age, data=d.pima, family=binomial) r.b <- glm(type ~ npreg + glu + bp + skin, data=d.pima, family=binomial) r.c <- glm(type ~ npreg + age, data=d.pima, family=binomial) or.a <- OddsRatio(r.a) or.b <- OddsRatio(r.b) or.c <- OddsRatio(r.c) # create the model table tm <- TMod(m_A=or.a, m_B=or.b, m_C=or.c) # .. and plotit plot(tm, main="ORs for Models A, B, C", intercept=FALSE, pch=15, col=c(DescTools::hred, DescTools::hblue, DescTools::horange), panel.first=abline(v=1, col="grey30"))
Simple reshaping a vector from long to wide or from wide to long shape by means of a single factor.
ToLong(x, varnames = NULL) ToWide(x, g, by = NULL, varnames = NULL)
ToLong(x, varnames = NULL) ToWide(x, g, by = NULL, varnames = NULL)
x |
the vector to be reshaped |
g |
the grouping vector to be used for the new columns. The resulting |
by |
a vector to be used to merge the pieces of |
varnames |
the variable names if not the grouping levels should be used. |
ToLong
expects x as a matrix or a data.frame and reshapes it to a (long) factor representation.
ToWide
expects the vectors x, g, by, wheras x being the variable, g the splitting factor and by a vector for rowwise merging.
the reshaped object
Andri Signorell <[email protected]>
d.x <- read.table(header=TRUE, text=" AA BB CC DD EE FF GG 7.9 18.1 13.3 6.2 9.3 8.3 10.6 9.8 14.0 13.6 7.9 2.9 9.1 13.0 6.4 17.4 16.0 10.9 8.6 11.7 17.5 ") ToLong(d.x) # ToWide by row numbers (by = NULL) ToWide(PlantGrowth$weight, PlantGrowth$group) # To wide aligned by key set.seed(41) PlantGrowth$nr <- c(sample(12, 10), sample(12, 10), sample(12, 10)) head(PlantGrowth) ToWide(PlantGrowth$weight, PlantGrowth$group, by=PlantGrowth$nr)
d.x <- read.table(header=TRUE, text=" AA BB CC DD EE FF GG 7.9 18.1 13.3 6.2 9.3 8.3 10.6 9.8 14.0 13.6 7.9 2.9 9.1 13.0 6.4 17.4 16.0 10.9 8.6 11.7 17.5 ") ToLong(d.x) # ToWide by row numbers (by = NULL) ToWide(PlantGrowth$weight, PlantGrowth$group) # To wide aligned by key set.seed(41) PlantGrowth$nr <- c(sample(12, 10), sample(12, 10), sample(12, 10)) head(PlantGrowth) ToWide(PlantGrowth$weight, PlantGrowth$group, by=PlantGrowth$nr)
Create a table summarizing continuous, categorical and dichotomous variables, optionally stratified by one or more variables, while performing adequate statistical tests.
TOne( x, grp = NA, add.length = TRUE, colnames = NULL, vnames = NULL, total = TRUE, align = "\\l", FUN = NULL, TEST = NULL, intref = "high", fmt = list(abs = Fmt("abs"), num = Fmt("num"), per = Fmt("per"), pval = as.fmt(fmt = "*", na.form = " ")) )
TOne( x, grp = NA, add.length = TRUE, colnames = NULL, vnames = NULL, total = TRUE, align = "\\l", FUN = NULL, TEST = NULL, intref = "high", fmt = list(abs = Fmt("abs"), num = Fmt("num"), per = Fmt("per"), pval = as.fmt(fmt = "*", na.form = " ")) )
x |
a data.frame containing all the variables to be included in the table. |
grp |
the grouping variable. |
add.length |
logical. If set to |
colnames |
a vector of column names for the result table. |
vnames |
a vector of variable names to be placed in the first column instead of the real names. |
total |
logical (default |
align |
the character on whose position the strings will be aligned.
Left alignment can be requested by setting |
FUN |
the function to be used as location and dispersion measure for
numeric (including integer) variables ( |
TEST |
a list of functions to be used to test the variables. Must be
named as |
intref |
one out of |
fmt |
format codes for absolute, numeric and percentage values, and for the p-values of the tests. |
In research the characteristics of study populations are often characterised through some kind of a "Table 1", containing descriptives of the used variables, as mean/standard deviation for continuous variables, and proportions for categorical variables. In many cases, a comparison is made between groups within the framework of the scientific question.
var Brent Camden Westminster n 474 (39.5 (31.8 Butcher 72 (15.2 (58.2 11 (2.9 77 (20.3 (50.3 exact test, "') Chi-Square test Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Creating such a table can be very time consuming and there's a need for a
flexible function that helps us to solve the task. TOne()
is designed
to be easily used with sensible defaults, and yet flexible enough to allow
free definition of the essential design elements.
This is done by breaking down the descriptive task to three types of variables: quantitative (numeric, integer), qualitative (factor, characters) and dichotomous variables (the latter having exactly two values or levels). Depending on the variable type, the descriptives and the according sensible tests are chosen. By default mean/sd are chosen to describe numeric variables.
FUN = function(x) gettextf(" Format(mean(x, na.rm = TRUE), digits = 1), Format(sd(x, na.rm = TRUE), digits = 3))
Their difference is tested with the Kruskal-Wallis test. For categorical
variables the absolute and relative frequencies are calculated and tested
with a chi-square test.
The tests can be changed with the argument
TEST
. These must be organised as list containing elements named
"num"
, "cat"
and "dich"
. Each of them must be a
function with arguments (x, g)
, returning something similar to a
p-value.
TEST = list( num = list(fun = function(x, g){summary(aov(x ~ g))[[1]][1, "Pr(>F)"]}, lbl = "ANOVA"), cat = list(fun = function(x, g){chisq.test(table(x, g))$p.val}, lbl = "Chi-Square test"), dich = list(fun = function(x, g){fisher.test(table(x, g))$p.val}, lbl = "Fisher exact test"))
The legend text of the test, which is appended to
the table together with the significance codes, can be set with the variable
lbl
.
Great importance was attached to the free definition of the number formats.
By default, the optionally definable format templates of DescTools
are used. Deviations from this can be freely passed as arguments to the
function. Formats can be defined for integers, floating point numbers,
percentages and for the p-values of statistical tests. All options of the
function Format()
are available and can be provided as a list.
See examples which show several different implementations.
fmt = list(abs = Fmt("abs"), num = Fmt("num"), per = Fmt("per"), pval = as.fmt(fmt = "*", na.form = " "))
The function returns a character matrix as result, which can easily be
subset or combined with other matrices. An interface for
ToWrd()
is available such that the matrix can be transferred
to MS-Word. Both font and alignment are freely selectable in the Word table.
a character matrix
Andri Signorell [email protected]
WrdTable()
, ToWrd.TOne()
options(scipen = 8) opt <- DescToolsOptions() # define some special formats for count data, percentages and numeric results # (those will be supported by TOne) Fmt(abs = as.fmt(digits = 0, big.mark = "'")) # counts Fmt(per = as.fmt(digits = 1, fmt = "%")) # percentages Fmt(num = as.fmt(digits = 1, big.mark = "'")) # numeric TOne(x = d.pizza[, c("temperature", "delivery_min", "driver", "wine_ordered")], grp = d.pizza$quality) # the same but no groups now... TOne(x = d.pizza[, c("temperature", "delivery_min", "driver", "wine_ordered")]) # define median/IQR as describing functions for the numeric variables TOne(iris[, -5], iris[, 5], FUN = function(x) { gettextf("%s / %s", Format(median(x, na.rm = TRUE), digits = 1), Format(IQR(x, na.rm = TRUE), digits = 3)) } ) # replace kruskal.test by ANOVA and report the p.value # Change tests for all the types TOne(x = iris[, -5], grp = iris[, 5], FUN = function(x) gettextf("%s / %s", Format(mean(x, na.rm = TRUE), digits = 1), Format(sd(x, na.rm = TRUE), digits = 3)), TEST = list( num = list(fun = function(x, g){summary(aov(x ~ g))[[1]][1, "Pr(>F)"]}, lbl = "ANOVA"), cat = list(fun = function(x, g){chisq.test(table(x, g))$p.val}, lbl = "Chi-Square test"), dich = list(fun = function(x, g){fisher.test(table(x, g))$p.val}, lbl = "Fisher exact test")), fmt = list(abs = Fmt("abs"), num = Fmt("num"), per = Fmt("per"), pval = as.fmt(fmt = "*", na.form = " ")) ) t1 <- TOne(x = d.pizza[,c("temperature", "driver", "rabate")], grp = d.pizza$area, align = " ", total = FALSE, FUN = function(x) gettextf("%s / %s (%s)", Format(mean(x, na.rm = TRUE), digits = 1), Format(sd(x, na.rm = TRUE), digits = 3), Format(median(x, na.rm = TRUE), digits = 1)), TEST = NA, fmt = list(abs = as.fmt(big.mark = " ", digits=0), num = as.fmt(big.mark = " ", digits=1), per = as.fmt(fmt=function(x) StrPad(Format(x, fmt="%", d=1), width=5, adj = "r")), pval = as.fmt(fmt = "*", na.form = " ")) ) # add a userdefined legend attr(t1, "legend") <- "numeric: mean / sd (median)), factor: n (n%)" t1 # dichotomous integer or logical values can be reported by the high or low value x <- sample(x = c(0, 1), size = 100, prob = c(0.3, 0.7), replace = TRUE) y <- sample(x = c(0, 1), size = 100, prob = c(0.3, 0.7), replace = TRUE) == 1 z <- factor(sample(x = c(0, 1), size = 100, prob = c(0.3, 0.7), replace = TRUE)) g <- sample(x = letters[1:4], size = 100, replace = TRUE) d.set <- data.frame(x = x, y = y, z = z, g = g) TOne(d.set[1:3], d.set$g, intref = "low") TOne(d.set[1:3], d.set$g, intref = "high") # intref would not control factors, use relevel to change reported value TOne(data.frame(z = relevel(z, "1")), g) TOne(data.frame(z = z), g) options(opt) ## Not run: # Send the whole stuff to Word wrd <- GetNewWrd() ToWrd( TOne(x = d.pizza[, c("temperature", "delivery_min", "driver", "wine_ordered")], grp = d.pizza$quality, fmt = list(num=Fmt("num", digits=1)) ), font = list(name="Arial narrow", size=8), align = c("l","r") # this will be recycled: left-right-left-right ... ) ## End(Not run)
options(scipen = 8) opt <- DescToolsOptions() # define some special formats for count data, percentages and numeric results # (those will be supported by TOne) Fmt(abs = as.fmt(digits = 0, big.mark = "'")) # counts Fmt(per = as.fmt(digits = 1, fmt = "%")) # percentages Fmt(num = as.fmt(digits = 1, big.mark = "'")) # numeric TOne(x = d.pizza[, c("temperature", "delivery_min", "driver", "wine_ordered")], grp = d.pizza$quality) # the same but no groups now... TOne(x = d.pizza[, c("temperature", "delivery_min", "driver", "wine_ordered")]) # define median/IQR as describing functions for the numeric variables TOne(iris[, -5], iris[, 5], FUN = function(x) { gettextf("%s / %s", Format(median(x, na.rm = TRUE), digits = 1), Format(IQR(x, na.rm = TRUE), digits = 3)) } ) # replace kruskal.test by ANOVA and report the p.value # Change tests for all the types TOne(x = iris[, -5], grp = iris[, 5], FUN = function(x) gettextf("%s / %s", Format(mean(x, na.rm = TRUE), digits = 1), Format(sd(x, na.rm = TRUE), digits = 3)), TEST = list( num = list(fun = function(x, g){summary(aov(x ~ g))[[1]][1, "Pr(>F)"]}, lbl = "ANOVA"), cat = list(fun = function(x, g){chisq.test(table(x, g))$p.val}, lbl = "Chi-Square test"), dich = list(fun = function(x, g){fisher.test(table(x, g))$p.val}, lbl = "Fisher exact test")), fmt = list(abs = Fmt("abs"), num = Fmt("num"), per = Fmt("per"), pval = as.fmt(fmt = "*", na.form = " ")) ) t1 <- TOne(x = d.pizza[,c("temperature", "driver", "rabate")], grp = d.pizza$area, align = " ", total = FALSE, FUN = function(x) gettextf("%s / %s (%s)", Format(mean(x, na.rm = TRUE), digits = 1), Format(sd(x, na.rm = TRUE), digits = 3), Format(median(x, na.rm = TRUE), digits = 1)), TEST = NA, fmt = list(abs = as.fmt(big.mark = " ", digits=0), num = as.fmt(big.mark = " ", digits=1), per = as.fmt(fmt=function(x) StrPad(Format(x, fmt="%", d=1), width=5, adj = "r")), pval = as.fmt(fmt = "*", na.form = " ")) ) # add a userdefined legend attr(t1, "legend") <- "numeric: mean / sd (median)), factor: n (n%)" t1 # dichotomous integer or logical values can be reported by the high or low value x <- sample(x = c(0, 1), size = 100, prob = c(0.3, 0.7), replace = TRUE) y <- sample(x = c(0, 1), size = 100, prob = c(0.3, 0.7), replace = TRUE) == 1 z <- factor(sample(x = c(0, 1), size = 100, prob = c(0.3, 0.7), replace = TRUE)) g <- sample(x = letters[1:4], size = 100, replace = TRUE) d.set <- data.frame(x = x, y = y, z = z, g = g) TOne(d.set[1:3], d.set$g, intref = "low") TOne(d.set[1:3], d.set$g, intref = "high") # intref would not control factors, use relevel to change reported value TOne(data.frame(z = relevel(z, "1")), g) TOne(data.frame(z = z), g) options(opt) ## Not run: # Send the whole stuff to Word wrd <- GetNewWrd() ToWrd( TOne(x = d.pizza[, c("temperature", "delivery_min", "driver", "wine_ordered")], grp = d.pizza$quality, fmt = list(num=Fmt("num", digits=1)) ), font = list(name="Arial narrow", size=8), align = c("l","r") # this will be recycled: left-right-left-right ... ) ## End(Not run)
Send objects like tables, ftables, lm tables, TOnes or just simple texts to a MS-Word document.
ToWrd(x, font = NULL, ..., wrd = DescToolsOptions("lastWord")) ## S3 method for class 'Freq' ToWrd(x, font = NULL, main = NULL, ..., wrd = DescToolsOptions("lastWord")) ## S3 method for class 'table' ToWrd(x, font = NULL, main = NULL, align = NULL, tablestyle = NULL, autofit = TRUE, row.names = TRUE, col.names = TRUE, ..., wrd = DescToolsOptions("lastWord")) ## S3 method for class 'data.frame' ToWrd(x, font = NULL, main = NULL, row.names = NULL, ..., wrd = DescToolsOptions("lastWord")) ## S3 method for class 'ftable' ToWrd(x, font = NULL, main = NULL, align = NULL, method = "compact", ..., wrd = DescToolsOptions("lastWord")) ## S3 method for class 'TOne' ToWrd(x, font = NULL, para = NULL, main = NULL, align = NULL, autofit = TRUE, ..., wrd = DescToolsOptions("lastWord")) ## S3 method for class 'TMod' ToWrd(x, font = NULL, para = NULL, main = NULL, align = NULL, split = " ", fixed=TRUE, autofit = TRUE, digits = 3, na.form = "-", ..., wrd = DescToolsOptions("lastWord")) ## S3 method for class 'lm' ToWrd(x, font = NULL, ..., wrd = DescToolsOptions("lastWord")) ## S3 method for class 'character' ToWrd(x, font = NULL, para = NULL, style = NULL, bullet = FALSE, ..., wrd = DescToolsOptions("lastWord")) ## Default S3 method: ToWrd(x, font = NULL, ..., wrd = DescToolsOptions("lastWord"))
ToWrd(x, font = NULL, ..., wrd = DescToolsOptions("lastWord")) ## S3 method for class 'Freq' ToWrd(x, font = NULL, main = NULL, ..., wrd = DescToolsOptions("lastWord")) ## S3 method for class 'table' ToWrd(x, font = NULL, main = NULL, align = NULL, tablestyle = NULL, autofit = TRUE, row.names = TRUE, col.names = TRUE, ..., wrd = DescToolsOptions("lastWord")) ## S3 method for class 'data.frame' ToWrd(x, font = NULL, main = NULL, row.names = NULL, ..., wrd = DescToolsOptions("lastWord")) ## S3 method for class 'ftable' ToWrd(x, font = NULL, main = NULL, align = NULL, method = "compact", ..., wrd = DescToolsOptions("lastWord")) ## S3 method for class 'TOne' ToWrd(x, font = NULL, para = NULL, main = NULL, align = NULL, autofit = TRUE, ..., wrd = DescToolsOptions("lastWord")) ## S3 method for class 'TMod' ToWrd(x, font = NULL, para = NULL, main = NULL, align = NULL, split = " ", fixed=TRUE, autofit = TRUE, digits = 3, na.form = "-", ..., wrd = DescToolsOptions("lastWord")) ## S3 method for class 'lm' ToWrd(x, font = NULL, ..., wrd = DescToolsOptions("lastWord")) ## S3 method for class 'character' ToWrd(x, font = NULL, para = NULL, style = NULL, bullet = FALSE, ..., wrd = DescToolsOptions("lastWord")) ## Default S3 method: ToWrd(x, font = NULL, ..., wrd = DescToolsOptions("lastWord"))
x |
the object to be transferred to Word. |
font |
the font to be used to the output. This should be defined as a list containing fontname, fontsize, bold and italic flags: |
para |
list containing paragraph format properties to be applied to the inserted text. For right align the paragraph one can set: |
main |
a caption for a table. This will be inserted by |
align |
character vector giving the alignment of the table columns. |
method |
string specifying how the
|
autofit |
logical, defining if the columns of table should be fitted to the length of their content. |
row.names |
logical, defining whether the row.names should be included in the output. Default is |
col.names |
logical, defining whether the col.names should be included in the output. Default is |
tablestyle |
either the name of a defined Word tablestyle or its index. |
style |
character, name of a style to be applied to the inserted text. |
... |
further arguments to be passed to or from methods. |
bullet |
logical, defines if the text should be formatted as bullet points. |
split |
character vector (or object which can be coerced to such) containing regular expression(s) (unless |
fixed |
logical. If TRUE match split exactly, otherwise use regular expressions. Has priority over perl. |
digits |
integer, the desired (fixed) number of digits after the decimal point. Unlike |
na.form |
character, string specifying how |
wrd |
the pointer to a word instance. Can be a new one, created by |
The paragraph format can be defined by means of these properties:
LeftIndent
, RightIndent
, SpaceBefore
, SpaceBeforeAuto
, SpaceAfter
, SpaceAfterAuto
, LineSpacingRule
,
Alignment
, WidowControl
, KeepWithNext
, KeepTogether
, PageBreakBefore
, NoLineNumber
, Hyphenation
,
FirstLineIndent
, OutlineLevel
, CharacterUnitLeftIndent
, CharacterUnitRightIndent
, CharacterUnitFirstLineIndent
,
LineUnitBefore
, LineUnitAfter
, MirrorIndents
.
if x
is a table a pointer to the table will be returned
Andri Signorell <[email protected]>
## Not run: # we can't get this through the CRAN test - run it with copy/paste to console wrd <- GetNewWrd() ToWrd("This is centered Text in Arial Black\n", para=list(Alignment=wdConst$wdAlignParagraphCenter, SpaceBefore=3, SpaceAfter=6), font=list(name="Arial Black", size=14), wrd=wrd) sel <- wrd$Selection()$Borders(wdConst$wdBorderBottom) sel[["LineStyle"]] <- wdConst$wdLineStyleSingle t1 <- TOne(x = d.pizza[, c("temperature","delivery_min","driver","wine_ordered")], grp=d.pizza$wine_delivered) ToWrd(t1, font=list(name="Algerian"), wrd=wrd) tab <- table(d.pizza$driver, d.pizza$area) tab <- table(d.pizza$driver, d.pizza$area) ToWrd(tab, font = list(size=15, name="Arial"), row.names = TRUE, col.names = TRUE, main= "my Title", wrd=wrd) ToWrd(tab, font = list(size=10, name="Arial narrow"), row.names = TRUE, col.names=FALSE, wrd=wrd) ToWrd(tab, font = list(size=15, name="Arial"), align="r", row.names = FALSE, col.names=TRUE, wrd=wrd) ToWrd(tab, font = list(size=15, name="Arial"), row.names = FALSE, col.names=FALSE, wrd=wrd) ToWrd(tab, tablestyle = "Mittlere Schattierung 2 - Akzent 4", row.names=TRUE, col.names=TRUE, wrd=wrd) ToWrd(Format(tab, big.mark = "'", digits=0), wrd=wrd) zz <- ToWrd(Format(tab, big.mark = "'", digits=0), wrd=wrd) zz$Rows(1)$Select() WrdFont(wrd = wrd) <- list(name="Algerian", size=14, bold=TRUE) # Send a TMod table to Word using a split to separate columns r.ols <- lm(Fertility ~ . , swiss) r.gam <- glm(Fertility ~ . , swiss, family=Gamma(link="identity")) # Build the model table for some two models, creating a user defined # reporting function (FUN) with | as column splitter tm <- TMod(OLS=r.ols, Gamma=r.gam, FUN=function(est, se, tval, pval, lci, uci){ gettextf("%s|[%s, %s]|%s", Format(est, fmt=Fmt("num"), digits=2), Format(lci, fmt=Fmt("num"), digits=2), Format(uci, fmt=Fmt("num"), digits=2), Format(pval, fmt="*") )}) # send it to Word, where we get a table with 3 columns per model # coef | confint | p-val wrd <- GetNewWrd() ToWrd(tm, split="|", align=StrSplit("lrclrcl")) ) ## End(Not run)
## Not run: # we can't get this through the CRAN test - run it with copy/paste to console wrd <- GetNewWrd() ToWrd("This is centered Text in Arial Black\n", para=list(Alignment=wdConst$wdAlignParagraphCenter, SpaceBefore=3, SpaceAfter=6), font=list(name="Arial Black", size=14), wrd=wrd) sel <- wrd$Selection()$Borders(wdConst$wdBorderBottom) sel[["LineStyle"]] <- wdConst$wdLineStyleSingle t1 <- TOne(x = d.pizza[, c("temperature","delivery_min","driver","wine_ordered")], grp=d.pizza$wine_delivered) ToWrd(t1, font=list(name="Algerian"), wrd=wrd) tab <- table(d.pizza$driver, d.pizza$area) tab <- table(d.pizza$driver, d.pizza$area) ToWrd(tab, font = list(size=15, name="Arial"), row.names = TRUE, col.names = TRUE, main= "my Title", wrd=wrd) ToWrd(tab, font = list(size=10, name="Arial narrow"), row.names = TRUE, col.names=FALSE, wrd=wrd) ToWrd(tab, font = list(size=15, name="Arial"), align="r", row.names = FALSE, col.names=TRUE, wrd=wrd) ToWrd(tab, font = list(size=15, name="Arial"), row.names = FALSE, col.names=FALSE, wrd=wrd) ToWrd(tab, tablestyle = "Mittlere Schattierung 2 - Akzent 4", row.names=TRUE, col.names=TRUE, wrd=wrd) ToWrd(Format(tab, big.mark = "'", digits=0), wrd=wrd) zz <- ToWrd(Format(tab, big.mark = "'", digits=0), wrd=wrd) zz$Rows(1)$Select() WrdFont(wrd = wrd) <- list(name="Algerian", size=14, bold=TRUE) # Send a TMod table to Word using a split to separate columns r.ols <- lm(Fertility ~ . , swiss) r.gam <- glm(Fertility ~ . , swiss, family=Gamma(link="identity")) # Build the model table for some two models, creating a user defined # reporting function (FUN) with | as column splitter tm <- TMod(OLS=r.ols, Gamma=r.gam, FUN=function(est, se, tval, pval, lci, uci){ gettextf("%s|[%s, %s]|%s", Format(est, fmt=Fmt("num"), digits=2), Format(lci, fmt=Fmt("num"), digits=2), Format(uci, fmt=Fmt("num"), digits=2), Format(pval, fmt="*") )}) # send it to Word, where we get a table with 3 columns per model # coef | confint | p-val wrd <- GetNewWrd() ToWrd(tm, split="|", align=StrSplit("lrclrcl")) ) ## End(Not run)
Send objects like tables, ftables, lm tables, TOnes or just simple texts to a MS-Word document and place a bookmark on them. This has the advantage, that objects in a Word document can be updated later, provided the bookmark name has been stored.
ToWrdB(x, font = NULL, ..., wrd = DescToolsOptions("lastWord"), bookmark = gettextf("bmt%s", round(runif(1, min = 0.1) * 1e+09)))
ToWrdB(x, font = NULL, ..., wrd = DescToolsOptions("lastWord"), bookmark = gettextf("bmt%s", round(runif(1, min = 0.1) * 1e+09)))
x |
the object to be transferred to Word. |
font |
the font to be used to the output. This should be defined as a list containing fontname, fontsize, bold and italic flags: |
... |
further arguments to be passed to or from methods. |
wrd |
the pointer to a word instance. Can be a new one, created by |
bookmark |
the name of the bookmark. |
This function encapsulates ToWrd
, by placing a bookmark over the complete inserted results.
The given name can be questioned with bm$name()
.
a handle to the set bookmark
Andri Signorell <[email protected]>
## Not run: # we can't get this through the CRAN test - run it with copy/paste to console wrd <- GetNewWrd() bm <- ToWrdB("This is text to be possibly replaced later.") # get the automatically created name of the bookmark bm$name() WrdGoto(bm$name()) UpdateBookmark(...) ## End(Not run)
## Not run: # we can't get this through the CRAN test - run it with copy/paste to console wrd <- GetNewWrd() bm <- ToWrdB("This is text to be possibly replaced later.") # get the automatically created name of the bookmark bm$name() WrdGoto(bm$name()) UpdateBookmark(...) ## End(Not run)
Evaluate given plot code to a tiff()
device and imports the created plot in the currently open MS-Word document. The imported plot is marked with a bookmark that can later be used for a potential update (provided the bookmark name has been stored).
ToWrdPlot(plotcode, width = NULL, height = NULL, scale = 100, pointsize = 12, res = 300, crop = 0, title = NULL, wrd = DescToolsOptions("lastWord"), bookmark = gettextf("bmp%s", round(runif(1, min = 0.1) * 1e+09)))
ToWrdPlot(plotcode, width = NULL, height = NULL, scale = 100, pointsize = 12, res = 300, crop = 0, title = NULL, wrd = DescToolsOptions("lastWord"), bookmark = gettextf("bmp%s", round(runif(1, min = 0.1) * 1e+09)))
plotcode |
code chunk needed for producing the plot |
bookmark |
character, the name of the bookmark |
width |
the width in cm of the plot in the Word document (default 15) |
height |
the height in cm of the plot in the Word document (default 9.3) |
scale |
the scale of the plot (default 100) |
pointsize |
the default pointsize of plotted text, interpreted as big points (1/72 inch) at |
res |
the resolution for the graphic (default 300) |
crop |
a vector of 4 elements, the crop factor for all 4 sides of a picture in cm (default all 0) |
title |
character, the title of the plot to be inserted in the word document |
wrd |
the pointer to a word instance. Can be a new one, created by |
An old and persistent problem that has existed for a long time is that as results once were loaded into a Word document the connection broke so that no update was possible. It was only recently that I realized that bookmarks in Word could be a solution for this. The present function evaluates some given plot code chunk using a tiff device and imports the created plot in a word document. The imported plot is given a bookmark, that can be used afterwards for changing or updating the plot.
This function is designed for use with the DescToolsAddIns functions ToWrdPlotWithBookmark()
and ToWrdWithBookmark()
allowing to assign keyboard shortcuts. The two functions will also insert the newly defined bookmark in the source file in a format, which can be interpreted by the function UpdateBookmark()
.
a list
plot_hwnd |
a windows handle to the inserted plot |
bookmark |
a windows handle to the bookmark |
Andri Signorell <[email protected]>
## Not run: # we can't get this through the CRAN test - run it with copy/paste to console wrd <- GetNewWrd() bm <- ToWrdB("This is text to be possibly replaced later.") # get the automatically created name of the bookmark bm$name() WrdGoto(bm$name()) UpdateBookmark(...) ## End(Not run)
## Not run: # we can't get this through the CRAN test - run it with copy/paste to console wrd <- GetNewWrd() bm <- ToWrdB("This is text to be possibly replaced later.") # get the automatically created name of the bookmark bm$name() WrdGoto(bm$name()) UpdateBookmark(...) ## End(Not run)
Density, distribution function, quantile function, and random generation
for the triangular distribution with parameters min
, max
,
and mode
.
dTri(x, min = 0, max = 1, mode = 1/2) pTri(q, min = 0, max = 1, mode = 1/2) qTri(p, min = 0, max = 1, mode = 1/2) rTri(n, min = 0, max = 1, mode = 1/2)
dTri(x, min = 0, max = 1, mode = 1/2) pTri(q, min = 0, max = 1, mode = 1/2) qTri(p, min = 0, max = 1, mode = 1/2) rTri(n, min = 0, max = 1, mode = 1/2)
x |
vector of quantiles. Missing values ( |
q |
vector of quantiles. Missing values ( |
p |
vector of probabilities between 0 and 1. Missing values ( |
n |
sample size. If |
min |
vector of minimum values of the distribution of the random variable.
The default value is |
max |
vector of maximum values of the random variable.
The default value is |
mode |
vector of modes of the random variable.
The default value is |
Let be a triangular random variable with parameters
min=
,
max=
, and
mode=
.
Probability Density and Cumulative Distribution Function
The density function of is given by:
|
|
for |
|
for |
|
where .
The cumulative distribution function of is given by:
|
|
for |
|
for |
|
where .
Quantiles
The quantile of
is given by:
|
|
for |
|
for |
|
where .
Random Numbers
Random numbers are generated using the inverse transformation method:
where is a random deviate from a uniform
distribution.
Mean and Variance
The mean and variance of are given by:
dTri
gives the density, pTri
gives the distribution function,
qTri
gives the quantile function, and rTri
generates random
deviates.
The triangular distribution is so named because of the shape of its probability
density function. The average of two independent identically distributed
uniform random variables with parameters min=
and
max=
has a triangular distribution with parameters
min=
,
max=
, and
mode=
.
The triangular distribution is sometimes used as an input distribution in probability risk assessment.
Steven P. Millard ([email protected])
Forbes, C., M. Evans, N. Hastings, and B. Peacock. (2011). Statistical Distributions. Fourth Edition. John Wiley and Sons, Hoboken, NJ.
Johnson, N. L., S. Kotz, and N. Balakrishnan. (1995). Continuous Univariate Distributions, Volume 2. Second Edition. John Wiley and Sons, New York.
Uniform, Probability Distributions and Random Numbers.
# Density of a triangular distribution with parameters # min=10, max=15, and mode=12, evaluated at 12, 13 and 14: dTri(12:14, 10, 15, 12) #[1] 0.4000000 0.2666667 0.1333333 #---------- # The cdf of a triangular distribution with parameters # min=2, max=7, and mode=5, evaluated at 3, 4, and 5: pTri(3:5, 2, 7, 5) #[1] 0.06666667 0.26666667 0.60000000 #---------- # The 25'th percentile of a triangular distribution with parameters # min=1, max=4, and mode=3: qTri(0.25, 1, 4, 3) #[1] 2.224745 #---------- # A random sample of 4 numbers from a triangular distribution with # parameters min=3 , max=20, and mode=12. # (Note: the call to set.seed simply allows you to reproduce this example.) set.seed(10) rTri(4, 3, 20, 12) #[1] 11.811593 9.850955 11.081885 13.539496
# Density of a triangular distribution with parameters # min=10, max=15, and mode=12, evaluated at 12, 13 and 14: dTri(12:14, 10, 15, 12) #[1] 0.4000000 0.2666667 0.1333333 #---------- # The cdf of a triangular distribution with parameters # min=2, max=7, and mode=5, evaluated at 3, 4, and 5: pTri(3:5, 2, 7, 5) #[1] 0.06666667 0.26666667 0.60000000 #---------- # The 25'th percentile of a triangular distribution with parameters # min=1, max=4, and mode=3: qTri(0.25, 1, 4, 3) #[1] 2.224745 #---------- # A random sample of 4 numbers from a triangular distribution with # parameters min=3 , max=20, and mode=12. # (Note: the call to set.seed simply allows you to reproduce this example.) set.seed(10) rTri(4, 3, 20, 12) #[1] 11.811593 9.850955 11.081885 13.539496
Clean data by means of trimming, i.e., by omitting outlying observations.
Trim(x, trim = 0.1, na.rm = FALSE)
Trim(x, trim = 0.1, na.rm = FALSE)
x |
a numeric vector to be trimmed. |
trim |
the fraction (0 to 0.5) of observations to be trimmed from each end of x. Values of trim outside that range (and < 1) are taken as the nearest endpoint.
If |
na.rm |
a logical value indicating whether |
A symmetrically trimmed vector x
with a fraction of trim observations (resp. the given number) deleted from each end will be returned. If trim
is set to a value >0.5 or to an integer value > n/2 then the result will be NA
.
The trimmed vector x
. The indices of the trimmed values will be attached as attribute named "trim"
.
This function is basically an excerpt from the base function mean
, which allows the vector x
to be trimmed before calculating the mean. But what if a trimmed standard deviation is needed?
R-Core (function mean), Andri Signorell <[email protected]>
## generate data set.seed(1234) # for reproducibility x <- rnorm(10) # standard normal x[1] <- x[1] * 10 # introduce outlier ## Trim data x Trim(x, trim=0.1) ## Trim fixed number, say cut the 3 extreme elements from each end Trim(x, trim=3) ## check function s <- sample(10:20) s.tr <- Trim(s, trim = 2) setequal(c(s[attr(s.tr, "trim")], s.tr), s)
## generate data set.seed(1234) # for reproducibility x <- rnorm(10) # standard normal x[1] <- x[1] * 10 # introduce outlier ## Trim data x Trim(x, trim=0.1) ## Trim fixed number, say cut the 3 extreme elements from each end Trim(x, trim=3) ## check function s <- sample(10:20) s.tr <- Trim(s, trim = 2) setequal(c(s[attr(s.tr, "trim")], s.tr), s)
Performs one and two sample t-tests based on user supplied summary information instead of data as in t.test()
.
TTestA(mx, sx, nx, my = NULL, sy = NULL, ny = NULL, alternative = c("two.sided", "less", "greater"), mu = 0, paired = FALSE, var.equal = FALSE, conf.level = 0.95, ...)
TTestA(mx, sx, nx, my = NULL, sy = NULL, ny = NULL, alternative = c("two.sided", "less", "greater"), mu = 0, paired = FALSE, var.equal = FALSE, conf.level = 0.95, ...)
mx |
a single number representing the sample mean of x. |
my |
an optional single number representing the sample mean of y. |
sx |
a single number representing the sample standard deviation of x. |
sy |
an optional single number representing the sample standard deviation of y. |
nx |
a single number representing the sample size of x. |
ny |
an optional single number representing the sample size of y. |
alternative |
a character string specifying the alternative
hypothesis, must be one of |
mu |
a number indicating the true value of the mean (or difference in means if you are performing a two sample test). |
paired |
paired = |
var.equal |
a logical variable indicating whether to treat the
two variances as being equal. If |
conf.level |
confidence level of the interval. |
... |
further arguments to be passed to or from methods. |
alternative = "greater"
is the alternative that x
has a
larger mean than y
.
The option paired
is not supported here, as the variance of the differences can't be calculated on the base of the variances of the two samples. However, for calculating the paired test we can simply supply the mean and standard deviation of the differences and use the one-sample test with mu = 0
.
If
var.equal
is TRUE
then the pooled estimate of the
variance is used. By default, if var.equal
is FALSE
then the variance is estimated separately for both groups and the
Welch modification to the degrees of freedom is used.
If the input data are effectively constant (compared to the larger of the two means) an error is generated.
A list with class "htest"
containing the following components:
statistic |
the value of the t-statistic. |
parameter |
the degrees of freedom for the t-statistic. |
p.value |
the p-value for the test. |
conf.int |
a confidence interval for the mean appropriate to the specified alternative hypothesis. |
estimate |
the estimated mean or difference in means depending on whether it was a one-sample test or a two-sample test. |
null.value |
the specified hypothesized value of the mean or mean difference depending on whether it was a one-sample test or a two-sample test. |
alternative |
a character string describing the alternative hypothesis. |
method |
a character string indicating what type of t-test was performed. |
data.name |
a character string giving the name(s) of the data. |
## Classical example: Student's sleep data mx <- 0.75 my <- 2.33 sx <- 1.789010 sy <- 2.002249 nx <- ny <- 10 TTestA(mx=mx, my=my, sx=sx, sy=sy, nx=nx, ny=ny) # compare to with(sleep, t.test(extra[group == 1], extra[group == 2])) # use the one sample test for the differences instead of paired=TRUE option x <- with(sleep, extra[group == 1]) y <- with(sleep, extra[group == 2]) TTestA(mx=mean(x-y), sx=sd(x-y), nx=length(x-y)) # compared to t.test(x, y, paired = TRUE)
## Classical example: Student's sleep data mx <- 0.75 my <- 2.33 sx <- 1.789010 sy <- 2.002249 nx <- ny <- 10 TTestA(mx=mx, my=my, sx=sx, sy=sy, nx=nx, ny=ny) # compare to with(sleep, t.test(extra[group == 1], extra[group == 2])) # use the one sample test for the differences instead of paired=TRUE option x <- with(sleep, extra[group == 1]) y <- with(sleep, extra[group == 2]) TTestA(mx=mean(x-y), sx=sd(x-y), nx=length(x-y)) # compared to t.test(x, y, paired = TRUE)
This calculates a robust average that is unaffected by outliers.
TukeyBiweight(x, const = 9, na.rm = FALSE, conf.level = NA, ci.type = "bca", R=1000, ...)
TukeyBiweight(x, const = 9, na.rm = FALSE, conf.level = NA, ci.type = "bca", R=1000, ...)
x |
a |
const |
a constant. |
na.rm |
logical, indicating whether |
conf.level |
confidence level of the interval. If set to |
ci.type |
The type of confidence interval required. The value should be any subset
of the values |
R |
The number of bootstrap replicates. Usually this will be a single positive integer. For importance resampling,
some resamples may use one set of weights and others use a different set of weights. In this case |
... |
the dots are passed to the function |
This is a one step computation that follows the Affy whitepaper below,
see page 22. const
determines the point at which
outliers are given a weight of 0 and therefore do not contribute to
the calculation of the mean. const = 9
sets values roughly
+/-6 standard deviations to 0. const = 6
is also used in
tree-ring chronology development. Cook and Kairiukstis (1990) have
further details.
An exact summation algorithm (Shewchuk 1997) is used. When some assumptions about the rounding of floating point numbers and conservative compiler optimizations hold, summation error is completely avoided. Whether the assumptions hold depends on the platform, i.e. compiler and CPU.
A numeric
mean.
Mikko Korpela <[email protected]>
Statistical Algorithms Description Document, 2002, Affymetrix.
Cook, E. R. and Kairiukstis, L. A. (1990) Methods of Dendrochronology: Applications in the Environmental Sciences. Springer. ISBN-13: 978-0792305866.
Mosteller, F. and Tukey, J. W. (1977) Data Analysis and Regression: a second course in statistics. Addison-Wesley. ISBN-13: 978-0201048544.
Shewchuk, J. R. (1997) Adaptive Precision Floating-Point Arithmetic and Fast Robust Geometric Predicates. Discrete and Computational Geometry, 18(3):305-363. Springer.
TukeyBiweight(rnorm(100))
TukeyBiweight(rnorm(100))
This function describes a numeric variable by a grouping factor with two levels. First, a descriptive text listing the frequencies and means of the two groups and the results of the significance test is generated. The results of Desc(x~g)
are reported as they are provided by the function, followed by a plot consisting of a density plot and a box plot.
This description makes sense, for example, if the age distribution of a collective is to be represented for both sexes.
TwoGroups(x, ..., plotit = TRUE) ## Default S3 method: TwoGroups(x, g, main = NULL, vname = NULL, ..., plotit = TRUE) ## S3 method for class 'formula' TwoGroups(formula, data, subset, na.action, ...) ## S3 method for class 'TwoGroups' ToWrd(x, font = NULL, ..., wrd = DescToolsOptions("lastWord"))
TwoGroups(x, ..., plotit = TRUE) ## Default S3 method: TwoGroups(x, g, main = NULL, vname = NULL, ..., plotit = TRUE) ## S3 method for class 'formula' TwoGroups(formula, data, subset, na.action, ...) ## S3 method for class 'TwoGroups' ToWrd(x, font = NULL, ..., wrd = DescToolsOptions("lastWord"))
x |
the numeric variable to describe. |
g |
the grouping factor (preferably with two levels.) |
main |
the main title. |
vname |
the variable names used in the description text. |
plotit |
boolean. Should a plot be created? Default can be defined by |
formula |
a formula of the form |
data |
an optional matrix or data frame (or similar: see |
subset |
an optional vector specifying a subset of observations to be used. |
na.action |
a function which indicates what should happen when the data contain NAs. Defaults to |
font |
the first font will be chosen for the introducing text, when sending the output to Word, the second for the description. |
wrd |
the pointer to a running MS Word instance, as created by GetNewWrd() (for a new one) or by GetCurrWrd() for an existing one. Default is |
... |
the dots are sent to the internally used function |
list with the results calculated by the used functions
Andri Signorell <[email protected]>
x <- d.pizza$temperature g <- factor(d.pizza$rabate) # we can change the colors for the plot by setting the DescToolsOptions DescToolsOptions(col=c(DescTools::horange, DescTools::hgreen)) TwoGroups(x, g, main="Temperature ~ Rebate") # for an output to Word simply define the wrd argument # wrd <- GetNewWrd() # TwoGroups(x, g, font.desc=list(name="Consolas", size=8), # main="Temperature ~ Rebate", wrd=wrd)
x <- d.pizza$temperature g <- factor(d.pizza$rabate) # we can change the colors for the plot by setting the DescToolsOptions DescToolsOptions(col=c(DescTools::horange, DescTools::hgreen)) TwoGroups(x, g, main="Temperature ~ Rebate") # for an output to Word simply define the wrd argument # wrd <- GetNewWrd() # TwoGroups(x, g, font.desc=list(name="Consolas", size=8), # main="Temperature ~ Rebate", wrd=wrd)
The uncertainty coefficient U(C|R) measures the proportion of uncertainty (entropy) in the column variable Y that is explained by the row variable X. The function has interfaces for a table, a matrix, a data.frame and for single vectors.
UncertCoef(x, y = NULL, direction = c("symmetric", "row", "column"), conf.level = NA, p.zero.correction = 1/sum(x)^2, ...)
UncertCoef(x, y = NULL, direction = c("symmetric", "row", "column"), conf.level = NA, p.zero.correction = 1/sum(x)^2, ...)
x |
a numeric vector, a factor, matrix or data frame. |
y |
|
direction |
direction of the calculation. Can be |
conf.level |
confidence level of the interval. If set to |
p.zero.correction |
slightly nudge zero values so that their logarithm can be calculated |
... |
further arguments are passed to the function |
The uncertainty coefficient is computed as
and
ranges from [0, 1].
Either a single numeric value, if no confidence interval is required,
or a vector with 3 elements for estimate, lower and upper confidence intervall.
Andri Signorell <[email protected]> strongly based on code from Antti Arppe <[email protected]>
Theil, H. (1972), Statistical Decomposition Analysis, Amsterdam: North-Holland Publishing Company.
# example from Goodman Kruskal (1954) m <- as.table(cbind(c(1768,946,115), c(807,1387,438), c(189,746,288), c(47,53,16))) dimnames(m) <- list(paste("A", 1:3), paste("B", 1:4)) m # direction default is "symmetric" UncertCoef(m) UncertCoef(m, conf.level=0.95) UncertCoef(m, direction="row") UncertCoef(m, direction="column")
# example from Goodman Kruskal (1954) m <- as.table(cbind(c(1768,946,115), c(807,1387,438), c(189,746,288), c(47,53,16))) dimnames(m) <- list(paste("A", 1:3), paste("B", 1:4)) m # direction default is "symmetric" UncertCoef(m) UncertCoef(m, conf.level=0.95) UncertCoef(m, direction="row") UncertCoef(m, direction="column")
The function UnirootAll
searches the interval from lower to upper
for several roots (i.e., zero's) of a function f
with respect to
its first argument.
UnirootAll(f, interval, lower = min(interval), upper = max(interval), tol = .Machine$double.eps^0.5, maxiter = 1000, n = 100, ...)
UnirootAll(f, interval, lower = min(interval), upper = max(interval), tol = .Machine$double.eps^0.5, maxiter = 1000, n = 100, ...)
f |
the function for which the root is sought. |
interval |
a vector containing the end-points of the interval to be searched for the root. |
lower |
the lower end point of the interval to be searched. |
upper |
the upper end point of the interval to be searched. |
tol |
the desired accuracy (convergence tolerance). |
maxiter |
the maximum number of iterations. |
n |
number of subintervals in which the root is sought. |
... |
additional named or unnamed arguments to be passed to |
f
will be called as f(x, ...)
for a numeric value of x
.
Run demo(Jacobandroots)
for an example of the use of UnirootAll
for steady-state analysis.
See also second example of gradient
This example is discussed in the book by Soetaert and Herman (2009).
a vector with the roots found in the interval
This is a verbatim copy from rootSolve::uniroot.all (v. 1.7).
The function calls uniroot
, the basic R-function.
It is not guaranteed that all roots will be recovered.
This will depend on n
, the number of subintervals in which the
interval is divided.
If the function "touches" the X-axis (i.e. the root is a saddle point), then this root will generally not be retrieved. (but chances of this are pretty small).
Whereas unitroot
passes values one at a time to the function,
UnirootAll
passes a vector of values to the function.
Therefore f
should be written such that it can handle a vector of values.
See last example.
Karline Soetaert <[email protected]>
uniroot
for more information about input.
## ======================================================================= ## Mathematical examples ## ======================================================================= # a well-behaved case... fun <- function (x) cos(2*x)^3 curve(fun(x), 0, 10,main = "UnirootAll") All <- UnirootAll(fun, c(0, 10)) points(All, y = rep(0, length(All)), pch = 16, cex = 2) # a difficult case... f <- function (x) 1/cos(1+x^2) AA <- UnirootAll(f, c(-5, 5)) curve(f(x), -5, 5, n = 500, main = "UnirootAll") points(AA, rep(0, length(AA)), col = "red", pch = 16) f(AA) # !!! ## ======================================================================= ## Vectorisation: ## ======================================================================= # from R-help Digest, Vol 130, Issue 27 # https://stat.ethz.ch/pipermail/r-help/2013-December/364799.html integrand1 <- function(x) 1/x*dnorm(x) integrand2 <- function(x) 1/(2*x-50)*dnorm(x) integrand3 <- function(x, C) 1/(x+C) res <- function(C) { integrate(integrand1, lower = 1, upper = 50)$value + integrate(integrand2, lower = 50, upper = 100)$value - integrate(integrand3, C = C, lower = 1, upper = 100)$value } # uniroot passes one value at a time to the function, so res can be used as such uniroot(res, c(1, 1000)) # Need to vectorise the function to use UnirootAll: res <- Vectorize(res) UnirootAll(res, c(1,1000))
## ======================================================================= ## Mathematical examples ## ======================================================================= # a well-behaved case... fun <- function (x) cos(2*x)^3 curve(fun(x), 0, 10,main = "UnirootAll") All <- UnirootAll(fun, c(0, 10)) points(All, y = rep(0, length(All)), pch = 16, cex = 2) # a difficult case... f <- function (x) 1/cos(1+x^2) AA <- UnirootAll(f, c(-5, 5)) curve(f(x), -5, 5, n = 500, main = "UnirootAll") points(AA, rep(0, length(AA)), col = "red", pch = 16) f(AA) # !!! ## ======================================================================= ## Vectorisation: ## ======================================================================= # from R-help Digest, Vol 130, Issue 27 # https://stat.ethz.ch/pipermail/r-help/2013-December/364799.html integrand1 <- function(x) 1/x*dnorm(x) integrand2 <- function(x) 1/(2*x-50)*dnorm(x) integrand3 <- function(x, C) 1/(x+C) res <- function(C) { integrate(integrand1, lower = 1, upper = 50)$value + integrate(integrand2, lower = 50, upper = 100)$value - integrate(integrand3, C = C, lower = 1, upper = 100)$value } # uniroot passes one value at a time to the function, so res can be used as such uniroot(res, c(1, 1000)) # Need to vectorise the function to use UnirootAll: res <- Vectorize(res) UnirootAll(res, c(1,1000))
Recreates the data.frame out of a contingency table x.
Untable(x, ...) ## S3 method for class 'data.frame' Untable(x, freq = "Freq", rownames = NULL, ...) ## Default S3 method: Untable(x, dimnames = NULL, type = NULL, rownames = NULL, colnames = NULL, ...)
Untable(x, ...) ## S3 method for class 'data.frame' Untable(x, freq = "Freq", rownames = NULL, ...) ## Default S3 method: Untable(x, dimnames = NULL, type = NULL, rownames = NULL, colnames = NULL, ...)
x |
a numeric vector, a matrix, a table or a data.frame. If x is a vector, a matrix or a table it is interpreted as frequencies which are to be inflated to the original list. |
dimnames |
the dimension names of x to be used for expanding. Can be used to expand a weight vector to its original values. If set to |
type |
defines the data type generated. This allows to directly define factors or ordered factors, but also numeric values. See examples. |
rownames |
A names vector for the rownames of the resulting data.frame. If set to |
colnames |
A names vector for the colnames of the resulting data.frame. If set to |
freq |
character, the name of the frequency variable in case x is a data.frame. |
... |
further arguments passed to or from functions (not used here). |
For x being a vector this reduces to rep(..., n)
with n as vector (which is not supported by rep()
). NA
s in the table will be treated as 0 without raising an error.
a data.frame with the detailed data (even if x was a 1-dimensional table)
Andri Signorell <[email protected]>
expand.grid
, rep
, gl
, xtabs
d.titanic <- Untable(Titanic) str(d.titanic) # ... not the same as: data.frame(Titanic) tab <- table(set1=sample(letters[1:5], size=40, replace=TRUE), set2=sample(letters[11:15], size=40, replace=TRUE)) Untable(tab) # return a numeric vector by setting type and coerce to a vector by [,] Untable(c(6,2,2), type="as.numeric")[,] # how to produce the original list based on frequencies, given as a data.frame d.freq <- data.frame(xtabs(Freq ~ Sex + Survived, data=Titanic)) # a data list with each individual d.data <- Untable( xtabs(c(1364, 126, 367, 344) ~ ., expand.grid(levels(d.freq$Sex),levels(d.freq$Survived)))) head(d.data) # expand a weights vector Untable(c(1,4,5), dimnames=list(c("Zurich","Berlin","London"))) # and the same with a numeric vector Untable(c(1,4,5), dimnames=list(c(5,10,15)), type="as.numeric")[,] # ... which again is nothing else than rep(times=c(1,4,5), x=c(5,10,15)) # the data.frame interface d.freq <- data.frame(f1=c("A","A","B","B"), f2=c("C","D","C","D"), Freq=c(1,2,3,4)) Untable(d.freq)
d.titanic <- Untable(Titanic) str(d.titanic) # ... not the same as: data.frame(Titanic) tab <- table(set1=sample(letters[1:5], size=40, replace=TRUE), set2=sample(letters[11:15], size=40, replace=TRUE)) Untable(tab) # return a numeric vector by setting type and coerce to a vector by [,] Untable(c(6,2,2), type="as.numeric")[,] # how to produce the original list based on frequencies, given as a data.frame d.freq <- data.frame(xtabs(Freq ~ Sex + Survived, data=Titanic)) # a data list with each individual d.data <- Untable( xtabs(c(1364, 126, 367, 344) ~ ., expand.grid(levels(d.freq$Sex),levels(d.freq$Survived)))) head(d.data) # expand a weights vector Untable(c(1,4,5), dimnames=list(c("Zurich","Berlin","London"))) # and the same with a numeric vector Untable(c(1,4,5), dimnames=list(c(5,10,15)), type="as.numeric")[,] # ... which again is nothing else than rep(times=c(1,4,5), x=c(5,10,15)) # the data.frame interface d.freq <- data.frame(f1=c("A","A","B","B"), f2=c("C","D","C","D"), Freq=c(1,2,3,4)) Untable(d.freq)
The inverse function to which
creates a logical
vector/matrix from indices.
Unwhich(idx, n = max(idx), useNames = TRUE)
Unwhich(idx, n = max(idx), useNames = TRUE)
idx |
the indices as returned by |
n |
integer, the length of the original vector. This must not be less than |
useNames |
logical, determining if the names of the indices should be preserved. |
a logical vector of the length n, with TRUE
on the positions i
.
Nick Sabbe
https://stackoverflow.com/questions/7659833/inverse-of-which
ll <- c(TRUE, FALSE, TRUE, NA, FALSE, FALSE, TRUE) names(ll) <- letters[seq(ll)] i <- which(ll) # back again (loosing the names of the FALSEs) Unwhich(i, length(ll))
ll <- c(TRUE, FALSE, TRUE, NA, FALSE, FALSE, TRUE) names(ll) <- letters[seq(ll)] i <- which(ll) # back again (loosing the names of the FALSEs) Unwhich(i, length(ll))
Performs a van der Waerden normal scores test.
VanWaerdenTest(x, ...) ## Default S3 method: VanWaerdenTest(x, g, ...) ## S3 method for class 'formula' VanWaerdenTest(formula, data, subset, na.action, ...)
VanWaerdenTest(x, ...) ## Default S3 method: VanWaerdenTest(x, g, ...) ## S3 method for class 'formula' VanWaerdenTest(formula, data, subset, na.action, ...)
x |
a numeric vector of data values, or a list of numeric data vectors. Non-numeric elements of a list will be coerced, with a warning. |
g |
a vector or factor object giving the group for the
corresponding elements of |
formula |
a formula of the form |
data |
an optional matrix or data frame (or similar: see
|
subset |
an optional vector specifying a subset of observations to be used. |
na.action |
a function which indicates what should happen when
the data contain |
... |
further arguments to be passed to or from methods. |
VanWaerdenTest
performs a van der Waerden test of the
null that the location parameters of the distribution of x
are the same in each group (sample). The alternative is that they
differ in at least one.
The van der Waerden rank scores are defined as the ranks of data, i.e., , divided by
transformed to a normal score by applying the inverse of the normal distribution function, i.e.,
. The ranks of data are obtained by ordering the observations from all groups (the same way as
kruskal.test
does it).
If x
is a list, its elements are taken as the samples to be
compared, and hence have to be numeric data vectors. In this case,
g
is ignored, and one can simply use VanWaerdenTest(x)
to perform the test. If the samples are not yet contained in a
list, use VanWaerdenTest(list(x, ...))
.
Otherwise, x
must be a numeric data vector, and g
must
be a vector or factor object of the same length as x
giving
the group for the corresponding elements of x
.
A list with class "htest"
containing the following components:
statistic |
the van der Waerden statistic. |
parameter |
the degrees of freedom of the approximate chi-squared distribution of the test statistic. |
p.value |
the p-value of the test. |
method |
the character string |
data.name |
a character string giving the names of the data. |
Conover, W. J., Iman, R. L. (1979). On multiple-comparisons procedures, Tech. Rep. LA-7677-MS, Los Alamos Scientific Laboratory.
Conover, W. J. (1999). Practical Nonparameteric Statistics (Third Edition ed.). Wiley. pp. 396406.
normal_test
in package
coin where the test is implemented in a more general context (but has a quite unpractical interface).
## Hollander & Wolfe (1973), 116. ## Mucociliary efficiency from the rate of removal of dust in normal ## subjects, subjects with obstructive airway disease, and subjects ## with asbestosis. x <- c(2.9, 3.0, 2.5, 2.6, 3.2) # normal subjects y <- c(3.8, 2.7, 4.0, 2.4) # with obstructive airway disease z <- c(2.8, 3.4, 3.7, 2.2, 2.0) # with asbestosis VanWaerdenTest(list(x, y, z)) ## Equivalently, x <- c(x, y, z) g <- factor(rep(1:3, c(5, 4, 5)), labels = c("Normal subjects", "Subjects with obstructive airway disease", "Subjects with asbestosis")) VanWaerdenTest(x, g) ## Formula interface. require(graphics) boxplot(Ozone ~ Month, data = airquality) VanWaerdenTest(Ozone ~ Month, data = airquality)
## Hollander & Wolfe (1973), 116. ## Mucociliary efficiency from the rate of removal of dust in normal ## subjects, subjects with obstructive airway disease, and subjects ## with asbestosis. x <- c(2.9, 3.0, 2.5, 2.6, 3.2) # normal subjects y <- c(3.8, 2.7, 4.0, 2.4) # with obstructive airway disease z <- c(2.8, 3.4, 3.7, 2.2, 2.0) # with asbestosis VanWaerdenTest(list(x, y, z)) ## Equivalently, x <- c(x, y, z) g <- factor(rep(1:3, c(5, 4, 5)), labels = c("Normal subjects", "Subjects with obstructive airway disease", "Subjects with asbestosis")) VanWaerdenTest(x, g) ## Formula interface. require(graphics) boxplot(Ozone ~ Month, data = airquality) VanWaerdenTest(Ozone ~ Month, data = airquality)
Var()
computes the variance of x
. If x
is a matrix variances of the columns of x
are computed. Varn
returns the uncorrected sample variance (which is biased estimator for the sample variance).
Var(x, ...) ## S3 method for class 'Freq' Var(x, breaks, ...) ## Default S3 method: Var(x, weights = NULL, na.rm = FALSE, method = c("unbiased", "ML"), ...) VarN(x, na.rm = FALSE)
Var(x, ...) ## S3 method for class 'Freq' Var(x, breaks, ...) ## Default S3 method: Var(x, weights = NULL, na.rm = FALSE, method = c("unbiased", "ML"), ...) VarN(x, na.rm = FALSE)
x |
a numeric vector, matrix or data frame. |
weights |
a numerical vector of weights the same length as |
na.rm |
logical. Should missing values be removed? |
method |
determines the estimator type; if |
breaks |
breaks for calculating the variance for classified data as composed by |
... |
further arguments passed to or from other methods. |
Var
is just another interface to Cov
.
The denominator is used which gives an unbiased estimator
of the (co)variance for i.i.d. observations.
These functions return
NA
when there is only one
observation (whereas S-PLUS has been returning NaN
), and
fail if x
has length zero.
For r <- Cor(*, use = "all.obs")
, it is now guaranteed that
all(abs(r) <= 1)
.
Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) The New S Language. Wadsworth & Brooks/Cole.
cor
, cov
for covariance and correlation matrices
cor.test
for confidence intervals (and tests).
cov.wt
for weighted covariance computation.
sd
for standard deviation (vectors).
Var(1:10) # 9.166667 Var(1:5, 1:5) # 2.5 # weighted Variance set.seed(45) (z <- as.numeric(names(w <- table(x <- sample(-10:20, size=50, replace=TRUE))))) Var(z, w=w) # check! all.equal(Var(x), Var(z, w=w)) # Variance for frequency tables Var(Freq(as.table(c(6,16,24,25,17))), breaks=c(0, 10, 20, 30, 40, 50))
Var(1:10) # 9.166667 Var(1:5, 1:5) # 2.5 # weighted Variance set.seed(45) (z <- as.numeric(names(w <- table(x <- sample(-10:20, size=50, replace=TRUE))))) Var(z, w=w) # check! all.equal(Var(x), Var(z, w=w)) # Variance for frequency tables Var(Freq(as.table(c(6,16,24,25,17))), breaks=c(0, 10, 20, 30, 40, 50))
Calculates confidence intervals for the variance. Available approachs are the classical one using the ChiSquare distribution, a more robust version proposed by Bonett and the bootstrap options available in the package boot
.
VarCI(x, method = c("classic", "bonett", "norm", "basic", "stud", "perc", "bca"), conf.level = 0.95, sides = c("two.sided", "left", "right"), na.rm = FALSE, R = 999)
VarCI(x, method = c("classic", "bonett", "norm", "basic", "stud", "perc", "bca"), conf.level = 0.95, sides = c("two.sided", "left", "right"), na.rm = FALSE, R = 999)
x |
a (non-empty) numeric vector of data values. |
method |
vector of character strings representing the type of intervals required. The value should be any subset of the values |
conf.level |
confidence level of the interval. |
sides |
a character string specifying the side of the confidence interval, must be one of |
na.rm |
logical. Should missing values be removed? Defaults to FALSE. |
R |
number of bootstrap replicates. Usually this will be a single positive integer. For importance resampling, some resamples may use one set of weights and others use a different set of weights. In this case R would be a vector of integers where each component gives the number of resamples from each of the rows of weights.
See |
The confidence interval for the variance is very sensitive to non-normality in the data. Bonett (2006) has proposed an interval that is nearly exact when the data is normally distributed and provides good performance for moderately non-normal data. See the references for the details.
a numeric vector with 3 elements:
var |
variance |
lwr.ci |
lower bound of the confidence interval |
upr.ci |
upper bound of the confidence interval |
Andri Signorell <[email protected]>
Bonett (2006) Approximate Confidence Interval for Standard Deviation of Nonnormal Distributions, Computational Statistics and Data Analysis, Vol. 50, pp. 775 - 782.
https://www.itl.nist.gov/div898/software/dataplot/refman1/auxillar/sdconfli.htm (might be outdated)
MeanCI
, MedianCI
, VarTest
, Var
VarCI(d.pizza$price, na.rm=TRUE) VarCI(d.pizza$price, conf.level=0.99, na.rm=TRUE) x <- c(14.816, 14.863, 14.814, 14.998, 14.965, 14.824, 14.884, 14.838, 14.916, 15.021, 14.874, 14.856, 14.860, 14.772, 14.980, 14.919) VarCI(x, conf.level=0.9) # and for the standard deviation sqrt(VarCI(x, conf.level=0.9)) # from Bonett's paper # expected results: # ------------------------------------ # conf.lvl sd lci uci # ------------------------------------ # 90.0 0.5168 0.3592 0.9359 # 95.0 0.5168 0.3263 1.0841 # 99.0 0.5168 0.2607 1.5109 p <- c(15.83, 16.01, 16.24, 16.42, 15.33, 15.44, 16.88, 16.31) sqrt(VarCI(p, method="bonett", conf.level=0.9)) sqrt(VarCI(p, method="bonett")) sqrt(VarCI(p, method="bonett", conf.level=0.99)) # some bootstrap intervals VarCI(x, method="norm") VarCI(x, method="perc") VarCI(x, method="bca")
VarCI(d.pizza$price, na.rm=TRUE) VarCI(d.pizza$price, conf.level=0.99, na.rm=TRUE) x <- c(14.816, 14.863, 14.814, 14.998, 14.965, 14.824, 14.884, 14.838, 14.916, 15.021, 14.874, 14.856, 14.860, 14.772, 14.980, 14.919) VarCI(x, conf.level=0.9) # and for the standard deviation sqrt(VarCI(x, conf.level=0.9)) # from Bonett's paper # expected results: # ------------------------------------ # conf.lvl sd lci uci # ------------------------------------ # 90.0 0.5168 0.3592 0.9359 # 95.0 0.5168 0.3263 1.0841 # 99.0 0.5168 0.2607 1.5109 p <- c(15.83, 16.01, 16.24, 16.42, 15.33, 15.44, 16.88, 16.31) sqrt(VarCI(p, method="bonett", conf.level=0.9)) sqrt(VarCI(p, method="bonett")) sqrt(VarCI(p, method="bonett", conf.level=0.99)) # some bootstrap intervals VarCI(x, method="norm") VarCI(x, method="perc") VarCI(x, method="bca")
Performs either a one sample chi-squared test to compare the variance of a vector with a given value or an F test to compare the variances of two samples from normal populations.
VarTest(x, ...) ## Default S3 method: VarTest(x, y, alternative = c("two.sided", "less", "greater"), ratio = 1, sigma.squared = 1, conf.level = 0.95, ...) ## S3 method for class 'formula' VarTest(formula, data, subset, na.action, ...)
VarTest(x, ...) ## Default S3 method: VarTest(x, y, alternative = c("two.sided", "less", "greater"), ratio = 1, sigma.squared = 1, conf.level = 0.95, ...) ## S3 method for class 'formula' VarTest(formula, data, subset, na.action, ...)
x , y
|
numeric vectors of data values. |
alternative |
a character string specifying the alternative
hypothesis, must be one of |
ratio |
the hypothesized ratio of the population variances of
|
sigma.squared |
a number indicating the true value of the variance, if one sample test is requested. |
conf.level |
confidence level for the returned confidence interval. |
formula |
a formula of the form |
data |
an optional matrix or data frame (or similar: see
|
subset |
an optional vector specifying a subset of observations to be used. |
na.action |
a function which indicates what should happen when
the data contain |
... |
further arguments to be passed to or from methods. |
The formula interface is only applicable for the 2-sample tests.
The null hypothesis is that the ratio of the variances of the
populations from which x
and y
were drawn, or in the
data to which the linear models x
and y
were fitted, is
equal to ratio
.
A list with class "htest"
containing the following components:
statistic |
the value of the F test statistic. |
parameter |
the degrees of the freedom of the F distribution of the test statistic. |
p.value |
the p-value of the test. |
conf.int |
a confidence interval for the ratio of the population variances. |
estimate |
the ratio of the sample variances of |
null.value |
the ratio of population variances under the null. |
alternative |
a character string describing the alternative hypothesis. |
method |
the character string
|
data.name |
a character string giving the names of the data. |
Andri Signorell <[email protected]> (One sample test)
Two Sample test and help text from R-Core.
var.test
, bartlett.test
for testing homogeneity of variances in
more than two samples from normal distributions;
ansari.test
and mood.test
for two rank
based (nonparametric) two-sample tests for difference in scale.
x <- rnorm(50, mean = 0, sd = 2) # One sample test VarTest(x, sigma.squared = 2.5) # two samples y <- rnorm(30, mean = 1, sd = 1) VarTest(x, y) # Do x and y have the same variance? VarTest(lm(x ~ 1), lm(y ~ 1)) # The same.
x <- rnorm(50, mean = 0, sd = 2) # One sample test VarTest(x, sigma.squared = 2.5) # two samples y <- rnorm(30, mean = 1, sd = 1) VarTest(x, y) # Do x and y have the same variance? VarTest(lm(x ~ 1), lm(y ~ 1)) # The same.
Shift the elements of a vector in circular mode by k
elements to the right (for positive k) or to the left (for negative k), such that the first element is at the (k+1)th position of the new vector and the last k elements are appended to the beginning.VecShift
does not attach the superfluous elements on one side to the other, but fills the resulting gaps with NA
s.
VecRot(x, k = 1) VecShift(x, k = 1)
VecRot(x, k = 1) VecShift(x, k = 1)
x |
a vector of any type. |
k |
the number of elements to shift. |
The function will repeat the vector two times and select the appropriate number of elements from the required shift on.
the shifted vector in the same dimensions as x.
Andri Signorell <[email protected]>
VecRot(c(1,1,0,0,3,4,8), 3) VecRot(letters[1:10], 3) VecRot(letters[1:10], -3) VecShift(letters[1:10], 3) VecShift(letters[1:10], -3)
VecRot(c(1,1,0,0,3,4,8), 3) VecRot(letters[1:10], 3) VecRot(letters[1:10], -3) VecShift(letters[1:10], 3) VecShift(letters[1:10], -3)
Calculates variance-inflation and generalized variance-inflation factors for linear and generalized linear models. It's a measure describing how much the variance of an estimated coefficient is increased because of collinearity.
VIF(mod)
VIF(mod)
mod |
an object that responds to |
If all terms in an unweighted linear model have 1 df, then the usual variance-inflation factors are calculated.
The vif are defined as
where equals the coefficient of determination for regressing the explanatory variable
j
in question on the other terms in the model. This is one of the well-known collinearity diagnostics.
If any terms in an unweighted linear model have more than 1 df, then generalized variance-inflation factors (Fox and Monette, 1992) are calculated. These are interpretable as the inflation in size of the confidence ellipse or ellipsoid for the coefficients of the term in comparison with what would be obtained for orthogonal data.
The generalized vifs
are invariant with respect to the coding of the terms in the model (as long as
the subspace of the columns of the model matrix pertaining to each term is
invariant). To adjust for the dimension of the confidence ellipsoid, the function
also prints where
is the degrees of freedom
associated with the term.
Through a further generalization, the implementation here is applicable as well to other sorts of models, in particular weighted linear models and generalized linear models.
Values of vif up to 5 are usually interpreted as uncritical, values above 5 denote a considerable multicollinearity.
A vector of vifs, or a matrix containing one row for each term in the model, and
columns for the GVIF, df, and .
This is a verbatim copy from the function car::vif
.
Henric Nilsson and John Fox [email protected]
Fox, J. and Monette, G. (1992) Generalized collinearity diagnostics. JASA, 87, 178–183.
Fox, J. (2008) Applied Regression Analysis and Generalized Linear Models, Second Edition. Sage.
Fox, J. and Weisberg, S. (2011) An R Companion to Applied Regression, Second Edition, Sage.
VIF(lm(Fertility ~ Agriculture + Education, data=swiss)) VIF(lm(Fertility ~ ., data=swiss))
VIF(lm(Fertility ~ Agriculture + Education, data=swiss)) VIF(lm(Fertility ~ ., data=swiss))
Implements a Vigenere cypher, both encryption and decryption. The function handle keys and text of unequal length and discards non-alphabetic characters.
Vigenere(x, key = NULL, decrypt = FALSE)
Vigenere(x, key = NULL, decrypt = FALSE)
x |
the text to be encrypted |
key |
the key to be used. If this remains to |
decrypt |
boolean defining if the text should be encrypted or decrypted. |
All characters beside charlist = c(LETTERS, letters, 0:9) will be discarded from the text and from the key.
the encrypted, resp. decrypted text
Andri Signorell <[email protected]>
strongly based on code found at https://rosettacode.org/wiki/Vigen%C3%A8re_cipher#R (credits to the unknown soldier)
key <- "My FavoriteKey452" (xenc <- Vigenere("Beware the Jabberwock, my son! The jaws that bite, the claws that catch!", key)) Vigenere(xenc, key, decrypt = TRUE) # note that everything besides the characters in the list will be discarded
key <- "My FavoriteKey452" (xenc <- Vigenere("Beware the Jabberwock, my son! The jaws that bite, the claws that catch!", key)) Vigenere(xenc, key, decrypt = TRUE) # note that everything besides the characters in the list will be discarded
A popular statistic to test for independence is the von Neumann ratio.
VonNeumannTest(x, alternative = c("two.sided", "less", "greater"), unbiased = TRUE)
VonNeumannTest(x, alternative = c("two.sided", "less", "greater"), unbiased = TRUE)
x |
a numeric vector containing the observations |
alternative |
a character string specifying the alternative hypothesis, must be one of |
unbiased |
logical. In order for VN to be an unbiased estimate of the true population value, the calculated value is multiplied by |
The VN test statistic is in the unbiased case
It is known that is asymptotically standard normal, where
and
.
The VN test statistic is in the original (biased) case
The test statistic is asymptotically standard normal, where
.
Missing values are silently removed.
A list with class "htest" containing the components:
statistic |
the value of the VN statistic and the normalized statistic test. |
parameter , n
|
the size of the data, after the remotion of consecutive duplicate values. |
p.value |
the p-value of the test. |
alternative |
a character string describing the alternative hypothesis. |
method |
a character string indicating the test performed. |
data.name |
a character string giving the name of the data. |
Andri Signorell <[email protected]>
von Neumann, J. (1941) Distribution of the ratio of the mean square successive difference to the variance. Annals of Mathematical Statistics 12, 367-395.
VonNeumannTest(d.pizza$temperature)
VonNeumannTest(d.pizza$temperature)
This is a list with all VBA constants for MS Word 2010, which is useful for writing R functions based on recorded macros in Word.
This way the constants need not be replaced by their numeric values and can only be complemented with the list's name,
say the VBA-constant wd10Percent
for example can be replaced by wdConst$wd10Percent
.
A handful constants for Excel are consolidated in xlConst
.
data(wdConst) data(xlConst)
data(wdConst) data(xlConst)
The format is:
List of 2755
$ wd100Words: num -4
$ wd10Percent: num -6
$ wd10Sentences: num -2
...
Microsoft
Winsorizing a vector means that a predefined quantum of the smallest and/or the largest values are replaced by less extreme values. Thereby the substitute values are the most extreme retained values.
Winsorize(x, val = quantile(x, probs = c(0.05, 0.95), na.rm = FALSE))
Winsorize(x, val = quantile(x, probs = c(0.05, 0.95), na.rm = FALSE))
x |
a numeric vector to be winsorized. |
val |
the low border, all values being lower than this will be replaced by this value. The default is set to the 5%-quantile of x. |
The winsorized vector is obtained by
You may also want to consider standardizing (possibly robustly) the data before you perform a winsorization.
A vector of the same length as the original data x
containing
the winsorized data.
Andri Signorell [email protected]
winsorize
from the package robustHD
contains
an option to winsorize multivariate data
library(DescTools) ## generate data set.seed(9128) x <- round(runif(100) * 100, 1) (d.frm <- DescTools::Sort(data.frame( x, default = Winsorize(x), quantile = Winsorize(x, quantile(x, probs=c(0.1, 0.8), na.rm = FALSE)), fixed_val = Winsorize(x, val=c(15, 85)), fixed_n = Winsorize(x, val=c(Small(x, k=3)[3], Large(x, k=3)[1])), closest = Winsorize(x, val=unlist(Closest(x, c(30, 70)))) )))[c(1:10, 90:100), ] # use Large and Small, if a fix number of values should be winsorized (here k=3) PlotLinesA(SetNames(d.frm, rownames=NULL), lwd=2, col=Pal("Tibco"), main="Winsorized Vector") z <- 0:10 # twosided (default): Winsorize(z, val=c(2,8)) # onesided: # ... replace all values > 8 with 8 Winsorize(z, val=c(min(z), 8)) # ... replace all values < 4 with 4 Winsorize(z, val=c(4, max(z)))
library(DescTools) ## generate data set.seed(9128) x <- round(runif(100) * 100, 1) (d.frm <- DescTools::Sort(data.frame( x, default = Winsorize(x), quantile = Winsorize(x, quantile(x, probs=c(0.1, 0.8), na.rm = FALSE)), fixed_val = Winsorize(x, val=c(15, 85)), fixed_n = Winsorize(x, val=c(Small(x, k=3)[3], Large(x, k=3)[1])), closest = Winsorize(x, val=unlist(Closest(x, c(30, 70)))) )))[c(1:10, 90:100), ] # use Large and Small, if a fix number of values should be winsorized (here k=3) PlotLinesA(SetNames(d.frm, rownames=NULL), lwd=2, col=Pal("Tibco"), main="Winsorized Vector") z <- 0:10 # twosided (default): Winsorize(z, val=c(2,8)) # onesided: # ... replace all values > 8 with 8 Winsorize(z, val=c(min(z), 8)) # ... replace all values < 4 with 4 Winsorize(z, val=c(4, max(z)))
Setting and resetting options is lengthy in command mode. WithOptions()
allows to evaluate a function with temporary set options.
WithOptions(optlist, expr)
WithOptions(optlist, expr)
optlist |
a list with new option settings. |
expr |
the expression to be evaluated |
the function result
Thomas Lumley <[email protected]>
# original: print((1:10)^-1) # with new options WithOptions(list(digits=3), print((1:10)^-1))
# original: print((1:10)^-1) # with new options WithOptions(list(digits=3), print((1:10)^-1))
Test for homogeneity on tables
over strata (i.e., whether the log odds ratios are the same in all
strata).
WoolfTest(x)
WoolfTest(x)
x |
a |
A list of class "htest"
containing the following
components:
statistic |
the chi-squared test statistic. |
parameter |
degrees of freedom of the approximate chi-squared distribution of the test statistic. |
p.value |
|
method |
a character string indicating the type of test performed. |
data.name |
a character string giving the name(s) of the data. |
observed |
the observed counts. |
expected |
the expected counts under the null hypothesis. |
This function was previously published as woolf_test()
in the vcd package and has been integrated here without logical changes.
David Meyer, Achim Zeileis, Kurt Hornik, Michael Friendly
Woolf, B. 1955: On estimating the relation between blood group and disease. Ann. Human Genet. (London) 19, 251-253.
mantelhaen.test
, BreslowDayTest
migraine <- xtabs(freq ~ ., cbind(expand.grid(treatment=c("active","placebo"), response=c("better","same"), gender=c("female","male")), freq=c(16,5,11,20,12,7,16,19)) ) WoolfTest(migraine)
migraine <- xtabs(freq ~ ., cbind(expand.grid(treatment=c("active","placebo"), response=c("better","same"), gender=c("female","male")), freq=c(16,5,11,20,12,7,16,19)) ) WoolfTest(migraine)
Accessing bookmarks by name is only possible by browsing the bookmark names. WrdBookmark
returns a handle to a bookmark by taking its name as argument. WrdInsertBookmark
, WrdDeleteBookmark
inserts/deletes a bookmark in a Word document. WrdGotoBookmark
allows to place the cursor on the bookmark and WrdUpdateBookmark
replaces the content within the range of the bookmark in a Word document with the given text.
WrdBookmark(name, wrd = DescToolsOptions("lastWord")) WrdInsertBookmark(name, wrd = DescToolsOptions("lastWord")) WrdDeleteBookmark(name, wrd = DescToolsOptions("lastWord")) WrdGoto(name, what = wdConst$wdGoToBookmark, wrd = DescToolsOptions("lastWord")) WrdUpdateBookmark(name, text, what = wdConst$wdGoToBookmark, wrd = DescToolsOptions("lastWord"))
WrdBookmark(name, wrd = DescToolsOptions("lastWord")) WrdInsertBookmark(name, wrd = DescToolsOptions("lastWord")) WrdDeleteBookmark(name, wrd = DescToolsOptions("lastWord")) WrdGoto(name, what = wdConst$wdGoToBookmark, wrd = DescToolsOptions("lastWord")) WrdUpdateBookmark(name, text, what = wdConst$wdGoToBookmark, wrd = DescToolsOptions("lastWord"))
name |
the name of the bookmark. |
text |
the text of the bookmark. |
what |
a word constant, defining the type of object to be used to place the cursor. |
wrd |
the pointer to a word instance. Can be a new one, created by |
Bookmarks are useful to build structured documents, which can be updated later.
Andri Signorell <[email protected]>
WrdFont
, WrdPlot
, GetNewWrd
, GetCurrWrd
## Not run: # we can't get this through the CRAN test - run it with copy/paste to console wrd <- GetNewWrd() WrdText("a)\n\n\nb)", fontname=WrdGetFont()$name, fontsize=WrdGetFont()$size) WrdInsertBookmark("chap_b") WrdText("\n\n\nc)\n\n\n", fontname=WrdGetFont()$name, fontsize=WrdGetFont()$size) WrdGoto("chap_b") WrdUpdateBookmark("chap_b", "Goto chapter B and set text") WrdInsertBookmark("mybookmark") ToWrd("A longer text\n\n\n") # Now returning the bookmark bm <- WrdBookmark("mybookmark") # get the automatically created name of the bookmark bm$name() ## End(Not run)
## Not run: # we can't get this through the CRAN test - run it with copy/paste to console wrd <- GetNewWrd() WrdText("a)\n\n\nb)", fontname=WrdGetFont()$name, fontsize=WrdGetFont()$size) WrdInsertBookmark("chap_b") WrdText("\n\n\nc)\n\n\n", fontname=WrdGetFont()$name, fontsize=WrdGetFont()$size) WrdGoto("chap_b") WrdUpdateBookmark("chap_b", "Goto chapter B and set text") WrdInsertBookmark("mybookmark") ToWrd("A longer text\n\n\n") # Now returning the bookmark bm <- WrdBookmark("mybookmark") # get the automatically created name of the bookmark bm$name() ## End(Not run)
Insert a caption in a given level to a Word document. The caption is inserted at the current cursor position.
WrdCaption(x, index = 1, wrd = DescToolsOptions("lastWord"))
WrdCaption(x, index = 1, wrd = DescToolsOptions("lastWord"))
x |
the text of the caption. |
index |
integer from 1 to 9, defining the number of the heading style. |
wrd |
the pointer to a word instance. Can be a new one, created by |
Andri Signorell <[email protected]>
ToWrd
, WrdPlot
, GetNewWrd
, GetCurrWrd
## Not run: # Windows-specific example wrd <- GetNewWrd() # insert a title in level 1 WrdCaption("My First Caption level 1", index=1, wrd=wrd) # works as well for several levels sapply(1:5, function(i) WrdCaption(gettextf("My First Caption level %s",i), index=i, wrd=wrd) ) ## End(Not run)
## Not run: # Windows-specific example wrd <- GetNewWrd() # insert a title in level 1 WrdCaption("My First Caption level 1", index=1, wrd=wrd) # works as well for several levels sapply(1:5, function(i) WrdCaption(gettextf("My First Caption level %s",i), index=i, wrd=wrd) ) ## End(Not run)
Return a handle of a cell range of a word table. This is useful for formating the cell range.
WrdCellRange(wtab, from, to)
WrdCellRange(wtab, from, to)
wtab |
a handle to the word table as returned i.g. by |
from |
a vector containing row- and column number of the left/upper cell of the cell range. |
to |
a vector containing row- and column number of the right/lower cell of the cell range. |
Cell range selecting might be complicated. This function makes it easy.
a handle to the range.
Andri Signorell <[email protected]>
## Not run: # Windows-specific example wrd <- GetNewWrd() WrdTable(nrow=3, ncol=3, wrd=wrd) crng <- WrdCellRange(from=c(1,2), to=c(2,3)) crng$Select() ## End(Not run)
## Not run: # Windows-specific example wrd <- GetNewWrd() WrdTable(nrow=3, ncol=3, wrd=wrd) crng <- WrdCellRange(from=c(1,2), to=c(2,3)) crng$Select() ## End(Not run)
WrdFont
can be used to get and set the font in Word for the text to be inserted. WrdFont
returns the font at the current cursor position.
WrdFont(wrd = DescToolsOptions("lastWord")) WrdFont(wrd) <- value
WrdFont(wrd = DescToolsOptions("lastWord")) WrdFont(wrd) <- value
value |
the font to be used to the output. This should be defined as a list containing fontname, fontsize, bold and italic flags: |
wrd |
the pointer to a word instance. Can be a new one, created by |
The font color can be defined by a Word constant beginning with wdConst$wdColor
.
The defined colors can be listed with grep("wdColor", names(wdConst), val=TRUE)
.
a list of the attributes of the font in the current cursor position:
name |
the fontname |
size |
the fontsize |
bold |
bold |
italic |
italic |
color |
the fontcolor |
Andri Signorell <[email protected]>
ToWrd
, WrdPlot
, GetNewWrd
, GetCurrWrd
## Not run: # Windows-specific example wrd <- GetNewWrd() for(i in seq(10, 24, 2)) ToWrd(gettextf("This is Arial size %s \n", i), font=list(name="Arial", size=i)) for(i in seq(10, 24, 2)) ToWrd(gettextf("This is Times size %s \n", i), font=list(name="Times", size=i)) ## End(Not run)
## Not run: # Windows-specific example wrd <- GetNewWrd() for(i in seq(10, 24, 2)) ToWrd(gettextf("This is Arial size %s \n", i), font=list(name="Arial", size=i)) for(i in seq(10, 24, 2)) ToWrd(gettextf("This is Times size %s \n", i), font=list(name="Times", size=i)) ## End(Not run)
Format cells of a Word table.
WrdFormatCells(wtab, rstart, rend, col = NULL, bg = NULL, font = NULL, border = NULL, align = NULL)
WrdFormatCells(wtab, rstart, rend, col = NULL, bg = NULL, font = NULL, border = NULL, align = NULL)
wtab |
a handle to the word table as returned i.g. by |
rstart |
the left/upper cell of the cell range |
rend |
the right/lower cell of the cell range |
col |
the foreground colour |
bg |
the background colour |
font |
the font to be used to the output. This should be defined as a list containing fontname, fontsize, bold and italic flags: |
border |
the border of the cell range, defined as a list containing arguments for
border, linestyle, linewidth and color. |
align |
a character out of |
Cell range selecting might be complicated. This function makes it easy.
a handle to the range.
Andri Signorell <[email protected]>
## Not run: # Windows-specific example m <- matrix(rnorm(12)*100, nrow=4, dimnames=list(LETTERS[1:4], c("Variable","Value","Remark"))) wrd <- GetNewWrd() wt <- ToWrd(m) WrdFormatCells(wt, rstart=c(3,1), rend=c(4,3), bg=wdConst$wdColorGold, font=list(name="Arial Narrow", bold=TRUE), align="c", border=list(color=wdConst$wdColorTeal, linewidth=wdConst$wdLineWidth300pt)) ## End(Not run)
## Not run: # Windows-specific example m <- matrix(rnorm(12)*100, nrow=4, dimnames=list(LETTERS[1:4], c("Variable","Value","Remark"))) wrd <- GetNewWrd() wt <- ToWrd(m) WrdFormatCells(wt, rstart=c(3,1), rend=c(4,3), bg=wdConst$wdColorGold, font=list(name="Arial Narrow", bold=TRUE), align="c", border=list(color=wdConst$wdColorTeal, linewidth=wdConst$wdLineWidth300pt)) ## End(Not run)
Merges a cell range of a word table.
WrdMergeCells(wtab, rstart, rend)
WrdMergeCells(wtab, rstart, rend)
wtab |
a handle to the word table as returned i.g. by |
rstart |
the left/upper cell of the cell range. |
rend |
the right/lower cell of the cell range. |
nothing
Andri Signorell <[email protected]>
## Not run: # Windows-specific example wrd <- GetNewWrd() wtab <- WrdTable(nrow=3, ncol=3, wrd=wrd) WrdMergeCells(wtab, rstart=c(1,2), rend=c(2,3)) ## End(Not run)
## Not run: # Windows-specific example wrd <- GetNewWrd() wtab <- WrdTable(nrow=3, ncol=3, wrd=wrd) WrdMergeCells(wtab, rstart=c(1,2), rend=c(2,3)) ## End(Not run)
Insert a page break in a MS-Word (R) document at the position of the cursor.
WrdPageBreak(wrd = DescToolsOptions("lastWord"))
WrdPageBreak(wrd = DescToolsOptions("lastWord"))
wrd |
the pointer to a word instance. Can be a new one, created by |
Andri Signorell <[email protected]>
WrdFont
, WrdPlot
, GetNewWrd
, GetCurrWrd
## Not run: # Windows-specific example wrd <- GetNewWrd() WrdText("This is text on page 1.\n\n") WrdPageBreak() WrdText("This is text on another page.\n\n") ## End(Not run)
## Not run: # Windows-specific example wrd <- GetNewWrd() WrdText("This is text on page 1.\n\n") WrdPageBreak() WrdText("This is text on another page.\n\n") ## End(Not run)
WrdParagraphFormat
can be used to get and set the font in Word for the text to be inserted.
WrdParagraphFormat(wrd = DescToolsOptions("lastWord")) WrdParagraphFormat(wrd) <- value
WrdParagraphFormat(wrd = DescToolsOptions("lastWord")) WrdParagraphFormat(wrd) <- value
value |
a list defining the paragraph format.
This can contain any combination of:
|
wrd |
the pointer to a word instance. Can be a new one, created by |
an object with the class paragraph
, basically a list with the attributes of the paragraph in the current cursor position:
LeftIndent |
left indentation in (in points) for the specified paragraphs. |
RightIndent |
right indent (in points) for the specified paragraphs. |
SpaceBefore |
spacing (in points) before the specified paragraphs. |
SpaceBeforeAuto |
|
SpaceAfter |
amount of spacing (in points) after the specified paragraph or text column. |
SpaceAfterAuto |
|
LineSpacingRule |
line spacing for the specified paragraph formatting. Use |
Alignment |
|
WidowControl |
|
KeepWithNext |
|
KeepTogether |
|
PageBreakBefore |
|
NoLineNumber |
|
Hyphenation |
|
FirstLineIndent |
value (in points) for a first line or hanging indent. Use a positive value to set a first-line indent, and use a negative value to set a hanging indent. |
OutlineLevel |
outline level for the specified paragraphs. |
CharacterUnitLeftIndent |
left indent value (in characters) for the specified paragraphs. |
CharacterUnitRightIndent |
right indent value (in characters) for the specified paragraphs. |
LineUnitBefore |
amount of spacing (in gridlines) before the specified paragraphs. |
LineUnitAfter |
amount of spacing (in gridlines) after the specified paragraphs. |
MirrorIndents |
Long that represents whether left and right indents are the same width. Can be |
Andri Signorell <[email protected]>
ToWrd
, WrdPlot
, GetNewWrd
, GetCurrWrd
## Not run: # Windows-specific example wrd <- GetNewWrd() # get the handle to a new word instance WrdParagraphFormat(wrd=wrd) <- list(Alignment=wdConst$wdAlignParagraphLeft, LeftIndent=42.55) ToWrd("Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est.\n", wrd=wrd) # reset WrdParagraphFormat(wrd=wrd) <- list(LeftIndent=0) ## End(Not run)
## Not run: # Windows-specific example wrd <- GetNewWrd() # get the handle to a new word instance WrdParagraphFormat(wrd=wrd) <- list(Alignment=wdConst$wdAlignParagraphLeft, LeftIndent=42.55) ToWrd("Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est.\n", wrd=wrd) # reset WrdParagraphFormat(wrd=wrd) <- list(LeftIndent=0) ## End(Not run)
This function inserts the plot on the active plot device to Word. The image is transferred by saving the picture to a file in R and inserting the file in Word. The format of the plot can be selected, as well as crop options and the size factor for inserting.
WrdPlot(type = "png", append.cr = TRUE, crop = c(0, 0, 0, 0), main = NULL, picscale = 100, height = NA, width = NA, res = 300, dfact = 1.6, wrd = DescToolsOptions("lastWord"))
WrdPlot(type = "png", append.cr = TRUE, crop = c(0, 0, 0, 0), main = NULL, picscale = 100, height = NA, width = NA, res = 300, dfact = 1.6, wrd = DescToolsOptions("lastWord"))
type |
the format for the picture file, default is |
append.cr |
should a carriage return be appended? Default is TRUE. |
crop |
crop options for the picture, defined by a 4-elements-vector. The first element is the bottom side, the second the left and so on. |
main |
a caption for the plot. This will be inserted by InserCaption in Word. Default is NULL, which will insert nothing. |
picscale |
scale factor of the picture in percent, default ist 100. |
height |
height in cm, this overrides the picscale if both are given. |
width |
width in cm, this overrides the picscale if both are given. |
res |
resolution for the png file, defaults to 300. |
dfact |
the size factor for the graphic. |
wrd |
the pointer to a word instance. Can be a new one, created by |
Returns a pointer to the inserted picture.
Andri Signorell <[email protected]>
## Not run: # Windows-specific example # let's have some graphics plot(1,type="n", axes=FALSE, xlab="", ylab="", xlim=c(0,1), ylim=c(0,1), asp=1) rect(0,0,1,1,col="black") segments(x0=0.5, y0=seq(0.632,0.67, length.out=100), y1=seq(0.5,0.6, length.out=100), x1=1, col=rev(rainbow(100))) polygon(x=c(0.35,0.65,0.5), y=c(0.5,0.5,0.75), border="white", col="black", lwd=2) segments(x0=0,y0=0.52, x1=0.43, y1=0.64, col="white", lwd=2) x1 <- seq(0.549,0.578, length.out=50) segments(x0=0.43, y0=0.64, x1=x1, y1=-tan(pi/3)* x1 + tan(pi/3) * 0.93, col=rgb(1,1,1,0.35)) # get a handle to a new word instance wrd <- GetNewWrd() # insert plot with a specified height WrdPlot(wrd=wrd, height=5) ToWrd("Remember?\n", fontname="Arial", fontsize=14, bold=TRUE, wrd=wrd) # crop the picture WrdPlot(wrd=wrd, height=5, crop=c(9,9,0,0)) wpic <- WrdPlot(wrd=wrd, height=5, crop=c(9,9,0,0)) wpic ## End(Not run)
## Not run: # Windows-specific example # let's have some graphics plot(1,type="n", axes=FALSE, xlab="", ylab="", xlim=c(0,1), ylim=c(0,1), asp=1) rect(0,0,1,1,col="black") segments(x0=0.5, y0=seq(0.632,0.67, length.out=100), y1=seq(0.5,0.6, length.out=100), x1=1, col=rev(rainbow(100))) polygon(x=c(0.35,0.65,0.5), y=c(0.5,0.5,0.75), border="white", col="black", lwd=2) segments(x0=0,y0=0.52, x1=0.43, y1=0.64, col="white", lwd=2) x1 <- seq(0.549,0.578, length.out=50) segments(x0=0.43, y0=0.64, x1=x1, y1=-tan(pi/3)* x1 + tan(pi/3) * 0.93, col=rgb(1,1,1,0.35)) # get a handle to a new word instance wrd <- GetNewWrd() # insert plot with a specified height WrdPlot(wrd=wrd, height=5) ToWrd("Remember?\n", fontname="Arial", fontsize=14, bold=TRUE, wrd=wrd) # crop the picture WrdPlot(wrd=wrd, height=5, crop=c(9,9,0,0)) wpic <- WrdPlot(wrd=wrd, height=5, crop=c(9,9,0,0)) wpic ## End(Not run)
Open and save MS-Word documents.
WrdOpenFile(fn, wrd = DescToolsOptions("lastWord")) WrdSaveAs(fn, fileformat = "docx", wrd = DescToolsOptions("lastWord"))
WrdOpenFile(fn, wrd = DescToolsOptions("lastWord")) WrdSaveAs(fn, fileformat = "docx", wrd = DescToolsOptions("lastWord"))
fn |
filename and -path for the document. |
fileformat |
file format, one out of |
wrd |
the pointer to a word instance. Can be a new one, created by |
nothing returned
Andri Signorell <[email protected]>
## Not run: # Windows-specific example wrd <- GetNewWrd() WrdCaption("A Report") WrdSaveAs(fn="report", fileformat="htm") ## End(Not run)
## Not run: # Windows-specific example wrd <- GetNewWrd() WrdCaption("A Report") WrdSaveAs(fn="report", fileformat="htm") ## End(Not run)
WrdStyle
can be used to get and set the style in Word for the text to be inserted. WrdStyle
returns the style at the current cursor position.
WrdStyle(wrd = DescToolsOptions("lastWord")) WrdStyle(wrd) <- value
WrdStyle(wrd = DescToolsOptions("lastWord")) WrdStyle(wrd) <- value
value |
the name of the style to be used to the output. This should be defined an existing name. |
wrd |
the pointer to a word instance. Can be a new one, created by |
character, name of the style
Andri Signorell <[email protected]>
ToWrd
, WrdPlot
, GetNewWrd
, GetCurrWrd
## Not run: # Windows-specific example wrd <- GetNewWrd() # the current stlye WrdStyle(wrd) ## End(Not run)
## Not run: # Windows-specific example wrd <- GetNewWrd() # the current stlye WrdStyle(wrd) ## End(Not run)
Create a table with a specified number of rows and columns in a Word document at the current position of the cursor.
WrdTable(nrow = 1, ncol = 1, heights = NULL, widths = NULL, main = NULL, wrd = DescToolsOptions("lastWord"))
WrdTable(nrow = 1, ncol = 1, heights = NULL, widths = NULL, main = NULL, wrd = DescToolsOptions("lastWord"))
nrow |
number of rows. |
ncol |
number of columns. |
heights |
a vector of the row heights (in [cm]). If set to |
widths |
a vector of the column widths (in [cm]). If set to |
main |
a caption for the plot. This will be inserted by InserCaption in Word. Default is NULL, which will insert nothing. |
wrd |
the pointer to a word instance. Can be a new one, created by |
A pointer to the inserted table.
Andri Signorell <[email protected]>
## Not run: # Windows-specific example wrd <- GetNewWrd() WrdTable(nrow=3, ncol=3, wrd=wrd) ## End(Not run)
## Not run: # Windows-specific example wrd <- GetNewWrd() WrdTable(nrow=3, ncol=3, wrd=wrd) ## End(Not run)
Drawing borders in a Word table is quite tedious. This function allows to select any range and draw border lines around it.
WrdTableBorders(wtab, from = NULL, to = NULL, border = NULL, lty = wdConst$wdLineStyleSingle, col = wdConst$wdColorBlack, lwd = wdConst$wdLineWidth050pt)
WrdTableBorders(wtab, from = NULL, to = NULL, border = NULL, lty = wdConst$wdLineStyleSingle, col = wdConst$wdColorBlack, lwd = wdConst$wdLineWidth050pt)
wtab |
|
from |
integer, a vector with two elements specifying the left upper bound of the cellrange. |
to |
integer, a vector with two elements specifying the right bottom of the cellrange. |
border |
a Word constant ( |
lty |
a Word constant ( |
col |
a Word constant ( |
lwd |
a Word constant ( |
nothing
Andri Signorell <[email protected]>
## Not run: # create table tab <- table(op=d.pizza$operator, area=d.pizza$area) # send it to Word wrd <- GetNewWrd() wtab <- ToWrd(tab, wrd=wrd, tablestyle = NA) # draw borders WrdTableBorders(wtab, from=c(2,2), to=c(3,3), border=wdConst$wdBorderBottom, wrd=wrd) WrdTableBorders(wtab, from=c(2,2), to=c(3,3), border=wdConst$wdBorderDiagonalUp, wrd=wrd) # demonstrate linewidth and color wtab <- ToWrd(tab, wrd=wrd, tablestyle = NA) WrdTableBorders(wtab, col=RgbToLong(ColToRgb("olivedrab")), lwd=wdConst$wdLineWidth150pt, wrd=wrd) WrdTableBorders(wtab, border=wdConst$wdBorderBottom, col=RgbToLong(ColToRgb("dodgerblue")), lwd=wdConst$wdLineWidth300pt, wrd=wrd) # use an R color in Word RgbToLong(ColToRgb("olivedrab")) # find a similar R-color for a Word color ColToRgb(RgbToCol(LongToRgb(wdConst$wdColorAqua))) ## End(Not run)
## Not run: # create table tab <- table(op=d.pizza$operator, area=d.pizza$area) # send it to Word wrd <- GetNewWrd() wtab <- ToWrd(tab, wrd=wrd, tablestyle = NA) # draw borders WrdTableBorders(wtab, from=c(2,2), to=c(3,3), border=wdConst$wdBorderBottom, wrd=wrd) WrdTableBorders(wtab, from=c(2,2), to=c(3,3), border=wdConst$wdBorderDiagonalUp, wrd=wrd) # demonstrate linewidth and color wtab <- ToWrd(tab, wrd=wrd, tablestyle = NA) WrdTableBorders(wtab, col=RgbToLong(ColToRgb("olivedrab")), lwd=wdConst$wdLineWidth150pt, wrd=wrd) WrdTableBorders(wtab, border=wdConst$wdBorderBottom, col=RgbToLong(ColToRgb("dodgerblue")), lwd=wdConst$wdLineWidth300pt, wrd=wrd) # use an R color in Word RgbToLong(ColToRgb("olivedrab")) # find a similar R-color for a Word color ColToRgb(RgbToCol(LongToRgb(wdConst$wdColorAqua))) ## End(Not run)
Inserting headings in a table can be hard, when column headings should span several columns. This function helps to easily insert headings and merge cells.
WrdTableHeading(wtab, text, bold = TRUE, alignment = wdConst$wdAlignParagraphCenter, merge_cols = NULL, wrd = DescToolsOptions("lastWord"))
WrdTableHeading(wtab, text, bold = TRUE, alignment = wdConst$wdAlignParagraphCenter, merge_cols = NULL, wrd = DescToolsOptions("lastWord"))
wtab |
the handle to a table in a word document |
text |
the text for the headings |
bold |
logical, for setting bold column headings, will be recycled. Default is |
alignment |
the alignment in the column headings, must be one out of the Word constant list |
merge_cols |
a vector consisting of entries to merge cells in the form |
wrd |
the pointer to a word instance. Can be a new one, created by |
Nothing returned.
Andri Signorell <[email protected]
## Not run: # Windows-specific example wrd <- GetNewWrd() wtab <- WrdTable(nrow=3, ncol=5, wrd=wrd) # insert headings and merge 1:2 and 4:5, such as there are # only 3 headings WrdTableHeading(wtab, text=c("First","Second", "Third"), alignment=c(wdConst$wdAlignParagraphLeft, rep(wdConst$wdAlignParagraphCenter, 2)), merge_cols = c("1:2", "4:5")) ## End(Not run)
## Not run: # Windows-specific example wrd <- GetNewWrd() wtab <- WrdTable(nrow=3, ncol=5, wrd=wrd) # insert headings and merge 1:2 and 4:5, such as there are # only 3 headings WrdTableHeading(wtab, text=c("First","Second", "Third"), alignment=c(wdConst$wdAlignParagraphLeft, rep(wdConst$wdAlignParagraphCenter, 2)), merge_cols = c("1:2", "4:5")) ## End(Not run)
As I repeatedly forgot how to convert Excel dates to POSIX here's the specific function.
XLDateToPOSIXct(x, tz = "GMT", xl1904 = FALSE)
XLDateToPOSIXct(x, tz = "GMT", xl1904 = FALSE)
x |
the integer vector to be converted. |
tz |
a time zone specification to be used for the conversion, if one is required. See |
xl1904 |
logical, defining if the unspeakable 1904-system should be used. Default is FALSE. |
XLGetRange
will return dates as integer values, because XL stores them as integers. An Excel date can be converted with the (unusual) origin of
as.Date(myDate, origin="1899-12-30")
, which is implemented here.
Microsoft Excel supports two different date systems, the 1900 date system and the 1904 date system. In the 1900 date system, the first day that is supported is January 1, 1900. A date is converted into a serial number that represents the number of elapsed days since January 1, 1900. In the 1904 date system, the first day that is supported is January 1, 1904. By default, Microsoft Excel for the Macintosh uses the 1904 date system, Excel for Windows the 1900 system. See also: https://support.microsoft.com/en-us/kb/214330.
return an object of the class POSIXct. Date-times known to be invalid will be returned as NA.
Andri Signorell <[email protected]>
XLDateToPOSIXct(41025) XLDateToPOSIXct(c(41025.23, 41035.52))
XLDateToPOSIXct(41025) XLDateToPOSIXct(c(41025.23, 41035.52))
The package RDCOMClient
is used to open an Excel workbook and return the content (value) of one (or several) given range(s)
in a specified sheet. This is helpful, whenever pathologically scattered data on an Excel sheet, which can't simply be saved as CSV-file, has to be imported in R.XLGetWorkbook()
does the same for all the sheets in an Excel workbook.
XLGetRange(file = NULL, sheet = NULL, range = NULL, as.data.frame = TRUE, header = FALSE, stringsAsFactors = FALSE, echo = FALSE, na.strings = NULL, skip = 0) XLGetWorkbook(file, compactareas = TRUE) XLCurrReg(cell) XLNamedReg(x)
XLGetRange(file = NULL, sheet = NULL, range = NULL, as.data.frame = TRUE, header = FALSE, stringsAsFactors = FALSE, echo = FALSE, na.strings = NULL, skip = 0) XLGetWorkbook(file, compactareas = TRUE) XLCurrReg(cell) XLNamedReg(x)
file |
the fully specified path and filename of the workbook. If it is left as |
sheet |
the name of the sheet containing the range(s) of interest. |
range |
a scalar or a vector with the address(es) of the range(s) to be returned (characters).
Use "A1"-address mode to specify the ranges, for example |
as.data.frame |
logical. Determines if the cellranges should be coerced into data.frames. Defaults to |
header |
a logical value indicating whether the range contains the names of the variables as its first line. Default is |
stringsAsFactors |
logical. Should character columns be coerced to factors? The default is |
echo |
logical. If set to |
na.strings |
a character vector of strings which are to be interpreted as |
compactareas |
logical, defining if areas should be returned by |
cell |
range of the left uppe cell, when current region should be used. |
x |
the name or the index of the XL-name to be used. |
skip |
the number of lines of the data file to skip before beginning to read data. |
The result consists of a list of lists, if as.data.frame
is set to FALSE
.
Be then prepared to encounter NULL
values. Those will prevent from easily being able to coerce
the square data structure to a data.frame.
The following code will replace the NULL
values by NA
and coerce the data to a data.frame.
# get the range D1:J69 from an excel file xlrng <- XLGetRange(file="myfile.xlsx", sheet="Tabelle1", range="D1:J69", as.data.frame=FALSE) # replace NULL values by NA xlrng[unlist(lapply(xlrng, is.null))] <- NA # coerce the square data structure to a data.frame d.lka <- data.frame(lapply(data.frame(xlrng), unlist))
This of course can be avoided by setting as.data.frame
= TRUE
.
The function will return dates as integers, because MS-Excel stores them internally as integers.
Such a date can subsequently be converted with the (unusual) origin of
as.Date(myDate, origin="1899-12-30")
. See also XLDateToPOSIXct
, which does the job. The conversion can directly be performed by XLGetRange()
if datecols
is used and contains the date columns in the sheet data.
If as.data.frame
is set to TRUE
, a single data.frame or a list of data.frames will be returned.
If set to FALSE
a list of the cell values in the specified Excel range, resp. a list of lists will be returned.
XLGetWorkbook()
returns a list of lists of the values in the given workbook.
Andri Signorell <[email protected]>
## Not run: # Windows-specific example XLGetRange(file="C:/My Documents/data.xls", sheet="Sheet1", range=c("A2:B5","M6:X23","C4:D40")) # if the current region has to be read (incl. a header), place the cursor in the interesting region # and run: d.set <- XLGetRange(header=TRUE) # Get XL nameslist nm <- xl$ActiveWorkbook()$names() lst <- list() for(i in 1:nm$count()) lst[[i]] <- c(name=nm[[i]]$name(), address=nm[[i]]$refersToRange()$Address()) # the defined names as.data.frame(do.call(rbind, lst), stringsAsFactors = FALSE) ## End(Not run)
## Not run: # Windows-specific example XLGetRange(file="C:/My Documents/data.xls", sheet="Sheet1", range=c("A2:B5","M6:X23","C4:D40")) # if the current region has to be read (incl. a header), place the cursor in the interesting region # and run: d.set <- XLGetRange(header=TRUE) # Get XL nameslist nm <- xl$ActiveWorkbook()$names() lst <- list() for(i in 1:nm$count()) lst[[i]] <- c(name=nm[[i]]$name(), address=nm[[i]]$refersToRange()$Address()) # the defined names as.data.frame(do.call(rbind, lst), stringsAsFactors = FALSE) ## End(Not run)
Save the current workbook under the given name and format.
XLSaveAs(fn, file_format = xlConst$XlFileFormat$xlWorkbookNormal, xl = DescToolsOptions("lastXL"))
XLSaveAs(fn, file_format = xlConst$XlFileFormat$xlWorkbookNormal, xl = DescToolsOptions("lastXL"))
fn |
the filename |
file_format |
the file format using the xl constant. |
xl |
the pointer to a MS-Excel instance. An new instance can be created with |
returns TRUE
if the save operation has been successful
Andri Signorell <[email protected]>
## Not run: # Windows-specific example XLView(d.diamonds) XLSaveAs("Diamonds") xl$quit() ## End(Not run)
## Not run: # Windows-specific example XLView(d.diamonds) XLSaveAs("Diamonds") xl$quit() ## End(Not run)
XLView
can be used to view and edit a data.frame directly in MS-Excel, resp. to create a new data.frame in MS-Excel.
XLView(x, col.names = TRUE, row.names = FALSE, na = "", preserveStrings = FALSE, sep = ";") ToXL(x, at, ..., xl=DescToolsOptions("lastXL")) ## S3 method for class 'data.frame' ToXL(x, at, ..., xl=DescToolsOptions("lastXL")) ## S3 method for class 'matrix' ToXL(x, at, ..., xl=DescToolsOptions("lastXL")) ## Default S3 method: ToXL(x, at, byrow = FALSE, ..., xl=DescToolsOptions("lastXL")) XLKill()
XLView(x, col.names = TRUE, row.names = FALSE, na = "", preserveStrings = FALSE, sep = ";") ToXL(x, at, ..., xl=DescToolsOptions("lastXL")) ## S3 method for class 'data.frame' ToXL(x, at, ..., xl=DescToolsOptions("lastXL")) ## S3 method for class 'matrix' ToXL(x, at, ..., xl=DescToolsOptions("lastXL")) ## Default S3 method: ToXL(x, at, byrow = FALSE, ..., xl=DescToolsOptions("lastXL")) XLKill()
x |
is a data.frame to be transferred to MS-Excel. If data is missing a new file will be created. |
row.names |
either a logical value indicating whether the row names of x are to be written along with x, or a character vector of row names to be written. |
col.names |
either a logical value indicating whether the column names of x are to be written
along with x, or a character vector of column names to be written.
See the section on 'CSV files' |
na |
the string to use for missing values in the data. |
preserveStrings |
logical, will preserve strings from being converted to numerics when imported in MS-Excel. See details. Default is |
sep |
the field separator string used for export of the object. Values within each row of x are separated by this string. |
at |
can be a range adress as character (e.g. |
byrow |
logical, defines if the vector should be inserted by row or by column (default). |
xl |
the pointer to a MS-Excel instance. An new instance can be created with |
... |
further arguments are not used. |
The data.frame will be exported in CSV format and then imported in MS-Excel. When importing data, MS-Excel will potentially change characters to numeric values. If this seems undesirable (maybe we're loosing leading zeros) then you should enclose the text in quotes and preset a =.
x <- gettextf('="%s"', x)
would do the trick.
Take care: Changes to the data made in MS-Excel will NOT automatically be updated in the original data.frame.
The user will have to read the csv-file into R again.
See examples how to get this done.
ToXL()
is used to export data frames or vectors directly to MS-Excel, without export the data to a csv-file and import it on the XL side. So it it possible to export several data.frames into one Workbook and edit the tables after ones needs.
XLKill
will kill a running XL instance (which might be invisible). Background is the fact, that the simple XL$quit() command
would not terminate a running XL task, but only set it invisible (observe the TaskManager). This ghost version may sometimes confuse XLView and hinder to create a new instance. In such cases you have to do the garbage collection...
the name/path of the temporary file edited in MS-Excel.
The function works only in Windows and requires RDCOMClient to be installed (see: Additional_repositories in DESCRIPTION of the package).
Andri Signorell <[email protected]>, ToXL()
is based on code of Duncan Temple Lang <[email protected]>
GetNewXL
, XLGetRange
, XLGetWorkbook
## Not run: # Windows-specific example XLView(d.diamonds) # edit an existing data.frame in MS-Excel, make changes and save there, return the filename fn <- XLView(d.diamonds) # read the changed file and store in new data.frame d.frm <- read.table(fn, header=TRUE, quote="", sep=";") # Create a new file, edit it in MS-Excel... fn <- XLView() # ... and read it into a data.frame when in R again d.set <- read.table(fn, header=TRUE, quote="", sep=";") # Export a ftable object, quite elegant... XLView(format(ftable(Titanic), quote=FALSE), row.names = FALSE, col.names = FALSE) # Export a data.frame directly to XL, combined with subsequent formatting xl <- GetNewXL() owb <- xl[["Workbooks"]]$Add() sheet <- xl$Sheets()$Add() sheet[["name"]] <- "pizza" ToXL(d.pizza[1:10, 1:10], xl$Cells(1,1)) obj <- xl$Cells()$CurrentRegion() obj[["VerticalAlignment"]] <- xlConst$xlTop row <- xl$Cells()$CurrentRegion()$rows(1) # does not work: row$font()[["bold"]] <- TRUE # works: obj <- row$font() obj[["bold"]] <- TRUE obj <- row$borders(xlConst$xlEdgeBottom) obj[["linestyle"]] <- xlConst$xlContinuous cols <- xl$Cells()$CurrentRegion()$columns(1) cols[["HorizontalAlignment"]] <- xlConst$xlLeft xl$Cells()$CurrentRegion()[["EntireColumn"]]$AutoFit() cols <- xl$Cells()$CurrentRegion()$columns(4) cols[["WrapText"]] <- TRUE cols[["ColumnWidth"]] <- 80 xl$Cells()$CurrentRegion()[["EntireRow"]]$AutoFit() sheet <- xl$Sheets()$Add() sheet[["name"]] <- "whisky" ToXL(d.whisky[1:10, 1:10], xl$Cells(1,1)) ## End(Not run)
## Not run: # Windows-specific example XLView(d.diamonds) # edit an existing data.frame in MS-Excel, make changes and save there, return the filename fn <- XLView(d.diamonds) # read the changed file and store in new data.frame d.frm <- read.table(fn, header=TRUE, quote="", sep=";") # Create a new file, edit it in MS-Excel... fn <- XLView() # ... and read it into a data.frame when in R again d.set <- read.table(fn, header=TRUE, quote="", sep=";") # Export a ftable object, quite elegant... XLView(format(ftable(Titanic), quote=FALSE), row.names = FALSE, col.names = FALSE) # Export a data.frame directly to XL, combined with subsequent formatting xl <- GetNewXL() owb <- xl[["Workbooks"]]$Add() sheet <- xl$Sheets()$Add() sheet[["name"]] <- "pizza" ToXL(d.pizza[1:10, 1:10], xl$Cells(1,1)) obj <- xl$Cells()$CurrentRegion() obj[["VerticalAlignment"]] <- xlConst$xlTop row <- xl$Cells()$CurrentRegion()$rows(1) # does not work: row$font()[["bold"]] <- TRUE # works: obj <- row$font() obj[["bold"]] <- TRUE obj <- row$borders(xlConst$xlEdgeBottom) obj[["linestyle"]] <- xlConst$xlContinuous cols <- xl$Cells()$CurrentRegion()$columns(1) cols[["HorizontalAlignment"]] <- xlConst$xlLeft xl$Cells()$CurrentRegion()[["EntireColumn"]]$AutoFit() cols <- xl$Cells()$CurrentRegion()$columns(4) cols[["WrapText"]] <- TRUE cols[["ColumnWidth"]] <- 80 xl$Cells()$CurrentRegion()[["EntireRow"]]$AutoFit() sheet <- xl$Sheets()$Add() sheet[["name"]] <- "whisky" ToXL(d.whisky[1:10, 1:10], xl$Cells(1,1)) ## End(Not run)
Performs one and two sample Yuen t-tests for trimmed means on vectors of data.
YuenTTest(x, ...) ## Default S3 method: YuenTTest(x, y = NULL, alternative = c("two.sided", "less", "greater"), mu = 0, paired = FALSE, conf.level = 0.95, trim = 0.2, ... ) ## S3 method for class 'formula' YuenTTest(formula, data, subset, na.action, ...)
YuenTTest(x, ...) ## Default S3 method: YuenTTest(x, y = NULL, alternative = c("two.sided", "less", "greater"), mu = 0, paired = FALSE, conf.level = 0.95, trim = 0.2, ... ) ## S3 method for class 'formula' YuenTTest(formula, data, subset, na.action, ...)
x |
numeric vector of data values. Non-finite (e.g. infinite or missing) values will be omitted. |
y |
an optional numeric vector of data values: as with x non-finite values will be omitted. |
alternative |
is a character string, one of |
paired |
a logical indicating whether you want a paired z-test. |
mu |
a number specifying the hypothesized mean of the population. |
conf.level |
confidence level for the interval computation. |
trim |
the fraction (0 to 0.5) of observations to be trimmed from each end of x before the mean is computed. Values of trim outside that range are taken as the nearest endpoint. |
formula |
a formula of the form |
data |
an optional matrix or data frame (or similar: see |
subset |
an optional vector specifying a subset of observations to be used. |
na.action |
a function which indicates what should happen when the data contain NAs. Defaults to |
... |
further arguments to be passed to or from methods. |
An object of class htest
containing the following components:
statistic |
the value of the t-statistic. |
parameter |
the degrees of freedom for the t-statistic and the trim percentage used. |
p.value |
the p-value for the test. |
conf.int |
a confidence interval for the trimmed mean appropriate to the specified alternative hypothesis. |
estimate |
the estimated trimmed mean or difference in trimmed means depending on whether it was a one-sample test or a two-sample test. |
null.value |
the specified hypothesized value of the trimmed mean or trimmed mean difference depending on whether it was a one-sample test or a two-sample test. |
alternative |
a character string describing the alternative hypothesis. |
method |
a character string indicating what type of test was performed. |
data.name |
a character string giving the name(s) of the data. |
Andri Signorell <[email protected]>, based on R-Core code of t.test
Wilcox, R. R. (2005) Introduction to robust estimation and hypothesis testing. Academic Press.
Yuen, K. K. (1974) The two-sample trimmed t for unequal population variances. Biometrika, 61, 165-170.
x <- rnorm(25, 100, 5) YuenTTest(x, mu=99) # the classic interface with(sleep, YuenTTest(extra[group == 1], extra[group == 2])) # the formula interface YuenTTest(extra ~ group, data = sleep) # Stahel (2002), pp. 186, 196 d.tyres <- data.frame(A=c(44.5,55,52.5,50.2,45.3,46.1,52.1,50.5,50.6,49.2), B=c(44.9,54.8,55.6,55.2,55.6,47.7,53,49.1,52.3,50.7)) with(d.tyres, YuenTTest(A, B, paired=TRUE)) d.oxen <- data.frame(ext=c(2.7,2.7,1.1,3.0,1.9,3.0,3.8,3.8,0.3,1.9,1.9), int=c(6.5,5.4,8.1,3.5,0.5,3.8,6.8,4.9,9.5,6.2,4.1)) with(d.oxen, YuenTTest(int, ext, paired=FALSE))
x <- rnorm(25, 100, 5) YuenTTest(x, mu=99) # the classic interface with(sleep, YuenTTest(extra[group == 1], extra[group == 2])) # the formula interface YuenTTest(extra ~ group, data = sleep) # Stahel (2002), pp. 186, 196 d.tyres <- data.frame(A=c(44.5,55,52.5,50.2,45.3,46.1,52.1,50.5,50.6,49.2), B=c(44.9,54.8,55.6,55.2,55.6,47.7,53,49.1,52.3,50.7)) with(d.tyres, YuenTTest(A, B, paired=TRUE)) d.oxen <- data.frame(ext=c(2.7,2.7,1.1,3.0,1.9,3.0,3.8,3.8,0.3,1.9,1.9), int=c(6.5,5.4,8.1,3.5,0.5,3.8,6.8,4.9,9.5,6.2,4.1)) with(d.oxen, YuenTTest(int, ext, paired=FALSE))
Replace NAs in a numeric vector x with 0. This function has the same logic as the zeroifnull function in SQL. NAIfZero()
does replace zeros with NA
.
BlankIfNA()
and NAIfBlank()
do the same, but for character vectors.
ZeroIfNA(x) NAIfZero(x) NAIf(x, what) BlankIfNA(x, blank="") NAIfBlank(x) Impute(x, FUN = function(x) median(x, na.rm = TRUE))
ZeroIfNA(x) NAIfZero(x) NAIf(x, what) BlankIfNA(x, blank="") NAIfBlank(x) Impute(x, FUN = function(x) median(x, na.rm = TRUE))
x |
the vector x, whose NAs should be overwritten with 0s. |
blank |
a character to be used for "blank". Default is an empty string (""). |
what |
a vector of elements to be set to |
FUN |
the name of a function to be used as imputation. Can as well be a self defined function or a constant value.
Default is |
the edited vector x
Andri Signorell <[email protected]>
z <- c(8, NA, 9, NA, 3) ZeroIfNA(z) # [1] 8 0 9 0 3 # set 8 and 9 to NA NAIf(ZeroIfNA(z), what=c(8, 9)) Impute(z) # [1] 8 8 9 8 3 z <- c("a", NA, "", NA, "k") BlankIfNA(z) # [1] "a" "" "" "" "k"
z <- c(8, NA, 9, NA, 3) ZeroIfNA(z) # [1] 8 0 9 0 3 # set 8 and 9 to NA NAIf(ZeroIfNA(z), what=c(8, 9)) Impute(z) # [1] 8 8 9 8 3 z <- c("a", NA, "", NA, "k") BlankIfNA(z) # [1] "a" "" "" "" "k"
Calculate the sign of zodiac of a date.
Zodiac(x, lang = c("engl", "deu"), stringsAsFactors = TRUE)
Zodiac(x, lang = c("engl", "deu"), stringsAsFactors = TRUE)
x |
the date to be transformed. |
lang |
the language of the zodiac names, can be english (default) or german ( |
stringsAsFactors |
logical. If set to |
The really relevant things can sometimes hardly be found. You just discovered such a function... ;-)
The following rule to determine zodiac symbols is implemented:
Dec. 22 - Jan. 19 : Capricorn Jan. 20 - Feb. 17 : Aquarius Feb. 18 - Mar. 19 : Pisces March 20 - April 19 : Aries April 20 - May 19 : Taurus May 20 - June 20 : Gemini June 21 - July 21 : Cancer July 22 - Aug. 22 : Leo Aug 23 - Sept. 21 : Virgo Sept. 22 - Oct. 22 : Libran Oct. 23 - Nov. 21 : Scorpio Nov. 22 - Dec. 21 : Sagittarius
character vector or factor with the zodiac.
Andri Signorell <[email protected]>, based on code from Markus Naepflin
Year
and other date functions
Zodiac(as.Date(c("1937-07-28", "1936-06-01", "1966-02-25", "1964-11-17", "1972-04-25")), lang="deu") d <- sample(seq(as.Date("2015-01-01"), as.Date("2015-12-31"), 1), 120) z <- Zodiac(d) Desc(z)
Zodiac(as.Date(c("1937-07-28", "1936-06-01", "1966-02-25", "1964-11-17", "1972-04-25")), lang="deu") d <- sample(seq(as.Date("2015-01-01"), as.Date("2015-12-31"), 1), 120) z <- Zodiac(d) Desc(z)
Compute the test of hypothesis and compute confidence interval on the mean of a population when the standard deviation of the population is known.
ZTest(x, ...) ## Default S3 method: ZTest(x, y = NULL, alternative = c("two.sided", "less", "greater"), paired = FALSE, mu = 0, sd_pop, conf.level = 0.95, ... ) ## S3 method for class 'formula' ZTest(formula, data, subset, na.action, ...)
ZTest(x, ...) ## Default S3 method: ZTest(x, y = NULL, alternative = c("two.sided", "less", "greater"), paired = FALSE, mu = 0, sd_pop, conf.level = 0.95, ... ) ## S3 method for class 'formula' ZTest(formula, data, subset, na.action, ...)
x |
numeric vector of data values. Non-finite (e.g. infinite or missing) values will be omitted. |
y |
an optional numeric vector of data values: as with x non-finite values will be omitted. |
mu |
a number specifying the hypothesized mean of the population. |
sd_pop |
a number specifying the known standard deviation of the population. |
alternative |
a character string specifying the alternative
hypothesis, must be one of |
paired |
a logical indicating whether you want a paired z-test. |
conf.level |
confidence level for the interval computation. |
formula |
a formula of the form |
data |
an optional matrix or data frame (or similar: see |
subset |
an optional vector specifying a subset of observations to be used. |
na.action |
a function which indicates what should happen when the data contain |
... |
further arguments to be passed to or from methods. |
Most introductory statistical texts introduce inference by using the z-test
and z-based confidence intervals based on knowing the population
standard deviation. However statistical packages often do not include
functions to do z-tests since the t-test is usually more appropriate
for real world situations. This function is meant to be used during
that short period of learning when the student is learning about
inference using z-procedures, but has not learned the t-based
procedures yet. Once the student has learned about the
t-distribution the t.test()
function should be used instead of this
one (but the syntax is very similar, so this function should be an
appropriate introductory step to learning t.test()
).
The formula interface is only applicable for the 2-sample tests.
A list with class "htest
" containing the following components:
statistic |
the value of the z-statistic. |
p.value |
the p-value for the test |
conf.int |
a confidence interval for the mean appropriate to the specified alternative hypothesis. |
estimate |
the estimated mean or difference in means depending on whether it was a one-sample test or a two-sample test. |
null.value |
the specified hypothesized value of the mean or mean difference depending on whether it was a one-sample test or a two-sample test. |
alternative |
a character string describing the alternative hypothesis. |
method |
a character string indicating what type of test was performed. |
data.name |
a character string giving the name(s) of the data. |
Andri Signorell <[email protected]>, based on R-Core code of t.test
,
documentation partly from Greg Snow <[email protected]>
Stahel, W. (2002) Statistische Datenanalyse, 4th ed, vieweg
x <- rnorm(25, 100, 5) ZTest(x, mu=99, sd_pop=5) # the classic interface with(sleep, ZTest(extra[group == 1], extra[group == 2], sd_pop=2)) # the formula interface ZTest(extra ~ group, data = sleep, sd_pop=2) # Stahel (2002), pp. 186, 196 d.tyres <- data.frame(A=c(44.5,55,52.5,50.2,45.3,46.1,52.1,50.5,50.6,49.2), B=c(44.9,54.8,55.6,55.2,55.6,47.7,53,49.1,52.3,50.7)) with(d.tyres, ZTest(A, B, sd_pop=3, paired=TRUE)) d.oxen <- data.frame(ext=c(2.7,2.7,1.1,3.0,1.9,3.0,3.8,3.8,0.3,1.9,1.9), int=c(6.5,5.4,8.1,3.5,0.5,3.8,6.8,4.9,9.5,6.2,4.1)) with(d.oxen, ZTest(int, ext, sd_pop=1.8, paired=FALSE))
x <- rnorm(25, 100, 5) ZTest(x, mu=99, sd_pop=5) # the classic interface with(sleep, ZTest(extra[group == 1], extra[group == 2], sd_pop=2)) # the formula interface ZTest(extra ~ group, data = sleep, sd_pop=2) # Stahel (2002), pp. 186, 196 d.tyres <- data.frame(A=c(44.5,55,52.5,50.2,45.3,46.1,52.1,50.5,50.6,49.2), B=c(44.9,54.8,55.6,55.2,55.6,47.7,53,49.1,52.3,50.7)) with(d.tyres, ZTest(A, B, sd_pop=3, paired=TRUE)) d.oxen <- data.frame(ext=c(2.7,2.7,1.1,3.0,1.9,3.0,3.8,3.8,0.3,1.9,1.9), int=c(6.5,5.4,8.1,3.5,0.5,3.8,6.8,4.9,9.5,6.2,4.1)) with(d.oxen, ZTest(int, ext, sd_pop=1.8, paired=FALSE))