| Title: | Descriptive Statistics Functions for Numeric Data |
|---|---|
| Description: | Provides fundamental functions for descriptive statistics, including MODE(), estimate_mode(), center_stats(), position_stats(), pct(), spread_stats(), kurt(), skew(), and shape_stats(), which assist in summarizing the center, spread, and shape of numeric data. For more details, see McCurdy (2025), "Introduction to Data Science with R" <https://jonmccurdy.github.io/Introduction-to-Data-Science/>. |
| Authors: | Luke Papayoanou [aut], Jon McCurdy [aut, cre] |
| Maintainer: | Jon McCurdy <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 0.1.2 |
| Built: | 2026-05-28 07:46:59 UTC |
| Source: | https://github.com/cran/MSMU |
This dataset contains historical performance and statistics for professional baseball teams across multiple seasons from 2000-2020.
baseball_teamsbaseball_teams
A data frame with 630 rows and 12 columns:
Year (integer)
Team (character)
Number of games played (integer)
Number of wins (integer)
Number of losses (integer)
World series winner that specific year (character)
Number of total runs scored during season (integer)
Number of total hits during season (integer)
Number of total homeruns during season (integer)
Team earned run average per 9 innings (numeric)
Team fielding percentage (numeric)
Average home game attendance (integer)
Data retrieved from Lahmans Baseball Database with alterations made for educational purposes
This dataset contains performance statistics for 363 men’s college basketball teams from the 2022-23 season.
basketballbasketball
A data frame with 363 rows and 18 columns:
School (character)
State (character)
Wins (integer)
Loss's (integer)
Win Loss percentage (numeric)
Simple Rating System (numeric)
Strength of Schedule (numeric)
Points scored (integer)
Points allowed (integer)
Team field goal percentage (numeric)
Three point percentage (numeric)
Free throw percentage (numeric)
Number of rebounds (integer)
Number of assists (integer)
Number of steals (integer)
Number of blocks (integer)
Number of turn overs (integer)
Number of fouls (integer)
Data retrieved from Sports Reference with alterations made for educational purposes.
Computes a variety of center statistics for a numeric vector, including:
mean, median, trimmed means (10% and 25%), and estimated mode (via probability density function
using estimate_mode()).
center_stats(x)center_stats(x)
x |
A numeric vector. |
A named numeric vector with values for:
Arithmetic mean
Median
25% trimmed mean
10% trimmed mean
Estimated mode from estimate_mode()
# Center Stats of continuous random data set.seed(123) x <- rnorm(1000, mean=50, sd=10) center_stats(x) # Center Stats of Sepal Length in iris data set data("iris") center_stats(iris$Sepal.Length)# Center Stats of continuous random data set.seed(123) x <- rnorm(1000, mean=50, sd=10) center_stats(x) # Center Stats of Sepal Length in iris data set data("iris") center_stats(iris$Sepal.Length)
Santa's dataset, exploring if Santa gives children presents based a variety of variables!
christmaschristmas
A data frame with 1000 rows and 15 columns:
Gender (character)
Number of toys (integer)
Number of Chores completed (numeric)
Childs Favorite color (character)
Childs helping hand number/score (integer)
Number of complaints child says (numeric)
Number of Tantrums child has (integer)
Number of rule breaking child does (numeric)
Childs willingness to share (numeric)
Childs average hours of sleep per night (numeric)
Childs average hours of screen time (numeric)
Childs school grade (numeric)
Childs parent presence (numeric)
Santas numeric system for labeling childrens greed (numeric)
Whether a child gets a present or coal (character)
Santa
A sample dataset representing demographic and academic information for 50 college students.
class_demographicsclass_demographics
A data frame with 50 rows and 6 columns:
Persons name (character)
Persons age (int)
Persons state (character)
Persons year in college (character)
Persons major (character)
Binary Sport, 1(yes) or 0(no) (integer)
Synthetic Data
This dataset provides detailed information on 777 U.S. colleges and universities from 1995, covering aspects of admissions, academics, finances, and student demographics.
college_datacollege_data
A data frame with 777 rows and 16 columns:
College name (character)
US region (character)
Acceptance (integer)
Enrollment (integer)
Percent of students that were top 10 in highschool class (integer)
Percent of students that were top 25 in highschool class (integer)
Full time undergrad (integer)
Part time undergrad (integer)
Number of Out of state students (integer)
Annual room and board price (integer)
Percentage of Faculty with a PhD (integer)
Percentage of Faculty with a terminal degree (integer)
Student Faculty ratio (numeric)
Percent of alumni who donate to the college (integer)
Instructional expenditure per student (integer)
Graduation Rate (integer)
This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University. Adapted from the College data set in the ISLR library with alterations made for educational purposes.
Data for 3142 counties in the United States containing demographic, educational, economic, and technological statistics.
county_datacounty_data
A data frame with 3142 rows and 17 columns:
State (character)
County name (character)
County level FIPS code (integer)
County population (integer)
Number of households (integer)
Median age of people in county (numeric)
Percent age of people over 18 (numeric)
Percent age of people over 65 (numeric)
Percent of highschool grads (numeric)
Percent of people with bachelors degrees (numeric)
Percent of population that is white (numeric)
Percent of population that is black (numeric)
Percent of population that is hispanic (numeric)
Percent of households who have a smartphone (numeric)
Average household income (integer)
Median household income (integer)
Unemployment rate (numeric)
Adapted from the county_complete data set in the usdata library with alterations made for educational purposes.
This dataset contains academic performance records for 200 students across four years of high school, with scores or letter grades in English and Math.
course_scorescourse_scores
A data frame with 200 rows and 10 columns:
Student ID (integer)
Grade type (character)
Freshman English Score/letter grade (character)
Freshman Math Score/letter grade (character)
Sophomore English Score/letter grade (character)
Sophomore Math Score/letter grade (character)
Junior English Score/letter grade (character)
Junior Math Score/letter grade (character)
Senior English Score/letter grade (character)
Senior Math Score/letter grade (character)
Synthetic Data
A synthetic dataset containing demographic and socioeconomic information for 1,000 individuals.
data_210_censusdata_210_census
A data frame with 1000 rows and 5 columns:
Persons Age (integer)
Persons Gender (character)
Persons level of education (character)
Persons Yearly Salary (integer)
Persons Height in inches (integer)
Synthetic Data
Dataset providing detailed results from the 2020 U.S. presidential election at the county level.
election_2020election_2020
A data frame with 32177 rows and 7 columns:
State (character)
State electoral votes (integer)
County name (character)
Candidate name (character)
Candidate party (character)
Total number of votes (integer)
True or false for the candidate to win the county (logical)
Data retrieved from MIT Election Data and Science Lab, 2018, "County Presidential Election Returns 2000-2020” with alterations made for educational purposes.
Estimates the mode of a numeric vector by identifying the value corresponding to the peak of its estimated probability density function.
estimate_mode(x)estimate_mode(x)
x |
A numeric vector. Missing values ( |
A single numeric value representing the estimated mode.
# Estimate the mode of continuous random data set.seed(123) x <- rnorm(1000, mean=5, sd=2) estimate_mode(x) # Estimate the mode of miles-per-gallon (mpg) in the mtcars dataset data("mtcars") estimate_mode(mtcars$mpg)# Estimate the mode of continuous random data set.seed(123) x <- rnorm(1000, mean=5, sd=2) estimate_mode(x) # Estimate the mode of miles-per-gallon (mpg) in the mtcars dataset data("mtcars") estimate_mode(mtcars$mpg)
Synthetic dataset containing academic performance and background information for 1,000 students.
exam_dataexam_data
A data frame with 1000 rows and 8 columns:
Students gender (character)
Students race/ethnicity (character)
Parents level of education (character)
Students lunch plan (character)
Student test prep level (character)
Students math score (integer)
Students reading score (integer)
Students writing score (integer)
Data retrieved from roycekimmons generated data
Dataset containing performance statistics for 106 football players who attempted a pass in the NFL for the 2022 season.
footballfootball
A data frame with 106 rows and 17 columns:
Players name (character)
Players team (character)
Players Age (integer)
Players position (character)
Number of games (integer)
Number of games starting (integer)
Number of wins (integer)
Number of completions (integer)
Number of throwing attempts (integer)
Completion percentage (numeric)
Number of yards thrown (integer)
Number of touchdowns (integer)
Number of interceptions thrown (integer)
Yards per Attempt (numeric)
Yards per Game (numeric)
Passer rating (numeric)
Total Quarterback Rating (numeric)
Data retrieved from Pro Football Reference with alterations made for educational purposes.
Dataset containing medical and diagnostic information for 303 patients, used to study the presence of Atherosclerotic Heart Disease (AHD).
heartheart
A data frame with 303 rows and 14 columns:
Patients age (integer)
Patients Sex (1 = Male, 0 = Female) (integer)
Chest pain type (character)
Resting blood pressure (in mm Hg on admission to the hospital) (integer)
Serum cholesterol in mg/dl (integer)
fasting blood sugar > 120 mg/dl (1 = true; 0 = false) (integer)
Resting electrocardiographic results (integer)
Maximum heart rate achieved (integer)
Exercise induced angina (1 = yes; 0 = no) (integer)
ST depression induced by exercise relative to rest (numeric)
The slope of the peak exercise ST segment (integer)
Number of major vessels (0-3) colored by fluoroscopy (integer)
Thal condition (character)
Atherosclerosis Heart Disease condition (character)
Data retrieved from UC Irvine Machine Learning Repository
Data on houses that were recently sold in the Duke Forest neighborhood of Durham, NC in November 2020.
housing_datahousing_data
A data frame with 98 rows and 6 columns:
Home price (numeric)
Number of bedrooms (integer)
Number of bathrooms (numeric)
Square footage (integer)
Date house was built (integer)
lot size (numeric)
Adapted from the duke_forest dataset in the openintro library with alterations made for educational purposes.
Dataset containing basic demographic and financial information for 20 individuals.
income_dataincome_data
A data frame with 20 rows and 5 columns:
ID (integer)
age (integer)
Years until retirement at 65 (integer)
Salary (integer)
Birth weight (integer)
Synthetic Data
Calculates the kurtosis of a numeric vector. A value near 0 suggests normal kurtosis (mesokurtic), positive values indicate heavier tails (leptokurtic), and negative values indicate lighter tails (platykurtic).
kurt(x)kurt(x)
x |
A numeric vector. |
The z-scores are computed as:
The kurtosis is then calculated as:
Where:
is the mean of ,
is the standard deviation of ,
and is the number of observations.
A single numeric value representing the kurtosis
# Kurtosis of mpg in mtcars data("mtcars") kurt(mtcars$mpg)# Kurtosis of mpg in mtcars data("mtcars") kurt(mtcars$mpg)
Dataset mimicking a ledger showing the price an item was bought and sold for, the date it occurred, and the color of the product.
ledger_dataledger_data
A data frame with 4 rows and 104 columns:
colors (character)
age (integer)
Price on date (numeric)
Price on date (numeric)
Price on date (numeric)
Price on date (numeric)
Price on date (numeric)
Price on date (numeric)
Price on date (numeric)
Price on date (numeric)
Price on date (numeric)
Price on date (numeric)
Price on date (numeric)
Price on date (numeric)
Price on date (numeric)
Price on date (numeric)
Price on date (numeric)
Price on date (numeric)
Price on date (numeric)
Price on date (numeric)
Price on date (numeric)
Price on date (numeric)
Price on date (numeric)
Price on date (numeric)
Price on date (numeric)
Price on date (numeric)
Price on date (numeric)
Price on date (numeric)
Price on date (numeric)
Price on date (numeric)
Price on date (numeric)
Price on date (numeric)
Price on date (numeric)
Price on date (numeric)
Price on date (numeric)
Price on date (numeric)
Price on date (numeric)
Price on date (numeric)
Price on date (numeric)
Price on date (numeric)
Price on date (numeric)
Price on date (numeric)
Price on date (numeric)
Price on date (numeric)
Price on date (numeric)
Price on date (numeric)
Price on date (numeric)
Price on date (numeric)
Price on date (numeric)
Price on date (numeric)
Price on date (numeric)
Price on date (numeric)
Price on date (numeric)
Price on date (numeric)
Price on date (numeric)
Price on date (numeric)
Price on date (numeric)
Price on date (numeric)
Price on date (numeric)
Price on date (numeric)
Price on date (numeric)
Price on date (numeric)
Price on date (numeric)
Price on date (numeric)
Price on date (numeric)
Price on date (numeric)
Price on date (numeric)
Price on date (numeric)
Price on date (numeric)
Price on date (numeric)
Price on date (numeric)
Price on date (numeric)
Price on date (numeric)
Price on date (numeric)
Price on date (numeric)
Price on date (numeric)
Price on date (numeric)
Price on date (numeric)
Price on date (numeric)
Price on date (numeric)
Price on date (numeric)
Price on date (numeric)
Price on date (numeric)
Price on date (numeric)
Price on date (numeric)
Price on date (numeric)
Price on date (numeric)
Price on date (numeric)
Price on date (numeric)
Price on date (numeric)
Price on date (numeric)
Price on date (numeric)
Price on date (numeric)
Price on date (numeric)
Price on date (numeric)
Price on date (numeric)
Price on date (numeric)
Price on date (numeric)
Price on date (numeric)
Price on date (numeric)
Price on date (numeric)
Price on date (numeric)
Price on date (numeric)
Price on date (numeric)
Synthetic Data
Batter statistics for 2018 Major League Baseball season
mlb_edamlb_eda
A data frame with 1270 rows and 13 columns:
Players name (character)
Players team (character)
Players position (character)
Number of games (integer)
Number of at bats (integer)
Number of runs (integer)
Number of hits (integer)
Number of doubles (integer)
Number of Home runs (integer)
Number of Runs Batted In (integer)
Players batting average (numeric)
Players Slugging percentage (numeric)
Players On-base Plus Slugging (numeric)
Data retrieved from MLB, with alterations made for educational purposes.
Calculates the mode (most frequent value) of a numeric vector. If there is a tie, returns all values that share the highest frequency.
MODE(x)MODE(x)
x |
A numeric vector. |
A numeric value (or vector) representing the mode(s) of x.
# Mode of a Numeric Vector MODE(c(1,2,3,3,3,4,5,5,3,8)) # Mode of the number of cylinders in mtcars dataset data("mtcars") MODE(mtcars$cyl)# Mode of a Numeric Vector MODE(c(1,2,3,3,3,4,5,5,3,8)) # Mode of the number of cylinders in mtcars dataset data("mtcars") MODE(mtcars$cyl)
Dataset summarizing the distribution of male and female students across various dormitories at Mount College, categorized by academic year.
mount_dormsmount_dorms
A data frame with 4 rows and 11 columns:
Students year (character)
Males living in Pangborn (integer)
Males living in Sheridan (integer)
Males living in Terrace (integer)
Males living in Powell (integer)
Males living in the Towers (integer)
Females living in Pangborn (integer)
Females living in Sheridan (integer)
Females living in Terrace (integer)
Females living in Powell (integer)
Females living in the Towers (integer)
Synthetic Data
The MSMU package provides core functions for descriptive statistics and exploratory data analysis. It includes functions for computing central tendency, spread, shape, and position statistics, along with utility functions for estimating modes and standardized ranges. The package contains
Luke Papayoanou, Jon McCurdy
Calculates the percentage of values in a numeric vector that fall within
n standard deviations of the mean.
pct(x, n)pct(x, n)
x |
A numeric vector. |
n |
A positive numeric value indicating how many standard deviations from the mean to use as bounds. |
A single numeric value representing the percentage (0–100) of values within the specified range.
# Percentage of values that fall within 2 sds of the mean in random normal data set.seed(123) x <- rnorm(1000) pct(x,2) # Percentage of values that fall within 2 sds of the mean in iris Sepal Lengths data("iris") pct(iris$Sepal.Length, 2)# Percentage of values that fall within 2 sds of the mean in random normal data set.seed(123) x <- rnorm(1000) pct(x,2) # Percentage of values that fall within 2 sds of the mean in iris Sepal Lengths data("iris") pct(iris$Sepal.Length, 2)
Calculates the quintiles, including quartiles(data is split in 4 equal parts) and quintiles(data is split in 5 equal parts) of a numeric vector using the 'quantile()' function. NA's are removed.
position_stats(x)position_stats(x)
x |
A numeric vector. |
Percentiles are values that divide a dataset into 100 equal parts, each representing 1% of the distribution. For example, the 25th percentile is the value below which 25% of the data fall.
Quartiles are special percentiles that divide the data into four equal groups: Q1 (25th percentile), Q2 (50th percentile or median), Q3 (75th percentile).
Quintiles divide data into five equal groups, each representing 20% of the distribution: 20th percentile, 40th, 60th, 80th percentiles split the data into quintiles.
A list with two elements:
Numeric vector of quintiles (0%, 20%, 40%, ..., 100%)
Numeric vector of quartiles (0%, 25%, 50%, 75%, 100%)
# Position stats of random data set.seed(123) x <- rnorm(1000) position_stats(x) # Position stats of MPG in mtcars data set data("mtcars") position_stats(mtcars$mpg)# Position stats of random data set.seed(123) x <- rnorm(1000) position_stats(x) # Position stats of MPG in mtcars data set data("mtcars") position_stats(mtcars$mpg)
This dataset contains synthetic reaction time measurements for 100 individuals under different conditions.
reaction_timereaction_time
A data frame with 100 rows and 6 columns:
Person id (integer)
color (character)
left (numeric)
right (numeric)
Person age (numeric)
Person gender (character)
Synthetic Data
Calculates the skewness of a numeric vector (via skew()).
A positive value indicates right skew (long right tail), while a negative value
indicates left skew (long left tail). A zero value represents symmetry.
Calculates the kurtosis of a numeric vector (via kurt()).
A value near 0 suggests normal kurtosis (mesokurtic),
positive values indicate heavier tails (leptokurtic), and negative
values indicate lighter tails (platykurtic).
shape_stats(x)shape_stats(x)
x |
A numeric vector. |
A list with two elements:
Skew of Data from skew()
Kurtosis of Data from kurt()
# Shape stats of mpg in mtcars data("mtcars") shape_stats(mtcars$mpg)# Shape stats of mpg in mtcars data("mtcars") shape_stats(mtcars$mpg)
Calculates the skewness of a numeric vector. A positive value indicates right skew (long right tail), while a negative value indicates left skew (long left tail). A zero value represents symmetry
skew(x)skew(x)
x |
A numeric vector. |
A single numeric value representing the skewness of the distribution.
# Skew of Sepal Lengths in iris data("iris") skew(iris$Sepal.Length)# Skew of Sepal Lengths in iris data("iris") skew(iris$Sepal.Length)
This dataset contains historical match results from various international soccer games between different countries for the years 1872-2024.
soccersoccer
A data frame with 13750 rows and 5 columns:
Date of match (character)
Home team name (character)
Away team name (character)
Home teams goal count (integer)
Away teams goal count (integer)
Data retrieved from Kaggle International football results dataset with alterations made for educational purposes.
Computes a variety of spread statistics for a numeric vector, including:
standard deviation, iqr, the normalized minimum, maximum,
and range as well as the percentage of data within 1, 2,
and 3 standard deviations (via pct())
spread_stats(x)spread_stats(x)
x |
A numeric vector |
Standard Deviation
Inter Quartile Range
Normalized Minimum
Normalized Maximum
Normalized Range
Percent of data within 1 standard deviation from pct()
Percent of data within 2 standard deviation from pct()
Percent of data within 3 standard deviation from pct()
# Spread stats of random normal data set.seed(123) x <- rnorm(1000) spread_stats(x) # Spread stats of mpg in mtcars data("mtcars") spread_stats(mtcars$mpg)# Spread stats of random normal data set.seed(123) x <- rnorm(1000) spread_stats(x) # Spread stats of mpg in mtcars data("mtcars") spread_stats(mtcars$mpg)