This vignette shows you how to
upload and prepare any dataset for use with finalfit. The demonstration
will use the boot::melanoma
. Use
?boot::melanoma
to see the help page with data description.
I will use library(tidyverse)
methods. First I’ll
write_csv()
the data just to demonstrate reading it.
Note the various options in read_csv()
, including
providing column names, variable type, missing data identifier etc.
library(readr)
# Save example
write_csv(boot::melanoma, "boot.csv")
# Read data
melanoma = read_csv("boot.csv")
#> Rows: 205 Columns: 7
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> dbl (7): time, status, sex, age, year, thickness, ulcer
#>
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Note the output shows how the columns/variables have been parsed. For
full details see ?readr::cols()
.
col_integer()
col_double()
col_factor()
col_character()
col_logical()
col_date()
col_time()
col_datetime()
ff_glimpse()
provides a convenient overview of all data
in a tibble or data frame. It is particularly important that factors are
correctly specified. Hence, ff_glimpse()
separates
variables into continuous and categorcial. As expected, no factors are
yet specified in the melanoma dataset.
library(finalfit)
ff_glimpse(melanoma)
#> $Continuous
#> label var_type n missing_n missing_percent mean sd min
#> time time <dbl> 205 0 0.0 2152.8 1122.1 10.0
#> status status <dbl> 205 0 0.0 1.8 0.6 1.0
#> sex sex <dbl> 205 0 0.0 0.4 0.5 0.0
#> age age <dbl> 205 0 0.0 52.5 16.7 4.0
#> year year <dbl> 205 0 0.0 1969.9 2.6 1962.0
#> thickness thickness <dbl> 205 0 0.0 2.9 3.0 0.1
#> ulcer ulcer <dbl> 205 0 0.0 0.4 0.5 0.0
#> quartile_25 median quartile_75 max
#> time 1525.0 2005.0 3042.0 5565.0
#> status 1.0 2.0 2.0 3.0
#> sex 0.0 0.0 1.0 1.0
#> age 42.0 54.0 65.0 95.0
#> year 1968.0 1970.0 1972.0 1977.0
#> thickness 1.0 1.9 3.6 17.4
#> ulcer 0.0 0.0 1.0 1.0
#>
#> $Categorical
#> data frame with 0 columns and 205 rows
If you wish to see the variables in the order in which they appear in
the data frame or tibble, missing_glimpse()
or
tibble::glimpse()
are useful.
Use an original description of the data (often called a data dictionary) to correctly assign and label any factor variables. This can be done in a single pipe.
library(dplyr)
melanoma %>%
mutate(
status.factor = factor(status, levels = c(1, 2, 3),
labels = c("Died from melanoma", "Alive", "Died from other causes")) %>%
ff_label("Status"),
sex.factor = factor(sex, levels = c(1, 0),
labels = c("Male", "Female")) %>%
ff_label("Sex"),
ulcer.factor = factor(ulcer, levels = c(1, 0),
labels = c("Present", "Absent")) %>%
ff_label("Ulcer")
) -> melanoma
ff_glimpse(melanoma)
#> $Continuous
#> label var_type n missing_n missing_percent mean sd min
#> time time <dbl> 205 0 0.0 2152.8 1122.1 10.0
#> status status <dbl> 205 0 0.0 1.8 0.6 1.0
#> sex sex <dbl> 205 0 0.0 0.4 0.5 0.0
#> age age <dbl> 205 0 0.0 52.5 16.7 4.0
#> year year <dbl> 205 0 0.0 1969.9 2.6 1962.0
#> thickness thickness <dbl> 205 0 0.0 2.9 3.0 0.1
#> ulcer ulcer <dbl> 205 0 0.0 0.4 0.5 0.0
#> quartile_25 median quartile_75 max
#> time 1525.0 2005.0 3042.0 5565.0
#> status 1.0 2.0 2.0 3.0
#> sex 0.0 0.0 1.0 1.0
#> age 42.0 54.0 65.0 95.0
#> year 1968.0 1970.0 1972.0 1977.0
#> thickness 1.0 1.9 3.6 17.4
#> ulcer 0.0 0.0 1.0 1.0
#>
#> $Categorical
#> label var_type n missing_n missing_percent levels_n
#> status.factor Status <fct> 205 0 0.0 3
#> sex.factor Sex <fct> 205 0 0.0 2
#> ulcer.factor Ulcer <fct> 205 0 0.0 2
#> levels
#> status.factor "Died from melanoma", "Alive", "Died from other causes", "(Missing)"
#> sex.factor "Male", "Female", "(Missing)"
#> ulcer.factor "Present", "Absent", "(Missing)"
#> levels_count levels_percent
#> status.factor 57, 134, 14 27.8, 65.4, 6.8
#> sex.factor 79, 126 39, 61
#> ulcer.factor 90, 115 44, 56
Everything looks good and you are ready to start analysis.