---
title: "Preparing data for finalfit"
author: "Ewen Harrison"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Preparing data for finalfit}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---
	
```{r setup, include = FALSE}
knitr::opts_chunk$set(
	collapse = TRUE,
	comment = "#>"
)
```

This vignette shows you how to upload and prepare any dataset for use with finalfit. The demonstration will use the `boot::melanoma`. Use `?boot::melanoma` to see the help page with data description. I will use `library(tidyverse)` methods. First I'll `write_csv()` the data just to demonstrate reading it. 

## Read data

Note the various options in `read_csv()`, including providing column names, variable type, missing data identifier etc. 

```{r}
library(readr)

# Save example
write_csv(boot::melanoma, "boot.csv")

# Read data
melanoma = read_csv("boot.csv")
```

## Column types

Note the output shows how the columns/variables have been parsed. For full details see `?readr::cols()`. 

### Continuous data

* Integer (whole numbers) - `col_integer()`
* Double or numeric (real numbers; the name comes from "double-precision floating point") - `col_double()`

### Categorical data

* Factor (a fixed set of names/strings or numbers) - `col_factor()`
* Character (sequences letters, numbers, and symbols) - `col_character()`
* Logical (containing only TRUE or FALSE) - `col_logical()`

### Dates and times

* Date - `col_date()`
* Time - `col_time()`
* Date-time - `col_datetime()`

## Check data

`ff_glimpse()` provides a convenient overview of all data in a tibble or data frame. It is particularly important that factors are correctly specified. Hence, `ff_glimpse()` separates variables into continuous and categorcial. As expected, no factors are yet specified in the melanoma dataset. 

```{r}
library(finalfit)
ff_glimpse(melanoma)
```

If you wish to see the variables in the order in which they appear in the data frame or tibble, `missing_glimpse()` or `tibble::glimpse()` are useful. 

```{r}
missing_glimpse(melanoma)
```

## Specify factors

Use an original description of the data (often called a data dictionary) to correctly assign and label any factor variables. This can be done in a single pipe. 

```{r}
library(dplyr)
melanoma %>% 
  mutate(
    status.factor = factor(status, levels = c(1, 2, 3), 
      labels = c("Died from melanoma", "Alive", "Died from other causes")) %>% 
    ff_label("Status"),
    sex.factor = factor(sex, levels = c(1, 0),
      labels = c("Male", "Female")) %>% 
    ff_label("Sex"),
    ulcer.factor = factor(ulcer, levels = c(1, 0),
      labels = c("Present", "Absent")) %>% 
    ff_label("Ulcer")
  ) -> melanoma

ff_glimpse(melanoma)
```

Everything looks good and you are ready to start analysis.