The purpose of the package is automatically detecting type of variables in not quality controlled data. The prediction is based on a pre-trained random forest model, trained on over 5000 medical variables with OOB accuracy of 99%. The accuracy depends heavily on the type and coding style of data. For example, often categorical variables are coded as integers 1 to x, if the number of categories is very large, there is no way to distinguish it from a continuous integer variable. Some types are per definition very sensitive to errors in data, like ID, missing or constant, where a single alternative non-missing value makes it not constant or not missing anymore. The data is assumed to be cross sectional, where ID is unique (no multiple entries per ID).
It can be used as a first step by data quality control to help sort the variables in advance and get some information about the possible formats.
The data set ‘sim_nqc_data’ contains 100 observations and 14 artificial variables with some not well formatted or missing values. The data is complete artificial and was not used for training or validation of the random forest model.
id | visit | sex | age | byear | decades | bmi | med | date | loct | group | crp | bnp | comms |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 1 | Men | 55 | 1966 | 55-64 | 28 | 0 | 2016/12/03 | NA | 4 | 1.78 | >300 | |
2 | NULL | Women | NULL | 1979 | NULL | NULL | 0 | 2016-07-06 | NA | NULL | kM | kM | no material available |
3 | 1 | Men | 49 | 1972 | 45-54 | 29 | 0 | 2015-09-16 | NA | 2 | 2.002 | <1.5 | |
4 | 1 | Women | 73 | 1948 | 65-75 | 25.5 | 1 | 2016-xx-xx | NA | 1 | 1.332 | 3.4 | |
5 | 1 | Women | 49 | 1972 | 45-54 | 24 | 0 | 2017-08-03 | NA | 2 | <0.2 | 2.1 | . |
6 | 1 | Men | 70 | 9999 | 65-75 | 28 | 0 | 2018-03-30 | NA | 2 | 3.157 | 5.6 | |
7 | 1 | 71 | 1950 | 65-75 | 32 | 0 | NA | 4 | 4.203 | <1.5 | nd | ||
8 | 1 | Women | 55 | 1966 | 55-64 | 27 | 0 | 2016/12/04 | NA | 2 | 0.619 | 3.1 | |
9 | 1 | Men | 46 | 1975 | 45-54 | 31 | 0 | 2016-06-26 | NA | 2 | 9.866 | 7.2 | |
10 | 1 | Men | 32 | 1989 | 25-34 | 24 | NA | 2018-05-08 | NA | 3 | NA | <1.5 | Lab problems |
11 | 1 | Women | 72 | 1949 | 65-75 | 29 | 0 | 2017-08-12 | NA | 4 | 0.352 | <1.5 | |
12 | 1 | Men | 40 | 1981 | 35-44 | 31 | 1 | 2016-11-27 | NA | 2 | 1.28 | <1.5 | |
13 | 1 | Men | 28 | 1993 | 25-34 | 0 | 0 | 2017-12-28 | NA | 3 | 1.073 | <1.5 | |
14 | 1 | Men | 72 | 1949 | 65-75 | 29 | 0 | 2015-09-18 | NA | 1 | 0.227 | <1.5 | |
17 | 1 | Men | 61 | 1960 | 55-64 | 27 | 0 | 2017-10-06 | NA | 2 | 5.113 | 5.5 | |
18 | 1 | Women | NA | NA | NA | 26 | 0 | NA | 5 | 0.508 | <1.5 | na | |
19 | 1 | Men | 52 | 1969 | 45-54 | 26 | 1 | 2018-01-10 | NA | 3 | 0.231 | 1.5 | |
20 | 1 | Men | 73 | 1948 | 65-75 | 29 | 1 | 2017-03-02 | NA | 4 | 0.975 | <1.5 | |
21 | 1 | Men | 54 | 1967 | 45-54 | 26 | 0 | 2016-05-27 | NA | 2 | 3.38 | <1.5 | |
22 | 1 | Women | NA | 9999 | NULL | 28 | 0 | 2017-12-11 | NA | 3 | 0.437 | <1.5 | no birth date |
The application is straightforward, it requires data in data.frame
format. It is important that all unusual missing values in the data,
e.g. the code 9999 for missing values are covered. Values as NA, NaN,
Inf, NULL and spaces are automatic considered as invalid (missing)
values. The second column type
is the estimated type of the
variable, and the column probability
indicates how certain
the type is. The format
gives additional information about
the possible format of the variable, especially useful for date
variables. Class
is just a translation of type into broader
categories.
tab <- vtype(sim_nqc_data, miss_values='9999')
knitr::kable(tab, caption='Application example of vtype')
variable | type | probability | format | class | alternative | n | missings |
---|---|---|---|---|---|---|---|
id | ID | 0.906 | supportive | continuous (4.8%) | 100 | 0 | |
visit | constant | 0.989 | 1 | uninformative | – | 98 | 2 |
sex | binary | 0.961 | men/women | qualitative | text (3.2%) | 98 | 2 |
age | continuous | 1.000 | integer | quantitative | – | 92 | 8 |
byear | date | 0.974 | %Y | supportive | continuous (2.6%) | 93 | 7 |
decades | categorical | 0.565 | labels | qualitative | date (39.8%) | 92 | 8 |
bmi | continuous | 0.999 | integer | quantitative | – | 92 | 8 |
med | binary | 0.999 | 0/1 | qualitative | – | 95 | 5 |
date | date | 1.000 | %Y-%m-%d | supportive | – | 97 | 3 |
loct | missing | 1.000 | uninformative | – | 0 | 100 | |
group | categorical | 0.978 | 1-7 | qualitative | continuous (2.1%) | 99 | 1 |
crp | continuous | 1.000 | floating | quantitative | – | 96 | 4 |
bnp | continuous | 0.997 | floating | quantitative | – | 95 | 5 |
comms | text | 0.952 | supportive | categorical (4.3%) | 10 | 90 |
Very small sample size can reduce the prediction performance
significantly. The id
variable is now detected as integer,
age
as categorical and decades
as a date
variable.
variable | type | probability | format | class | alternative | n | missings |
---|---|---|---|---|---|---|---|
id | continuous | 0.517 | integer | quantitative | date (16.6%) | 10 | 0 |
visit | constant | 0.962 | 1 | uninformative | continuous (3.8%) | 9 | 1 |
sex | binary | 0.754 | men/women | qualitative | text (21.8%) | 9 | 1 |
age | categorical | 0.441 | 32-73 | qualitative | continuous (42.3%) | 9 | 1 |
byear | date | 0.444 | ? | supportive | continuous (32.4%) | 10 | 0 |
decades | date | 0.493 | ? | supportive | categorical (40.9%) | 9 | 1 |
bmi | continuous | 0.595 | integer | quantitative | categorical (27.8%) | 9 | 1 |
med | binary | 0.965 | 0/1 | qualitative | continuous (3.5%) | 9 | 1 |
date | date | 0.999 | %Y-%m-%d | supportive | – | 9 | 1 |
loct | missing | 1.000 | uninformative | – | 0 | 10 | |
group | categorical | 0.909 | 1-4 | qualitative | continuous (7.2%) | 9 | 1 |
crp | continuous | 0.640 | floating | quantitative | text (17%) | 9 | 1 |
bnp | continuous | 0.479 | floating | quantitative | text (31.3%) | 10 | 0 |
comms | text | 0.999 | supportive | – | 4 | 6 |
variable | type | probability | format | class | alternative | n | missings |
---|---|---|---|---|---|---|---|
mpg | continuous | 1.000 | floating | quantitative | – | 32 | 0 |
cyl | categorical | 0.966 | 4-8 | qualitative | continuous (3%) | 32 | 0 |
disp | continuous | 0.999 | floating | quantitative | – | 32 | 0 |
hp | continuous | 1.000 | integer | quantitative | – | 32 | 0 |
drat | continuous | 1.000 | floating | quantitative | – | 32 | 0 |
wt | continuous | 0.998 | floating | quantitative | – | 32 | 0 |
qsec | continuous | 0.999 | floating | quantitative | – | 32 | 0 |
vs | binary | 0.974 | 0/1 | qualitative | continuous (2.6%) | 32 | 0 |
am | binary | 0.974 | 0/1 | qualitative | continuous (2.6%) | 32 | 0 |
gear | categorical | 0.966 | 3-5 | qualitative | continuous (3%) | 32 | 0 |
carb | categorical | 0.967 | 1-8 | qualitative | continuous (3.1%) | 32 | 0 |