---
title: "Using the functions of the ExPanDaR package"
author: "Joachim Gassen"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{Using the functions of the ExPanDaR package}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r setup, include = FALSE}
knitr::opts_chunk$set(
echo = TRUE,
fig.align = "center",
fig.width = 6,
warnings = FALSE
)
library(ExPanDaR)
library(knitr)
library(kableExtra)
```
While the main purpose of the ExPanDaR package is to provide the infrastructure for the ExPanD app, the auxiliary functions of the package can also be used to support your exploratory data analysis workflow in your own code. All functions are relatively thin wrappers around established R packages for graphics and HTML table presentation (ggplot2, kableExtra, stargazer).
While I developed them to support analysis steps that are common with empirical archival research projects in the area of accounting and finance (which happens to be my field),
I hope that they are generally useful for exploratory data analysis.
To see what ExPanDaR has to offer, let's take a quick tour. For more detailed guidance on how to use a specific function presented below, take a look at the respective function's help page.
## Data Preparation
ExPanDaR is designed for exploratory panel data analysis (hence the name).
Thus, while you can also use some functions on cross-sectional data, for most functions you will need a data frame containing your panel data.
ExPanDaR expects the data to be organized in long format.
This implies that each observation (a row) is identified by cross-sectional and time series identifiers and that variables are organized by columns.
While you can have a vector of variables jointly determining the cross-section, the time-series needs to be identified by a unique variable.
The ExPanDaR functions treat cross-sectional identifiers as factors and expect the time-series identifier to be coercible into an ordered factor.
For this walk-through I will use the data set russell_3000, which comes with the package.
It contains some financial reporting and stock return data of Russell 3000 firms from Google Finance and Yahoo Finance and has been collected using the tidyquant package in the summer of 2017.
A word of caution: While the data appears to be relatively decent quality I would advise against using this data for scientific work without verifying its integrity first.
These are the variables included in the data.
``` {r variables}
kable(data.frame(Variable=russell_3000_data_def$var_name,
Definition=sub('$', '\\$', russell_3000_data_def$var_def, fixed = TRUE)),
row.names = FALSE)
```
You can infer from the variable definition that `coid` seems to identify the cross-section (a Russell 3000 firm) while `period` identifies the time-series (a fiscal year).
In addition, `coname` also sounds like it mighty identify a firm but we cannot be sure whether there are duplicate company names.
In addition, we want to verify that there are no duplicate `coid`/`period` pairs.
Let's check.
``` {r cross-sectional_ids}
cs_ids <- unique(russell_3000[,c("coid", "coname")])
identical(cs_ids$coid, unique(russell_3000$coid))
identical(cs_ids$coname, unique(russell_3000$coname))
```
The first test verifies that there are no two observations that share the same `coid` but a different `coname`.
The second makes sure that there are firms with the same `coname` but a different `coid`. Thus, we can use both, `coname` and `coid`, or either as cross-sectional identifier.
The following test establishes whether in combination `coid` and `period` identify a panel observation.
``` {r duplicates}
any(duplicated(russell_3000[,c("coid", "period")]))
```
This seems to be the case.
As a next step, let's use ExPanDaR's function `prepare_missing_values_graph()` to eyeball how frequently observations are missing in the data set.
```{r missing_obs}
prepare_missing_values_graph(russell_3000, ts_id = "period")
```
OK. This does not look too bad. Only FY2013 seems odd, as some variables are completely missing. Guess why? They are calculated using lagged values of total assets. So, in the following, let's focus on the variables that we care about and on the fiscal years 2014 to 2016 (a short panel, I know). Time to check the descriptive statistics using the `prepare_descriptive_table()` function.
```{r descriptive_statistics_table}
r3 <- droplevels(russell_3000[russell_3000$period > "FY2013",
c("coid", "coname", "period", "sector", "toas",
"nioa", "cfoa", "accoa", "return")])
t <- prepare_descriptive_table(r3)
t$kable_ret %>%
kable_styling("condensed", full_width = F, position = "center")
```
Take a look at the minima and the maxima of some of the variables (e.g., net income over assets (`nioa`)). Normally, it should be around -50 % to + 50%. Our measure has a minimum way below -50 %. One thing that comes very handy when dealing with outliers is a quick way to observe extreme values. `prepare_ext_obs_table()` might be helpful here.
```{r extreme_observations}
t <- prepare_ext_obs_table(na.omit(r3[c("coname", "period", "nioa")]))
t$kable_ret %>%
kable_styling("condensed", full_width = F, position = "center")
```
In a real life research situation, you might want to take a break and check your data as well as the actual financial statements to see what is going on.
In most cases, you will see that the outliers are caused by very small denominators (average total assets in this case). To reduce the effect of these outliers on your analysis, you can winsorize (or truncate) them by using the `treat_outliers()` function.
```{r winsorizing}
r3win <- treat_outliers(r3, percentile = 0.01)
t <- prepare_ext_obs_table(na.omit(r3win[c("coname", "period", "nioa")]))
t$kable_ret %>%
kable_styling("condensed", full_width = F, position = "center")
```
## Descriptive Statistics
This looks better. Let's look at the winsorized descriptive statistics.
```{r descriptive_statistics_table_winsorized}
t <- prepare_descriptive_table(r3win)
t$kable_ret %>%
kable_styling("condensed", full_width = F, position = "center")
```
I am sure that you won't care but I am a big fan of correlation tables. `prepare_correlation_table()` prepares a table reporting Pearson correlations above and Spearman correlations below the diagonal.
```{r correlation_table}
t<- prepare_correlation_table(r3win, bold = 0.01, format="html")
t$kable_ret %>%
kable_styling("condensed", full_width = F, position = "center")
```
In fact, I like correlations so much that especially for samples containing many variables I use `prepare_correlation_graph()` to display a graphic variant based on the corrplot package. See for yourself.
``` {r correlation_graph, fig.width = 4, fig.height= 4}
ret <- prepare_correlation_graph(r3win)
```
## Visuals
Additional visuals are available for exploring time trends. `prepare_trend_graph()` can be used for comparing variables...
```{r time_trend_plot}
graph <- prepare_trend_graph(r3win[c("period", "nioa", "cfoa", "accoa")], "period")
graph$plot
```
... and for eyeballing the distributional properties of a single variable over time you have `prepare_quantile_trend_graph()`.
```{r quantile_plot}
graph <- prepare_quantile_trend_graph(r3win[c("period", "return")], "period", c(0.05, 0.25, 0.5, 0.75, 0.95))
graph$plot
```
Nothing special going on here (not really surprising, given the short time span that the sample covers). Let's see how profitability varies across sectors by using the
`prepare_by_group_trend_graph()` function.
```{r bgtg_plot}
graph <- prepare_by_group_trend_graph(r3win, "period", "sector", "nioa")
graph$plot
```
The health sector is clearly less profitable compared to the others, which can be explained by small growing firms. Finally, `prepare_scatter_plot()` produces the mother of all plots, the scatter plot.
```{r scatter_plot, fig.width = 7, fig.height= 6}
prepare_scatter_plot(r3win, x="nioa", y="return", color="sector", size="toas", loess = 1)
```
Do you see the structural break around nioa == 0? Researchers in the area of accounting tend to like that kind of stuff.
## Regression Tables
Finally, if you happen to be a fan of starred numbers, you can also quickly produce regression tables by using the function `prepare_regression_table()` that calls `lfe::felm()` for OLS and `glm()` for binary logit models. The tables are then constructed by calling `stargazer::stargazer()`, allowing for plain text, html and latex output.
You can construct tables by mixing different models...
```{r regressions}
dvs <- c("return", "return", "return", "return", "return", "return")
idvs <- list(c("nioa"),
c("cfoa"),
c("accoa"),
c("cfoa", "accoa"),
c("nioa", "accoa"),
c("nioa", "accoa"))
feffects <- list("period", "period", "period",
c("coid", "period"), c("coid", "period"), c("coid", "period"))
clusters <- list("", "", "", "coid", "coid", c("coid", "period"))
t <- prepare_regression_table(r3win, dvs, idvs, feffects, clusters)
htmltools::HTML(t$table)
```
... or by applying one model on different sub-samples.
```{r sub-sample_regressions}
t <- prepare_regression_table(r3win, "return", c("nioa", "accoa"), byvar="period")
htmltools::HTML(t$table)
```
## Conclusion
This is all there is (currently).
All these functions are rather simple wrappers around established R functions.
They can easily be modified to fit your needs and taste.
Take look at the [github repository of the ExPanDaR package](https://github.com/joachim-gassen/ExPanDaR) for the code.
Have fun!