---
title: "Part 2: Turning Zoom Downloads into Datasets"
author: "Andrew P. Knight"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Part 2: Turning Zoom Downloads into Datasets}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

In the second part of this guide, you will learn how to turn downloadable files from Zoom recordings into datasets that you can analyze in `R` using `zoomGroupStats`. Again, because `zoomGroupStats` relies on [Zoom Cloud recording](https://support.zoom.us/hc/en-us/articles/203741855) features, this guide will focus specifically on the files that you download from the Zoom Cloud. However, the functions in `zoomGroupStats` can also be used on meetings that you record locally. 

This guide will progress from instructions on which files to download from Zoom to how to convert those files into usable datasets. In its most basic form, a research dataset using virtual meetings can be structured as individual meeting participants who are nested within virtual meetings. Individual participants can attend multiple virtual meetings; indeed, individual participants *could* attend multiple virtual meetings at the same time. Individuals can further be members of multiple other organizational units (e.g., groups, teams, departments, organizations). To build a basic, clean dataset, though, what you must know is which individual human being was logged into which virtual meeting.  

## How to Download Files from Zoom

Depending on your research objectives, you will need different output files from Zoom. The components that are currently supported and used in this package are:

1. A *participants* file. This is a .csv file that contains meta-data about your meeting. This is an essential file that should be downloaded for each meeting that you wish to process. To download it:  
    * Log into your Zoom account through a web browser
    * Navigate to the *Reports*
    * Click *Usage Reports* 
    * Scroll to the *Participants* column for your focal meeting
    * Click the linked number of participants
    * Check "export with meeting data" and "show unique users"
    * Click *Export*
1. A *transcript* file. This is a .vtt file that contains Zoom's cloud transcription of the spoken audio during the recorded meeting. 
    * Navigate to the *Recordings* page
    * Click the link indicating the number of items
    * Download the Audio Transcript file
1. A *chat* file. This is a.txt file that contains the record of public chat messages posted during the recorded meeting. Download it from the same page where you accessed the *transcript*. 
1. An *audio* file. This is an .mp4 file that contains the recorded audio during the session. Download it from the same age as the items above.
1. Several different *video* file options. [Zoom's Help Center describes The nature of these different formats](https://support.zoom.us/hc/en-us/articles/360025561091-Recording-layouts). For analyzing facial expressions, one of the most useful is the gallery style video. 

## Naming the Downloaded Files from Zoom 

`zoomGroupStats` is designed to batch proces virtual meetings. Batch processing simply means that you are combining several meetings into a single dataset. Even if you are only analyzing a single meeting, though, it is useful to treat that single meeting as a batch of one. This will help in creating an extensible dataset that you could add to in the future. 

To organize your data for a batch run, you must rename the files in a systematic way and save them in a single directory. To name the files, imagine that each file will have a "prefix" and a "suffix". The prefix will be an identifier for what meeting it came from and the suffix will be an identifier for what data source is in the file. Whereas you can use whatever prefix you choose, the suffix must be standardized as follows: 

| Suffix | Description |
|:---------------|:------------------------------------------------|
| _chat.txt | Used for the chat file for meetings |
| _transcript.vtt | Used for the transcript file for meetings  |
| _participants.csv | Used for the participants file for meetings |
| _video.mp4 | Used for a video file for meetings |

Imagine a sample batch run, then, which includes all four elements for three meetings. The prefixes for the meetings could be "meeting001", "meeting002", and "meeting003". This would yield a total of 12 files, named as follows: 

* meeting001_chat.txt
* meeting001_transcript.vtt
* meeting001_participants.csv
* meeting001_video.mp4
* meeting002_chat.txt
* meeting002_transcript.vtt
* meeting002_participants.csv
* meeting002_video.mp4
* meeting003_chat.txt
* meeting003_transcript.vtt
* meeting003_participants.csv
* meeting003_video.mp4

All 8 files should be saved in a single directory. 

## Prepare a Batch Spreadsheet

Next, you will prepare a spreadsheet in .xlsx format and save it within the same directory where your Zoom downloads are. This spreadsheet tells `zoomGroupStats` how to find your files and what information to process. Practically, this spreadsheet is also helpful for organizing your data and keeping track of what you have downloaded. 

You must use the following structure and column headers for your spreadsheet:

| batchMeetingId | fileRoot | participants | transcript | chat | video | sessionStartTime | recordingStartDateTime |
|:---------------|:---------|:-------------|:-----------|:-----|:------|:--------|:--------|
|   00000000001  |  /myMeetings/meeting001  |    1   |    1  | 1 | 0 | 2020-09-04 15:00:00 | 2020-09-04 15:03:30 |
|   00000000002  |  /myMeetings/meeting002  |    1   |    1  | 1 | 0 | 2020-09-05 15:00:00 | 2020-09-04 15:03:04 |
|   00000000003  |  /myMeetings/meeting003  |    1   |    1  | 1 | 0 | 2020-09-06 15:00:00 | 2020-09-04 15:03:07 |

The first row in the file is the header, which should mirror the example above. Each subsequent row provides the information for a single meeting. [Here is a sample that you could use](https://github.com/andrewpknight/zoomGroupStats/blob/main/inst/extdata/myMeetingsBatch.xlsx), replacing any rows after the header with your own information.

| Column Name | Description |
|:---------------|:------------------------------------------------|
| batchMeetingId | A string identifier for this particular meeting |
| fileRoot | A string that gives the full path and prefix where the files from this meeting can be found. The final part of this string (e.g., meeting001 above) is how you have named the files downloaded for this meeting.  |
| participants | Binary - 0 if you did not download the participants file, 1 if you did |
| transcript | Binary - 0 if you did not download the transcript file, 1 if you did |
| chat | Binary - 0 if you did not download the chat file, 1 if you did |
| video | Binary - 0 if you did not download a video file, 1 if you did |
| sessionStartDateTime | A string giving the timestamp for when the meeting began as YYYY-MM-DD HH:MM:SS |
| recordingStartDateTime | A string giving the timestamp for when the recording of the meeting began as YYYY-MM-DD HH:MM:SS |

## Process your Batch

If you have followed the steps above, turning your Zoom sessions into data should be straightforward using the `zoomGroupStats` package.You should begin by installing the latest version of `zoomGroupStats`. Currently this is best done through the `install_github` function from the `devtools` package: 

```{r, eval=FALSE}
# Install zoomGroupStats from CRAN
install.packages("zoomGroupStats")

# Install the development version of zoomGroupStats
devtools::install_github("andrewpknight/zoomGroupStats")
```

After you have installed the package, you can then load it into your environment as usual: 

```{r setup}
library(zoomGroupStats)
```

The first step in turning downloads into data is to process your batch using the `batchProcessZoomOutput` function. 

```{r, eval=FALSE}
batchOut = batchProcessZoomOutput(batchInput="./myMeetingsBatch.xlsx")
```

```{r error=FALSE, message=FALSE, warning=FALSE, include=FALSE, results='hide'}
batchOut = invisible(batchProcessZoomOutput(batchInput=system.file('extdata', 'myMeetingsBatch.xlsx', package = 'zoomGroupStats')))
```

This function will iterate through the meetings listed in the `batchInput` file that you created above. For each meeting, the function will detect any of the named files described above (except the video file) and do an initial processing of them. The function currently ignores the video file because video processing can take a considerable amount of time. [Processing video data will be covered in Part 4](http://zoomgroupstats.org/articles/part04-analyze-zoom-video-data.html). 

The output, stored in this example in `batchOut`, will be a multi-item list that contains any data that were available according to the batch instructions. Currently, the following items can be included, if raw files are available: 

### batchInfo

This is information about the batch, drawn from the input batch file that you have supplied. It is helpful to have this information stored for later analysis of the different components of a Zoom recording. 

```{r}
str(batchOut$batchInfo)
```

### meetInfo & partInfo

These are based on information extracted from the participants file downloaded from Zoom. If a meeting has a *_participants.csv file, these will be included. They provide meta-data about the full set of virtual meetings. These are useful files for nailing down the structure of your full dataset (in terms of individuals nested within meetings) and for creating a unique individual identifier for the people in your dataset. 

`meetInfo` is a data.frame that gives information pulled about the meetings from the participants file downloaded from Zoom. Each row in this file is a meeting: 

```{r}
str(batchOut$meetInfo)
```

`partInfo` is a data.frame that gives the participants in each meeting and any information available in the participants file downloaded from Zoom

```{r}
str(batchOut$partInfo)
```

### transcript & chat

These two items provide parsed text data from the transcribed audio spoken during the meeting and the text-based chat in the meeting. These files are the basis of any further text-based analysis. One important thing to note in processing transcripts is that Zoom timestamps the transcript file anchored on the start of the recording--not at the start of the session itself. The issue here is that a recording may not begin until a period of time after the launch of the session. This means that you could have a transcript file out of sync with the chat file, which begins its timestamp at the start of the session. Resolving this issue requires careful records of when the recording was started or setting up your session to automatically record. 

`transcript` is a data.frame that is the parsed audio transcript file. Each row represents a single marked "utterance" using Zoom's cloud-based transcription algorithm. Utterances are marked as a function of pauses in speech and/or speaker changes. 

```{r}
str(batchOut$transcript)
```


`chat` is a data.frame that is the parsed text-based chat file. Each row represents a single chat message submitted by a user. Non-ASCII characters will not be correctly rendered in the message text.

```{r}
str(batchOut$chat)
```

### rosetta
The final element is called "rosetta" because it will help you deal with one of the most vexing challenges in analyzing virtual meetings with a large number of participants: The lack of a persistent and unique individual identifier for participants. Because the transcript and chat files rely on people's Zoom display names--and because people can change their display names--you will frequently encounter duplicate names and/or multiple identifiers for the same person. 

From a data structure perspective, what is most important to know is which individual person logged into which virtual meeting. Unfortunately, complications in how Zoom identifies individuals require careful human attention to this issue. To illustrate, consider the following exaggerated example. Arun Jah is an attendee in Meeting A. Arun logs in on his laptop through his corporate account. But, he also logs in on his iPad through his family account to the same meeting. After a few minutes, Arun realizes that his iPad account has his child's name displayed (the child was using it for virtual school). So, he changes his display name. Halfway through the meeting, Arun needs to get in his car to go pick up his child. So, he logs into the meeting on his smartphone. In a raw Zoom dataset, the person "Arun" will show up as four different names--(1) his corporate name that was not changed; (2) his child's name on his iPad; (3) the name he changed his iPad to; and, (4) his mobile phone. Clearly, to properly study human behavior, all of the actions associated with these four names should be attached to "Arun". 

To address this problem, the `rosetta` file compiles every unique display name (by meeting) encountered across the `participants`, `chat`, and `transcript` files. 

```{r}
str(batchOut$rosetta)
```

## Add a Unique Individual Identifier to All Elements

I have found that by exporting the `rosetta` file and manually attaching a unique individual identifier (e.g., number) is a necessary process to ensure that the right data are attached to the right people. In terms of process, here is what I do: 

* Export the `rosetta` file. You can do this in your initial `batchProcessZoomOutput` call as follows: 
```{r, eval=FALSE}
batchOut = batchProcessZoomOutput(batchInput="./myMeetingsBatch.xlsx", exportZoomRosetta="./myMeetings_rosetta_original.xlsx")
```
* Copy the file that you have exported and save it with a new name (e.g., myMeetings_rosetta_edited.xlsx). This will prevent you from losing your manual work if you re-run the command above and overwrite the file. 

* Add a new column to the new file that contains a unique user identifier. Name the column something like 'indivId'. In this column you will enter a persistent individual identifier for individuals in your dataset. 

* Manually review the file, entering the unique identifier that is correct for each Zoom display name. I tend to sort the file by the Zoom display name column first. This increases the chances of catching the same person who shows up in multiple meetings or slightly different names (e.g., Ringo, Ringo Starr). For unique identifier, I recommend either using a master numeric identifier or email address that connects to other data streams that you might have. 

* Import the reviewed and edited rosetta file. The following command will import the file and add your new identifier to each of the elements that are included in `batchOut`. Going forward, you should use the new individual identifier (e.g., `indivId`) in your analyses. Together with the meeting identifier (e.g., `batchMeetingId`), you will be able to link records to other relevant data that you have collected (e.g., survey data). 

```{r, eval=FALSE}
batchOutIds = importZoomRosetta(zoomOutput=batchOut, zoomRosetta="./myEditedRosetta.xlsx", 
meetingId="batchMeetingId")
```

Running the importZoomRosetta command will attach the new unique identifier that you have created to the datasets that you created before for any of the available files in your directory. 

## Next Steps

Following the process above, you should have a single R object with a tremendous amount of information. [Part 3 of this guide will cover how to analyze the conversational data in this R object](http://zoomgroupstats.org/articles/part03-analyze-zoom-conversation-data.html).