---
title: "Harmonizing Product Codes with R"
author:
- Christoph Baumgartner^[Faculty of Economics and Statistics, University of Innsbruck, Christoph.Baumgartner@uibk.ac.at]
- Stjepan Srhoj^[Faculty of Economics, Business and Tourism, University of Split, Stjepan.Srhoj@efst.hr]
- Janette Walde^[Faculty of Economics and Statistics, University of Innsbruck, Janette.Walde@uibk.ac.at]
package: harmonizer
output: rmarkdown::html_vignette
abstract: Innovation is a major engine of economic growth. To compare products over time, harmonization of product codes is mandatory. This package provides an _easy-to-use_ approach to harmonize product codes. Moreover, it offers an application that allows finding all new and dropped products for given firm-level data based on harmonized product codes.
vignette: >
%\VignetteIndexEntry{Harmonizing Product Codes with R}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = ""
)
```
```{r, include=FALSE}
# get the file path
file_path <- system.file("extdata", package = "harmonizer")
# get some examples
example_cn8 <- readRDS(paste0(file_path, "/CN8/CN8_2000.rds"))
example_pc8 <- readRDS(paste0(file_path, "/PC8/PC8_2010.rds"))
example_pc8_changes <- readRDS(paste0(file_path, "/PC8/PC8_2010_2011.rds"))
example_pc8_cn8_concordance <- readRDS(paste0(file_path, "/PC8/PC8_CN8_2010.rds"))
```
This package provides several functions to harmonize _CN8_ product codes ([Combined Nomenclature 8 digits](https://taxation-customs.ec.europa.eu/customs-4/calculation-customs-duties/customs-tariff/combined-nomenclature_en)) as well as _PC8_ product codes ([Production Communautaire 8 digits](https://ec.europa.eu/eurostat/statistics-explained/index.php?title=Industrial_production_statistics_introduced_-_PRODCOM)), _HS6_ ([Harmonized System 6 digits](https://taxation-customs.ec.europa.eu/customs-4/calculation-customs-duties/customs-tariff/harmonized-system-general-information_en)) and _BEC_ ([Broad Economic Categories](https://unstats.un.org/unsd/trade/classifications/bec.asp)). All functions are listed below:
## Overview
- [Idea behind harmonization](#ididea)
- [Main Functions](#idmainfunctions)
- [harmonize_cn8()](#idharmonize_cn8)
- [harmonize_pc8()](#idharmonize_pc8)
- [Support Functions](#idsupportfunctions)
- [history_cn8()](#idhistory_cn8)
- [history_pc8()](#idhistory_pc8)
- [cn8_to_bec()](#idcn8_to_bec)
- [pc8_to_bec()](#idpc8_to_bec)
- [get_data_directory()](#idget_data_directory)
- [Additional Functions](#idadditionalfunctions)
- [utilize_cn8()](#idutilize_cn8)
- [utilize_pc8()](#idutilize_pc8)
- [Data Sets](#iddatasets)
- [CN8 data](#idCN8data)
- [PC8 data](#idPC8data)
- [HS6 data](#idHS6data)
- [BEC data](#idBECdata)
- [Custom Data](#idcustomdata)
## Idea behind harmonization
The basic idea that stands behind the harmonization is to keep track of each single code, during a certain time period. In the simplest case, a code doesn't change during the examined period, i.e. no harmonization needed. In any other case, all changes, which are associated with a specific code need to be documented. There are different kinds of changes: 1 to 1, 1 to many, may to 1, many to many, 1 to none, none to 1. The last two described changes, simply mean that a code was dropped, respectively that a new code was created. Those two procedures are only possible for PC8- and not for CN8-classification. 1 to many or many to 1 changes can occur, if two or more codes are split, respectively merged. It is also possible that a 'mixture' of changes is present, e.g. a code can merge and remain the same in terms of notation at the same time.
In a more technical way this means, a _(n + m)_ x _k_ matrix, consecutively referred to as 'history matrix', is designed to store all the information, where _n_ is the number of observations, _m_ is the number of added rows due changes and k the number of years. In this matrix every row represents a history of a particular code. For every code that has multiple replacements in one year a new row has to be added to the matrix. This is necessary, since one has to keep track of the changed codes as well. History matrices are designed for CN8- and for PC8-classification. These history matrices are used as a input for the final harmonization.
In the further process, the history matrices are extended by new variables, like additional classification systems (HS6, BEC), binary indication variables and the harmonized code itself.
The goal of the harmonization is to make a comparison possible for the give time period. Therefore a baseline is created, defined in the last year of the period. This information is stored in the variable _CN8plus_ (or _PC8plus_ respectively). In order to achieve this information, the harmonization is done in several steps. Firstly, one has to check if a code didn't change, i.e. all codes between the first and the last year of interest are the same and no change appeared in addition. If this is the case, this code does not need further harmonization and is used as _CN8plus_ already. Secondly, all connections among codes need to be documented. This means if a code split or merged into several codes for example, this codes are grouped together from now on (consecutively referred to as _family_). If a mixture of changes, e.g. a code split and remained the same in terms of notation in the same year, this information is stored in the variable '_flag_'. However, these codes with flag-variables being 1, are also part of a certain family. Summarizing this means, if one takes all this information into account, one reaches the professed goal of _CN8plus_, which therefore includes all codes that did not change and all families. Furthermore, the varaibale _flag_ can be 2 or 3. The variable _flag_ being equal to 2, means that this code is either new or was dropped during the periode of interest. If it is 3, the code had at least one simple change, but is not associated with a family.
The additional classification systems, are based on _CN8plus_ (or _PC8plus_ respectively). Since CN8 and HS6 are closely connected (HS6 are the first six digits of CN8), this transformation is mostly straight forward and is stored in the variable _HS6_. However, HS6 changes its classification due a separate change-list as well. Considering these lists, yields the variable _HS6plus_. For PC8 codes it is not that easy. Here, first all PC8 codes need to be translated into CN8 codes (if possible) and afterwards the same procedure like with CN8 codes is used. Note that not all PC8 codes do have a corresponding CN8 code. Therefore some codes might be lost due to this issue.
The BEC classification system is also based on _CN8plus_. Concordance lists between CN8 and BEC are used to derive this classification. This information is stored in several ways: higher (one digit) and lower (up to 3 digits) aggregation as well as the basic class of the good. If _CN8plus_ contains a family, it is not possible to assign a HS6- nor a BEC-classification, because a family can include several different codes, with again several different related HS6- and BEC-codes.
The resulting matrix, we will call it _harmonization matrix_ from now on, is therefore a _(n + m)_ x _(k + l)_ matrix, where the additional parameter _l_ describes the the added columns.
Concordance files between HS6 and BEC Rev. 4 exist for 1996, 2002, and 2007. For 2012 and 2017, there exists a concordance between HS6 and BEC Rev. 5. Therefore, we provide BEC codes from Rev. 4 until 2011 and BEC codes from Rev. 5 thereafter. Moreover, BEC codes can be classified into three classes defined by the [System of National Accounts](https://ec.europa.eu/eurostat/statistics-explained/index.php?title=Building_the_System_of_National_Accounts_-_basic_concepts&oldid=452465) (SNA), which focus on the end-use of the product.
## Main Functions
harmonize_cn8()
provides for a given time period a data frame that contains all _CN8_ product codes and their history, harmonized _CN8plus_ codes, harmonized _HS6plus_ codes, and _BEC_ classification. The "plus-codes" are the main outcome of the function. They provide harmonized information of the product codes, i.e. comparable codes. Every harmonization refers to the last/first year of interest. The following table offers an overview of all provided variables.
| Variable | Explanation|
|:----|:--------------|
|CN8\_xxxx | a specific CN8 code in a given year|
|CN8plus | the harmonization code for CN8, which refers to the last/first year of the time period|
|HS6plus | the harmonization code of HS6, which refers to the last/first year of the time period|
|BEC | provides the BEC classification at a high aggregation level (1 digit) |
|BEC\_agr | provides the BEC classification at a lower aggregation level (up to 3 digits) |
|SNA | provides information if the code is classified as consumption, capital or intermediate good in SNA |
|flag | integer from 0 to 3; 1 indicates that this code remained the same in notation over the whole time period but was split or merged in addition; 2 indicates that this code is either new or was dropped during the period of interest; 3 indicates the code had at least one simple change, but is not associated with a family |
|flagyear | indicates the first year in which the flag was set |
For more application details, see _?harmonization\_cn8_.
harmonize_pc8()
provides for a given time period a data frame that contains all _PC8_ product codes and their history, harmonized _PC8plus_ codes, harmonized _HS6plus_ codes, and _BEC_ classification. The "plus-codes" are the main outcome of the function. They provide harmonized information of the product codes, i.e. comparable codes. Every harmonization refers to the last/first year of interest. The following table offers an overview of all provided variables.
| Variable | Explanation|
|:----|:--------------|
|PC8\_xxxx | a specific PC8 code in a given year|
|PC8plus | the harmonization code for PC8, which refers to the last/first year of the time period|
|HS6plus | the harmonization code of HS6, which refers to the last/first year of the time period|
|BEC | provides the BEC classification at a high aggregation level (1 digit) |
|BEC\_agr | provides the BEC classification at a lower aggregation level (up to 3 digits) |
|SNA | provides information if the code is classified as consumption, capital or intermediate good in SNA |
|flag | integer from 0 to 3; 1 indicates that this code remained the same in notation over the whole time period but was split or merged in addition; 2 indicates that this code is either new or was dropped during the period of interest; 3 indicates the code had at least one simple change, but is not associated with a family |
|flagyear | indicates the first year in which the flag was set |
For more application details, see _?harmonization\_pc8_.
## Support Functions
All support functions are used within the main functions. They provide intermediate steps to harmonize the data. However, they can be used as stand-alone functions as well.
history_cn8()
provides a data frame that contains all _CN8_ product codes and their history over time for the demanded time period. This dataset is the basis for the main function `harmonize_cn8()` and can be obtained therewith as well. The following table offers an overview of all provided variables.
| Variable | Explanation|
|:----|:--------------|
|CN8\_xxxx | a specific CN8 code in a given year|
|flag | integer from 0 to 2; 1 indicates that this code remained the same in notation over the whole time period but was split or merged in addition; 2 indicates that this code is either new or was dropped during the period of interest |
|flagyear | indicates the first year in which the flag was set |
For more application details, see _?history_cn8_.
history_pc8()
provides a data frame that contains all _PC8_ product codes and their history over time for the demanded time period. This dataset is the basis for the main function `harmonize_PC8()` and can be obtained therewith as well. The following table offers an overview of all provided variables.
| Variable | Explanation|
|:----|:--------------|
|PC8\_xxxx | a specific PC8 code in a given year|
|flag | integer from 0 to 3; 1 indicates that this code remained the same in notation over the whole time period but was split or merged in addition; 2 indicates that this code is either new or was dropped during the period of interest |
|flagyear | indicates the first year in which the flag was set |
For more application details, see _?history\_pc8_.
cn8_to_bec()
provides a data frame that contains all _CN8_ product codes and related _BEC_ and _HS6_ codes in a given time period. Therefore, this data serves as a connection between _CN8_ and _BEC_ classification and between _CN8_ and _HS6_ classification. It forms the basis of some output of the main function, namely: _BEC_, _BEC\_agr_, _SNA_ and _HS6plus_. The following table offers an overview of all provided variables.
| Variable | Explanation|
|:----|:--------------|
|CN8 | a specific CN8 code|
|HS6 | provides the HS6 classification of the CN8plus code |
|BEC | provides the BEC classification on a high aggregation level (1 digit) |
|BEC\_agr | provides the BEC classification on a lower aggregation level (up to 3 digits) |
For more application details, see _?cn8\_to\_bec_.
pc8_to_bec()
provides a data frame that contains all _PC8_ product codes and related _BEC_ and _HS6_ codes in a given time period. Therefore, this data serves as a connection between _PC8_ and _BEC_ classification and between _PC8_ and _HS6_ classification. It forms the basis of some output of the main function, namely: _BEC_, _BEC\_agr_, _SNA_ and _HS6plus_. The following table offers an overview of all provided variables.
| Variable | Explanation|
|:----|:--------------|
|PC8\_xxxx | a specific PC8 code|
|HS6 | provides the HS6 classification of the PC8plus code |
|BEC | provides the BEC classification on a high aggregation level (1 digit) |
|BEC\_agr | provides the BEC classification on a lower aggregation level (up to 3 digits) |
For more application details see _?pc8\_to\_bec_.
get_data_directory()
provides the directory where custom data must be stored and the used data (e.g., concordance lists, list of codes) can be edited. However, before editing the employed data or using additional concordance lists for example, it is highly recommended to read first the instructions in this vignette carefully (also see section [Data Sets](#iddatasets) and [Custom Data](#idcustomdata)). The directory is provided in the R console. Further features (like open an explorer, print available data in console) are only executable if the directory path does not contain any blanks.
For more application details see _?get\_data\_directory_.
## Additional Functions
These functions go beyond the primary purpose of this package. The additional functions provide an application of the data frames obtained by the main functions. To use these additional functions, data on firm-level is required, which is data that is not provided by the package. The firm-level data must provide columns with the following names: _ID_, _year_ and _CN8_ or _PC8_. Other columns may exist; however, they will not be used by the function. The following table summarizes the variables that need to be included in the firm-level data.
| Variable | Explanation|
|:----|:--------------|
|ID | specific code that describes a firm over the years (this code does not change over time)|
|year | year in which the firm produced a product |
|CN8 | CN8 code of firm product|
|PC8 | PC8 code of firm product|
utilize_cn8()
may provide two data frames:
- A data frame that contains all changed _CN8_ product codes per firm per year. In more detail, this means how many products remained the same, were added, were dropped, how many products were produced by a certain firm in a given year, and how many products were produced in the year after. As a base of this computation CN8plus codes or HS6plus codes can be used.
- A data frame that is based on the entered firm data. The entered firm data data is extended by harmonized data (that is _CN8plus_, _flag_, _flagyear_, _HS6plus_, _BEC_, _BEC\_agr_, _SNA_).
The tables at the end of this section offer an overview of all provided variables.
utilize_pc8()
may provide two data frames:
- A data frame that contains all changed _PC8_ product codes per firm per year. In more detail, this means how many products remained the same, were added or dropped - the value of the same/added/dropped products - how many products were produced by a certain firm in a given year, and how many products were produced in the year after. As a base of this computation PC8plus codes or HS6plus codes can be used.
- A data frame that is based on the entered firm data. The entered firm data data is extended by harmonized data (that is _PC8plus_, _flag_, _flagyear_, _HS6plus_, _BEC_, _BEC\_agr_, _SNA_).
The tables at the end of this section offer an overview of all provided variables.
Since the provided data frames do not differ between `utilize_cn8()` and `utilize_pc8()`, in terms of notation, the tables are only provided once here.
_Table that summarizes the output, described by the notation_ __a.__ _above:_
| Variable | Explanation|
|:----|:--------------|
|firmID | specific code that describes a firm over the years (this code does not change over time)|
|period\_UL | lower limit of the time period |
|period | time period in which the product was produced |
|gap | indicating if the time period is greater than one (i.e. upper limit - lower limit > 1) |
|same\_products | number of products that were produced in both years (i.e. remained in the product portfolio of this firm) |
|value\_same\_products | value of products that were produced in both years (i.e. remained in the product portfolio of this firm); the value is calculated in the upper limit of the time period |
|new\_products | number of added products in the upper limit of the time period (i.e. added to the product portfolio of this firm) |
|value\_new\_products | value of added products in the upper limit of the time period (i.e. added to the product portfolio of this firm) |
|dropped\_products | number of dropped products in the upper limit of the time period (i.e. removed of the product portfolio of this firm) |
|value\_dropped\_products | value of dropped products in the upper limit of the time period (i.e. removed of the product portfolio of this firm); the value is calculated in the lower limit of the time period |
|nbr\_of\_products\_period\_LL | number of all products produced in the lower limit of the time period (i.e. entire product portfolio of this firm) |
|nbr\_of\_products\_period\_UL | number of all products produced in the upper limit of the time period (i.e. entire product portfolio of this firm) |
_Table that summarizes the output, described by the notation_ __b.__ _above:_
| Variable | Explanation|
|:----|:--------------|
|firmID | specific code that describes a firm over the years (this code does not change over time, provided by user)|
|year | year in which the firm produced a product (provided by user)|
|CN8 | CN8 code of firm product (provided by user)|
|PC8 | PC8 code of firm product (provided by user)|
|(value) | value of the corresponding product code (may be provided by user)|
| ... | additional columns from original firm data (provided by user)|
|CN8plus | final harmonization, which refers to the last year of the time period|
|PC8plus | final harmonization, which refers to the last year of the time period|
|flag | integer from 0 to 3; 1 indicates that this code remained the same in notation over the whole time period but was split or merged in addition; 2 indicates that this code is either new or was dropped during the period of interest; 3 indicates the code had at least one simple change, but is not associated with a family |
|HS6 | provides the HS6 classification of the PC8plus / CN8plus code |
|HS6plus | also adjusts for the change lists of HS6 |
|BEC | provides the BEC classification on a high aggregated level (1 digit) |
|BEC\_agr | provides the BEC classification on a less aggregated level (up to 3 digits) |
|SNA | provides information if the code is classified as consumption, capital or intermediate good in SNA |
## Data Sets
By default, the package provides several data sets for CN8-, PC8-, HS6- and BEC-classification. This data allows for harmonization of CN8 product codes between 1995 and 2022 and PC8 product codes between 2007 and 2017. All available data sets are stored within the package. The function `get_data_directory()` provides support to access the data more easily.
All data included in the package was downloaded from [EU server Ramon](https://ec.europa.eu/eurostat/ramon/) originally and altered if needed.
Provided data in more detail:
1. __CN8 data__ is provided in the corresponding CN8 folder. This folder contains two different types of files. Firstly, a list of all existing CN8 codes for every year, e.g. for the year 2000, _CN8_2000.rds_. More technically speaking, these files provided a data frame with one column and _n_ rows, where _n_ is the number of existing CN8 codes in a given year. An example (first six lines) of the year 2000 is the following:
```{r, echo=FALSE}
head(example_cn8)
```
Secondly, the CN8 folder contains a concordance list of all CN8 codes over time, a _.csv_ file, where the separator is a semicolon, i.e. "_;_". A header is necessary. The header names must be the following: _from_, _to_, _obsolete_, _new_. The period between "from" and "to" is always one year and describes when the code changed. The "obsolete" and "new" codes represent the outdated code and the replacement, respectively. The first six lines of the default csv-file look like the following:
```{r comment='', echo=FALSE}
cat(readLines(paste0(file_path, "/CN8/CN8_concordances_1988_2022.csv"), n = 6, encoding = "UTF-8"), sep = '\n')
```
2. __PC8 data__ is provided in the corresponding PC8 folder. This folder contains two different types of files. Firstly, a list of all existing PC8 codes for every year, e.g. for the year 2010, _PC8\_2010.rds_. More technically speaking, these files provided a data frame with one column and _n_ rows, where _n_ is the number of existing CN8 codes in a given year. An example (first six lines) of the year 2010 is the following:
```{r, echo = FALSE}
head(example_pc8)
```
Secondly, a concordance between every year is necessary. These files contain two years in their filenames, with a period of one year in between, e.g. between 2010 and 2011 this results in _PC8\_2010\_2011.rds_. More technically speaking, these files are data frames with two columns, which must be named "new" and "old" and _n_ rows, where _n_ is the number of changes in a given year. An example (first six lines) of the changes between 2010 and 2011 is the following:
```{r, echo = FALSE}
head(example_pc8_changes)
```
Thirdly, the PC8 folder contains concordance lists between PC8- and CN8- classifications for every year. This data is needed in terms of translating PC8 into BEC. An example for the year 2010 would be _PC8\_CN8\_2010.rds_. Technically this means, a data frame with two columns, named "PRCCODE" for PC8 codes and "CNCODE" for CN8 codes and _n_ rows, where _n_ is the number of concordances between specific codes is provided by every year. However, no concordance between PC8 and CN8 may be possible. In this case, the missing value is filled by _NA_. Some examples out of the associated file for the year 2010 can be found below:
```{r, echo = FALSE}
head(example_pc8_cn8_concordance)
example_pc8_cn8_concordance[2400:2405, ]
```
3. __HS6 data__ is provided in the corresponding HS6 folder. This folder only contains one type of file, which are correspondence lists between the changes of HS6 codes over time. Those changes happened in several years: 1992, 1996, 2002, 2007, 2012 and 2017. For every period, a separate concordance list is necessary. csv-files provided this data, where the separator is a semicolon, i.e. "_;_" and the filenames contain both years. For example, between 1996 and 2002, the file is called _HS\_1996\_to\_HS\_1992.csv_. Also, headers are included in this file. For this specific case, they are "HS 1996" and "HS 1992". For other periods the headers change accordingly. An example (first six lines) of the changes between 1996 and 1992 is the following:
```{r comment='', echo=FALSE}
cat(readLines(paste0(file_path, "/HS6/HS_1996_to_HS_1992.csv"), n = 6, encoding = "UTF-8"), sep = '\n')
```
4. __BEC data__ data is provided in the HS6toBEC folder. This folder contains only one type of file, which are correspondence lists between HS6- and BEC-classification in the years HS6 codes changed (i.e. 2002, 2007, 2012, 2017). For each year, a separate concordance list is necessary. csv-files are used for this data, where the separator is a semicolon, i.e. "_;_" and the filenames contain the year. For example, in 2002, the file is called _HS2002toBEC.csv_. Also, headers are included in this file, namely "HS" for the HS6 codes and "BEC" for the BEC codes. An example (first six lines) of the concordance in 2002 is the following:
```{r comment='', echo=FALSE}
cat(readLines(paste0(file_path, "/HS6toBEC/HS2002toBEC.csv"), n = 6), sep = '\n')
```
### Custom Data
The use of additional concordance lists for example or altering provided data is possible. However, it is highly recommended to read first the instructions in this vignette carefully.
If new data is added, there are some mandatory aspects and some valuable aspects to acknowledge.
Mandatory aspects:
* New data must be stored inside the package. This can be easily done by adding new files in the appropriate subfolder of the package database. `get_data_directory()` may provide help to find the correct folder to store new data.
* Chosen filenames must be analogue to already existing files.
* The structure of the new data is crucial. The section [Data Sets](#iddatasets) may provide more details. In short: file-type, header names, column numbers and datatype (numeric, character, ...) are very important.
* All new added _.csv-files_ must be encoded using _UTF-8_.
Valuable aspects:
* It is highly recommended to download new data from [EU server Ramon](https://ec.europa.eu/eurostat/ramon/) and only alter content-related data if necessary.
* Product codes need to have the correct length, e.g. CN8 codes must be eight digits long. Some programs tend to interpret codes as numeric values and cut of leading zeros, which leads to completely wrong results.