Title: | Handwriting Analysis with Random Forests |
---|---|
Description: | Perform forensic handwriting analysis of two scanned handwritten documents. This package implements the statistical method described by Madeline Johnson and Danica Ommen (2021) <doi:10.1002/sam.11566>. Similarity measures and a random forest produce a score-based likelihood ratio that quantifies the strength of the evidence in favor of the documents being written by the same writer or different writers. |
Authors: | Iowa State University of Science and Technology on behalf of its Center for Statistics and Applications in Forensic Evidence [aut, cph, fnd], Stephanie Reinders [aut, cre] |
Maintainer: | Stephanie Reinders <[email protected]> |
License: | GPL (>= 3) |
Version: | 1.0.2 |
Built: | 2024-12-04 07:00:35 UTC |
Source: | CRAN |
Compares two handwriting samples scanned and saved a PNG images with the following steps:
processDocument
splits the writing in both samples into component shapes, or graphs.
get_clusters_batch
groups the graphs into clusters of similar shapes.
get_cluster_fill_counts
counts the number of graphs assigned to each cluster.
get_cluster_fill_rates
calculates the proportion of graphs assigned to each cluster. The cluster fill rates serve as a writer profile.
A similarity score is calculated between the cluster fill rates of the two documents using a random forest trained with ranger.
The similarity score is compared to reference distributions of same writer and different writer similarity scores. The result is a score-based likelihood ratio that conveys the strength of the evidence in favor of same writer or different writer. For more details, see Madeline Johnson and Danica Ommen (2021) <doi:10.1002/sam.11566>.
calculate_slr( sample1_path, sample2_path, rforest = random_forest, project_dir = NULL )
calculate_slr( sample1_path, sample2_path, rforest = random_forest, project_dir = NULL )
sample1_path |
A file path to a handwriting sample saved in PNG file format. |
sample2_path |
A file path to a second handwriting sample saved in PNG file format. |
rforest |
Optional. A random forest trained with ranger. If rforest is not given, the data object random_forest is used. |
project_dir |
Optional. A path to a directory where helper files will be saved. If no project directory is specified, the helper files will be saved to tempdir() and deleted before the function terminates. |
A number
# Compare two samples from the same writer sample1 <- system.file(file.path("extdata", "w0030_s01_pWOZ_r01.png"), package = "handwriterRF") sample2 <- system.file(file.path("extdata", "w0030_s01_pWOZ_r02.png"), package = "handwriterRF") calculate_slr(sample1, sample2) # Compare samples from two writers sample1 <- system.file(file.path("extdata", "w0030_s01_pWOZ_r01.png"), package = "handwriterRF") sample2 <- system.file(file.path("extdata", "w0238_s01_pWOZ_r02.png"), package = "handwriterRF") calculate_slr(sample1, sample2)
# Compare two samples from the same writer sample1 <- system.file(file.path("extdata", "w0030_s01_pWOZ_r01.png"), package = "handwriterRF") sample2 <- system.file(file.path("extdata", "w0030_s01_pWOZ_r02.png"), package = "handwriterRF") calculate_slr(sample1, sample2) # Compare samples from two writers sample1 <- system.file(file.path("extdata", "w0030_s01_pWOZ_r01.png"), package = "handwriterRF") sample2 <- system.file(file.path("extdata", "w0238_s01_pWOZ_r02.png"), package = "handwriterRF") calculate_slr(sample1, sample2)
A dataset containing cluster fill counts for for 1,200 handwriting samples
from the CSAFE Handwriting Database. The documents were split into graphs
with process_batch_dir
. The graphs were grouped into clusters
with get_clusters_batch
. The cluster fill counts were
calculated with get_cluster_fill_counts
.
cfc
cfc
A data frame with 1200 rows and 41 variables:
The file name of the handwriting sample. The file name includes the writer ID, the writing session, prompt, and repetition number of the handwriting sample. There are 1,200 handwriting samples.
Writer ID. There are 100 distinct writer ID's. Each writer has 12 documents.
A document code that records the writing session, prompt, and repetition number of the handwriting sample. There are 12 distinct document codes. Each writer has a writing sample for each of the 12 document codes.
The number of graphs in cluster 1
The number of graphs in cluster 2
The number of graphs in cluster 3
The number of graphs in cluster 4
The number of graphs in cluster 5
The number of graphs in cluster 6
The number of graphs in cluster 7
The number of graphs in cluster 8
The number of graphs in cluster 9
The number of graphs in cluster 10
The number of graphs in cluster 11
The number of graphs in cluster 12
The number of graphs in cluster 13
The number of graphs in cluster 14
The number of graphs in cluster 15
The number of graphs in cluster 16
The number of graphs in cluster 17
The number of graphs in cluster 18
The number of graphs in cluster 19
The number of graphs in cluster 20
The number of graphs in cluster 21
The number of graphs in cluster 22
The number of graphs in cluster 23
The number of graphs in cluster 24
The number of graphs in cluster 25
The number of graphs in cluster 26
The number of graphs in cluster 27
The number of graphs in cluster 28
The number of graphs in cluster 29
The number of graphs in cluster 30
The number of graphs in cluster 31
The number of graphs in cluster 32
The number of graphs in cluster 33
The number of graphs in cluster 34
The number of graphs in cluster 35
The number of graphs in cluster 36
The number of graphs in cluster 37
The number of graphs in cluster 38
The number of graphs in cluster 39
The number of graphs in cluster 40
<https://forensicstats.org/handwritingdatabase/>
A dataset containing cluster fill rates for for 1,200 handwriting samples
from the CSAFE Handwriting Database. The dataset was created by running
get_cluster_fill_rates
on the cluster
fill counts data frame cfc. Cluster fill rates are the proportion of total
graphs assigned to each cluster.
cfr
cfr
A data frame with 1200 rows and 42 variables:
file name of the handwriting sample
The total number of graphs in the handwriting sample
The number of graphs in cluster 1
The number of graphs in cluster 2
The number of graphs in cluster 3
The number of graphs in cluster 4
The number of graphs in cluster 5
The number of graphs in cluster 6
The number of graphs in cluster 7
The number of graphs in cluster 8
The number of graphs in cluster 9
The number of graphs in cluster 10
The number of graphs in cluster 11
The number of graphs in cluster 12
The number of graphs in cluster 13
The number of graphs in cluster 14
The number of graphs in cluster 15
The number of graphs in cluster 16
The number of graphs in cluster 17
The number of graphs in cluster 18
The number of graphs in cluster 19
The number of graphs in cluster 20
The number of graphs in cluster 21
The number of graphs in cluster 22
The number of graphs in cluster 23
The number of graphs in cluster 24
The number of graphs in cluster 25
The number of graphs in cluster 26
The number of graphs in cluster 27
The number of graphs in cluster 28
The number of graphs in cluster 29
The number of graphs in cluster 30
The number of graphs in cluster 31
The number of graphs in cluster 32
The number of graphs in cluster 33
The number of graphs in cluster 34
The number of graphs in cluster 35
The number of graphs in cluster 36
The number of graphs in cluster 37
The number of graphs in cluster 38
The number of graphs in cluster 39
The number of graphs in cluster 40
<https://forensicstats.org/handwritingdatabase/>
Calculate cluster fill rates from a data frame of cluster fill counts created
with get_cluster_fill_counts
.
get_cluster_fill_rates(df)
get_cluster_fill_rates(df)
df |
A data frame of cluster fill rates created with
|
A data frame of cluster fill rates.
rates <- get_cluster_fill_rates(df = cfc)
rates <- get_cluster_fill_rates(df = cfc)
Create a training set from a data frame of cluster fill rates from the CSAFE Handwriting Database.
get_csafe_train_set(df, train_prompt_codes)
get_csafe_train_set(df, train_prompt_codes)
df |
A data frame of cluster fill rates created with
|
train_prompt_codes |
A character vector of which prompt(s) to use in the training set. Available prompts are 'pLND', 'pPHR', 'pWOZ', and 'pCMB'. |
A data frame
train <- get_csafe_train_set(df = cfr, train_prompt_codes = 'pCMB')
train <- get_csafe_train_set(df = cfr, train_prompt_codes = 'pCMB')
Calculate distances using between all pairs of cluster fill rates in a data frame using one or more distance measures. The available distance measures absolute distance, Manhattan distance, Euclidean distance, maximum distance, and cosine distance.
get_distances(df, distance_measures)
get_distances(df, distance_measures)
df |
A data frame of cluster fill rates created with
|
distance_measures |
A vector of distance measures. Use 'abs' to calculate the absolute difference, 'man' for the Manhattan distance, 'euc' for the Euclidean distance, 'max' for the maximum absolute distance, and 'cos' for the cosine distance. The vector can be a single distance, or any combination of these five distance measures. |
The absolute distance between two n-length vectors of cluster fill rates, a
and b, is a vector of the same length as a and b. It can be calculated as
abs(a-b) where subtraction is performed element-wise, then the absolute
value of each element is returned. More specifically, element i of the vector is for
.
The Manhattan distance between two n-length vectors of cluster fill rates, a and b, is
. In other words, it is the sum of the absolute
distance vector.
The Euclidean distance between two n-length vectors of cluster fill rates, a and b, is
. In other words, it is the sum of the elements of the
absolute distance vector.
The maximum distance between two n-length vectors of cluster fill rates, a and b, is
. In other words, it is the sum of the elements of the
absolute distance vector.
The cosine distance between two n-length vectors of cluster fill rates, a and b, is
.
A data frame of distances
# calculate maximum and Euclidean distances between the first 3 documents in cfr. distances <- get_distances(df = cfr[1:3, ], distance_measures = c('max', 'euc')) distances <- get_distances(df = cfr, distance_measures = c('man'))
# calculate maximum and Euclidean distances between the first 3 documents in cfr. distances <- get_distances(df = cfr[1:3, ], distance_measures = c('max', 'euc')) distances <- get_distances(df = cfr, distance_measures = c('man'))
Verbally interprent an SLR value.
interpret_slr(df)
interpret_slr(df)
df |
A data frame created by |
A string
df <- data.frame("score" = 5, "slr" = 20) interpret_slr(df) df <- data.frame("score" = 0.12, "slr" = 0.5) interpret_slr(df) df <- data.frame("score" = 1, "slr" = 1) interpret_slr(df) df <- data.frame("score" = 0, "slr" = 0) interpret_slr(df)
df <- data.frame("score" = 5, "slr" = 20) interpret_slr(df) df <- data.frame("score" = 0.12, "slr" = 0.5) interpret_slr(df) df <- data.frame("score" = 1, "slr" = 1) interpret_slr(df) df <- data.frame("score" = 0, "slr" = 0) interpret_slr(df)
A list that contains a trained random forest created with ranger, the data frame of distances used to train the random forest, and two densities obtained from the random forest.
random_forest
random_forest
A list with the following components:
The data frame used to train the random forest. The data frame has 600 rows. Each row contains the absolute and Euclidean distances between the cluster fill rates of two handwriting samples. If both handwriting samples are from the same writer, the class is 'same'. If the handwriting samples are from different writers, the class is 'different'. There are 300 'same' distances and 300 'different' distances in the data frame.
A random forest created with ranger with settings: importance = 'permutation', scale.permutation.importance = TRUE, and num.trees = 200.
A similarity score was obtained for each pair of handwriting samples in the
training data frame, dists, by calculating the proportion of decision trees that voted 'same'
class for the pair. The 'same_writer' density was created by applying density
to the similarity scores for the 300 same writer pairs in dists. Similarly, the 'diff_writer'
density was created by applying the density
function to the similarity scores for the 300
different writer pairs in dists. The default settings were used with density
.
# view the random forest random_forest$rf # view the distances data frame random_forest$dists # plot the same writer density plot(random_forest$densities$same_writer) # plot the different writer density plot(random_forest$densities$diff_writer)
# view the random forest random_forest$rf # view the distances data frame random_forest$dists # plot the same writer density plot(random_forest$densities$same_writer) # plot the different writer density plot(random_forest$densities$diff_writer)
A cluster template created by handwriter with 40 clusters. This template was created from 120 handwriting samples from the CSAFE Handwriting Database.
templateK40
templateK40
A list containing the contents of the cluster template.
An integer for the random number generator use to select the starting cluster centers for the K-Means algorithm.
A vector of cluster assignments for each graph used to create the cluster template. The clusters are numbered sequentially 1, 2,...,K.
The final cluster centers produced by the K-Means algorithm.
The number of clusters in the template.
The number of training graphs to used to create the template.
A vector that lists the training document from which each graph originated.
A vector that lists the writer of each graph.
The maximum number of iterations for the K-means algorithm.
A vector of the number of graphs that changed clusters on each iteration of the K-means algorithm.
A vector of the outlier cutoff values calculated on each iteration of the K-means algorithm.
The reason the K-means algorithm terminated.
The within cluster
distances on the final iteration of the K-means algorithm. More specifically,
the distance between each graph and the center of the cluster to which it
was assigned on each iteration. The output of make_clustering_template
' stores
the within cluster distances on each iteration, but the previous iterations were removed here to reduce the file size.
A vector of the within-cluster sum of squares on each iteration of the K-means algorithm.
handwriter splits handwriting samples into component shapes called graphs. The graphs are sorted into 40 clusters with a K-Means algorithm.
# view number of clusters templateK40$K # view number of iterations templateK40$iters # view cluster centers templateK40$centers
# view number of clusters templateK40$K # view number of iterations templateK40$iters # view cluster centers templateK40$centers
Train a random forest with ranger from a data frame of cluster fill rates.
train_rf( df, ntrees, distance_measures, output_dir = NULL, run_number = 1, downsample = TRUE )
train_rf( df, ntrees, distance_measures, output_dir = NULL, run_number = 1, downsample = TRUE )
df |
A data frame of cluster fill rates created with
|
ntrees |
An integer number of decision trees to use |
distance_measures |
A vector of distance measures. Any combination of 'abs', 'euc', 'man', 'max', and 'cos' may be used. |
output_dir |
A path to a directory where the random forest will be saved. |
run_number |
An integer used for both the set.seed function and to distinguish between different runs on the same input data frame. |
downsample |
Whether to downsample the number of different writer distances before training the random forest. If TRUE, the different writer distances will be randomly sampled, resulting in the same number of different writer and same writer pairs. |
A random forest
train <- get_csafe_train_set(df = cfr, train_prompt_code = 'pCMB') rforest <- train_rf( df = train, ntrees = 200, distance_measures = c('euc'), run_number = 1, downsample = TRUE )
train <- get_csafe_train_set(df = cfr, train_prompt_code = 'pCMB') rforest <- train_rf( df = train, ntrees = 200, distance_measures = c('euc'), run_number = 1, downsample = TRUE )