Title: | Handwriting Analysis with Random Forests |
---|---|
Description: | Perform forensic handwriting analysis of two scanned handwritten documents. This package implements the statistical method described by Madeline Johnson and Danica Ommen (2021) <doi:10.1002/sam.11566>. Similarity measures and a random forest produce a score-based likelihood ratio that quantifies the strength of the evidence in favor of the documents being written by the same writer or different writers. |
Authors: | Iowa State University of Science and Technology on behalf of its Center for Statistics and Applications in Forensic Evidence [aut, cph, fnd], Stephanie Reinders [aut, cre] |
Maintainer: | Stephanie Reinders <[email protected]> |
License: | GPL (>= 3) |
Version: | 1.1.1 |
Built: | 2025-01-29 12:53:57 UTC |
Source: | CRAN |
calculate_slr
has been superseded in
favor of compare_documents()
which offers more functionality.
calculate_slr( sample1_path, sample2_path, rforest = NULL, reference_scores = NULL, project_dir = NULL )
calculate_slr( sample1_path, sample2_path, rforest = NULL, reference_scores = NULL, project_dir = NULL )
sample1_path |
A file path to a handwriting sample saved in PNG file format. |
sample2_path |
A file path to a second handwriting sample saved in PNG file format. |
rforest |
Optional. A random forest trained with ranger. If no
random forest is specified, |
reference_scores |
Optional. A dataframe of reference similarity
scores. If reference scores is not specified, |
project_dir |
A path to a directory where helper files will be saved. If no project directory is specified, the helper files will be saved to tempdir() and deleted before the function terminates. |
Compares two handwriting samples scanned and saved a PNG images with the following steps:
processDocument
splits the writing in both samples into component shapes, or graphs.
get_clusters_batch
groups the graphs into clusters of similar shapes.
get_cluster_fill_counts
counts the number of graphs assigned to each cluster.
get_cluster_fill_rates
calculates the proportion of graphs assigned to each cluster. The cluster fill rates serve as a writer profile.
A similarity score is calculated between the cluster fill rates of the two documents using a random forest trained with ranger.
The similarity score is compared to reference distributions of same writer and different writer similarity scores. The result is a score-based likelihood ratio that conveys the strength of the evidence in favor of same writer or different writer. For more details, see Madeline Johnson and Danica Ommen (2021) doi:10.1002/sam.11566.
A dataframe
# Compare two samples from the same writer s1 <- system.file(file.path("extdata", "docs", "w0005_s01_pLND_r03.png"), package = "handwriterRF" ) s2 <- system.file(file.path("extdata", "docs", "w0005_s02_pWOZ_r02.png"), package = "handwriterRF" ) calculate_slr(s1, s2) # Compare samples from two writers s1 <- system.file(file.path("extdata", "docs", "w0005_s02_pWOZ_r02.png"), package = "handwriterRF" ) s2 <- system.file(file.path("extdata", "docs", "w0238_s01_pWOZ_r02.png"), package = "handwriterRF" ) calculate_slr(s1, s2)
# Compare two samples from the same writer s1 <- system.file(file.path("extdata", "docs", "w0005_s01_pLND_r03.png"), package = "handwriterRF" ) s2 <- system.file(file.path("extdata", "docs", "w0005_s02_pWOZ_r02.png"), package = "handwriterRF" ) calculate_slr(s1, s2) # Compare samples from two writers s1 <- system.file(file.path("extdata", "docs", "w0005_s02_pWOZ_r02.png"), package = "handwriterRF" ) s2 <- system.file(file.path("extdata", "docs", "w0238_s01_pWOZ_r02.png"), package = "handwriterRF" ) calculate_slr(s1, s2)
The cfc dataframe contains cluster fill counts for two documents from the CSAFE Handwriting Database: w0238_s01_pWOZ_r02.rds and w0238_s01_pWOZ_r03.rds.
cfc
cfc
A dataframe with 2 rows and 15 variables:
The file name of the handwriting sample.
Writer ID.
The name of the handwriting prompt.
The number of graphs in cluster 3.
The number of graphs in cluster 10.
The number of graphs in cluster 12.
The number of graphs in cluster 15.
The number of graphs in cluster 16.
The number of graphs in cluster 17.
The number of graphs in cluster 19.
The number of graphs in cluster 20.
The number of graphs in cluster 23.
The number of graphs in cluster 25.
The number of graphs in cluster 27.
The number of graphs in cluster 29.
The documents were split into graphs with
process_batch_dir
. The graphs were grouped into
clusters with get_clusters_batch
and the cluster
template templateK40
. The number of graphs in each
cluster, the cluster fill counts, were counted with
get_cluster_fill_counts
. The dataframe cfc has a
column for each cluster in templateK40
that has at
least one graph from w0238_s01_pWOZ_r02.rds or w0238_s01_pWOZ_r03.rds
assigned to it. Empty clusters do not have columns in cfc, so cfc only has 12
cluster columns instead of 40.
https://forensicstats.org/handwritingdatabase/
Compare two handwritten documents to predict whether they were written by the same person. Use either a similarity score or a score-based likelihood ratio as a comparison method.
compare_documents( sample1, sample2, score_only = TRUE, rforest = NULL, project_dir = NULL, reference_scores = NULL )
compare_documents( sample1, sample2, score_only = TRUE, rforest = NULL, project_dir = NULL, reference_scores = NULL )
sample1 |
A filepath to a handwritten document scanned and saved as a PNG file. |
sample2 |
A filepath to a handwritten document scanned and saved as a PNG file. |
score_only |
TRUE returns only the similarity score. FALSE returns the
similarity score and a score-based likelihood ratio for that score,
calculated using |
rforest |
Optional. A random forest created with |
project_dir |
Optional. A folder in which to save helper files and a CSV file with the results. If no project directory is supplied. Helper files will be saved to tempdir() > comparison but deleted before the function terminates. A CSV file with the results will not be saved, but a dataframe of the results will be returned. |
reference_scores |
Optional. A list of same writer and different writer
similarity scores used for reference to calculate a score-based likelihood
ratio. If reference scores are not supplied, |
A dataframe
# Compare two documents from the same writer with a similarity score s1 <- system.file(file.path("extdata", "docs", "w0005_s01_pLND_r03.png"), package = "handwriterRF" ) s2 <- system.file(file.path("extdata", "docs", "w0005_s02_pWOZ_r02.png"), package = "handwriterRF" ) compare_documents(s1, s2, score_only = TRUE) # Compare two documents from the same writer with a score-based # likelihood ratio s1 <- system.file(file.path("extdata", "docs", "w0005_s01_pLND_r03.png"), package = "handwriterRF" ) s2 <- system.file(file.path("extdata", "docs", "w0005_s02_pWOZ_r02.png"), package = "handwriterRF" ) compare_documents(s1, s2, score_only = FALSE)
# Compare two documents from the same writer with a similarity score s1 <- system.file(file.path("extdata", "docs", "w0005_s01_pLND_r03.png"), package = "handwriterRF" ) s2 <- system.file(file.path("extdata", "docs", "w0005_s02_pWOZ_r02.png"), package = "handwriterRF" ) compare_documents(s1, s2, score_only = TRUE) # Compare two documents from the same writer with a score-based # likelihood ratio s1 <- system.file(file.path("extdata", "docs", "w0005_s01_pLND_r03.png"), package = "handwriterRF" ) s2 <- system.file(file.path("extdata", "docs", "w0005_s02_pWOZ_r02.png"), package = "handwriterRF" ) compare_documents(s1, s2, score_only = FALSE)
Compare the writer profiles from two handwritten documents to predict whether they were written by the same person. Use either a similarity score or a score-based likelihood ratio as a comparison method.
compare_writer_profiles( writer_profiles, score_only = TRUE, rforest = NULL, reference_scores = NULL )
compare_writer_profiles( writer_profiles, score_only = TRUE, rforest = NULL, reference_scores = NULL )
writer_profiles |
A dataframe of writer profiles or cluster fill rates calculated with get_cluster_fill_rates |
score_only |
TRUE returns only the similarity score. FALSE returns the
similarity score and a score-based likelihood ratio for that score,
calculated using |
rforest |
Optional. A random forest created with |
reference_scores |
Optional. A list of same writer and different writer
similarity scores used for reference to calculate a score-based likelihood
ratio. If reference scores are not supplied, |
A dataframe
compare_writer_profiles(test[1:2, ], score_only = TRUE) compare_writer_profiles(test[1:2, ], score_only = FALSE)
compare_writer_profiles(test[1:2, ], score_only = TRUE) compare_writer_profiles(test[1:2, ], score_only = FALSE)
get_cluster_fill_rates
is deprecated.
Use get_cluster_fill_rates
instead.
get_cluster_fill_rates(df)
get_cluster_fill_rates(df)
df |
A dataframe of cluster fill rates created with
|
A dataframe of cluster fill rates.
## Not run: rates <- get_cluster_fill_rates(df = cfc) ## End(Not run)
## Not run: rates <- get_cluster_fill_rates(df = cfc) ## End(Not run)
Calculate distances using between all pairs of cluster fill rates in a data frame using one or more distance measures. The available distance measures absolute distance, Manhattan distance, Euclidean distance, maximum distance, and cosine distance.
get_distances(df, distance_measures)
get_distances(df, distance_measures)
df |
A dataframe of cluster fill rates created with
|
distance_measures |
A vector of distance measures. Use 'abs' to calculate the absolute difference, 'man' for the Manhattan distance, 'euc' for the Euclidean distance, 'max' for the maximum absolute distance, and 'cos' for the cosine distance. The vector can be a single distance, or any combination of these five distance measures. |
The absolute distance between two n-length vectors of cluster fill rates, a
and b, is a vector of the same length as a and b. It can be calculated as
abs(a-b) where subtraction is performed element-wise, then the absolute
value of each element is returned. More specifically, element i of the vector is for
.
The Manhattan distance between two n-length vectors of cluster fill rates, a and b, is
. In other words, it is the sum of the absolute
distance vector.
The Euclidean distance between two n-length vectors of cluster fill rates, a and b, is
. In other words, it is the sum of the elements of the
absolute distance vector.
The maximum distance between two n-length vectors of cluster fill rates, a and b, is
. In other words, it is the sum of the elements of the
absolute distance vector.
The cosine distance between two n-length vectors of cluster fill rates, a and b, is
.
A dataframe of distances
rates <- test[1:3, ] # calculate maximum and Euclidean distances between the first 3 documents in test. distances <- get_distances(df = rates, distance_measures = c("max", "euc")) # calculate maximum and distances between all documents in test. distances <- get_distances(df = test, distance_measures = c("man"))
rates <- test[1:3, ] # calculate maximum and Euclidean distances between the first 3 documents in test. distances <- get_distances(df = rates, distance_measures = c("max", "euc")) # calculate maximum and distances between all documents in test. distances <- get_distances(df = test, distance_measures = c("man"))
Calculate the rates of misleading evidence for score-based likelihood ratios (SLRs) when the ground truth is known.
get_rates_of_misleading_slrs(df, threshold = 1)
get_rates_of_misleading_slrs(df, threshold = 1)
df |
A dataframe of SLRs from |
threshold |
A number greater than zero that serves as a decision threshold. If the ground truth for two documents is that they came from the same writer and the SLR is less than the decision threshold, this is misleading evidence that incorrectly supports the defense (false negative). If the ground truth for two documents is that they came from different writers and the SLR is greater than the decision threshold, this is misleading evidence that incorrectly supports the prosecution (false positive). |
A list
comparisons <- compare_writer_profiles(test, score_only = FALSE) get_rates_of_misleading_slrs(comparisons)
comparisons <- compare_writer_profiles(test, score_only = FALSE) get_rates_of_misleading_slrs(comparisons)
Create reference scores of same writer and different writer scores from a dataframe of cluster fill rates.
get_ref_scores(rforest, df, seed = NULL, downsample_diff_pairs = FALSE)
get_ref_scores(rforest, df, seed = NULL, downsample_diff_pairs = FALSE)
rforest |
A ranger random forest created with
|
df |
A dataframe of cluster fill rates created with
|
seed |
Optional. An integer to set the seed for the random number generator to make the results reproducible. |
downsample_diff_pairs |
If TRUE, the different writer pairs are down-sampled to equal the number of same writer pairs. If FALSE, all different writer pairs are used. |
A list of scores
get_ref_scores(rforest = random_forest, df = validation)
get_ref_scores(rforest = random_forest, df = validation)
Verbally interprent an SLR value.
interpret_slr(df)
interpret_slr(df)
df |
A dataframe created by |
A string
df <- data.frame("score" = 5, "slr" = 20) interpret_slr(df) df <- data.frame("score" = 0.12, "slr" = 0.5) interpret_slr(df) df <- data.frame("score" = 1, "slr" = 1) interpret_slr(df) df <- data.frame("score" = 0, "slr" = 0) interpret_slr(df)
df <- data.frame("score" = 5, "slr" = 20) interpret_slr(df) df <- data.frame("score" = 0.12, "slr" = 0.5) interpret_slr(df) df <- data.frame("score" = 1, "slr" = 1) interpret_slr(df) df <- data.frame("score" = 0, "slr" = 0) interpret_slr(df)
Plot same writer and different writers reference similarity scores from a
validation set. The similarity scores are greater than or equal to zero and
less than or equal to one. The interval from 0 to 1 is split into n_bins
.
The proportion of scores in each bin is calculated and plotted. Optionally, a
vertical dotted line may be plotted at an observed similarity score.
plot_scores(scores, obs_score = NULL, n_bins = 50)
plot_scores(scores, obs_score = NULL, n_bins = 50)
scores |
A dataframe of scores calculated with
|
obs_score |
Optional. A similarity score calculated with
|
n_bins |
The number of bins |
The methods used in this package typically produce many times more different
writer scores than same writer scores. For example, ref_scores
contains
79,600 different writer scores but only 200 same writer scores. Histograms,
which show the frequency of scores, don't handle this class imbalance well.
Instead, the rate of scores is plotted.
A ggplot2 plot of histograms
plot_scores(scores = ref_scores) plot_scores(scores = ref_scores, n_bins = 70) # Add a vertical line 0.1 on the horizontal axis. plot_scores(scores = ref_scores, obs_score = 0.1)
plot_scores(scores = ref_scores) plot_scores(scores = ref_scores, n_bins = 70) # Add a vertical line 0.1 on the horizontal axis. plot_scores(scores = ref_scores, obs_score = 0.1)
A list that contains a trained random forest created with ranger and the dataframe of distances used to train the random forest.
random_forest
random_forest
A list with the following components:
A random forest created with ranger with settings: importance = 'permutation', scale.permutation.importance = TRUE, and num.trees = 200.
A vector of the distance measures used to train the random forest: c('abs', 'euc')
# view the random forest random_forest$rf # view the distance measures used to train the random forest random_forest$distance_measures
# view the random forest random_forest$rf # view the distance measures used to train the random forest random_forest$distance_measures
A list containing two dataframes. The same_writer dataframe contains similarity scores from same writer pairs. The diff_writer dataframe contains similarity scores from different writer pairs. The similarity scores are calculated from the validation dataframe with the following steps:
The absolute and Euclidean distances are calculated between pairs of writer profiles.
random_forest
uses the distances between the pair to predict the class of the pair
as same writer or different writer.
The proportion of decision trees that predict same writer is used as the similarity score.
ref_scores
ref_scores
A list with the following components:
A dataframe of 1,800 same writer similarity scores. The columns docname1 and writer1 record the file name and the writer ID of the first handwriting sample. The columns docname2 and writer2 record the file name and writer ID of the second handwriting sample. The match column records the class, which is same, of the pairs of handwriting samples. The similarity scores between the pairs of handwriting samples are in the score column.
A dataframe of 717,600 different writer similarity scores. The columns docname1 and writer1 record the file name and the writer ID of the first handwriting sample. The columns docname2 and writer2 record the file name and writer ID of the second handwriting sample. The match column records the class, which is different, of the pairs of handwriting samples. The similarity scores between the pairs of handwriting samples are in the score column.
summary(ref_scores$same_writer) summary(ref_scores$diff_writer) plot_scores(ref_scores)
summary(ref_scores$same_writer) summary(ref_scores$diff_writer) plot_scores(ref_scores)
A cluster template created by handwriter with 40 clusters. This template was created from 100 handwriting samples from the CSAFE Handwriting Database, the CVL Handwriting Database, and the IAM Handwriting Database.
templateK40
templateK40
A list containing the contents of the cluster template.
A vector of cluster assignments for each graph used to create the cluster template. The clusters are numbered sequentially 1, 2,...,40.
The final cluster centers produced by the K-Means algorithm.
The number of clusters in the template (40).
The number of training graphs to used to create the template (32,708).
The within cluster distances, the distance between each graph and the nearest cluster center, on the final iteration of the K-means algorithm.
handwriter splits handwriting samples into component shapes called graphs. The graphs are sorted into 40 clusters with a K-Means algorithm.
handwriter::plot_cluster_centers(templateK40)
handwriter::plot_cluster_centers(templateK40)
Writers from the CSAFE Handwriting Database and the CVL Handwriting Database were randomly assigned to train, validation, and test sets.
test
test
A dataframe with 332 rows and 43 variables:
The file name of the handwriting sample.
Writer ID. There are 83 distinct writer ID's. Each writer has four documents in the dataframe.
The name of the handwriting prompt.
The total number of graphs in the document.
The proportion of graphs in cluster 1
The proportion of graphs in cluster 2
The proportion of graphs in cluster 3
The proportion of graphs in cluster 4
The proportion of graphs in cluster 5
The proportion of graphs in cluster 6
The proportion of graphs in cluster 7
The proportion of graphs in cluster 8
The proportion of graphs in cluster 9
The proportion of graphs in cluster 10
The proportion of graphs in cluster 11
The proportion of graphs in cluster 12
The proportion of graphs in cluster 13
The proportion of graphs in cluster 14
The proportion of graphs in cluster 15
The proportion of graphs in cluster 16
The proportion of graphs in cluster 17
The proportion of graphs in cluster 18
The proportion of graphs in cluster 19
The proportion of graphs in cluster 20
The proportion of graphs in cluster 21
The proportion of graphs in cluster 22
The proportion of graphs in cluster 23
The proportion of graphs in cluster 24
The proportion of graphs in cluster 25
The proportion of graphs in cluster 26
The proportion of graphs in cluster 27
The proportion of graphs in cluster 28
The proportion of graphs in cluster 29
The proportion of graphs in cluster 30
The proportion of graphs in cluster 31
The proportion of graphs in cluster 32
The proportion of graphs in cluster 33
The proportion of graphs in cluster 34
The proportion of graphs in cluster 35
The proportion of graphs in cluster 36
The proportion of graphs in cluster 37
The proportion of graphs in cluster 38
The proportion of graphs in cluster 39
The proportion of graphs in cluster 40
The test dataframe contains cluster fill rates for 332 handwritten documents from the CSAFE Handwriting Database and the CVL Handwriting Database. The documents are from 83 writers. The CSAFE Handwriting Database has nine repetitions of each prompt. Two London Letter prompts and two Wizard of Oz prompts were randomly selected from each writer. The CVL Handwriting Database does not contain multiple repetitions of prompts and four Engligh language prompts were randomly selected from each writer.
The documents were split into graphs with
process_batch_dir
. The graphs were grouped into
clusters with get_clusters_batch
. The cluster fill
counts were calculated with
get_cluster_fill_counts
. Finally,
get_cluster_fill_rates
calculated the cluster fill rates.
https://forensicstats.org/handwritingdatabase/, https://cvl.tuwien.ac.at/research/cvl-databases/an-off-line-database-for-writer-retrieval-writer-identification-and-word-spotting/
Writers from the CSAFE Handwriting Database and the CVL Handwriting Database were randomly assigned to train, validation, and test sets.
train
train
A dataframe with 800 rows and 43 variables:
The file name of the handwriting sample.
Writer ID. There are 200 distinct writer ID's. Each writer has 4 documents in the dataframe.
The name of the handwriting prompt.
The total number of graphs in the document.
The proportion of graphs in cluster 1
The proportion of graphs in cluster 2
The proportion of graphs in cluster 3
The proportion of graphs in cluster 4
The proportion of graphs in cluster 5
The proportion of graphs in cluster 6
The proportion of graphs in cluster 7
The proportion of graphs in cluster 8
The proportion of graphs in cluster 9
The proportion of graphs in cluster 10
The proportion of graphs in cluster 11
The proportion of graphs in cluster 12
The proportion of graphs in cluster 13
The proportion of graphs in cluster 14
The proportion of graphs in cluster 15
The proportion of graphs in cluster 16
The proportion of graphs in cluster 17
The proportion of graphs in cluster 18
The proportion of graphs in cluster 19
The proportion of graphs in cluster 20
The proportion of graphs in cluster 21
The proportion of graphs in cluster 22
The proportion of graphs in cluster 23
The proportion of graphs in cluster 24
The proportion of graphs in cluster 25
The proportion of graphs in cluster 26
The proportion of graphs in cluster 27
The proportion of graphs in cluster 28
The proportion of graphs in cluster 29
The proportion of graphs in cluster 30
The proportion of graphs in cluster 31
The proportion of graphs in cluster 32
The proportion of graphs in cluster 33
The proportion of graphs in cluster 34
The proportion of graphs in cluster 35
The proportion of graphs in cluster 36
The proportion of graphs in cluster 37
The proportion of graphs in cluster 38
The proportion of graphs in cluster 39
The proportion of graphs in cluster 40
The train dataframe contains cluster fill rates for 800 handwritten documents from the CSAFE Handwriting Database and the CVL Handwriting Database. The documents are from 200 writers. The CSAFE Handwriting Database has nine repetitions of each prompt. Two London Letter prompts and two Wizard of Oz prompts were randomly selected from each writer. The CVL Handwriting Database does not contain multiple repetitions of prompts and four English language prompts were randomly selected from each writer.
The documents were split into graphs with
process_batch_dir
. The graphs were grouped into
clusters with get_clusters_batch
. The cluster fill
counts were calculated with
get_cluster_fill_counts
. Finally,
get_cluster_fill_rates
calculated the cluster fill rates.
https://forensicstats.org/handwritingdatabase/, https://cvl.tuwien.ac.at/research/cvl-databases/an-off-line-database-for-writer-retrieval-writer-identification-and-word-spotting/
Train a random forest with ranger from a dataframe of writer profiles
estimated with get_cluster_fill_rates
. train_rf
calculates
the distance between all pairs of writer profiles using one or more distance
measures. Currently, the available distance measures are absolute, Manhattan,
Euclidean, maximum, and cosine.
train_rf( df, ntrees, distance_measures, output_dir = NULL, run_number = 1, downsample_diff_pairs = TRUE )
train_rf( df, ntrees, distance_measures, output_dir = NULL, run_number = 1, downsample_diff_pairs = TRUE )
df |
A dataframe of writer profiles created with
|
ntrees |
An integer number of decision trees to use |
distance_measures |
A vector of distance measures. Any combination of 'abs', 'euc', 'man', 'max', and 'cos' may be used. |
output_dir |
A path to a directory where the random forest will be saved. |
run_number |
An integer used for both the set.seed function and to distinguish between different runs on the same input dataframe. |
downsample_diff_pairs |
Whether to downsample the number of different writer distances before training the random forest. If TRUE, the different writer distances will be randomly sampled, resulting in the same number of different writer and same writer pairs. |
The absolute distance between two n-length vectors of cluster fill rates, a
and b, is a vector of the same length as a and b. It can be calculated as
abs(a-b) where subtraction is performed element-wise, then the absolute
value of each element is returned. More specifically, element i of the vector is for
.
The Manhattan distance between two n-length vectors of cluster fill rates, a and b, is
. In other words, it is the sum of the absolute
distance vector.
The Euclidean distance between two n-length vectors of cluster fill rates, a and b, is
. In other words, it is the sum of the elements of the
absolute distance vector.
The maximum distance between two n-length vectors of cluster fill rates, a and b, is
. In other words, it is the sum of the elements of the
absolute distance vector.
The cosine distance between two n-length vectors of cluster fill rates, a and b, is
.
A random forest
rforest <- train_rf( df = train, ntrees = 200, distance_measures = c("euc"), run_number = 1, downsample = TRUE )
rforest <- train_rf( df = train, ntrees = 200, distance_measures = c("euc"), run_number = 1, downsample = TRUE )
Writers from the CSAFE Handwriting Database and the CVL Handwriting Database were randomly assigned to train, validation, and test sets.
validation
validation
A dataframe with 1,200 rows and 43 variables:
The file name of the handwriting sample.
Writer ID. There are 300 distinct writer ID's. Each writer has 4 documents in the dataframe.
The name of the handwriting prompt.
The total number of graphs in the document.
The proportion of graphs in cluster 1
The proportion of graphs in cluster 2
The proportion of graphs in cluster 3
The proportion of graphs in cluster 4
The proportion of graphs in cluster 5
The proportion of graphs in cluster 6
The proportion of graphs in cluster 7
The proportion of graphs in cluster 8
The proportion of graphs in cluster 9
The proportion of graphs in cluster 10
The proportion of graphs in cluster 11
The proportion of graphs in cluster 12
The proportion of graphs in cluster 13
The proportion of graphs in cluster 14
The proportion of graphs in cluster 15
The proportion of graphs in cluster 16
The proportion of graphs in cluster 17
The proportion of graphs in cluster 18
The proportion of graphs in cluster 19
The proportion of graphs in cluster 20
The proportion of graphs in cluster 21
The proportion of graphs in cluster 22
The proportion of graphs in cluster 23
The proportion of graphs in cluster 24
The proportion of graphs in cluster 25
The proportion of graphs in cluster 26
The proportion of graphs in cluster 27
The proportion of graphs in cluster 28
The proportion of graphs in cluster 29
The proportion of graphs in cluster 30
The proportion of graphs in cluster 31
The proportion of graphs in cluster 32
The proportion of graphs in cluster 33
The proportion of graphs in cluster 34
The proportion of graphs in cluster 35
The proportion of graphs in cluster 36
The proportion of graphs in cluster 37
The proportion of graphs in cluster 38
The proportion of graphs in cluster 39
The proportion of graphs in cluster 40
The validation dataframe contains cluster fill rates for 1,200 handwritten documents from the CSAFE Handwriting Database and the CVL Handwriting Database. The documents are from 300 writers. The CSAFE Handwriting Database has nine repetitions of each prompt. Two London Letter prompts and two Wizard of Oz prompts were randomly selected from each writer. The CVL Handwriting Database does not contain multiple repetitions of prompts and four English language prompts were randomly selected from each writer.
The documents were split into graphs with
process_batch_dir
. The graphs were grouped into
clusters with get_clusters_batch
. The cluster fill
counts were calculated with
get_cluster_fill_counts
. Finally,
get_cluster_fill_rates
calculated the cluster fill rates.
https://forensicstats.org/handwritingdatabase/, https://cvl.tuwien.ac.at/research/cvl-databases/an-off-line-database-for-writer-retrieval-writer-identification-and-word-spotting/