The CSAFE Handwriting Database assigned unique IDs to its writers where the IDs start with the letter “w” are are followed by a four digit number. The smallest ID number is 1 and the largest is 720, but some IDs in that range are not used. The ID numbers are padded on the left with zeros to make them four digits in length. E.g. the first writer ID is “w0001” and the last is “w0720.” In the implementation of the method we will use the CSAFE writer ID to more easily track the true writer of a handwriting sample. However, for ease in notation we will denote the set of writers in an experiment as {w1, w2, ..., wn} where n is the total number of writers.
In an abuse of notation, instead of denoting a specific writer as wi for some i ∈ {1, 2, ...n} we will omit the subscript and denote a given writer as w ∈ {w1, w2, ..., wn}. To refer to two distinct writers we we use w and w′
Each writer wi in the CSAFE Handwriting Database produced twenty-seven handwriting samples over three writing sessions s1, s2, s3. A specific session is denoted s ∈ {s1, s2, s3} and two distinct session are denoted s and s′.
During each writing session, each writer copied three prompts: p ∈ {L, P, W} where L refers to the London Letter prompt, P refers to the common phrase prompt, and W refers to the Wizard of Oz prompt.
During each writing session, each writer produced three repetitions of each of the three prompts, for a total of nine samples per session. A specific repetition is denoted r ∈ {r1, r2, r3} and two distinct repetitions are denoted r and r′.
We create a fourth pseudo prompt C by combining the prompts L, P, and W from a specific writer, session, and repetition. In other words, for writer w we create a pseudo prompt C by combining the L, P, and W prompts from the first session and first repetition. Each writer has three C prompts per session for a total of nine C prompts.
Randomly assign writers to either the training or testing set. Mattie assigned 72 writers (80% of 90) to the training set and 18 writers (20% of 90) to the test set. Let ntrain and ntest be the number of writers in the training and testing sets, respectively.
For the ntrain writers in the training set, choose a session s ∈ {s1, s2, s3} and split the writers’ handwriting samples into four subsets based on the prompt p ∈ {L, P, W, C}. Each writer has three repetitions of each of the four prompts for the given session for a total of twelve handwriting samples.
We use the cluster fill rates obtained with the ‘handwriter’ R package as document-level features.
Let x ∈ ℝ40 and yℝ40 be the cluster fill rates for docx and docy, respectively.
Distance measures:
Concatenate distance measures: d(x, y) = dA(x, y)+ + dE(x, y) ∈ ℝ2 where + + denotes vector concatenation.
Calculate distances between known matching (KM) and known non-matching (KNM) pairs for the common source SLR.
Let DKM and DKNM be the sets of known matching and known non-matching scores, respectively.
Train a random forest rf(DKM, DKNM)
For a new pair, x and y, use the random forest to make predictions about d(x, y) = concat(dA, dE). Each decision tree in the random forest predicts “same-writer” or “different-writer” for the distance d. A similarity score is the percentage of decision trees in the random forest that predict “same-writer.” The similarity score is a function that maps from (ℝ+)41 → [0, 1].
Use kernel density estimation to estimate the “same-writer” and “different-writer” score densities using the density function in R with a Gaussian kernel (default bandwidth) within the bounds [0, 1].
Split dataset: 80% training and 20% testing. Split by writers.
Let p ⊂ L, P, W, C denote a prompt where
For prompt p, session s, and nw writers there are $n_w {3 \choose 2} = 3 n_w$ “same-writer” scores.
For prompt p, session s, and nw writers there are $9 {n_w \choose 2}$ “different-writer” scores. Downsample the “different-writer” scores by randomly selecting a sample of 3nw scores.