Title: | Word Embedding Research Framework for Psychological Science |
---|---|
Description: | An integrative toolbox of word embedding research that provides: (1) a collection of 'pre-trained' static word vectors in the '.RData' compressed format <https://psychbruce.github.io/WordVector_RData.pdf>; (2) a series of functions to process, analyze, and visualize word vectors; (3) a range of tests to examine conceptual associations, including the Word Embedding Association Test <doi:10.1126/science.aal4230> and the Relative Norm Distance <doi:10.1073/pnas.1720347115>, with permutation test of significance; (4) a set of training methods to locally train (static) word vectors from text corpora, including 'Word2Vec' <arXiv:1301.3781>, 'GloVe' <doi:10.3115/v1/D14-1162>, and 'FastText' <arXiv:1607.04606>; (5) a group of functions to download 'pre-trained' language models (e.g., 'GPT', 'BERT') and extract contextualized (dynamic) word vectors (based on the R package 'text'). |
Authors: | Han-Wu-Shuang Bao [aut, cre] |
Maintainer: | Han-Wu-Shuang Bao <[email protected]> |
License: | GPL-3 |
Version: | 2023.9 |
Built: | 2024-12-28 06:42:56 UTC |
Source: | CRAN |
wordvec
and embed
.PsychWordVec
uses two types of word vectors data:
wordvec
(data.table, with two variables word
and vec
)
and embed
(matrix, with dimensions as columns and words as row names).
Note that matrix operation makes embed
much faster than wordvec
.
Users are suggested to reshape data to embed
before using the other functions.
as_embed(x, normalize = FALSE) as_wordvec(x, normalize = FALSE) ## S3 method for class 'embed' x[i, j] pattern(pattern)
as_embed(x, normalize = FALSE) as_wordvec(x, normalize = FALSE) ## S3 method for class 'embed' x[i, j] pattern(pattern)
x |
Object to be reshaped. See examples. |
normalize |
Normalize all word vectors to unit length?
Defaults to |
i , j
|
Row ( |
pattern |
Regular expression to be used in |
A wordvec
(data.table) or embed
(matrix).
as_embed()
: From wordvec
(data.table) to embed
(matrix).
as_wordvec()
: From embed
(matrix) to wordvec
(data.table).
Download pre-trained word vectors data (.RData
):
https://psychbruce.github.io/WordVector_RData.pdf
dt = head(demodata, 10) str(dt) embed = as_embed(dt, normalize=TRUE) embed str(embed) wordvec = as_wordvec(embed, normalize=TRUE) wordvec str(wordvec) df = data.frame(token=LETTERS, D1=1:26/10000, D2=26:1/10000) as_embed(df) as_wordvec(df) dd = rbind(dt[1:5], dt[1:5]) dd # duplicate words unique(dd) dm = as_embed(dd) dm # duplicate words unique(dm) # more examples for extracting a subset using `x[i, j]` # (3x faster than `wordvec`) embed = as_embed(demodata) embed[1] embed[1:5] embed["for"] embed[pattern("^for.{0,2}$")] embed[cc("for, in, on, xxx")] embed[cc("for, in, on, xxx"), 5:10] embed[1:5, 5:10] embed[, 5:10] embed[3, 4] embed["that", 4]
dt = head(demodata, 10) str(dt) embed = as_embed(dt, normalize=TRUE) embed str(embed) wordvec = as_wordvec(embed, normalize=TRUE) wordvec str(wordvec) df = data.frame(token=LETTERS, D1=1:26/10000, D2=26:1/10000) as_embed(df) as_wordvec(df) dd = rbind(dt[1:5], dt[1:5]) dd # duplicate words unique(dd) dm = as_embed(dd) dm # duplicate words unique(dm) # more examples for extracting a subset using `x[i, j]` # (3x faster than `wordvec`) embed = as_embed(demodata) embed[1] embed[1:5] embed["for"] embed[pattern("^for.{0,2}$")] embed[cc("for, in, on, xxx")] embed[cc("for, in, on, xxx"), 5:10] embed[1:5, 5:10] embed[, 5:10] embed[3, 4] embed["that", 4]
Cosine similarity/distance between two vectors.
cosine_similarity(v1, v2, distance = FALSE) cos_sim(v1, v2) cos_dist(v1, v2)
cosine_similarity(v1, v2, distance = FALSE) cos_sim(v1, v2) cos_dist(v1, v2)
v1 , v2
|
Numeric vector (of the same length). |
distance |
Compute cosine distance instead?
Defaults to |
Cosine similarity =
sum(v1 * v2) / ( sqrt(sum(v1^2)) * sqrt(sum(v2^2)) )
Cosine distance =
1 - cosine_similarity(v1, v2)
A value of cosine similarity/distance.
cos_sim(v1=c(1,1,1), v2=c(2,2,2)) # 1 cos_sim(v1=c(1,4,1), v2=c(4,1,1)) # 0.5 cos_sim(v1=c(1,1,0), v2=c(0,0,1)) # 0 cos_dist(v1=c(1,1,1), v2=c(2,2,2)) # 0 cos_dist(v1=c(1,4,1), v2=c(4,1,1)) # 0.5 cos_dist(v1=c(1,1,0), v2=c(0,0,1)) # 1
cos_sim(v1=c(1,1,1), v2=c(2,2,2)) # 1 cos_sim(v1=c(1,4,1), v2=c(4,1,1)) # 0.5 cos_sim(v1=c(1,1,0), v2=c(0,0,1)) # 0 cos_dist(v1=c(1,1,1), v2=c(2,2,2)) # 0 cos_dist(v1=c(1,4,1), v2=c(4,1,1)) # 0.5 cos_dist(v1=c(1,1,0), v2=c(0,0,1)) # 1
wordvec
(data.table) or embed
(matrix),
saved in a compressed ".RData" file.Transform plain text of word vectors into
wordvec
(data.table) or embed
(matrix),
saved in a compressed ".RData" file.
Speed: In total (preprocess + compress + save),
it can process about 30000 words/min
with the slowest settings (compress="xz"
, compress.level=9
)
on a modern computer (HP ProBook 450, Windows 11, Intel i7-1165G7 CPU, 32GB RAM).
data_transform( file.load, file.save, as = c("wordvec", "embed"), sep = " ", header = "auto", encoding = "auto", compress = "bzip2", compress.level = 9, verbose = TRUE )
data_transform( file.load, file.save, as = c("wordvec", "embed"), sep = " ", header = "auto", encoding = "auto", compress = "bzip2", compress.level = 9, verbose = TRUE )
file.load |
File name of raw text (must be plain text). Data must be in this format (values separated by cat 0.001 0.002 0.003 0.004 0.005 ... 0.300 dog 0.301 0.302 0.303 0.304 0.305 ... 0.600 |
file.save |
File name of to-be-saved R data (must be .RData). |
as |
Transform the text to which R object?
|
sep |
Column separator. Defaults to |
header |
Is the 1st row a header (e.g., meta-information such as "2000000 300")?
Defaults to |
encoding |
File encoding. Defaults to |
compress |
Compression method for the saved file. Defaults to Options include:
|
compress.level |
Compression level from |
verbose |
Print information to the console? Defaults to |
A wordvec
(data.table) or embed
(matrix).
Download pre-trained word vectors data (.RData
):
https://psychbruce.github.io/WordVector_RData.pdf
## Not run: # please first manually download plain text data of word vectors # e.g., from: https://fasttext.cc/docs/en/crawl-vectors.html # the text file must be on your disk # the following code cannot run unless you have the file library(bruceR) set.wd() data_transform(file.load="cc.zh.300.vec", # plain text file file.save="cc.zh.300.vec.RData", # RData file header=TRUE, compress="xz") # of minimal size ## End(Not run)
## Not run: # please first manually download plain text data of word vectors # e.g., from: https://fasttext.cc/docs/en/crawl-vectors.html # the text file must be on your disk # the following code cannot run unless you have the file library(bruceR) set.wd() data_transform(file.load="cc.zh.300.vec", # plain text file file.save="cc.zh.300.vec.RData", # RData file header=TRUE, compress="xz") # of minimal size ## End(Not run)
wordvec
or embed
) from ".RData" file.Load word vectors data (wordvec
or embed
) from ".RData" file.
data_wordvec_load( file, as = c("wordvec", "embed"), normalize = FALSE, verbose = TRUE ) load_wordvec(file, normalize = TRUE) load_embed(file, normalize = TRUE)
data_wordvec_load( file, as = c("wordvec", "embed"), normalize = FALSE, verbose = TRUE ) load_wordvec(file, normalize = TRUE) load_embed(file, normalize = TRUE)
file |
File name of .RData transformed by |
as |
Load as
|
normalize |
Normalize all word vectors to unit length?
Defaults to |
verbose |
Print information to the console? Defaults to |
A wordvec
(data.table) or embed
(matrix).
Download pre-trained word vectors data (.RData
):
https://psychbruce.github.io/WordVector_RData.pdf
d = demodata[1:200] save(d, file="demo.RData") d = load_wordvec("demo.RData") d d = load_embed("demo.RData") d unlink("demo.RData") # delete file for code check ## Not run: # please first manually download the .RData file # (see https://psychbruce.github.io/WordVector_RData.pdf) # or transform plain text data by using `data_transform()` # the RData file must be on your disk # the following code cannot run unless you have the file library(bruceR) set.wd() d = load_embed("../data-raw/GloVe/glove_wiki_50d.RData") d ## End(Not run)
d = demodata[1:200] save(d, file="demo.RData") d = load_wordvec("demo.RData") d d = load_embed("demo.RData") d unlink("demo.RData") # delete file for code check ## Not run: # please first manually download the .RData file # (see https://psychbruce.github.io/WordVector_RData.pdf) # or transform plain text data by using `data_transform()` # the RData file must be on your disk # the following code cannot run unless you have the file library(bruceR) set.wd() d = load_embed("../data-raw/GloVe/glove_wiki_50d.RData") d ## End(Not run)
Extract a subset of word vectors data (with S3 methods).
You may specify either a wordvec
or embed
loaded by data_wordvec_load
)
or an .RData file transformed by data_transform
).
data_wordvec_subset( x, words = NULL, pattern = NULL, as = c("wordvec", "embed"), file.save, compress = "bzip2", compress.level = 9, verbose = TRUE ) ## S3 method for class 'wordvec' subset(x, ...) ## S3 method for class 'embed' subset(x, ...)
data_wordvec_subset( x, words = NULL, pattern = NULL, as = c("wordvec", "embed"), file.save, compress = "bzip2", compress.level = 9, verbose = TRUE ) ## S3 method for class 'wordvec' subset(x, ...) ## S3 method for class 'embed' subset(x, ...)
x |
Can be:
|
words |
[Option 1] Character string(s). |
pattern |
[Option 2] Regular expression (see |
as |
Reshape to
|
file.save |
File name of to-be-saved R data (must be .RData). |
compress |
Compression method for the saved file. Defaults to Options include:
|
compress.level |
Compression level from |
verbose |
Print information to the console? Defaults to |
... |
Parameters passed to |
A subset of wordvec
or embed
of valid (available) words.
Download pre-trained word vectors data (.RData
):
https://psychbruce.github.io/WordVector_RData.pdf
## directly use `embed[i, j]` (3x faster than `wordvec`): d = as_embed(demodata) d[1:5] d["people"] d[c("China", "Japan", "Korea")] ## specify `x` as a `wordvec` or `embed` object: subset(demodata, c("China", "Japan", "Korea")) subset(d, pattern="^Chi") ## specify `x` and `pattern`, and save with `file.save`: subset(demodata, pattern="Chin[ae]|Japan|Korea", file.save="subset.RData") ## load the subset: d.subset = load_wordvec("subset.RData") d.subset ## specify `x` as an .RData file and save with `file.save`: data_wordvec_subset("subset.RData", words=c("China", "Chinese"), file.save="new.subset.RData") d.new.subset = load_embed("new.subset.RData") d.new.subset unlink("subset.RData") # delete file for code check unlink("new.subset.RData") # delete file for code check
## directly use `embed[i, j]` (3x faster than `wordvec`): d = as_embed(demodata) d[1:5] d["people"] d[c("China", "Japan", "Korea")] ## specify `x` as a `wordvec` or `embed` object: subset(demodata, c("China", "Japan", "Korea")) subset(d, pattern="^Chi") ## specify `x` and `pattern`, and save with `file.save`: subset(demodata, pattern="Chin[ae]|Japan|Korea", file.save="subset.RData") ## load the subset: d.subset = load_wordvec("subset.RData") d.subset ## specify `x` as an .RData file and save with `file.save`: data_wordvec_subset("subset.RData", words=c("China", "Chinese"), file.save="new.subset.RData") d.new.subset = load_embed("new.subset.RData") d.new.subset unlink("subset.RData") # delete file for code check unlink("new.subset.RData") # delete file for code check
This demo data contains a sample of 8000 English words with 300-dimension word vectors pre-trained using the "word2vec" algorithm based on the Google News corpus. Most of these words are from the Top 8000 frequent wordlist, whereas a few are selected from less frequent words and appended.
data(demodata)
data(demodata)
A data.table
(of new class wordvec
) with two variables word
and vec
,
transformed from the raw data (see the URL in Source) into .RData
using the data_transform
function.
Google Code - word2vec (https://code.google.com/archive/p/word2vec/)
class(demodata) demodata embed = as_embed(demodata, normalize=TRUE) class(embed) embed
class(demodata) demodata embed = as_embed(demodata, normalize=TRUE) class(embed) embed
Expand a dictionary from the most similar words.
dict_expand(data, words, threshold = 0.5, iteration = 5, verbose = TRUE)
dict_expand(data, words, threshold = 0.5, iteration = 5, verbose = TRUE)
data |
A |
words |
A single word or a list of words, used to calculate the sum vector. |
threshold |
Threshold of cosine similarity,
used to find all words with similarities higher than this value.
Defaults to |
iteration |
Number of maximum iterations. Defaults to |
verbose |
Print information to the console? Defaults to |
An expanded list (character vector) of words.
Download pre-trained word vectors data (.RData
):
https://psychbruce.github.io/WordVector_RData.pdf
dict = dict_expand(demodata, "king") dict dict = dict_expand(demodata, cc("king, queen")) dict most_similar(demodata, dict) dict.cn = dict_expand(demodata, "China") dict.cn # too inclusive if setting threshold = 0.5 dict.cn = dict_expand(demodata, cc("China, Chinese"), threshold=0.6) dict.cn # adequate to represent "China"
dict = dict_expand(demodata, "king") dict dict = dict_expand(demodata, cc("king, queen")) dict most_similar(demodata, dict) dict.cn = dict_expand(demodata, "China") dict.cn # too inclusive if setting threshold = 0.5 dict.cn = dict_expand(demodata, cc("China, Chinese"), threshold=0.6) dict.cn # adequate to represent "China"
Reliability analysis (Cronbach's and average cosine similarity) and
Principal Component Analysis (PCA) of a dictionary,
with visualization of cosine similarities
between words (ordered by the first principal component loading).
Note that Cronbach's
can be misleading
when the number of items/words is large.
dict_reliability( data, words = NULL, pattern = NULL, alpha = TRUE, sort = TRUE, plot = TRUE, ... )
dict_reliability( data, words = NULL, pattern = NULL, alpha = TRUE, sort = TRUE, plot = TRUE, ... )
data |
A |
words |
[Option 1] Character string(s). |
pattern |
[Option 2] Regular expression (see |
alpha |
Estimate the Cronbach's |
sort |
Sort items by the first principal component loading (PC1)?
Defaults to |
plot |
Visualize the cosine similarities? Defaults to |
... |
Other parameters passed to |
A list
object of new class reliability
:
alpha
Cronbach's
eigen
Eigen values from PCA
pca
PCA (only 1 principal component)
pca.rotation
PCA with varimax rotation (if potential principal components > 1)
items
Item statistics
cos.sim.mat
A matrix of cosine similarities of all word pairs
cos.sim
Lower triangular part of the matrix of cosine similarities
Download pre-trained word vectors data (.RData
):
https://psychbruce.github.io/WordVector_RData.pdf
Nicolas, G., Bai, X., & Fiske, S. T. (2021). Comprehensive stereotype content dictionaries using a semi-automated method. European Journal of Social Psychology, 51(1), 178–196.
d = as_embed(demodata, normalize=TRUE) dict = dict_expand(d, "king") dict_reliability(d, dict) dict.cn = dict_expand(d, "China", threshold=0.65) dict_reliability(d, dict.cn) dict_reliability(d, c(dict, dict.cn)) # low-loading items should be removed
d = as_embed(demodata, normalize=TRUE) dict = dict_expand(d, "king") dict_reliability(d, dict) dict.cn = dict_expand(d, "China", threshold=0.65) dict_reliability(d, dict.cn) dict_reliability(d, c(dict, dict.cn)) # low-loading items should be removed
Extract word vector(s), using either a list of words or a regular expression.
get_wordvec( data, words = NULL, pattern = NULL, plot = FALSE, plot.dims = NULL, plot.step = 0.05, plot.border = "white" )
get_wordvec( data, words = NULL, pattern = NULL, plot = FALSE, plot.dims = NULL, plot.step = 0.05, plot.border = "white" )
data |
A |
words |
[Option 1] Character string(s). |
pattern |
[Option 2] Regular expression (see |
plot |
Generate a plot to illustrate the word vectors? Defaults to |
plot.dims |
Dimensions to be plotted (e.g., |
plot.step |
Step for value breaks. Defaults to |
plot.border |
Color of tile border. Defaults to |
A data.table
with words as columns and dimensions as rows.
Download pre-trained word vectors data (.RData
):
https://psychbruce.github.io/WordVector_RData.pdf
d = as_embed(demodata, normalize=TRUE) get_wordvec(d, c("China", "Japan", "Korea")) get_wordvec(d, cc(" China, Japan; Korea ")) ## specify `pattern`: get_wordvec(d, pattern="Chin[ae]|Japan|Korea") ## plot word vectors: get_wordvec(d, cc("China, Japan, Korea, Mac, Linux, Windows"), plot=TRUE, plot.dims=1:100) ## a more complex example: words = cc(" China Chinese Japan Japanese good bad great terrible morning evening king queen man woman he she cat dog ") dt = get_wordvec( d, words, plot=TRUE, plot.dims=1:100, plot.step=0.06) # if you want to change something: attr(dt, "ggplot") + scale_fill_viridis_b(n.breaks=10, show.limits=TRUE) + theme(legend.key.height=unit(0.1, "npc")) # or to save the plot: ggsave(attr(dt, "ggplot"), filename="wordvecs.png", width=8, height=5, dpi=500) unlink("wordvecs.png") # delete file for code check
d = as_embed(demodata, normalize=TRUE) get_wordvec(d, c("China", "Japan", "Korea")) get_wordvec(d, cc(" China, Japan; Korea ")) ## specify `pattern`: get_wordvec(d, pattern="Chin[ae]|Japan|Korea") ## plot word vectors: get_wordvec(d, cc("China, Japan, Korea, Mac, Linux, Windows"), plot=TRUE, plot.dims=1:100) ## a more complex example: words = cc(" China Chinese Japan Japanese good bad great terrible morning evening king queen man woman he she cat dog ") dt = get_wordvec( d, words, plot=TRUE, plot.dims=1:100, plot.step=0.06) # if you want to change something: attr(dt, "ggplot") + scale_fill_viridis_b(n.breaks=10, show.limits=TRUE) + theme(legend.key.height=unit(0.1, "npc")) # or to save the plot: ggsave(attr(dt, "ggplot"), filename="wordvecs.png", width=8, height=5, dpi=500) unlink("wordvecs.png") # delete file for code check
Find the Top-N most similar words, which replicates the results produced
by the Python gensim
module most_similar()
function.
(Exact replication of gensim
requires the same word vectors data,
not the demodata
used here in examples.)
most_similar( data, x = NULL, topn = 10, above = NULL, keep = FALSE, row.id = TRUE, verbose = TRUE )
most_similar( data, x = NULL, topn = 10, above = NULL, keep = FALSE, row.id = TRUE, verbose = TRUE )
data |
A |
x |
Can be:
|
topn |
Top-N most similar words. Defaults to |
above |
Defaults to
If both |
keep |
Keep words specified in |
row.id |
Return the row number of each word? Defaults to |
verbose |
Print information to the console? Defaults to |
A data.table
with the most similar words and their cosine similarities.
Download pre-trained word vectors data (.RData
):
https://psychbruce.github.io/WordVector_RData.pdf
d = as_embed(demodata, normalize=TRUE) most_similar(d) most_similar(d, "China") most_similar(d, c("king", "queen")) most_similar(d, cc(" king , queen ; man | woman ")) # the same as above: most_similar(d, ~ China) most_similar(d, ~ king + queen) most_similar(d, ~ king + queen + man + woman) most_similar(d, ~ boy - he + she) most_similar(d, ~ Jack - he + she) most_similar(d, ~ Rose - she + he) most_similar(d, ~ king - man + woman) most_similar(d, ~ Tokyo - Japan + China) most_similar(d, ~ Beijing - China + Japan) most_similar(d, "China", above=0.7) most_similar(d, "China", above="Shanghai") # automatically normalized for more accurate results ms = most_similar(demodata, ~ king - man + woman) ms str(ms)
d = as_embed(demodata, normalize=TRUE) most_similar(d) most_similar(d, "China") most_similar(d, c("king", "queen")) most_similar(d, cc(" king , queen ; man | woman ")) # the same as above: most_similar(d, ~ China) most_similar(d, ~ king + queen) most_similar(d, ~ king + queen + man + woman) most_similar(d, ~ boy - he + she) most_similar(d, ~ Jack - he + she) most_similar(d, ~ Rose - she + he) most_similar(d, ~ king - man + woman) most_similar(d, ~ Tokyo - Japan + China) most_similar(d, ~ Beijing - China + Japan) most_similar(d, "China", above=0.7) most_similar(d, "China", above="Shanghai") # automatically normalized for more accurate results ms = most_similar(demodata, ~ king - man + woman) ms str(ms)
L2-normalization (scaling to unit euclidean length): the norm of each vector in the vector space will be normalized to 1. It is necessary for any linear operation of word vectors.
R code:
Vector: vec / sqrt(sum(vec^2))
Matrix: mat / sqrt(rowSums(mat^2))
normalize(x)
normalize(x)
x |
A |
A wordvec
(data.table) or embed
(matrix) with normalized word vectors.
Download pre-trained word vectors data (.RData
):
https://psychbruce.github.io/WordVector_RData.pdf
d = normalize(demodata) # the same: d = as_wordvec(demodata, normalize=TRUE)
d = normalize(demodata) # the same: d = as_wordvec(demodata, normalize=TRUE)
In order to compare word embeddings from different time periods, we must ensure that the embedding matrices are aligned to the same semantic space (coordinate axes). The Orthogonal Procrustes solution (Schönemann, 1966) is commonly used to align historical embeddings over time (Hamilton et al., 2016; Li et al., 2020).
Note that this kind of rotation does not change the relative relationships between vectors in the space, and thus does not affect semantic similarities or distances within each embedding matrix. But it does influence the semantic relationships between different embedding matrices, and thus would be necessary for some purposes such as the "semantic drift analysis" (e.g., Hamilton et al., 2016; Li et al., 2020).
This function produces the same results as by
cds::orthprocr()
,
psych::Procrustes()
, and
pracma::procrustes()
.
orth_procrustes(M, X)
orth_procrustes(M, X)
M , X
|
Two embedding matrices of the same size (rows and columns),
can be
Note: The function automatically extracts only
the intersection (overlapped part) of words in |
A matrix
or wordvec
object of
X
after rotation, depending on the class of
M
and X
.
Hamilton, W. L., Leskovec, J., & Jurafsky, D. (2016). Diachronic word embeddings reveal statistical laws of semantic change. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Vol. 1, pp. 1489–1501). Association for Computational Linguistics.
Li, Y., Hills, T., & Hertwig, R. (2020). A brief history of risk. Cognition, 203, 104344.
Schönemann, P. H. (1966). A generalized solution of the orthogonal Procrustes problem. Psychometrika, 31(1), 1–10.
M = matrix(c(0,0, 1,2, 2,0, 3,2, 4,0), ncol=2, byrow=TRUE) X = matrix(c(0,0, -2,1, 0,2, -2,3, 0,4), ncol=2, byrow=TRUE) rownames(M) = rownames(X) = cc("A, B, C, D, E") # words colnames(M) = colnames(X) = cc("dim1, dim2") # dimensions ggplot() + geom_path(data=as.data.frame(M), aes(x=dim1, y=dim2), color="red") + geom_path(data=as.data.frame(X), aes(x=dim1, y=dim2), color="blue") + coord_equal() # Usage 1: input two matrices (can be `embed` objects) XR = orth_procrustes(M, X) XR # aligned with M ggplot() + geom_path(data=as.data.frame(XR), aes(x=dim1, y=dim2)) + coord_equal() # Usage 2: input two `wordvec` objects M.wv = as_wordvec(M) X.wv = as_wordvec(X) XR.wv = orth_procrustes(M.wv, X.wv) XR.wv # aligned with M.wv # M and X must have the same set and order of words # and the same number of word vector dimensions. # The function extracts only the intersection of words # and sorts them in the same order according to M. Y = rbind(X, X[rev(rownames(X)),]) rownames(Y)[1:5] = cc("F, G, H, I, J") M.wv = as_wordvec(M) Y.wv = as_wordvec(Y) M.wv # words: A, B, C, D, E Y.wv # words: F, G, H, I, J, E, D, C, B, A YR.wv = orth_procrustes(M.wv, Y.wv) YR.wv # aligned with M.wv, with the same order of words
M = matrix(c(0,0, 1,2, 2,0, 3,2, 4,0), ncol=2, byrow=TRUE) X = matrix(c(0,0, -2,1, 0,2, -2,3, 0,4), ncol=2, byrow=TRUE) rownames(M) = rownames(X) = cc("A, B, C, D, E") # words colnames(M) = colnames(X) = cc("dim1, dim2") # dimensions ggplot() + geom_path(data=as.data.frame(M), aes(x=dim1, y=dim2), color="red") + geom_path(data=as.data.frame(X), aes(x=dim1, y=dim2), color="blue") + coord_equal() # Usage 1: input two matrices (can be `embed` objects) XR = orth_procrustes(M, X) XR # aligned with M ggplot() + geom_path(data=as.data.frame(XR), aes(x=dim1, y=dim2)) + coord_equal() # Usage 2: input two `wordvec` objects M.wv = as_wordvec(M) X.wv = as_wordvec(X) XR.wv = orth_procrustes(M.wv, X.wv) XR.wv # aligned with M.wv # M and X must have the same set and order of words # and the same number of word vector dimensions. # The function extracts only the intersection of words # and sorts them in the same order according to M. Y = rbind(X, X[rev(rownames(X)),]) rownames(Y)[1:5] = cc("F, G, H, I, J") M.wv = as_wordvec(M) Y.wv = as_wordvec(Y) M.wv # words: A, B, C, D, E Y.wv # words: F, G, H, I, J, E, D, C, B, A YR.wv = orth_procrustes(M.wv, Y.wv) YR.wv # aligned with M.wv, with the same order of words
Compute a matrix of cosine similarity/distance of word pairs.
pair_similarity( data, words = NULL, pattern = NULL, words1 = NULL, words2 = NULL, distance = FALSE )
pair_similarity( data, words = NULL, pattern = NULL, words1 = NULL, words2 = NULL, distance = FALSE )
data |
A |
words |
[Option 1] Character string(s). |
pattern |
[Option 2] Regular expression (see |
words1 , words2
|
[Option 3] Two sets of words for only n1 * n2 word pairs. See examples. |
distance |
Compute cosine distance instead?
Defaults to |
A matrix of pairwise cosine similarity/distance.
Download pre-trained word vectors data (.RData
):
https://psychbruce.github.io/WordVector_RData.pdf
pair_similarity(demodata, c("China", "Chinese")) pair_similarity(demodata, pattern="^Chi") pair_similarity(demodata, words1=c("China", "Chinese"), words2=c("Japan", "Japanese"))
pair_similarity(demodata, c("China", "Chinese")) pair_similarity(demodata, pattern="^Chi") pair_similarity(demodata, words1=c("China", "Chinese"), words2=c("Japan", "Japanese"))
Visualize a (partial correlation) network graph of words.
plot_network( data, words = NULL, pattern = NULL, index = c("pcor", "cor", "glasso", "sim"), alpha = 0.05, bonf = FALSE, max = NULL, node.size = "auto", node.group = NULL, node.color = NULL, label.text = NULL, label.size = 1.2, label.size.equal = TRUE, label.color = "black", edge.color = c("#009900", "#BF0000"), edge.label = FALSE, edge.label.size = 1, edge.label.color = NULL, edge.label.bg = "white", file = NULL, width = 10, height = 6, dpi = 500, ... )
plot_network( data, words = NULL, pattern = NULL, index = c("pcor", "cor", "glasso", "sim"), alpha = 0.05, bonf = FALSE, max = NULL, node.size = "auto", node.group = NULL, node.color = NULL, label.text = NULL, label.size = 1.2, label.size.equal = TRUE, label.color = "black", edge.color = c("#009900", "#BF0000"), edge.label = FALSE, edge.label.size = 1, edge.label.color = NULL, edge.label.bg = "white", file = NULL, width = 10, height = 6, dpi = 500, ... )
data |
A |
words |
[Option 1] Character string(s). |
pattern |
[Option 2] Regular expression (see |
index |
Use which index to perform network analysis?
Can be |
alpha |
Significance level to be used for not showing edges. Defaults to |
bonf |
Bonferroni correction of p value. Defaults to |
max |
Maximum value for scaling edge widths and colors. Defaults to the highest value of the index.
Can be |
node.size |
Node size. Defaults to 8*exp(-nNodes/80)+1. |
node.group |
Node group(s). Can be a named list (see examples) in which each element is a vector of integers identifying the numbers of the nodes that belong together, or a factor. |
node.color |
Node color(s). Can be a character vector of colors corresponding to |
label.text |
Node label of text. Defaults to original words. |
label.size |
Node label font size. Defaults to |
label.size.equal |
Make the font size of all labels equal. Defaults to |
label.color |
Node label color. Defaults to |
edge.color |
Edge colors for positive and negative values, respectively.
Defaults to |
edge.label |
Edge label of values. Defaults to |
edge.label.size |
Edge label font size. Defaults to |
edge.label.color |
Edge label color. Defaults to |
edge.label.bg |
Edge label background color. Defaults to |
file |
File name to be saved, should be png or pdf. |
width , height
|
Width and height (in inches) for the saved file.
Defaults to |
dpi |
Dots per inch. Defaults to |
... |
Other parameters passed to |
Invisibly return a qgraph
object,
which further can be plotted using plot()
.
Download pre-trained word vectors data (.RData
):
https://psychbruce.github.io/WordVector_RData.pdf
d = as_embed(demodata, normalize=TRUE) words = cc(" man, woman, he, she, boy, girl, father, mother, mom, dad, China, Japan ") plot_network(d, words) p = plot_network( d, words, node.group=list(Gender=1:6, Family=7:10, Country=11:12), node.color=c("antiquewhite", "lightsalmon", "lightblue"), file="network.png") plot(p) unlink("network.png") # delete file for code check # network analysis with centrality plot (see `qgraph` package) qgraph::centralityPlot(p, include="all", scale="raw", orderBy="Strength") # graphical lasso-estimation of partial correlation matrix plot_network( d, words, index="glasso", # threshold=TRUE, node.group=list(Gender=1:6, Family=7:10, Country=11:12), node.color=c("antiquewhite", "lightsalmon", "lightblue"))
d = as_embed(demodata, normalize=TRUE) words = cc(" man, woman, he, she, boy, girl, father, mother, mom, dad, China, Japan ") plot_network(d, words) p = plot_network( d, words, node.group=list(Gender=1:6, Family=7:10, Country=11:12), node.color=c("antiquewhite", "lightsalmon", "lightblue"), file="network.png") plot(p) unlink("network.png") # delete file for code check # network analysis with centrality plot (see `qgraph` package) qgraph::centralityPlot(p, include="all", scale="raw", orderBy="Strength") # graphical lasso-estimation of partial correlation matrix plot_network( d, words, index="glasso", # threshold=TRUE, node.group=list(Gender=1:6, Family=7:10, Country=11:12), node.color=c("antiquewhite", "lightsalmon", "lightblue"))
Visualize cosine similarity of word pairs.
plot_similarity( data, words = NULL, pattern = NULL, words1 = NULL, words2 = NULL, label = "auto", value.color = NULL, value.percent = FALSE, order = c("original", "AOE", "FPC", "hclust", "alphabet"), hclust.method = c("complete", "ward", "ward.D", "ward.D2", "single", "average", "mcquitty", "median", "centroid"), hclust.n = NULL, hclust.color = "black", hclust.line = 2, file = NULL, width = 10, height = 6, dpi = 500, ... )
plot_similarity( data, words = NULL, pattern = NULL, words1 = NULL, words2 = NULL, label = "auto", value.color = NULL, value.percent = FALSE, order = c("original", "AOE", "FPC", "hclust", "alphabet"), hclust.method = c("complete", "ward", "ward.D", "ward.D2", "single", "average", "mcquitty", "median", "centroid"), hclust.n = NULL, hclust.color = "black", hclust.line = 2, file = NULL, width = 10, height = 6, dpi = 500, ... )
data |
A |
words |
[Option 1] Character string(s). |
pattern |
[Option 2] Regular expression (see |
words1 , words2
|
[Option 3] Two sets of words for only n1 * n2 word pairs. See examples. |
label |
Position of text labels.
Defaults to |
value.color |
Color of values added on the plot.
Defaults to |
value.percent |
Whether to transform values into percentage style for space saving.
Defaults to |
order |
Character, the ordering method of the correlation matrix.
See function |
hclust.method |
Character, the agglomeration method to be used when
|
hclust.n |
Number of rectangles to be drawn on the plot according to
the hierarchical clusters, only valid when |
hclust.color |
Color of rectangle border, only valid when |
hclust.line |
Line width of rectangle border, only valid when |
file |
File name to be saved, should be png or pdf. |
width , height
|
Width and height (in inches) for the saved file.
Defaults to |
dpi |
Dots per inch. Defaults to |
... |
Other parameters passed to |
Invisibly return a matrix of cosine similarity between each pair of words.
Download pre-trained word vectors data (.RData
):
https://psychbruce.github.io/WordVector_RData.pdf
w1 = cc("king, queen, man, woman") plot_similarity(demodata, w1) plot_similarity(demodata, w1, value.color="grey", value.percent=TRUE) plot_similarity(demodata, w1, value.color="grey", order="hclust", hclust.n=2) plot_similarity( demodata, words1=cc("man, woman, king, queen"), words2=cc("he, she, boy, girl, father, mother"), value.color="grey20" ) w2 = cc("China, Chinese, Japan, Japanese, Korea, Korean, man, woman, boy, girl, good, bad, positive, negative") plot_similarity(demodata, w2, order="hclust", hclust.n=3) plot_similarity(demodata, w2, order="hclust", hclust.n=7, file="plot.png") unlink("plot.png") # delete file for code check
w1 = cc("king, queen, man, woman") plot_similarity(demodata, w1) plot_similarity(demodata, w1, value.color="grey", value.percent=TRUE) plot_similarity(demodata, w1, value.color="grey", order="hclust", hclust.n=2) plot_similarity( demodata, words1=cc("man, woman, king, queen"), words2=cc("he, she, boy, girl, father, mother"), value.color="grey20" ) w2 = cc("China, Chinese, Japan, Japanese, Korea, Korean, man, woman, boy, girl, good, bad, positive, negative") plot_similarity(demodata, w2, order="hclust", hclust.n=3) plot_similarity(demodata, w2, order="hclust", hclust.n=7, file="plot.png") unlink("plot.png") # delete file for code check
Visualize word vectors.
plot_wordvec(x, dims = NULL, step = 0.05, border = "white")
plot_wordvec(x, dims = NULL, step = 0.05, border = "white")
x |
Can be:
|
dims |
Dimensions to be plotted (e.g., |
step |
Step for value breaks. Defaults to |
border |
Color of tile border. Defaults to |
A ggplot
object.
Download pre-trained word vectors data (.RData
):
https://psychbruce.github.io/WordVector_RData.pdf
d = as_embed(demodata, normalize=TRUE) plot_wordvec(d[1:10]) dt = get_wordvec(d, cc("king, queen, man, woman")) dt[, QUEEN := king - man + woman] dt[, QUEEN := QUEEN / sqrt(sum(QUEEN^2))] # normalize names(dt)[5] = "king - man + woman" plot_wordvec(dt[, c(1,3,4,5,2)], dims=1:50) dt = get_wordvec(d, cc("boy, girl, he, she")) dt[, GIRL := boy - he + she] dt[, GIRL := GIRL / sqrt(sum(GIRL^2))] # normalize names(dt)[5] = "boy - he + she" plot_wordvec(dt[, c(1,3,4,5,2)], dims=1:50) dt = get_wordvec(d, cc(" male, man, boy, he, his, female, woman, girl, she, her")) p = plot_wordvec(dt, dims=1:100) # if you want to change something: p + theme(legend.key.height=unit(0.1, "npc")) # or to save the plot: ggsave(p, filename="wordvecs.png", width=8, height=5, dpi=500) unlink("wordvecs.png") # delete file for code check
d = as_embed(demodata, normalize=TRUE) plot_wordvec(d[1:10]) dt = get_wordvec(d, cc("king, queen, man, woman")) dt[, QUEEN := king - man + woman] dt[, QUEEN := QUEEN / sqrt(sum(QUEEN^2))] # normalize names(dt)[5] = "king - man + woman" plot_wordvec(dt[, c(1,3,4,5,2)], dims=1:50) dt = get_wordvec(d, cc("boy, girl, he, she")) dt[, GIRL := boy - he + she] dt[, GIRL := GIRL / sqrt(sum(GIRL^2))] # normalize names(dt)[5] = "boy - he + she" plot_wordvec(dt[, c(1,3,4,5,2)], dims=1:50) dt = get_wordvec(d, cc(" male, man, boy, he, his, female, woman, girl, she, her")) p = plot_wordvec(dt, dims=1:100) # if you want to change something: p + theme(legend.key.height=unit(0.1, "npc")) # or to save the plot: ggsave(p, filename="wordvecs.png", width=8, height=5, dpi=500) unlink("wordvecs.png") # delete file for code check
Visualize word vectors with dimensionality reduced
using the t-Distributed Stochastic Neighbor Embedding (t-SNE) method
(i.e., projecting high-dimensional vectors into a low-dimensional vector space),
implemented by Rtsne::Rtsne()
.
You should specify a random seed if you expect reproducible results.
plot_wordvec_tSNE( x, dims = 2, perplexity, theta = 0.5, colors = NULL, seed = NULL, custom.Rtsne = NULL )
plot_wordvec_tSNE( x, dims = 2, perplexity, theta = 0.5, colors = NULL, seed = NULL, custom.Rtsne = NULL )
x |
Can be:
|
dims |
Output dimensionality: |
perplexity |
Perplexity parameter, should not be larger than (number of words - 1) / 3.
Defaults to |
theta |
Speed/accuracy trade-off (increase for less accuracy), set to 0 for exact t-SNE. Defaults to 0.5. |
colors |
A character vector specifying (1) the categories of words (for 2-D plot only) or (2) the exact colors of words (for 2-D and 3-D plot). See examples for its usage. |
seed |
Random seed for reproducible results. Defaults to |
custom.Rtsne |
User-defined |
2-D: A ggplot
object.
You may extract the data from this object using $data
.
3-D: Nothing but only the data was invisibly returned,
because rgl::plot3d()
is
"called for the side effect of drawing the plot"
and thus cannot return any 3-D plot object.
Download pre-trained word vectors data (.RData
):
https://psychbruce.github.io/WordVector_RData.pdf
Hinton, G. E., & Salakhutdinov, R. R. (2006). Reducing the dimensionality of data with neural networks. Science, 313(5786), 504–507.
van der Maaten, L., & Hinton, G. (2008). Visualizing data using t-SNE. Journal of Machine Learning Research, 9, 2579–2605.
d = as_embed(demodata, normalize=TRUE) dt = get_wordvec(d, cc(" man, woman, king, queen, China, Beijing, Japan, Tokyo")) ## 2-D (default): plot_wordvec_tSNE(dt, seed=1234) plot_wordvec_tSNE(dt, seed=1234)$data colors = c(rep("#2B579A", 4), rep("#B7472A", 4)) plot_wordvec_tSNE(dt, colors=colors, seed=1234) category = c(rep("gender", 4), rep("country", 4)) plot_wordvec_tSNE(dt, colors=category, seed=1234) + scale_x_continuous(limits=c(-200, 200), labels=function(x) x/100) + scale_y_continuous(limits=c(-200, 200), labels=function(x) x/100) + scale_color_manual(values=c("#B7472A", "#2B579A")) ## 3-D: colors = c(rep("#2B579A", 4), rep("#B7472A", 4)) plot_wordvec_tSNE(dt, dims=3, colors=colors, seed=1)
d = as_embed(demodata, normalize=TRUE) dt = get_wordvec(d, cc(" man, woman, king, queen, China, Beijing, Japan, Tokyo")) ## 2-D (default): plot_wordvec_tSNE(dt, seed=1234) plot_wordvec_tSNE(dt, seed=1234)$data colors = c(rep("#2B579A", 4), rep("#B7472A", 4)) plot_wordvec_tSNE(dt, colors=colors, seed=1234) category = c(rep("gender", 4), rep("country", 4)) plot_wordvec_tSNE(dt, colors=category, seed=1234) + scale_x_continuous(limits=c(-200, 200), labels=function(x) x/100) + scale_y_continuous(limits=c(-200, 200), labels=function(x) x/100) + scale_color_manual(values=c("#B7472A", "#2B579A")) ## 3-D: colors = c(rep("#2B579A", 4), rep("#B7472A", 4)) plot_wordvec_tSNE(dt, dims=3, colors=colors, seed=1)
Calculate the sum vector of multiple words.
sum_wordvec(data, x = NULL, verbose = TRUE)
sum_wordvec(data, x = NULL, verbose = TRUE)
data |
A |
x |
Can be:
|
verbose |
Print information to the console? Defaults to |
Normalized sum vector.
Download pre-trained word vectors data (.RData
):
https://psychbruce.github.io/WordVector_RData.pdf
sum_wordvec(normalize(demodata), ~ king - man + woman)
sum_wordvec(normalize(demodata), ~ king - man + woman)
Tabulate cosine similarity/distance of word pairs.
tab_similarity( data, words = NULL, pattern = NULL, words1 = NULL, words2 = NULL, unique = FALSE, distance = FALSE )
tab_similarity( data, words = NULL, pattern = NULL, words1 = NULL, words2 = NULL, unique = FALSE, distance = FALSE )
data |
A |
words |
[Option 1] Character string(s). |
pattern |
[Option 2] Regular expression (see |
words1 , words2
|
[Option 3] Two sets of words for only n1 * n2 word pairs. See examples. |
unique |
Return unique word pairs ( |
distance |
Compute cosine distance instead?
Defaults to |
A data.table
of words, word pairs,
and their cosine similarity (cos_sim
)
or cosine distance (cos_dist
).
Download pre-trained word vectors data (.RData
):
https://psychbruce.github.io/WordVector_RData.pdf
tab_similarity(demodata, cc("king, queen, man, woman")) tab_similarity(demodata, cc("king, queen, man, woman"), unique=TRUE) tab_similarity(demodata, cc("Beijing, China, Tokyo, Japan")) tab_similarity(demodata, cc("Beijing, China, Tokyo, Japan"), unique=TRUE) ## only n1 * n2 word pairs across two sets of words tab_similarity(demodata, words1=cc("king, queen, King, Queen"), words2=cc("man, woman"))
tab_similarity(demodata, cc("king, queen, man, woman")) tab_similarity(demodata, cc("king, queen, man, woman"), unique=TRUE) tab_similarity(demodata, cc("Beijing, China, Tokyo, Japan")) tab_similarity(demodata, cc("Beijing, China, Tokyo, Japan"), unique=TRUE) ## only n1 * n2 word pairs across two sets of words tab_similarity(demodata, words1=cc("king, queen, King, Queen"), words2=cc("man, woman"))
Tabulate data and conduct the permutation test of significance for the Relative Norm Distance (RND; also known as Relative Euclidean Distance). This is an alternative method to Single-Category WEAT.
test_RND( data, T1, A1, A2, use.pattern = FALSE, labels = list(), p.perm = TRUE, p.nsim = 10000, p.side = 2, seed = NULL )
test_RND( data, T1, A1, A2, use.pattern = FALSE, labels = list(), p.perm = TRUE, p.nsim = 10000, p.side = 2, seed = NULL )
data |
A |
T1 |
Target words of a single category (a vector of words or a pattern of regular expression). |
A1 , A2
|
Attribute words (a vector of words or a pattern of regular expression). Both must be specified. |
use.pattern |
Defaults to |
labels |
Labels for target and attribute concepts (a named |
p.perm |
Permutation test to get exact or approximate p value of the overall effect.
Defaults to |
p.nsim |
Number of samples for resampling in permutation test. Defaults to If |
p.side |
One-sided ( In Caliskan et al.'s (2017) article, they reported one-sided p value for WEAT. Here, I suggest reporting two-sided p value as a more conservative estimate. The users take the full responsibility for the choice.
|
seed |
Random seed for reproducible results of permutation test. Defaults to |
A list
object of new class rnd
:
words.valid
Valid (actually matched) words
words.not.found
Words not found
data.raw
A data.table
of (absolute and relative) norm distances
eff.label
Description for the difference between the two attribute concepts
eff.type
Effect type: RND
eff
Raw effect and p value (if p.perm=TRUE
)
eff.interpretation
Interpretation of the RND score
Download pre-trained word vectors data (.RData
):
https://psychbruce.github.io/WordVector_RData.pdf
Garg, N., Schiebinger, L., Jurafsky, D., & Zou, J. (2018). Word embeddings quantify 100 years of gender and ethnic stereotypes. Proceedings of the National Academy of Sciences, 115(16), E3635–E3644.
Bhatia, N., & Bhatia, S. (2021). Changes in gender stereotypes over time: A computational analysis. Psychology of Women Quarterly, 45(1), 106–125.
rnd = test_RND( demodata, labels=list(T1="Occupation", A1="Male", A2="Female"), T1=cc(" architect, boss, leader, engineer, CEO, officer, manager, lawyer, scientist, doctor, psychologist, investigator, consultant, programmer, teacher, clerk, counselor, salesperson, therapist, psychotherapist, nurse"), A1=cc("male, man, boy, brother, he, him, his, son"), A2=cc("female, woman, girl, sister, she, her, hers, daughter"), seed=1) rnd
rnd = test_RND( demodata, labels=list(T1="Occupation", A1="Male", A2="Female"), T1=cc(" architect, boss, leader, engineer, CEO, officer, manager, lawyer, scientist, doctor, psychologist, investigator, consultant, programmer, teacher, clerk, counselor, salesperson, therapist, psychotherapist, nurse"), A1=cc("male, man, boy, brother, he, him, his, son"), A2=cc("female, woman, girl, sister, she, her, hers, daughter"), seed=1) rnd
Tabulate data (cosine similarity and standardized effect size) and conduct the permutation test of significance for the Word Embedding Association Test (WEAT) and Single-Category Word Embedding Association Test (SC-WEAT).
For WEAT, two-samples permutation test is conducted (i.e., rearrangements of data).
For SC-WEAT, one-sample permutation test is conducted (i.e., rearrangements of +/- signs to data).
test_WEAT( data, T1, T2, A1, A2, use.pattern = FALSE, labels = list(), p.perm = TRUE, p.nsim = 10000, p.side = 2, seed = NULL, pooled.sd = "Caliskan" )
test_WEAT( data, T1, T2, A1, A2, use.pattern = FALSE, labels = list(), p.perm = TRUE, p.nsim = 10000, p.side = 2, seed = NULL, pooled.sd = "Caliskan" )
data |
A |
T1 , T2
|
Target words (a vector of words or a pattern of regular expression).
If only |
A1 , A2
|
Attribute words (a vector of words or a pattern of regular expression). Both must be specified. |
use.pattern |
Defaults to |
labels |
Labels for target and attribute concepts (a named |
p.perm |
Permutation test to get exact or approximate p value of the overall effect.
Defaults to |
p.nsim |
Number of samples for resampling in permutation test. Defaults to If |
p.side |
One-sided ( In Caliskan et al.'s (2017) article, they reported one-sided p value for WEAT. Here, I suggest reporting two-sided p value as a more conservative estimate. The users take the full responsibility for the choice.
|
seed |
Random seed for reproducible results of permutation test. Defaults to |
pooled.sd |
Method used to calculate the pooled SD for effect size estimate in WEAT.
|
A list
object of new class weat
:
words.valid
Valid (actually matched) words
words.not.found
Words not found
data.raw
A data.table
of cosine similarities between all word pairs
data.mean
A data.table
of mean cosine similarities
across all attribute words
data.diff
A data.table
of differential mean cosine similarities
between the two attribute concepts
eff.label
Description for the difference between the two attribute concepts
eff.type
Effect type: WEAT or SC-WEAT
eff
Raw effect, standardized effect size, and p value (if p.perm=TRUE
)
Download pre-trained word vectors data (.RData
):
https://psychbruce.github.io/WordVector_RData.pdf
Caliskan, A., Bryson, J. J., & Narayanan, A. (2017). Semantics derived automatically from language corpora contain human-like biases. Science, 356(6334), 183–186.
## cc() is more convenient than c()! weat = test_WEAT( demodata, labels=list(T1="King", T2="Queen", A1="Male", A2="Female"), T1=cc("king, King"), T2=cc("queen, Queen"), A1=cc("male, man, boy, brother, he, him, his, son"), A2=cc("female, woman, girl, sister, she, her, hers, daughter"), seed=1) weat sc_weat = test_WEAT( demodata, labels=list(T1="Occupation", A1="Male", A2="Female"), T1=cc(" architect, boss, leader, engineer, CEO, officer, manager, lawyer, scientist, doctor, psychologist, investigator, consultant, programmer, teacher, clerk, counselor, salesperson, therapist, psychotherapist, nurse"), A1=cc("male, man, boy, brother, he, him, his, son"), A2=cc("female, woman, girl, sister, she, her, hers, daughter"), seed=1) sc_weat ## Not run: ## the same as the first example, but using regular expression weat = test_WEAT( demodata, labels=list(T1="King", T2="Queen", A1="Male", A2="Female"), use.pattern=TRUE, # use regular expression below T1="^[kK]ing$", T2="^[qQ]ueen$", A1="^male$|^man$|^boy$|^brother$|^he$|^him$|^his$|^son$", A2="^female$|^woman$|^girl$|^sister$|^she$|^her$|^hers$|^daughter$", seed=1) weat ## replicating Caliskan et al.'s (2017) results ## WEAT7 (Table 1): d = 1.06, p = .018 ## (requiring installation of the `sweater` package) Caliskan.WEAT7 = test_WEAT( as_wordvec(sweater::glove_math), labels=list(T1="Math", T2="Arts", A1="Male", A2="Female"), T1=cc("math, algebra, geometry, calculus, equations, computation, numbers, addition"), T2=cc("poetry, art, dance, literature, novel, symphony, drama, sculpture"), A1=cc("male, man, boy, brother, he, him, his, son"), A2=cc("female, woman, girl, sister, she, her, hers, daughter"), p.side=1, seed=1234) Caliskan.WEAT7 # d = 1.055, p = .0173 (= 173 counts / 10000 permutation samples) ## replicating Caliskan et al.'s (2017) supplemental results ## WEAT7 (Table S1): d = 0.97, p = .027 Caliskan.WEAT7.supp = test_WEAT( demodata, labels=list(T1="Math", T2="Arts", A1="Male", A2="Female"), T1=cc("math, algebra, geometry, calculus, equations, computation, numbers, addition"), T2=cc("poetry, art, dance, literature, novel, symphony, drama, sculpture"), A1=cc("male, man, boy, brother, he, him, his, son"), A2=cc("female, woman, girl, sister, she, her, hers, daughter"), p.side=1, seed=1234) Caliskan.WEAT7.supp # d = 0.966, p = .0221 (= 221 counts / 10000 permutation samples) ## End(Not run)
## cc() is more convenient than c()! weat = test_WEAT( demodata, labels=list(T1="King", T2="Queen", A1="Male", A2="Female"), T1=cc("king, King"), T2=cc("queen, Queen"), A1=cc("male, man, boy, brother, he, him, his, son"), A2=cc("female, woman, girl, sister, she, her, hers, daughter"), seed=1) weat sc_weat = test_WEAT( demodata, labels=list(T1="Occupation", A1="Male", A2="Female"), T1=cc(" architect, boss, leader, engineer, CEO, officer, manager, lawyer, scientist, doctor, psychologist, investigator, consultant, programmer, teacher, clerk, counselor, salesperson, therapist, psychotherapist, nurse"), A1=cc("male, man, boy, brother, he, him, his, son"), A2=cc("female, woman, girl, sister, she, her, hers, daughter"), seed=1) sc_weat ## Not run: ## the same as the first example, but using regular expression weat = test_WEAT( demodata, labels=list(T1="King", T2="Queen", A1="Male", A2="Female"), use.pattern=TRUE, # use regular expression below T1="^[kK]ing$", T2="^[qQ]ueen$", A1="^male$|^man$|^boy$|^brother$|^he$|^him$|^his$|^son$", A2="^female$|^woman$|^girl$|^sister$|^she$|^her$|^hers$|^daughter$", seed=1) weat ## replicating Caliskan et al.'s (2017) results ## WEAT7 (Table 1): d = 1.06, p = .018 ## (requiring installation of the `sweater` package) Caliskan.WEAT7 = test_WEAT( as_wordvec(sweater::glove_math), labels=list(T1="Math", T2="Arts", A1="Male", A2="Female"), T1=cc("math, algebra, geometry, calculus, equations, computation, numbers, addition"), T2=cc("poetry, art, dance, literature, novel, symphony, drama, sculpture"), A1=cc("male, man, boy, brother, he, him, his, son"), A2=cc("female, woman, girl, sister, she, her, hers, daughter"), p.side=1, seed=1234) Caliskan.WEAT7 # d = 1.055, p = .0173 (= 173 counts / 10000 permutation samples) ## replicating Caliskan et al.'s (2017) supplemental results ## WEAT7 (Table S1): d = 0.97, p = .027 Caliskan.WEAT7.supp = test_WEAT( demodata, labels=list(T1="Math", T2="Arts", A1="Male", A2="Female"), T1=cc("math, algebra, geometry, calculus, equations, computation, numbers, addition"), T2=cc("poetry, art, dance, literature, novel, symphony, drama, sculpture"), A1=cc("male, man, boy, brother, he, him, his, son"), A2=cc("female, woman, girl, sister, she, her, hers, daughter"), p.side=1, seed=1234) Caliskan.WEAT7.supp # d = 0.966, p = .0221 (= 221 counts / 10000 permutation samples) ## End(Not run)
text_*
functions
designed for contextualized word embeddings.Install required Python modules
in a new conda environment
and initialize the environment,
necessary for all text_*
functions
designed for contextualized word embeddings.
text_init()
text_init()
Users may first need to manually install Anaconda or Miniconda.
The R package text
(https://www.r-text.org/) enables users access to
HuggingFace Transformers models in R,
through the R package reticulate
as an interface to Python
and the Python modules torch
and transformers
.
For advanced usage, see
## Not run: text_init() # You may need to specify the version of Python: # RStudio -> Tools -> Global/Project Options # -> Python -> Select -> Conda Environments # -> Choose ".../textrpp_condaenv/python.exe" ## End(Not run)
## Not run: text_init() # You may need to specify the version of Python: # RStudio -> Tools -> Global/Project Options # -> Python -> Select -> Conda Environments # -> Choose ".../textrpp_condaenv/python.exe" ## End(Not run)
Download pre-trained language models (Transformers Models,
such as GPT, BERT, RoBERTa, DeBERTa, DistilBERT, etc.)
from HuggingFace to
your local ".cache" folder ("C:/Users/[YourUserName]/.cache/").
The models will never be removed unless you run
text_model_remove
.
text_model_download(model = NULL)
text_model_download(model = NULL)
model |
Character string(s) specifying the pre-trained language model(s) to be downloaded. For a full list of options, see HuggingFace. Defaults to download nothing and check currently downloaded models. Example choices:
|
Invisibly return the names of all downloaded models.
## Not run: # text_init() # initialize the environment text_model_download() # check downloaded models text_model_download(c( "bert-base-uncased", "bert-base-cased", "bert-base-multilingual-cased" )) ## End(Not run)
## Not run: # text_init() # initialize the environment text_model_download() # check downloaded models text_model_download(c( "bert-base-uncased", "bert-base-cased", "bert-base-multilingual-cased" )) ## End(Not run)
Remove downloaded models from the local .cache folder.
text_model_remove(model = NULL)
text_model_remove(model = NULL)
model |
Model name. See |
## Not run: # text_init() # initialize the environment text_model_remove() ## End(Not run)
## Not run: # text_init() # initialize the environment text_model_remove() ## End(Not run)
Extract hidden layers from a language model and aggregate them to
get token (roughly word) embeddings and text embeddings
(all reshaped to embed
matrix).
It is a wrapper function of text::textEmbed()
.
text_to_vec( text, model, layers = "all", layer.to.token = "concatenate", token.to.word = TRUE, token.to.text = TRUE, encoding = "UTF-8", ... )
text_to_vec( text, model, layers = "all", layer.to.token = "concatenate", token.to.word = TRUE, token.to.text = TRUE, encoding = "UTF-8", ... )
text |
Can be:
|
model |
Model name at HuggingFace.
See |
layers |
Layers to be extracted from the |
layer.to.token |
Method to aggregate hidden layers to each token.
Defaults to |
token.to.word |
Aggregate subword token embeddings (if whole word is out of vocabulary)
to whole word embeddings. Defaults to |
token.to.text |
Aggregate token embeddings to each text.
Defaults to |
encoding |
Text encoding (only used if |
... |
Other parameters passed to
|
A list
of:
token.embed
Token (roughly word) embeddings
text.embed
Text embeddings, aggregated from token embeddings
## Not run: # text_init() # initialize the environment text = c("Download models from HuggingFace", "Chinese are East Asian", "Beijing is the capital of China") embed = text_to_vec(text, model="bert-base-cased", layers=c(0, 12)) embed embed1 = embed$token.embed[[1]] embed2 = embed$token.embed[[2]] embed3 = embed$token.embed[[3]] View(embed1) View(embed2) View(embed3) View(embed$text.embed) plot_similarity(embed1, value.color="grey") plot_similarity(embed2, value.color="grey") plot_similarity(embed3, value.color="grey") plot_similarity(rbind(embed1, embed2, embed3)) ## End(Not run)
## Not run: # text_init() # initialize the environment text = c("Download models from HuggingFace", "Chinese are East Asian", "Beijing is the capital of China") embed = text_to_vec(text, model="bert-base-cased", layers=c(0, 12)) embed embed1 = embed$token.embed[[1]] embed2 = embed$token.embed[[2]] embed3 = embed$token.embed[[3]] View(embed1) View(embed2) View(embed3) View(embed$text.embed) plot_similarity(embed1, value.color="grey") plot_similarity(embed2, value.color="grey") plot_similarity(embed3, value.color="grey") plot_similarity(rbind(embed1, embed2, embed3)) ## End(Not run)
Note: This function has been deprecated and will not be updated since I have developed new package FMAT as the integrative toolbox of Fill-Mask Association Test (FMAT).
Predict the probably correct masked token(s) in a sequence,
based on the Python module transformers
.
text_unmask(query, model, targets = NULL, topn = 5)
text_unmask(query, model, targets = NULL, topn = 5)
query |
A query (sentence/prompt) with masked token(s) |
model |
Model name at HuggingFace.
See |
targets |
Specific target word(s) to be filled in the blank |
topn |
Number of the most likely predictions to return.
Defaults to |
Masked language modeling is the task of masking some of the words in a sentence and predicting which words should replace those masks. These models are useful when we want to get a statistical understanding of the language in which the model is trained in. See https://huggingface.co/tasks/fill-mask for details.
A data.table
of query results:
query_id
(if there are more than one query
)query
ID (indicating multiple queries)
mask_id
(if there are more than one [MASK]
in query
)[MASK]
ID (position in sequence, indicating multiple masks)
prob
Probability of the predicted token in the sequence
token_id
Predicted token ID (to replace [MASK]
)
token
Predicted token (to replace [MASK]
)
sequence
Complete sentence with the predicted token
## Not run: # text_init() # initialize the environment model = "distilbert-base-cased" text_unmask("Beijing is the [MASK] of China.", model) # multiple [MASK]s: text_unmask("Beijing is the [MASK] [MASK] of China.", model) # multiple queries: text_unmask(c("The man worked as a [MASK].", "The woman worked as a [MASK]."), model) # specific targets: text_unmask("The [MASK] worked as a nurse.", model, targets=c("man", "woman")) ## End(Not run)
## Not run: # text_init() # initialize the environment model = "distilbert-base-cased" text_unmask("Beijing is the [MASK] of China.", model) # multiple [MASK]s: text_unmask("Beijing is the [MASK] [MASK] of China.", model) # multiple queries: text_unmask(c("The man worked as a [MASK].", "The woman worked as a [MASK]."), model) # specific targets: text_unmask("The [MASK] worked as a nurse.", model, targets=c("man", "woman")) ## End(Not run)
Tokenize raw text for training word embeddings.
tokenize( text, tokenizer = text2vec::word_tokenizer, split = " ", remove = "_|'|<br/>|<br />|e\\.g\\.|i\\.e\\.", encoding = "UTF-8", simplify = TRUE, verbose = TRUE )
tokenize( text, tokenizer = text2vec::word_tokenizer, split = " ", remove = "_|'|<br/>|<br />|e\\.g\\.|i\\.e\\.", encoding = "UTF-8", simplify = TRUE, verbose = TRUE )
text |
A character vector of text, or a file path on disk containing text. |
tokenizer |
Function used to tokenize the text.
Defaults to |
split |
Separator between tokens, only used when |
remove |
Strings (in regular expression) to be removed from the text.
Defaults to |
encoding |
Text encoding (only used if |
simplify |
Return a character vector ( |
verbose |
Print information to the console? Defaults to |
simplify=TRUE
: A tokenized character vector,
with each element as a sentence.
simplify=FALSE
: A list of tokenized character vectors,
with each element as a vector of tokens in a sentence.
txt1 = c( "I love natural language processing (NLP)!", "I've been in this city for 10 years. I really like here!", "However, my computer is not among the \"Top 10\" list." ) tokenize(txt1, simplify=FALSE) tokenize(txt1) %>% cat(sep="\n----\n") txt2 = text2vec::movie_review$review[1:5] texts = tokenize(txt2) txt2[1] texts[1:20] # all sentences in txt2[1]
txt1 = c( "I love natural language processing (NLP)!", "I've been in this city for 10 years. I really like here!", "However, my computer is not among the \"Top 10\" list." ) tokenize(txt1, simplify=FALSE) tokenize(txt1) %>% cat(sep="\n----\n") txt2 = text2vec::movie_review$review[1:5] texts = tokenize(txt2) txt2[1] texts[1:20] # all sentences in txt2[1]
Train static word embeddings using the
Word2Vec
,
GloVe
, or
FastText
algorithm
with multi-threading.
train_wordvec( text, method = c("word2vec", "glove", "fasttext"), dims = 300, window = 5, min.freq = 5, threads = 8, model = c("skip-gram", "cbow"), loss = c("ns", "hs"), negative = 5, subsample = 1e-04, learning = 0.05, ngrams = c(3, 6), x.max = 10, convergence = -1, stopwords = character(0), encoding = "UTF-8", tolower = FALSE, normalize = FALSE, iteration, tokenizer, remove, file.save, compress = "bzip2", verbose = TRUE )
train_wordvec( text, method = c("word2vec", "glove", "fasttext"), dims = 300, window = 5, min.freq = 5, threads = 8, model = c("skip-gram", "cbow"), loss = c("ns", "hs"), negative = 5, subsample = 1e-04, learning = 0.05, ngrams = c(3, 6), x.max = 10, convergence = -1, stopwords = character(0), encoding = "UTF-8", tolower = FALSE, normalize = FALSE, iteration, tokenizer, remove, file.save, compress = "bzip2", verbose = TRUE )
text |
A character vector of text, or a file path on disk containing text. |
method |
Training algorithm: |
dims |
Number of dimensions of word vectors to be trained.
Common choices include 50, 100, 200, 300, and 500.
Defaults to |
window |
Window size (number of nearby words behind/ahead the current word).
It defines how many surrounding words to be included in training:
[window] words behind and [window] words ahead ([window]*2 in total).
Defaults to |
min.freq |
Minimum frequency of words to be included in training.
Words that appear less than this value of times will be excluded from vocabulary.
Defaults to |
threads |
Number of CPU threads used for training.
A modest value produces the fastest training.
Too many threads are not always helpful.
Defaults to |
model |
<Only for Word2Vec / FastText> Learning model architecture:
|
loss |
<Only for Word2Vec / FastText> Loss function (computationally efficient approximation):
|
negative |
<Only for Negative Sampling in Word2Vec / FastText> Number of negative examples.
Values in the range 5~20 are useful for small training datasets,
while for large datasets the value can be as small as 2~5.
Defaults to |
subsample |
<Only for Word2Vec / FastText> Subsampling of frequent words (threshold for occurrence of words).
Those that appear with higher frequency in the training data will be randomly down-sampled.
Defaults to |
learning |
<Only for Word2Vec / FastText> Initial (starting) learning rate, also known as alpha.
Defaults to |
ngrams |
<Only for FastText> Minimal and maximal ngram length.
Defaults to |
x.max |
<Only for GloVe> Maximum number of co-occurrences to use in the weighting function.
Defaults to |
convergence |
<Only for GloVe> Convergence tolerance for SGD iterations. Defaults to |
stopwords |
<Only for Word2Vec / GloVe> A character vector of stopwords to be excluded from training. |
encoding |
Text encoding. Defaults to |
tolower |
Convert all upper-case characters to lower-case?
Defaults to |
normalize |
Normalize all word vectors to unit length?
Defaults to |
iteration |
Number of training iterations.
More iterations makes a more precise model,
but computational cost is linearly proportional to iterations.
Defaults to |
tokenizer |
Function used to tokenize the text.
Defaults to |
remove |
Strings (in regular expression) to be removed from the text.
Defaults to |
file.save |
File name of to-be-saved R data (must be .RData). |
compress |
Compression method for the saved file. Defaults to Options include:
|
verbose |
Print information to the console? Defaults to |
A wordvec
(data.table) with three variables:
word
, vec
, freq
.
Download pre-trained word vectors data (.RData
):
https://psychbruce.github.io/WordVector_RData.pdf
All-in-one package:
Word2Vec:
GloVe:
FastText:
review = text2vec::movie_review # a data.frame' text = review$review ## Note: All the examples train 50 dims for faster code check. ## Word2Vec (SGNS) dt1 = train_wordvec( text, method="word2vec", model="skip-gram", dims=50, window=5, normalize=TRUE) dt1 most_similar(dt1, "Ive") # evaluate performance most_similar(dt1, ~ man - he + she, topn=5) # evaluate performance most_similar(dt1, ~ boy - he + she, topn=5) # evaluate performance ## GloVe dt2 = train_wordvec( text, method="glove", dims=50, window=5, normalize=TRUE) dt2 most_similar(dt2, "Ive") # evaluate performance most_similar(dt2, ~ man - he + she, topn=5) # evaluate performance most_similar(dt2, ~ boy - he + she, topn=5) # evaluate performance ## FastText dt3 = train_wordvec( text, method="fasttext", model="skip-gram", dims=50, window=5, normalize=TRUE) dt3 most_similar(dt3, "Ive") # evaluate performance most_similar(dt3, ~ man - he + she, topn=5) # evaluate performance most_similar(dt3, ~ boy - he + she, topn=5) # evaluate performance
review = text2vec::movie_review # a data.frame' text = review$review ## Note: All the examples train 50 dims for faster code check. ## Word2Vec (SGNS) dt1 = train_wordvec( text, method="word2vec", model="skip-gram", dims=50, window=5, normalize=TRUE) dt1 most_similar(dt1, "Ive") # evaluate performance most_similar(dt1, ~ man - he + she, topn=5) # evaluate performance most_similar(dt1, ~ boy - he + she, topn=5) # evaluate performance ## GloVe dt2 = train_wordvec( text, method="glove", dims=50, window=5, normalize=TRUE) dt2 most_similar(dt2, "Ive") # evaluate performance most_similar(dt2, ~ man - he + she, topn=5) # evaluate performance most_similar(dt2, ~ boy - he + she, topn=5) # evaluate performance ## FastText dt3 = train_wordvec( text, method="fasttext", model="skip-gram", dims=50, window=5, normalize=TRUE) dt3 most_similar(dt3, "Ive") # evaluate performance most_similar(dt3, ~ man - he + she, topn=5) # evaluate performance most_similar(dt3, ~ boy - he + she, topn=5) # evaluate performance