NEWS
PsychWordVec 2023.9 (2023-09-27)
Minor Changes
- Use
\donttest{}
in more examples to avoid unnecessary errors.
- Improved
text_unmask()
, though it has been deprecated.
PsychWordVec 2023.8 (2023-08-08)
Minor Changes
- Now use "YYYY.M" as package version number.
- Deprecated
text_unmask()
since I have developed a new package FMAT as an integrative toolbox of the Fill-Mask Association Test (FMAT).
PsychWordVec 0.3.2 (2023-03-04)
Minor Changes
- Changed welcome messages by using
packageStartupMessage()
so that the messages can be suppressed.
- Improved
text_unmask()
, but a new package (currently not publicly available) has been developed for a more general purpose of using masked language models to measure conceptual associations. Please wait for the release of this new package and the publication of a related methodological article.
Bug Fixes
- Fixed problematic
normalized
attribute when using data_wordvec_load()
.
PsychWordVec 0.3.0 (2022-12-15)
New Features
- New S3
[
method for embed
, see new examples in as_embed()
.
- New S3
unique()
method to delete duplicate words.
- New S3
str()
method to print the data structure and attributes.
- New
pattern()
function designed for S3 [
method of embed
: Users can directly use regular expression like embed[pattern("^for")]
to extract a subset of embedding matrix.
- New
plot_network()
function: Visualize a (partial correlation) network graph of words. Very useful for identifying potential semantic clusters from a list of words and even useful for disentangling antonyms from synonyms.
- New
targets
argument of text_unmask()
: Return specific fill-mask results for certain target words (rather than the top n results).
Major Changes
- Most functions now have been substantially enhanced for a faster speed, especially
tab_similarity()
, most_similar()
, dict_expand()
, dict_reliability()
, test_WEAT()
, test_RND()
.
- Improved S3
print()
method for embed
and wordvec
.
pair_similarity()
has been improved by using matrix operation tcrossprod(embed, embed)
to compute cosine similarity, with embed
normalized.
data_wordvec_load()
has got two wrapper functions load_wordvec()
and load_embed()
for faster use.
data_wordvec_normalize()
(deprecated) has been renamed to normalize()
.
get_wordvecs()
(deprecated) has been integrated into get_wordvec()
.
tab_similarity_cross()
(deprecated) has been integrated into tab_similarity()
.
test_WEAT()
and test_RND()
: Warning if T1
and T2
or A1
and A2
have duplicate values.
Bug Fixes
- Fixed the issue of unexpected long loading and processing time in 0.2.0, which was related to duplicate words in .RData, too many words in
embed
or wordvec
, and too many words to be printed to console. Now all related functions have been substantially improved so that they would not take unnecessarily long time.
PsychWordVec 0.2.0 (2022-12-01)
Breaking News
- Most functions now internally use
embed
(an extended class of matrix) rather than wordvec
in order to enhance the speed!
- New series of
text_*
functions for contextualized word embeddings! Based on the R package text
(and using the R package reticulate
to call functions from the Python module transformers
), a series of new functions have been developed to (1) download HuggingFace Transformers pre-trained language models (PLM; thousands of options such as GPT, BERT, RoBERTa, DeBERTa, DistilBERT, etc.), (2) extract contextualized token (roughly word) embeddings and text embeddings, and (3) fill in the blank mask(s) in a query (e.g., "Beijing is the [MASK] of China.").
text_init()
: set up a Python environment for PLM
text_model_download()
: download PLMs from HuggingFace to local ".cache" folder
text_model_remove()
: remove PLMs from local ".cache" folder
text_to_vec()
: extract contextualized token and text embeddings
text_unmask()
: fill in the blank mask(s) in a query
- New
orth_procrustes()
function: Orthogonal Procrustes matrix alignment. Users can input either two matrices of word embeddings or two wordvec
objects as loaded by data_wordvec_load()
or transformed from matrices by as_wordvec()
.
- New
dict_expand()
function: Expand a dictionary from the most similar words, based on most_similar()
.
- New
dict_reliability()
function: Reliability analysis (Cronbach's α) and Principal Component Analysis (PCA) of a dictionary. Note that Cronbach's α may be misleading when the number of items/words is large.
New Features
- New
sum_wordvec()
function: Calculate the sum vector of multiple words.
- New
plot_similarity()
function: Visualize cosine similarities between word pairs in a style of correlation matrix plot.
- New
tab_similarity_cross()
function: A wrapper of tab_similarity()
to tabulate cosine similarities for only n1 * n2 word pairs from two sets of words (arguments: words1
, words2
).
- New S3 methods:
print.wordvec()
, print.embed()
, rbind.wordvec()
, rbind.embed()
, subset.wordvec()
, subset.embed()
Major Changes
as_matrix()
has been renamed to as_embed()
: Now PsychWordVec
supports two classes of data objects -- wordvec
(data.table) and embed
(matrix). Most functions now use embed
(or transform wordvec
to embed
) internally so as to enhance the speed. Matrix is much faster!
- Deprecated
data_wordvec_reshape()
: Now use as_wordvec()
and as_embed()
.
Minor Changes
- Defaults changed in
data_wordvec_subset()
, get_wordvecs()
, tab_similarity()
, and plot_similarity()
: If neither words
nor pattern
are specified (NULL
), then all words in data
will be extracted.
- Improved S3 methods
print.weat()
and print.rnd()
.
PsychWordVec 0.1.2 (2022-11-03)
New Features
- Added permutation test of significance for both
test_WEAT()
and test_RND()
: Users can specify the number of permutation samples and choose to calculate either one-sided or two-sided p value. It can well reproduce the results in Caliskan et al.'s (2017) article.
- Added the
pooled.sd
argument for test_WEAT()
: Users can choose the method used to calculate the pooled SD for effect size estimate in WEAT. However, the original approach proposed by Caliskan et al. (2017) is the default and highly suggested.
- Wrapper functions
as_matrix()
and as_wordvec()
for data_wordvec_reshape()
, which can make it easier to reshape word embeddings data from matrix
to "wordvec" data.table
or vice versa.
Major Changes
- Both
test_WEAT()
and test_RND()
now have changed the element names and S3 print method of their returned objects (of new class weat
and rnd
, respectively): The elements $eff.raw
, $eff.size
, and $eff.sum
are now deprecated and replaced by $eff
, which is a data.table
containing the overall raw/standardized effects and permutation p value. The new S3 print methods print.weat()
and print.rnd()
can make a tidy report of the test results when you directly type and print the returned object (see code examples).
- Improved command line interfaces using the
cli
package.
- Improved welcome messages when
library(PsychWordVec)
.
PsychWordVec 0.1.0 (2022-08-22)
- CRAN initial release.
- Fixed all issues in the CRAN manual inspection.
PsychWordVec 0.0.8
New Features
- Added
wordvec
as the primary class of word vectors data: Now the data classes contain wordvec
, data.table
, and data.frame
, which actually perform as a data.table
.
- New
train_wordvec()
function: Train word vectors using the Word2Vec, GloVe, or FastText algorithm with multi-threading.
- New
tokenize()
function: Tokenize raw texts for training word vectors.
- New
data_wordvec_reshape()
function: Reshape word vectors data from dense (a data.table
of new classs wordvec
with two variables word
and vec
) to plain (a matrix
of word vectors) or vice versa.
- New
test_RND()
function, and tab_WEAT()
is renamed to test_WEAT()
: These two functions serve as convenient tools of word semantic similarity analysis and conceptual association test.
- New
plot_wordvec_tSNE()
function: Visualize 2-D or 3-D word vectors with dimensionality reduced using the t-Distributed Stochastic Neighbor Embedding (t-SNE) method.
PsychWordVec 0.0.6
New Features
- Enhanced all functions.
- New
data_wordvec_subset()
function.
- Added the
unique
argument for tab_similarity()
.
- Added support to use regular expression pattern in
test_WEAT()
.
PsychWordVec 0.0.4
- Initial public release on GitHub with more functions.
PsychWordVec 0.0.1