semanticfa performs exploratory factor analysis on
language model embeddings of psychological scale items. Given item text,
it embeds each item, computes a similarity matrix, and extracts latent
factors — entirely from the text, with no human response data
required.
The package is designed to feel familiar to psych and
EFAtools users.
The package ships with the 50-item IPIP Big Five inventory and precomputed Qwen3-Embedding-8B embeddings (50 x 4096, rounded to 4 decimal places), so you can try it with zero setup:
library(semanticfa)
data(big5)
fit <- sfa(
big5$items,
nfactors = 5,
embeddings = big5$embeddings,
scoring = big5$scoring
)
print(fit)
#> Semantic Factor Analysis
#> Encoding: atomic
#> Embedding dim: 4096
#> Factors:5 (minres + oblimin)
#>
#> Diagnostics:
#> KMO: 0.971 (marvelous - higher is better)
#> TEFI: -27.7158 (lower is better)
#> RMSR: 0.0323 (good - lower is better)
#> CAF: 0.4120 (marginal - higher is better)
#>
#> Factor loadings:
#>
#> Loadings:
#> MR4 MR5 MR3 MR1 MR2
#> item_16 1.003
#> item_19 0.870
#> item_18 0.861
#> item_11 0.817
#> item_15 0.801
#> item_17 0.790
#> item_20 0.575
#> item_29 0.553 0.390
#> item_13 0.490
#> item_24 0.453 0.425
#> item_26 0.391 0.303
#> item_23
#> item_35 0.715
#> item_39 0.675
#> item_31 0.626
#> item_32 0.617
#> item_40 0.604
#> item_37 0.567
#> item_38 0.566 0.337
#> item_36 0.566
#> item_33 0.565
#> item_04 0.545
#> item_34 0.355 0.502
#> item_28 0.458 0.308
#> item_12 0.395 0.347
#> item_49 0.350
#> item_27 0.895
#> item_25 0.787
#> item_21 0.718
#> item_44 0.615 0.404
#> item_08 0.570
#> item_02 0.449
#> item_46 0.438 0.438
#> item_06 0.432
#> item_10 0.408 0.390
#> item_14 0.319
#> item_03 0.751
#> item_30 0.596
#> item_07 0.509
#> item_22 0.484
#> item_09 0.450
#> item_05 0.384
#> item_01 0.309 0.353
#> item_50 0.607
#> item_45 0.300 0.583
#> item_42 0.350 0.563
#> item_43 0.555
#> item_41 0.450
#> item_48 0.425
#> item_47 0.333
#>
#> MR4 MR5 MR3 MR1 MR2
#> SS loadings 6.423 5.320 4.114 3.189 2.742
#> Proportion Var 0.128 0.106 0.082 0.064 0.055
#> Cumulative Var 0.128 0.235 0.317 0.381 0.436
#>
#> Factor correlations (Phi):
#> MR4 MR5 MR3 MR1 MR2
#> MR4 1.000 0.684 0.607 0.576 0.540
#> MR5 0.684 1.000 0.538 0.581 0.555
#> MR3 0.607 0.538 1.000 0.465 0.447
#> MR1 0.576 0.581 0.465 1.000 0.398
#> MR2 0.540 0.555 0.447 0.398 1.000
#>
#> Variance accounted for:
#> MR4 MR5 MR3 MR1 MR2
#> SS loadings 9.240 8.620 6.284 5.612 4.842
#> Proportion Var 0.185 0.172 0.126 0.112 0.097
#> Cumulative Var 0.185 0.357 0.483 0.595 0.692
#> Proportion Explained 0.267 0.249 0.182 0.162 0.140
#> Cumulative Proportion 0.267 0.516 0.698 0.860 1.000When you omit nfactors, sfa() uses
embedding-adapted parallel analysis (random unit vectors in the
embedding dimension as the null):
fit_auto <- sfa(
big5$items,
embeddings = big5$embeddings,
scoring = big5$scoring
)
cat("Auto-detected factors:", fit_auto$factors, "\n")
#> Auto-detected factors: 5For a multi-method comparison, use sfa_nfactors():
sim <- sfa_similarity(big5$embeddings, encoding = "atomic_reversed",
scoring = big5$scoring)
nf <- sfa_nfactors(sim, big5$embeddings,
methods = c("parallel", "kaiser"),
parallel_iter = 50)
print(nf)
#> Factor retention analysis (embedding-adapted)
#>
#> Method n_factors
#> parallel 5
#> kaiser 6
#> ------------------------
#> Consensus 5
#>
#> Eigenvalues: 29.62 1.92 1.74 1.48 1.27 1.01 0.93 0.88 0.78 0.70 ...The encoding argument controls how embeddings become a
similarity matrix:
sim_ar <- sfa_similarity(big5$embeddings, "atomic_reversed", big5$scoring)
sim_sq <- sfa_similarity(big5$embeddings, "squid", big5$scoring)
#> Warning: 'scoring' has reverse-keyed items but is ignored for encoding =
#> "squid": this method is keying-free by design. Use "atomic_reversed" for keyed
#> sign-flipping.
sim_mcp <- sfa_similarity(big5$embeddings, "mean_centered_pearson", big5$scoring)
#> Warning: 'scoring' has reverse-keyed items but is ignored for encoding =
#> "mean_centered_pearson": this method is keying-free by design. Use
#> "atomic_reversed" for keyed sign-flipping.
cat("atomic_reversed range:", range(sim_ar[lower.tri(sim_ar)]), "\n")
#> atomic_reversed range: -0.7766852 0.9381671
cat("squid range: ", range(sim_sq[lower.tri(sim_sq)]), "\n")
#> squid range: -0.2671767 0.8269712
cat("mean_centered_pearson:", range(sim_mcp[lower.tri(sim_mcp)]), "\n")
#> mean_centered_pearson: 0.3938766 0.9382228The ranges show what each encoding does to the sign structure.
Sign-flipping (atomic_reversed) manufactures strong
negatives by turning reverse-keyed items into anti-topic vectors, and
SQuID’s questionnaire-mean centering recovers a modest negative range.
Mean-centered Pearson tracks the raw cosines, which stay positive for
these items.
Scree plot with parallel analysis threshold
Factor loading heatmap
The $loadings component works directly with
psych functions:
# Run human-data EFA (not run — requires response data)
human_fit <- psych::fa(response_data, nfactors = 5, rotate = "oblimin")
# Compare
psych::factor.congruence(fit$loadings, human_fit$loadings)For NMI, ARI, Frobenius, and disattenuated correlation:
Pass any embedding model’s output via embeddings=:
# With sentence-transformers (requires reticulate + Python).
# The default model is "Qwen/Qwen3-Embedding-0.6B"; larger models such as
# "Qwen/Qwen3-Embedding-4B" (8 GB RAM) or "Qwen/Qwen3-Embedding-8B" (16 GB RAM)
# recover factor structure more accurately.
emb <- sfa_embed(my_items, embed = "sbert", model = "Qwen/Qwen3-Embedding-0.6B")
fit <- sfa(my_items, embeddings = emb, scoring = my_scoring)
# Or bring your own function
my_embedder <- function(texts) {
# ... your embedding logic ...
# must return a numeric matrix (n_items x dim)
}
fit <- sfa(my_items, embed = my_embedder, scoring = my_scoring)