Getting Started with semanticfa

Overview

semanticfa performs exploratory factor analysis on language model embeddings of psychological scale items. Given item text, it embeds each item, computes a similarity matrix, and extracts latent factors — entirely from the text, with no human response data required.

The package is designed to feel familiar to psych and EFAtools users.

Quick start with bundled data

The package ships with the 50-item IPIP Big Five inventory and precomputed Qwen3-Embedding-8B embeddings (50 x 4096, rounded to 4 decimal places), so you can try it with zero setup:

library(semanticfa)
data(big5)

fit <- sfa(
  big5$items,
  nfactors    = 5,
  embeddings  = big5$embeddings,
  scoring     = big5$scoring
)
print(fit)
#> Semantic Factor Analysis
#>   Encoding: atomic 
#>   Embedding dim: 4096 
#>   Factors:5 (minres + oblimin)
#> 
#> Diagnostics:
#>   KMO:  0.971 (marvelous - higher is better)
#>   TEFI: -27.7158 (lower is better)
#>   RMSR: 0.0323 (good - lower is better)
#>   CAF:  0.4120 (marginal - higher is better)
#> 
#> Factor loadings:
#> 
#> Loadings:
#>         MR4    MR5    MR3    MR1    MR2   
#> item_16  1.003                            
#> item_19  0.870                            
#> item_18  0.861                            
#> item_11  0.817                            
#> item_15  0.801                            
#> item_17  0.790                            
#> item_20  0.575                            
#> item_29  0.553                0.390       
#> item_13  0.490                            
#> item_24  0.453                0.425       
#> item_26  0.391                0.303       
#> item_23                                   
#> item_35         0.715                     
#> item_39         0.675                     
#> item_31         0.626                     
#> item_32         0.617                     
#> item_40         0.604                     
#> item_37         0.567                     
#> item_38         0.566  0.337              
#> item_36         0.566                     
#> item_33         0.565                     
#> item_04         0.545                     
#> item_34  0.355  0.502                     
#> item_28         0.458         0.308       
#> item_12         0.395         0.347       
#> item_49         0.350                     
#> item_27                0.895              
#> item_25                0.787              
#> item_21                0.718              
#> item_44                0.615         0.404
#> item_08                0.570              
#> item_02                0.449              
#> item_46                0.438         0.438
#> item_06                0.432              
#> item_10                0.408  0.390       
#> item_14                0.319              
#> item_03                       0.751       
#> item_30                       0.596       
#> item_07                       0.509       
#> item_22                       0.484       
#> item_09                       0.450       
#> item_05                       0.384       
#> item_01         0.309         0.353       
#> item_50                              0.607
#> item_45         0.300                0.583
#> item_42                0.350         0.563
#> item_43                              0.555
#> item_41                              0.450
#> item_48                              0.425
#> item_47                              0.333
#> 
#>                  MR4   MR5   MR3   MR1   MR2
#> SS loadings    6.423 5.320 4.114 3.189 2.742
#> Proportion Var 0.128 0.106 0.082 0.064 0.055
#> Cumulative Var 0.128 0.235 0.317 0.381 0.436
#> 
#> Factor correlations (Phi):
#>       MR4   MR5   MR3   MR1   MR2
#> MR4 1.000 0.684 0.607 0.576 0.540
#> MR5 0.684 1.000 0.538 0.581 0.555
#> MR3 0.607 0.538 1.000 0.465 0.447
#> MR1 0.576 0.581 0.465 1.000 0.398
#> MR2 0.540 0.555 0.447 0.398 1.000
#> 
#> Variance accounted for:
#>                         MR4   MR5   MR3   MR1   MR2
#> SS loadings           9.240 8.620 6.284 5.612 4.842
#> Proportion Var        0.185 0.172 0.126 0.112 0.097
#> Cumulative Var        0.185 0.357 0.483 0.595 0.692
#> Proportion Explained  0.267 0.249 0.182 0.162 0.140
#> Cumulative Proportion 0.267 0.516 0.698 0.860 1.000

Factor retention

When you omit nfactors, sfa() uses embedding-adapted parallel analysis (random unit vectors in the embedding dimension as the null):

fit_auto <- sfa(
  big5$items,
  embeddings = big5$embeddings,
  scoring    = big5$scoring
)
cat("Auto-detected factors:", fit_auto$factors, "\n")
#> Auto-detected factors: 5

For a multi-method comparison, use sfa_nfactors():

sim <- sfa_similarity(big5$embeddings, encoding = "atomic_reversed",
                      scoring = big5$scoring)
nf <- sfa_nfactors(sim, big5$embeddings,
                   methods = c("parallel", "kaiser"),
                   parallel_iter = 50)
print(nf)
#> Factor retention analysis (embedding-adapted)
#> 
#>   Method       n_factors
#>   parallel     5
#>   kaiser       6
#>   ------------------------
#>   Consensus    5
#> 
#> Eigenvalues: 29.62   1.92   1.74   1.48   1.27   1.01   0.93   0.88   0.78   0.70  ...

Encoding methods

The encoding argument controls how embeddings become a similarity matrix:

sim_ar  <- sfa_similarity(big5$embeddings, "atomic_reversed", big5$scoring)
sim_sq  <- sfa_similarity(big5$embeddings, "squid", big5$scoring)
#> Warning: 'scoring' has reverse-keyed items but is ignored for encoding =
#> "squid": this method is keying-free by design. Use "atomic_reversed" for keyed
#> sign-flipping.
sim_mcp <- sfa_similarity(big5$embeddings, "mean_centered_pearson", big5$scoring)
#> Warning: 'scoring' has reverse-keyed items but is ignored for encoding =
#> "mean_centered_pearson": this method is keying-free by design. Use
#> "atomic_reversed" for keyed sign-flipping.

cat("atomic_reversed range:", range(sim_ar[lower.tri(sim_ar)]), "\n")
#> atomic_reversed range: -0.7766852 0.9381671
cat("squid range:          ", range(sim_sq[lower.tri(sim_sq)]), "\n")
#> squid range:           -0.2671767 0.8269712
cat("mean_centered_pearson:", range(sim_mcp[lower.tri(sim_mcp)]), "\n")
#> mean_centered_pearson: 0.3938766 0.9382228

The ranges show what each encoding does to the sign structure. Sign-flipping (atomic_reversed) manufactures strong negatives by turning reverse-keyed items into anti-topic vectors, and SQuID’s questionnaire-mean centering recovers a modest negative range. Mean-centered Pearson tracks the raw cosines, which stay positive for these items.

Visualization

plot(fit, type = "scree")
Scree plot with parallel analysis threshold

Scree plot with parallel analysis threshold

plot(fit, type = "loadings")
Factor loading heatmap

Factor loading heatmap

Comparing with empirical factor analysis

The $loadings component works directly with psych functions:

# Run human-data EFA (not run — requires response data)
human_fit <- psych::fa(response_data, nfactors = 5, rotate = "oblimin")

# Compare
psych::factor.congruence(fit$loadings, human_fit$loadings)

For NMI, ARI, Frobenius, and disattenuated correlation:

cong <- sfa_congruence(fit, big5$factors, metrics = c("nmi", "ari"))
print(cong)
#> Factor structure congruence
#> 
#>   NMI:            0.527 (moderate - higher is better)
#>   ARI:            0.402 (weak - higher is better)

Using your own embeddings

Pass any embedding model’s output via embeddings=:

# With sentence-transformers (requires reticulate + Python).
# The default model is "Qwen/Qwen3-Embedding-0.6B"; larger models such as
# "Qwen/Qwen3-Embedding-4B" (8 GB RAM) or "Qwen/Qwen3-Embedding-8B" (16 GB RAM)
# recover factor structure more accurately.
emb <- sfa_embed(my_items, embed = "sbert", model = "Qwen/Qwen3-Embedding-0.6B")
fit <- sfa(my_items, embeddings = emb, scoring = my_scoring)

# Or bring your own function
my_embedder <- function(texts) {
  # ... your embedding logic ...
  # must return a numeric matrix (n_items x dim)
}
fit <- sfa(my_items, embed = my_embedder, scoring = my_scoring)