A very small update to flag an issue with RcppArmadillo flagged and fixed by Dirk Eddelbuettel long ago. Thank you Dirk!
A very small update to fix invalid numeric inputs made by Kurt Hornick. Thank you Kurt!
also a small bug fix from ADernild. Thank you!
Another very small release to address a concern with file download speeds.
took the opportunity to fix a number of small issues. Thanks to Julia Silge for pointing out an issue with formula.
Super big thanks to Brooks Ambrose for not only making a speed improvement but then noticing I hadn't actually fixed it in 1.3.4.
Thanks to Sergei Pashakhin for catching an issue with the way summary.estimateEffect() indexed things.
A small release to correct the class() issue in preparation for R 4.0.0. Thanks to the R core team for clear directions.
A small release to coincide with JSS vignette being posted. Larger release to come soon hopefully.
Hat tip to Wouter van Atteveldt for a pull request that fixed a problem in summary.estimateEffect
Thanks to Brooks Ambrose for a great catch in the spectral algorithm that should increase memory efficiency and speed during initialization for large vocabulary models.
Thanks to Elena Savelyeva for improvements to the toLDAVis function.
Added a function that allows for custom punctuation removal in textProcessor and also a flag which allows access to the broader array of unicode punctuation (ucp=TRUE).
Multiple improvements from Yuval Dugan. Allow topics to be scaled by vertex size in plot.topicCorr. toLDAvis now allows renumbering of topics.
Numerous documentation updates.
In order to facilitate interoperability, more functions can now take quanteda and Matrix objects. Thanks to Julia Silge for the suggestion!
Fixing a problem with ngroups and adding a test to detect such problems in the future. Thanks to Igor Elbert for finding this.
Fixing a bug in K=0 where the method could fail if the perplexity was too large. It now automatically tries to reset itself. To help with this in the future the initial dimension and perplexity values can now be set through the control options in stm. Thanks to Mario Santoro for the bug catch and excellent replication file.
Fixed a very small edge case bug with an internal function that powers prepDocuments() which would cause document sets where every single document had the same number of unique words to fail.
Fixed an issue with heldout document creation where a document could be completely emptied out. This in turn caused issues in searchK. Hat tip to KB Park for reporting.
Fixed an issue with heldout evaluation that caused the number of tokens in the heldout set to be reported incorrectly by eval.heldout(). Note the incorrect calculation had no downstream consequences because it was only for storing the term as part of the output object. Thanks to Jungmin Lee for spotting.
Fixed a small bug that kept the cov.value1 and cov.value2 arguments of plot.estimateEffect from working when factor variables were passed to them instead of character variables. Hat tip to Rob Williams for finding that.
In version 1.3.0 we changed one of the calculations for K=0 to make it faster. This broke an old bug fix for an unusual edge case. We have now fixed that. Thanks to Thien Vuong Nguyen for sending info on such an edge case.
Changed the default initialization to Spectral (thanks to Carsten Schwemmer for the reminder to do that),
Important bug fix from Chris Baker that caused summary.estimateEffect to not work sometimes.
A fix from Jeffrey Arnold that makes the ngroups memoization functionality work more in line with the original Hughes and Sudderth paper.
Changed documentation for the ngroups memoization functionality in accordance with info from Adel Dauod who showed that in his larger document examples setting ngroups > 1 increased the number of iterations needed for convergence.
Changed the K=0 spectral initialization to use a randomized projection method to calculate the PCA which initializes the T-SNE projection. This should make the initializations much faster for K=0 and large vocabs.
You can now run for a fixed number of em steps by setting emtol=0L, you can also disallow negative changes in the bound triggering convergence using the control option allow.neg.change=FALSE.
Bumped the required version of R to be consistent with package dependencies, hat tip to Sean Westwood
Fixed an edge case in make.heldout where you can end up with documents being classed as non-integers. Hat tip to github user LouHb.
Fixed a bug that caused R to crash for very large models on Windows machines only. Huge thanks to Adel Dauod for reporting the bug and doing an enormous amount of testing to isolate it.
Added the alignCorpus() function to help prep unseen documents for fitNewDocuments()
Thanks to Vincent Arel-Bundock for adding parallel computing to searchK
Long-overdue thanks to Stephen Woloszynek for fixing some bugs in thetaPosterior when local uncertainty is chosen.
Setting max.em.its=0 will now return the results of the initialization procedure.
You can now pass your own custom initializations of beta.
We've removed the text file reader from textProcessor but encourage users who want to read in texts to check out the excellent readtext package.
stm can now take term matrices from corpus and text2vec and pass them to stm thanks to a contribution from Patrick Perry.
Additional internal improvements in how inputs are handled by Patrick Perry.
A small bug fix to fitNewDocuments() thanks to user: OO00OO00 on Github that cause the functionality not to work for CTMs.
A small release correcting some minor bugs.
Fixed a bug in estimateEffect when a character or factor variable had 2+ levels with the same number of observtions. Thanks to APuzyk on Github for catching this.
The last release turned on a different recovery method for the spectral algorithm by default. Changed the default back to exponential gradient as documented. Thanks to Simone Zhang for this catch.
Better defaults for some of the labeling in plot.STM
Better argument matching and errors for plot.estimateEffect
Small updates to the documentation and vignette.
Thanks to efforts by Ken Benoit stm can now take a quanteda dfm object.
Thanks to help from Chris Baker we are now using roxygen for our documentation.
Jeffrey Arnold fixed a small bug in toLDAVis. Thanks!
Carsten Schwemmer helped us find a bug where plot.estimateEffect() didn't work when dplyr was loaded. This is now fixed.
In making the change to roxygen we unfortunately break backwards compatibility. The package's generic functions such as plot.estimateEffect() and plot.STM() can now only be called by plot() rather than by their full name.
We document and export optimizeDocument which provides access to the document level E-step.
We have documented and exported several of the labeling functions including calcfrex, calcscore, calclift and js.estimate. These are marked with keyword internal because they don't have much error checking and most users will want labelTopics anyhow. But they can be accessed with ? and are linked from labelTopics
After much popular demand we have released a fitNewDocuments() function which will calculate topic proportions for documents not used to fit the models. There are many different options here.
estimateEffect now has a summary function which will make regression tables
a number of the internals to plot.estimateEffect have been improved which should eliminate some edge case bugs.
Much of the documentation has been updated as has the vignette.
Our wrapper s() for the splines package function bs() now has predict functions associated with it so it should work in contexts like lm()
We have documented and exported all the metrics for searchK()
We made a change in the spectral initialization which ensures that only the top 10000 (a modifiable default) words are used in the initialization. This allows it to be used effectively with much larger vocab.
Added a modifiable max iteration timeout error for the prevalence regression in stm. This will only matter for people using covariate sets which are very, very large.
Added a new recovery algorithm used in spectral initialization from gradient descent to a more accurate and generally faster one based on quadratic programming. This can be turned on by: control=list(recoverEG=FALSE). Eventually we may change the default recovery method. Note: while this more accurately solves the actual problem, we've seen better results with the early stopping produced by exponential gradient not fully convering. This has been confirmed by the Arora group as well.
Clarified some of the documentation in textProcessor thanks to James Gibbon.
Fixed a problem with registration of S3 methods for textProcessor()
Added access to the information criterion parameter for L1 mode prevalence covariance in stm. See the gamma.ic.k option in the control parameters of stm
searchK() can now be used with content covariates thanks to GitHub user rosemm
added a querying function based on data.table into findThoughts()
Fixed a rare bug in the K=0 feature for spectral initialization where words with the exact same appearance pattern would cause the projection to fail.
Fixed the unexported findTopic()
Improved some documentation
Small finetuning in toLDAViz
Fixed a small bug that caused readCorpus to fail on dense document term matrices
Fixed a small bug in the random projections algorithm
Improved warnings in stm when restarting models (Hat tip to Andrew Goldstone)
Added the convertCorpus function for converting stm to other formats.
Formatting changes to the vignette
Updated Vignette
Performance improvements via various optimizations including porting some components to C++
Various new experimental features including K=0
Improved documentation including a new version of the vignette.
Better error messages in several places
Experimental options for random projections with spectral initializations
Fixes a problem in make.heldout where a document could be completely emptied by the procedure. Hat tip to Jesse Rhodes for the bug report.
When gamma.prior="L1"
coerce the mu object back to a matrix class object. Should fix a speed hit introduced in 1.0.10 for this case.
Prevalence covariates can now use sparse matrices which will result in better performance for large factors.
textProcessor() and prepDocuments() now do a better job of preserving labels and keeping track of dropped elements. Special thanks to Github users gtuckerkellog and elbamos for pull requests.
Fixed an edge case in init.type="Spectral" where words appearing only in documents by themselves would throw an error. The error was correct but hard to address in certain cases, so now it temporarily removes the words and then reintroduces them before starting inference spreading a tiny bit of mass evenly across the topics. Hat tip to Nathan Sanders for brining this to our attention.
New function findTopic() which helps locate topics containing particular words or phrases.
New function topicLasso() helps build predictive models with topics.
Fixed a minor bug in prepDocuments which arises in cases where there are vocab elements which do not appear in the data.
Fixed a minor bug in frex calculation that caused some models not to label.
Fixed a minor bug in searchK that caused heldout results to report incorrectly.
Rewrite of plot.estimateEffect() which fixed a bug in some interaction models. Also returns results invisibly for creating custom plots.
Increased the stability of the spectral methods for stm initialization.
Complete rewrite of plotRemoved() which makes it much faster for larger datasets.
A minor patch to deal with textProcessor() in older versions of R.
Large changes many of which are not backwards compatible.
Numerous speed improvements to the core algorithm.
Introduction of several new options for the core stm function including spectral initalization, memoized inference, and model restarts.
Content covariate models are now estimated using the distributed multinomial formulation which is dramatically faster. Default prior also changed to L1.
Handling of document level convergence was changed to ensure positive definiteness in the document-level covariance matrices
Fixed bug in binary/binary interactions.
Numerous new diagnostic and summary functions
Expanding the console printing of many of the preprocessing functions
Fix an error with vignettes building on linux machines
sageLabels exported but not documented
factorCheck diagnostic function exported
Bug fix in the semantic Coherence function that affected content covariate models.
Bug fix to plot.STM() where for content covariate models with only a subset of topics requested the labels would show up as mostly NA. Thanks to Jetson Leder-Luis for pointing this out.
Bug fix for the readCorpus() function with txtorg vocab. Thanks to Justin Farrell for pointing this out.
Added some diagnostics to notify the user when words have been dropped in preprocessing.
Automatically coerce dates to numeric in spline function.
Very minor change with textProcessor() to accommodate API change in tm version 0.6
New option for plot.STM() which plots the distribution of theta values. Thanks to Antonio Coppola for coauthoring this component.
Deprecated option "custom" in "labeltype" of plot.STM(). Now you can simply specify the labels. Added additional functionality to specify custom topic names rather than the default "Topic #:"
Bug fixes to various portions of plot.STM() that would cause labels to not print.
Added numerous error messages.
Added permutationTest() function and associated plot capabilities
Updates to the vignette.
Added functionality to a few plotting functions.
When using summary() and labelTopics() content covariate models now have labels thresholded by a small value. Thus one may see no labels or very few labels particularly for topic-covariate interactions which indicates that there are no sizable positive deviations from the baseline.
S3 method for findThoughts and ability to threshold by theta.
Allow estimateEffect() to receive a data frame. (Thanks to Baoqiang Cao for pointing this out)
Major updates to the vignette
Minor Updates to several plotting functions
Fixed an error where labelTopics() would mislabel when passed topic numbers out of order (Thanks to Jetson Leder-Luis for pointing this out)
Introduction of the termitewriter function.
Version for submission to CRAN (2/28/2014)
Introduced new dataset poliblog5k and shrunk the footprint of the package
Numerous alternate options changed and some slight syntax changes to stm to finalize the API.
New build 2/14/2014
Fixing a small bug introduced in the last version which kept defaults of manyTopics() from working.
Updated version posted to Github (2/13/2014)
Various improvements to plotting functions.
Setting the seed in selectModel() threw an error. This is now corrected. Thanks to Mark Bell for pointing this out.
First public version released on Github (2/5/2014)
This is a beta release and we may change some of the API before submission to CRAN.