NEWS
stringdist 0.9.11
- Fixed a warning in gcc-13: changed specifier from %d to %ld.
(Thanks to Kurt Hornik for the head's up)
stringdist 0.9.10 (2022-11-07)
- Fixed another warning generated by new C compiler that I overlooked.
(Thanks to the CRAN team for the head's up)
stringdist 0.9.9 (2022-10-20)
- Fixed warnings generated by new C compiler. (function prototypes must
now be defined completely). (Thanks to Kurt Hornik for the head's up.)
stringdist 0.9.8 (2021-09-09)
- Fixed some issues on C-level causing problems with the
CLANG compiler. (Thanks to Brian Ripley for not only
reporting this, but also sending updated code with
fixes).
stringdist 0.9.7 (2021-07-28)
- Fixes in use of INTEGER() and VECTOR_ELT() after updates in R's C API.
this affected 'afind' and 'max_length' (internally). (Thanks to Luke
Tierny and Kurt Hornik for the notification).
- Fix in 'amatch' causing utf-8 characters to be ignored in some
cases (thanks to Joan Mime for reporting #78).
- Fix: segfault when 'afind' was called with many search patterns or many
texts to be searched.
- Fix: stringsimmatrix was not normalized correctly (Thanks to Tamas Ferenci
for reporting GH).
stringdist 0.9.6.3 (2020-10-09)
- Resubmit. Fixed an URL redirect that was detected by CRAN.
stringdist 0.9.6.2
- Resubmit. Fixed url issues detected by CRAN, added doi to description
as per CRAN request.
stringdist 0.9.6.1
- Bugfix: afind/grab/grabl returned wrong results on MacOS only.
(thanks to Prof. Brian Ripley for the notification and for running tests
on his personal machine and to Tomas Kalibera for making the
ubuntu-rchk docker image available).
stringdist 0.9.6 (2020-07-16)
- New function 'afind': find approximate matches in text based on string distance.
- New functions 'grab', 'grabl': fuzzy matching equivalent to 'grep' and 'grepl'.
- New function 'extract': fuzzy matching equivalent of stringr::str_extract.
- New algorithm 'running_cosine': fast fuzzy text search using cosine distance.
- New function 'stringsimmatrix' (Thanks to Johannes Gruber).
- Number of threads used is now reported when loading 'stringdist'.
- Internal fixes (in some cases class() == 'class' was used).
stringdist 0.9.5.5 (2019-10-21)
- Changed two URLs to canonical form in README.md (https://) to comply with
CRAN policy.
stringdist 0.9.5.4
- Some tests using seq_dist() would fail unpredictably when the input was
defined with lazily evaluated arguments, e.g. list(1:3, 2:4); but only in the
context of NSE by a test suite ('tinytest', 'testthat'). Tests were replaced by
literal versions, e.g. list(c(1,2,3), c(2,3,4)).
stringdist 0.9.5.3 (2019-10-11)
- Update in test suite to stay on CRAN
stringdist 0.9.5.2 (2019-06-06)
- RJournal paper and C/C++ api docs are now presented as vignette.
- Switched to tinytest framework
- Fix: stringdist could cause a segfault for edit distances between very long
strings. (Thanks to GH user gllipatz)
stringdist 0.9.5.1 (2018-06-08)
- Fixed header file for C API
stringdist 0.9.5.0 (2018-06-07)
- New contributor: Chris Muir
- C/C++ API now exposed for packages LinkingTo stringdist. See '?stringdist_api'
- Arguments 'maxDist', 'ncores', 'cluster' of functions 'stringdist' and
'stringdistmatrix' have been deprecated for several years and are now
removed.
- Fixed edge case where cosine distance with q=1, between strings of repeating characters
yielded Inf (Thanks to Markus Dumke)
stringdist 0.9.4.6 (2017-07-31)
- Fixed argument passing error in lower_tri (thanks to Kurt Hornik)
stringdist 0.9.4.5 (2017-07-27)
- New argument 'bt' implementing Winkler's boost threshold for the Jaro-Winkler distance
- stringdist(a,b,method="qgram") returns correct value when q>nchar(a) (or b).
(Thanks to Giora Simchoni). Also affects stringdistmatrix, amatch, seq_dist,
and seq_distmatrix.
- registered native routines as now recommended by CRAN
stringdist 0.9.4.4 (2016-12-16)
- updated default nr of threads to comply to CRAN policy (thanks to Kurt Hornik).
The default nr of cores now equals OMP_NUM_THREADS if set. See
?'stringdist-parallelization' for the full policy.
stringdist 0.9.4.2 (2016-09-09)
- bugfix in stringdistmatrix(a): value of p, for jw-distance was ignored
(thanks to Max Fritsche)
- bugfix in stringdistmatrix(a): Would segfault on q-gram w/input > ~7k strings
and q>1 (thanks to Connor McKay)
- bugfix in jaccard distance: distance not always correct when passing multiple
strings (thanks to Robert Carlson)
stringdist 0.9.4.1 (2016-01-02)
- stringdistmatrix(a) now outputs long vectors (issue #45, thanks to Wouter
Touw). For stringdistmatrix(a,b) this was already the case, but the length
of rows and columns remains restricted to 2^31-1 since long input vectors are
not supported (yet).
- bugfix in osa/dl/lv distances w/unequal edit weights (thanks to Nathalia Potocka)
stringdist 0.9.4 (2015-10-26)
- bugfix: edge case for zero-size for lower tridiagonal dist matrices (caused
UBSAN to fire, but gave correct results).
- bugfix in jw distance: not symmetric for certain cases (thanks to github user gtumuluri)
stringdist 0.9.3 (2015-08-21)
- new function for tokenizing integer sequences: seq_qgrams
- new function for matching integer sequences: seq_amatch
- new functions computing distances between integer sequences: seq_dist, seq_distmatrix
- q-gram based distances are now always 0 when q=0 (used to be Inf if at least
one of the arguments was not the empty string)
- stringdist, stringdistmatrix now emit warning when presented with 'list' argument
- small c-side code optimizations
- bugfix in dl, lv, osa distance: weights were not taken into account properly
(thanks to Zach Price)
stringdist 0.9.2 (2015-06-24)
- Update fixing some errors (missing documentation, tests) in the 0.9.1 release.
- Fixed a few possible memory leaks.
stringdist 0.9.1 (2015-06-22)
- Argument 'useNames' of 'stringdistmatrix' now accepts 'none', 'strings', and 'names'
- New function 'stringsim' computes string similarities between 0 and 1 based on 'stringdist'
- Calling 'stringdistmatrix' with a single argument returns an object of class 'dist'
- Argument 'cluster' to stringdistmatrix is phased out. It is now ignored with a message.
- Specifying 'ncores' was already ignored but now also causes a warning
- internal: rewrite of the R/C interface, saving about 1/3 of C-code, making extending easier
- bugfix in stringdistmatrix: output was transposed when length(a)==1 (Thanks to github user cpoonolly)
- Safer core detection to avoid a failure under Cygwin (thanks to Lauri Koobas)
stringdist 0.9.0 (2015-01-10)
- C-code underlying stringdist and amatch now automatically use multithreading based on openMP.
The default number of threads is governed by options('sd_num_thread').
- stringdist, stringdistmatrix, amatch and ain gain nthread argument which can
overwrite the default maximum number of threads.
- Argument 'maxDist' is phased out for 'stringdist' and 'stringdistmatrix'.
Specifying it causes a message.
- Argument 'ncores' is phased out for 'stringdistmatrix'. It is now ignored and
specifying it causes a message.
- bugfix in amatch/dl. In certain cases, the best match went undetected.
- Documentation improved and rearranged with string metrics, encoding, and
parallelization now documented as separate topics.
stringdist 0.8.2 (2014-12-16)
- Fixed a few warnings issued by the CLANG compiler (thanks to Brian Ripley).
This fixes a bug in amatch/jaccard
- Fixed a bug in stringdist/osa, dl: NA incorectly returned (thanks to Lauri
Koobas).
stringdist 0.8.1 (2014-10-07)
- stringdistmatrix returns dimensionless matrix when both arguments have length
zero (thanks to Richie Cotton)
- stringdistmatrix gains argument 'useNames' (thanks to Richie Cotton)
- Package now 'Imports' parallel rather than 'Depends' on it.
- bugfix in optimal string alignment distance: the nr of transpositions was
sometimes overcounted (thanks to Frank Binder)
- rearranged the documentation.
stringdist 0.8.0 (2014-08-08)
- Added soundex-based string distance (thanks to Jan van der Laan)
- New function 'phonetic' translates strings to phonetic codes using soundex
(thanks to Jan van der Laan)
- New function 'printable_ascii' detects non-printable ascii or non-ascii
characters.
- Precision issue: cosine distance between equal strings would be O(1e-16) in
stead of 0.0 (thanks to Ben Haller).
- Code cleaning: somewhat better performance when maxDist is unspecified in
stringdist. It remains deprecated.
- Row names in the output array of 'qgrams' are now in system native encoding
(used to be utf8 for all systems).
- updated CITATION with page number info as the R Journal is now out.
stringdist 0.7.3 (2014-05-16)
- bugfix in jw-distance: out-of-range access in C-code caused R to crash in
some cases (thanks to Carol Gan)
- bugfix in dl distance: in some cases, distances could be one unit too high.
- Updated CITATION file: paper to appear in The R Journal vol 6 (2014).
- Some updates in documentation.
stringdist 0.7.2 (2014-03-02)
- function 'qgrams' gains .list argument
- bugfix in multicore option of stringdistmatrix
- bugfix in substitution weight of DL-distance (undercounted when w4 != 1 in
some cases)
- bugfix in dl.c: C-function read outside of array.
stringdist 0.7.0 (2013-09-06)
- added useBytes option: up to ~3-fold speed gain at the cost of possible
encoding-dependent results.
- new memory allocation method for q-grams increases speed between ~5% and ~30%
depending on q and input string.
- function 'qgrams' gains useNames option.
- jaro-winkler distance gains weight argument.
- C-code optimization in edit-based distances: 10~20% speed increase depending
on input.
- bugfix in amatch: sometimes NA was erroneously returned.
- bugfix in amatch/lcs: hamming distance method was called erroneously.
stringdist 0.6.1 (2013-08-09)
- bugfix in parallel version of stringdistmatrix: parameter p was not passed
(thanks to Ricardo Saporta)
- bugfix in lv/osa/dl: maxDist ignored in certain cases
stringdist 0.6.0 (2013-07-19)
- added amatch function: approximate matching version of 'match'
- added ain function: approximate matching version of '%in%'
- qgrams now accepts arbitrary number of arguments. Outputs array, not table
- added cosine distance
- added Jaccard distance
- added Jaro and Jaro-Winkler distances
- small performance tweeks in underlying C code
- Edge case in stringdistmatrix: output is now always of class matrix
- Default maxDist is now Inf (this is only to make it more intuitive and does
not break previous code)
- BREAKING CHANGE: output -1 is replaced by Inf for all distance methods
stringdist 0.5.0 (2013-06-21)
- added qgram counting function 'qgrams'
- faster edge case handling in osa method.
- edge case in lv/osa/dl methods: distance returned length(b) in stead of -1
when length(a) == 0, maxDist < length(b).
- bugfix in lv/osa/dl method: maxDist returned when length(a) > maxDist > 0
(thanks to Daniel Reckhard).
- Hamming distance (method='h') now returns -1 for strings of unequal lengts
(used to emit error).
- added longest common substring distance (method='lcs').
- added qgram distance method.
- stringdistmatrix gains cluster argument.
stringdist 0.4.2
- Fix in error message for hamming distance
- Workaround for system-dependent translation of utf8 NA characters
stringdist 0.4.0