textrecipes 0.5.0
New steps
-
step_dummy_hash()
generates binary indicators (possibly signed) from simple factor or character vectors. -
step_tokenize()
has gotten a couple of cousin functionsstep_tokenize_bpe()
,step_tokenize_sentencepiece()
andstep_tokenize_wordpiece()
which wraps {tokenizers.bpe}, {sentencepiece} and {wordpiece} respectively (#147).
Improvements and Other Changes
-
Added
all_tokenized()
andall_tokenized_predictors()
to more easily select tokenized columns (#132). -
Use
show_tokens()
to more easily debug a recipe involving tokenization. -
Reorganize documentation for all recipe step
tidy
methods (#126). -
Steps now have a dedicated subsection detailing what happens when
tidy()
is applied. (#163) -
All recipe steps now officially support empty selections to be more aligned with dplyr and other packages that use tidyselect (#141).
-
step_ngram()
has been given a speed increase to put it in line with other packages performance. -
step_tokenize()
will now try to error if vocabulary size is too low when usingengine = "tokenizers.bpe"
(#119). -
Warning given by
step_tokenfilter()
when filtering failed to apply now correctly refers to the right argument name (#137). -
step_tf()
now returns 0 instead of NaN when there aren't any tokens present (#118). -
step_tokenfilter()
now has a new argumentfilter_fun
will takes a function which can be used to filter tokens. (#164) -
tidy.step_stem()
now correctly shows if custom stemmer was used. -
Added
keep_original_cols
argument tostep_lda
,step_texthash()
,step_tf()
,step_tfidf()
,step_word_embeddings()
,step_dummy_hash()
,step_sequence_onehot()
, andstep_textfeatures()
(#139).
Breaking Changes
- Steps with
prefix
argument now creates names according to the patternprefix_variablename_name/number
. (#124)