tok provides bindings to the 🤗tokenizers library. It uses the same Rust libraries that powers the Python implementation.
We still don’t provide the full API of tokenizers. Please open a issue if there’s a feature you are missing.
You can install tok from CRAN using:
install.packages("tok")
Installing tok from source requires working Rust toolchain. We recommend using rustup.
On Windows, you’ll also have to add the
i686-pc-windows-gnu
and x86_64-pc-windows-gnu
targets:
rustup target add x86_64-pc-windows-gnu
rustup target add i686-pc-windows-gnu
Once Rust is working, you can install this package via:
::install_github("dfalbel/tok") remotes
We still don’t have complete support for the 🤗tokenizers API. Please open an issue if you need a feature that is currently not implemented.
tok
can be used to load and use tokenizers that have
been previously serialized. For example, HuggingFace model weights are
usually accompanied by a ‘tokenizer.json’ file that can be loaded with
this library.
To load a pre-trained tokenizer from a json file, use:
<- testthat::test_path("assets/tokenizer.json")
path <- tok::tokenizer$from_file(path) tok
Use the encode
method to tokenize sentendes and
decode
to transform them back.
<- tok$encode("hello world")
enc $decode(enc$ids)
tok#> [1] "hello world"
You can also load any tokenizer available in HuggingFace hub by using
the from_pretrained
static method. For example, let’s load
the GPT2 tokenizer with:
<- tok::tokenizer$from_pretrained("gpt2")
tok <- tok$encode("hello world")
enc $decode(enc$ids)
tok#> [1] "hello world"