BPE tokenizer in Rust - would love feedback from the community

Posted by farhan-dev@reddit | LocalLLaMA | View on Reddit | 8 comments

BPE tokenizer in Rust - would love feedback from the community

Hey everyone,

I've been working on a side project called Splintr - a BPE tokenizer written in Rust with Python bindings. It's compatible with OpenAI's tiktoken vocabularies (cl100k_base, o200k_base).

What it does:

Quick example:

pip install splintr-rs
from splintr import Tokenizer

tokenizer = Tokenizer.from_pretrained("cl100k_base")   
tokens = tokenizer.encode("Hello, world!")   
text = tokenizer.decode(tokens)

# Batch encode (where it really shines)

texts = ["Hello", "World"] * 1000   
batch_tokens = tokenizer.encode_batch(texts)

I spent some time benchmarking and optimizing - turns out sequential encoding beats parallel for most text sizes (Rayon overhead only pays off at \~1MB+). Sometimes simpler is faster.

GitHub: https://github.com/farhan-syah/splintr

Would really appreciate if you could give it a try and let me know:

Still early days, but happy to hear any feedback. Thanks for reading!