Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is there a way to cache the bpe when using num_tokens_from_messages #81

Open
xd009642 opened this issue Aug 13, 2024 · 3 comments
Open

Comments

@xd009642
Copy link

We've found that the num_tokens_from_messages function is a significant bottleneck in our application and from some benchmarking found that it seems to load the bpe every time it's called and this dominates the runtime. Is there a way to cache this so we can load it once and not take the pain of loading it each time or will it require a change to tiktoken-rs' internals (and if so what does the change have to be?)

Section of a flamegraph showing the breakdown of num_tokens_from_messages

image

@xd009642
Copy link
Author

As an aside note for perf the singleton models having a mutex around them seems counterproductive. Generally, with model code you want your model to either be:

  1. Callable from multiple threads/contexts with any mutable state stored in a separate type so the model type remains immutable
  2. Cheap to clone (usually via reference counted params) so if mutable state can't be separated the model is reusable

Having mutexes around the singletons seems to drastically limit their utility given that the tokenisers are running on the CPU and not that complex.

Also for benchmarking you should look at criterion or divan, the rust benchmarking stuff won't be stable and provides less useful statistics/measurements

@xd009642
Copy link
Author

For some benchmarking results to back things up. We took the impl of num_tokens_from_message and passed in an already loaded model to avoid reloading and we got:

├─ token_counting                     │               │               │               │         │
│  ├─ Azure35           2.299 ms      │ 97.9 ms       │ 4.02 ms       │ 5.278 ms      │ 100     │ 100
│  ├─ Azure4            2.048 ms      │ 11.42 ms      │ 4.429 ms      │ 4.66 ms       │ 100     │ 100
│  ├─ Llama2            2.373 ms      │ 33.58 ms      │ 5.256 ms      │ 6.082 ms      │ 100     │ 100
│  ├─ OpenAI35          3.006 ms      │ 15.72 ms      │ 8.08 ms       │ 8.418 ms      │ 100     │ 100
│  ╰─ OpenAI4           3.22 ms       │ 30 ms         │ 7.46 ms       │ 8.057 ms      │ 100     │ 100

Using the version in this library we get:

Timer precision: 25 ns
assistant_benches       fastest       │ slowest       │ median        │ mean          │ samples │ iters
├─ token_counting                     │               │               │               │         │
│  ├─ Azure35           82.79 ms      │ 174.3 ms      │ 92.74 ms      │ 97.51 ms      │ 100     │ 100
│  ├─ Azure4            208.7 ms      │ 694.2 ms      │ 238.2 ms      │ 274.8 ms      │ 100     │ 100
│  ├─ Llama2            83.74 ms      │ 123.6 ms      │ 99.9 ms       │ 99.07 ms      │ 100     │ 100
│  ├─ OpenAI35          83.85 ms      │ 125.5 ms      │ 97.92 ms      │ 98.48 ms      │ 100     │ 100
│  ╰─ OpenAI4           208.9 ms      │ 528.9 ms      │ 241.8 ms      │ 249.7 ms      │ 100     │ 100

@OTheDev
Copy link

OTheDev commented Oct 14, 2024

I can reproduce the performance gains of using a pre-initialized BPE.

In fact, it would be possible to implement something like a num_tokens_from_messages_cached_bpe that does the initialization of the two chat BPEs in it's first call and retrieves references to initialized BPEs in subsequent calls. The arguments to this function would be exactly the same as that for num_tokens_from_messages (no BPE would need to be supplied). This makes it easier for users that don't want to get a BPE themselves.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants