Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add get_encoding_name_for_model to tiktoken #136

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

noseworthy
Copy link

@noseworthy noseworthy commented Feb 16, 2025

The tiktoken-js library includes a very helpful function,
getEncodingNameForModel(). This function is buried in the
implementation of encoding_for_model() in the rust based
tiktoken package.

This function is very useful when implementing an encoding cache based
on the model used. In this case, having a mapping from model ->
encoding and then caching based on the encoding name conserves
resources since so many models re-use the same encoding.

I've exposed a new get_encoding_name_for_model() function
that behaves similarly to the one in the tiktoken-js package, and used
it inside of encoding_for_model().

Finally, I've also added a test to ensure that this function can be
called properly from typescript code, and that it properly throws
exceptions in the case of invalid model names.

Fixes: #123

@noseworthy
Copy link
Author

Hey, @dqbd and @jens-f 👋

I think this should fix #123. It's a feature we've been looking for as well, so I figured I'd take a crack at it.

I'm not a rust developer, so please excuse any obvious blunders on my part. I'd love to know what you think. It'd be awesome to have this functionality exposed!

Thanks in advance for your consideration of the PR 🙏

The `tiktoken-js` library includes a very helpful function,
`getEncodingNameForModel()`. This function is buried in the
implementation of `encoding_for_model()` in the rust based
`tiktoken` package.

This function is very useful when implementing an encoding cache based
on the model used. In this case, having a mapping from model ->
encoding and then caching based on the encoding name conserves
resources since so many models re-use the same encoding.

I've exposed a new `get_encoding_name_for_model()` function
that behaves similarly to the one in the `tiktoken-js` package, and used
it inside of `encoding_for_model()`.

Finally, I've also added a test to ensure that this function can be
called properly from typescript code, and that it properly throws
exceptions in the case of invalid model names.

Fixes: dqbd#123
@noseworthy noseworthy force-pushed the expose-encoding-name-for-model branch from e405f8b to e512c0d Compare February 17, 2025 16:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Expose getEncodingNameForModel in wasm version?
1 participant