Move clk100k_base tokenizer to docker image (#3411)

### What problem does this PR solve?

Move the tiktoken of cl100k_base into docker image

issue: #3338 

### Type of change

- [x] Refactoring

Signed-off-by: jinhai <haijin.chn@gmail.com>
Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>
This commit is contained in:
Jin Hai
2024-11-15 10:18:40 +08:00
committed by GitHub
parent 220aaddc62
commit 996c94a8e7
4 changed files with 15 additions and 4 deletions

View File

@ -17,7 +17,7 @@
import os
import re
import tiktoken
from api.utils.file_utils import get_project_base_directory
def singleton(cls, *args, **kw):
instances = {}
@ -71,9 +71,10 @@ def findMaxTm(fnm):
pass
return m
encoder = tiktoken.encoding_for_model("gpt-3.5-turbo")
tiktoken_cache_dir = get_project_base_directory()
os.environ["TIKTOKEN_CACHE_DIR"] = tiktoken_cache_dir
# encoder = tiktoken.encoding_for_model("gpt-3.5-turbo")
encoder = tiktoken.get_encoding("cl100k_base")
def num_tokens_from_string(string: str) -> int:
"""Returns the number of tokens in a text string."""