1

For my particular project, it would be very helpful to know how many tokens the BGE-M3 embedding model would break a string down into before I embed the text. I could embed the string and count the tokens with using the following code

Settings.callback_manager = callback_manager
embedding_vector = Settings.embed_model.get_text_embedding(text)
embedding_tokens = token_counter.total_embedding_token_count

but unfortunately, embedding large amounts of text is a relatively computationally heavy problem, so I would prefer not to use this method. After doing some digging, I realized that I could use

tokenizer = AutoTokenizer.from_pretrained("BAAI/bge-m3")

to tokenize text, but oddly enough any text I embed using this method gives different results than using the previously mentioned method. I think the crux of my issue might be that the BGE-M3 embedding model pre-processes text prior to embedding. I have tried to google exactly what this pre-processing step looks like, but have been unable to find it so far. Below is some code that will allow us to re-create the issue I am talking about. Please note that this script assumes you have the model saved in ./embeddings. If you don't, it will attempt to automatically download the model, which is > 2GB.

from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core import Settings
from llama_index.core.callbacks import CallbackManager, TokenCountingHandler
from transformers import AutoTokenizer
import os

# String to test on
text = "Random words. This is a test! A very exciting test, indeed."
# just set chunk_size to 512
chunk_size = 512


# Load in the embedding model. If it does not exist, go ahead and download it
def create_embedding_model(_chunk_size=None):
    print('loading embeddings...')
    if os.path.exists('./embeddings/models--BAAI--bge-m3'):
        _cache_path = f"./embeddings/models--BAAI--bge-m3/snapshots/{os.listdir('./embeddings/models--BAAI--bge-m3/snapshots')[0]}"
        _embed_model = HuggingFaceEmbedding(model_name=_cache_path)
    else:
        os.makedirs("./embeddings", exist_ok=True)
        _emb_model_name = "BAAI/bge-m3"
        _embed_model = HuggingFaceEmbedding(model_name=_emb_model_name, max_length=_chunk_size,
                                            cache_folder='./embeddings')
    print('embeddings loaded')
    return _embed_model


# Grab embedding model
embed_model = create_embedding_model(_chunk_size=chunk_size)

# Create a token counting handler
token_counter = TokenCountingHandler()
callback_manager = CallbackManager([token_counter])

Settings.embed_model = embed_model
Settings.callback_manager = callback_manager

# Grab the embedding vector from the model
embedding_vector = Settings.embed_model.get_text_embedding(text)

# Grab the count of tokens from the embedding vector
embedding_tokens = token_counter.total_embedding_token_count

# Just the tokenizer
model_name = "BAAI/bge-m3"
tokenizer = AutoTokenizer.from_pretrained(model_name)

tokenized_text = tokenizer(text)
token_count = len(tokenized_text['input_ids'])

print(f"Original text: {text}")
print(f"The embedding model broke the text into: {embedding_tokens} tokens")
print(f"The tokenizer broke the text into {token_count} tokens")

When I run the above, I get the following output

Original text: Random words. This is a test! A very exciting test, indeed. The embedding model broke the text into: 15 tokens The tokenizer broke the text into 18 tokens

How can I replicate the token count that the BGE-M3 embedding model uses, without running the embedding itself? Is there a way to pre-process the text in the same way the embedding model does, so that the tokenizer gives the same token count?

2 Answers 2

0

What you face here is a good example for the python priciple of why explicit is better than implicit.

The TokenCountingHandler initializes a TokenCounter object with the tokenizer you provide as parameter or the default tokenizer (code reference). The default tokenizer of llama_index is a tiktoken (i.e. openai) tokenizer and not the one your model uses:

from llama_index.core import Settings

print(Settings.tokenizer)

Output:

functools.partial(<bound method Encoding.encode of <Encoding 'cl100k_base'>>, allowed_special='all')

In order to get the correct number of tokens, which is 18, you need to initialize the TokenCounter object with the tokenizer your model is using. Since the llama_index implementation TokenCounter requires that the tokenizer returns the list of input_ids instead of huggingfaces "standard" of a Batchencoding object, you need to wrap it the tokenizer first (otherwise it will return always 2 for the keys input_ids and attention_mask).

from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core import Settings
from llama_index.core.callbacks import CallbackManager, TokenCountingHandler
from transformers import AutoTokenizer
from transformers import XLMRobertaTokenizerFast
import os

text = "Random words. This is a test! A very exciting test, indeed."
chunk_size = 512

model_id = "BAAI/bge-m3"


# AutoTokenizer is just a factory
# BAAI/bge-m3 uses an XLMRobertaTokenizer
class MyToeknCounterTokenizerForLlamaIndex(XLMRobertaTokenizerFast):
    def __call__(self, *args, **kwargs):
        return super().__call__(*args, **kwargs).input_ids

llama_index_tokenizer = MyToeknCounterTokenizerForLlamaIndex.from_pretrained(model_id)

embed_model = HuggingFaceEmbedding(model_name=model_id, max_length=chunk_size)

# Create a token counting handler
# You could of course also make it the default via the settings object
token_counter = TokenCountingHandler(tokenizer=llama_index_tokenizer)
callback_manager = CallbackManager([token_counter])

Settings.embed_model = embed_model
Settings.callback_manager = callback_manager

# Grab the embedding vector from the model
embedding_vector = Settings.embed_model.get_text_embedding(text)

# Grab the count of tokens from the embedding vector
embedding_tokens = token_counter.total_embedding_token_count

# Just the tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id)
token_count = len(tokenizer(text).input_ids)

print(f"Original text: {text}")
print(f"The embedding model broke the text into: {embedding_tokens} tokens")
print(f"The tokenizer broke the text into {token_count} tokens")

Output:

Original text: Random words. This is a test! A very exciting test, indeed.
The embedding model broke the text into: 18 tokens
The tokenizer broke the text into 18 tokens
Sign up to request clarification or add additional context in comments.

1 Comment

This is fantastic, thank you for the input!
0

The difference is because BGE-M3 does extra preprocessing (lowercasing, stripping, adding [CLS]/[SEP]) before tokenization. To match its token count:

from transformers import AutoTokenizer

text = "Random words. This is a test! A very exciting test, indeed."
tokenizer = AutoTokenizer.from_pretrained("BAAI/bge-m3")

tokens = tokenizer(
    text.lower().strip(),
    add_special_tokens=True,
    truncation=True,
    max_length=512
)

print(len(tokens["input_ids"]))

This will align with the token count reported by the embedding model.

Ref: BGE-M3 model card

1 Comment

Thanks for your input! Using your method I get 17 tokens for the as opposed to my initial 18 tokens. This looks like an improvement to me considering the BGE-M3 breaks the string down to 15 tokens, but there's still a mismatch. I am wondering if either A: More pre-processing needs to be done, or B: I am incorrectly counting the tokens generated from the embedding model.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.