-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve throughput of computing embeddings with BetterTransformer
#15
base: main
Are you sure you want to change the base?
Conversation
move tests from the previous repo
I tested this out (in the multi-gpu case), and I'm not seeing any improvements with
I'm running if __name__ == "__main__":
torch_mem = 40
model_name = "sentence-transformers/all-MiniLM-L6-v2"
dataset = "quora"
start = time.time()
model = cf.SentenceTransformerModel(model_name, max_mem_gb=torch_mem)
with cf.Distributed(rmm_pool_size=f"{torch_mem}GB", n_workers=2):
cf.embed(
dataset,
model=model,
vector_search=False,
sorted_data_loader=True,
overwrite=True,
)
print("total time", time.time() - start) With
Without
|
that could be because the Flash Attention kernel is not available in your environment. Try runnning the following which explictly activates the flash attention kernel. If you get the error import torch
from transformers import AutoModel, AutoTokenizer
from optimum.bettertransformer import BetterTransformer
tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")
model = AutoModel.from_pretrained("sentence-transformers/all-MiniLM-L6-v2").to("cuda")
# convert the model to BetterTransformer
model = BetterTransformer.transform(model.to(torch.float16))
input_text = "Example sentence"
inputs = {k: v.cuda() for k, v in tokenizer(input_text, return_tensors="pt").items()}
with torch.backends.cuda.sdp_kernel(enable_flash=True, enable_math=False, enable_mem_efficient=False):
outputs = model(**inputs) |
I didn't get any errors so I assume the flash attention kernel is available. I'm running everything in |
Hmm, I've been trying things out with |
Yeah, it looks like the model size has something to do with it. I do see ~2x improvements with large models like |
This confirms my suspicion that it depends on various factors what serving mechanism is the most efficient for a particular model. I wonder if we could implement some functionality that's a bit similar to how we tried to estimate memory consumption of a model by calling it with various batch/seq-len combinations. We could take the model, some synthetic data and record latency of all the batch-prediction techniques we offer & pick the fastest. |
Improve throughput of computing embeddings with
BetterTransformer
inSentenceTransformerModel
.This improved throughput by about 6x for me processing the FIQA BEIR dataset (57k documents)