Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA: Fix new mma detection for Turing cards with Volta PTX #12187

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

neilmehta24
Copy link

@neilmehta24 neilmehta24 commented Mar 4, 2025

We are seeing that this change incorrectly disabled flash attention for Turing cards (cc=75) when llama.cpp was compiled for Volta cards only (cc=70). To fix, check that we have compiled for Volta or greater, and that the card is Turing or greater. If there is a better way to fix, please do advise.

To reproduce the breakage on the current build, compile with architecture 70 and without architecture 75, and generate with flash attention on a Turing card.

@github-actions github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Mar 4, 2025
@JohannesGaessler
Copy link
Collaborator

Please confirm whether or not #12222 fixes the issue. The fix in this PR is definitely not correct for all scenarios.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants