-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Trying this with bumblebee, some of the default neural tasks crash #58
Comments
I also get the crash on my end, but it's silent as well. I suspect memory issues or something like that. |
I tried running the code in
|
I get a crash when using Nx.BinaryBackend as well which is possibly related:
|
We can use EMLX.clip on the indices over the corresponding axes to avoid the segfault. Or if we make use of a process dictionary key to decide whether we're inside an EMLX compilation workflow, we can do the runtime check on the concrete value and raise an error instead. |
What strikes me as odd is that the index being accessed is max uint32 |
Good call, there was a case where Bumblebee would generate out of bound indices. At the same time, whatever gather would return for these specific indices was ignored, so it worked fine, as long as gather returned anything. I've just fixed the indices on Bumblebee main. That said, we definitely don't want EMLX to segfault :D I updated the notebook to install |
Oh got it, the other error is just a configuration issue in the notebook -Nx.default_backend(EMLX.Backend)
+Nx.global_default_backend(EMLX.Backend) |
Thank you, this is amazing! Following your instructions (hopefully got them correct) I still experience crashes when i hit run on the neural network task: https://gist.github.com/nickkaltner/e1f69a7a73530bb443584c057eae4583 |
It seems to work on some runs and crash on others (and sometimes crashes midway through generation), so there's possibly another segfault in EMLX, though less deterministic. |
Yes same This seems to reproduce it Mix.install([
{:nx, "~> 0.9.2"},
{:bumblebee, github: "elixir-nx/bumblebee", override: true},
{:emlx, github: "elixir-nx/emlx"},
{:kino_bumblebee, "~> 0.5.1"}
])
Nx.global_default_backend(EMLX.Backend)
# Nx.global_default_backend({EMLX.Backend, device: :gpu})
Nx.Defn.default_options(compiler: EMLX)
{:ok, model_info} = Bumblebee.load_model({:hf, "openai-community/gpt2"})
{:ok, tokenizer} = Bumblebee.load_tokenizer({:hf, "openai-community/gpt2"})
{:ok, generation_config} =
Bumblebee.load_generation_config({:hf, "openai-community/gpt2"})
generation_config = Bumblebee.configure(generation_config, max_new_tokens: 20)
serving =
Bumblebee.Text.generation(model_info, tokenizer, generation_config,
compile: [batch_size: 1, sequence_length: 100],
stream: true
)
Nx.Serving.run(serving, "What is the capital of queensland?") |> Enum.map(fn x -> IO.puts(x) end) and then just run |
After a bunch of digging it looks like going into
not sure if this helps anyone, there are lots of moving parts that i'm trying to understand :) another thing i worked out is i can jump into lldb if i put these lines at the top of the above file;
then i run the command it outputs in another tab, type |
https://dockyard.com/blog/2024/05/22/debugging-elixir-nifs-with-lldb This might help. It's my usual workflow for using LLDB with Elixir stuff |
Also, we might need to introduce a debug-enabled version of libmlx for proper debugging here (cc @cocoa-xu). Although it might be sufficient to enable -g on the NIF side, which will let us at least know what's the problematic NIF call on our side. |
Ah we have that already, it can be enabled by setting |
As for the |
Maybe related to ml-explore/mlx#1448? I've successfully run
Both seem to prevent NIFs from running in separate threads (I'm not sure if this is always the case). Note that, in my experiment, I removed |
I ran into this trying the streaming. I think I also hit it in a weird way when I was developing and the dev server would rebuild various parts of the application. Suddenly it stopped rebuilding and when I checked the terminal the app was dead from a segmentation fault. I won't be able to reproduce it but I would imagine it is related or similar. |
I also ran into this, though my stacktrace is a bit different from the one above:
I found that removing |
Crashes for me in live book too. I am trying with bumblee default examples just modified to use EMLX. Mix.install([
{:nx, "~> 0.9.2"},
{:bumblebee, github: "elixir-nx/bumblebee", override: true},
{:emlx, github: "elixir-nx/emlx"},
{:kino_bumblebee, "~> 0.5.1"}
])
#Nx.global_default_backend(EMLX.Backend)
Nx.global_default_backend({EMLX.Backend, device: :gpu})
Nx.default_backend({EMLX.Backend, device: :gpu}) repo_id = "CompVis/stable-diffusion-v1-4"
opts = [params_variant: "fp16", type: :bf16, backend: EMLX.Backend]
{:ok, tokenizer} = Bumblebee.load_tokenizer({:hf, "openai/clip-vit-large-patch14"})
{:ok, clip} = Bumblebee.load_model({:hf, repo_id, subdir: "text_encoder"}, opts)
{:ok, unet} = Bumblebee.load_model({:hf, repo_id, subdir: "unet"}, opts)
{:ok, vae} = Bumblebee.load_model({:hf, repo_id, subdir: "vae"}, [architecture: :decoder] ++ opts)
{:ok, scheduler} = Bumblebee.load_scheduler({:hf, repo_id, subdir: "scheduler"})
{:ok, featurizer} = Bumblebee.load_featurizer({:hf, repo_id, subdir: "feature_extractor"})
{:ok, safety_checker} = Bumblebee.load_model({:hf, repo_id, subdir: "safety_checker"}, opts)
:ok serving =
Bumblebee.Diffusion.StableDiffusion.text_to_image(clip, unet, vae, tokenizer, scheduler,
num_steps: 2,
num_images_per_prompt: 1,
safety_checker: safety_checker,
safety_checker_featurizer: featurizer,
compile: [batch_size: 1, sequence_length: 60],
# Option 1
# defn_options: [compiler: EMLX]
# Option 2 (reduces GPU usage, but runs noticeably slower)
# Also remove `backend: EXLA.Backend` from the loading options above
# defn_options: [compiler: EXLA, lazy_transfers: :always]
)
Kino.start_child({Nx.Serving, name: StableDiffusion, serving: serving}) prompt_input =
Kino.Input.text("Prompt", default: "numbat, forest, high quality, detailed, digital art")
negative_prompt_input = Kino.Input.text("Negative Prompt", default: "darkness, rainy, foggy")
Kino.Layout.grid([prompt_input, negative_prompt_input]) prompt = Kino.Input.read(prompt_input)
negative_prompt = Kino.Input.read(negative_prompt_input)
output =
Nx.Serving.batched_run(StableDiffusion, %{prompt: prompt, negative_prompt: negative_prompt})
for result <- output.results do
Kino.Image.new(result.image)
end
|> Kino.Layout.grid(columns: 2) |
I'm on an m1 macbook pro - here is a gist of what i did;
https://gist.github.com/nickkaltner/38d0801b11c12407ac50e3517685e623.
The first and second cells work, the third crash the runtime.
Is there a debug option I should be using?
thanks!
The text was updated successfully, but these errors were encountered: