Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trying this with bumblebee, some of the default neural tasks crash #58

Open
nickkaltner opened this issue Nov 30, 2024 · 19 comments
Open

Comments

@nickkaltner
Copy link

I'm on an m1 macbook pro - here is a gist of what i did;

https://gist.github.com/nickkaltner/38d0801b11c12407ac50e3517685e623.

The first and second cells work, the third crash the runtime.

Is there a debug option I should be using?

thanks!

@polvalente
Copy link
Collaborator

I also get the crash on my end, but it's silent as well. I suspect memory issues or something like that.
@jonatanklosko any thoughts on how to debug this?

@samrat
Copy link
Contributor

samrat commented Nov 30, 2024

I tried running the code in iex-- looks like it segfaults. I haven't found the root cause out yet, but here is the stacktrace:

(lldb)
* thread #40, stop reason = EXC_BAD_ACCESS (code=1, address=0xc0341497400)
  * frame #0: 0x00000001977d3230 libsystem_platform.dylib`_platform_memmove + 144
    frame #1: 0x000000011c81cbf0 libmlx.dylib`float* std::__1::__constexpr_memmove[abi:ue170006]<float, float const, 0>(float*, float const*, std::__1::__element_count) + 80
    frame #2: 0x000000011c81cb68 libmlx.dylib`std::__1::pair<float const*, float*> std::__1::__copy_trivial_impl[abi:ue170006]<float const, float>(float const*, float const*, float*) + 72
    frame #3: 0x000000011c81cacc libmlx.dylib`std::__1::pair<float const*, float*> std::__1::__copy_trivial::operator()[abi:ue170006]<float const, float, 0>(float const*, float const*, float*) const + 44
    frame #4: 0x000000011c81ca44 libmlx.dylib`std::__1::pair<float const*, float*> std::__1::__unwrap_and_dispatch[abi:ue170006]<std::__1::__overload<std::__1::__copy_loop<std::__1::_ClassicAlgPolicy>, std::__1::__copy_trivial>, float const*, float const*, float*, 0>(float const*, float const*, float*) + 88
    frame #5: 0x000000011c81c9d0 libmlx.dylib`std::__1::pair<float const*, float*> std::__1::__dispatch_copy_or_move[abi:ue170006]<std::__1::_ClassicAlgPolicy, std::__1::__copy_loop<std::__1::_ClassicAlgPolicy>, std::__1::__copy_trivial, float const*, float const*, float*>(float const*, float const*, float*) + 40
    frame #6: 0x000000011c81c98c libmlx.dylib`std::__1::pair<float const*, float*> std::__1::__copy[abi:ue170006]<std::__1::_ClassicAlgPolicy, float const*, float const*, float*>(float const*, float const*, float*) + 40
    frame #7: 0x000000011c81ae20 libmlx.dylib`float* std::__1::copy[abi:ue170006]<float const*, float*>(float const*, float const*, float*) + 40
    frame #8: 0x000000011d077724 libmlx.dylib`void mlx::core::gather<float, unsigned int>(mlx::core::array const&, std::__1::vector<mlx::core::array, std::__1::allocator<mlx::core::array>> const&, mlx::core::array&, std::__1::vector<int, std::__1::allocator<int>> const&, std::__1::vector<int, std::__1::allocator<int>> const&) + 1604
    frame #9: 0x000000011d05ca3c libmlx.dylib`void mlx::core::dispatch_gather<unsigned int>(mlx::core::array const&, std::__1::vector<mlx::core::array, std::__1::allocator<mlx::core::array>> const&, mlx::core::array&, std::__1::vector<int, std::__1::allocator<int>> const&, std::__1::vector<int, std::__1::allocator<int>> const&) + 408
    frame #10: 0x000000011d05c0b8 libmlx.dylib`mlx::core::Gather::eval(std::__1::vector<mlx::core::array, std::__1::allocator<mlx::core::array>> const&, mlx::core::array&) + 556
    frame #11: 0x000000011d1cb9e0 libmlx.dylib`mlx::core::Gather::eval_cpu(std::__1::vector<mlx::core::array, std::__1::allocator<mlx::core::array>> const&, mlx::core::array&) + 40
    frame #12: 0x000000011c8deef8 libmlx.dylib`mlx::core::UnaryPrimitive::eval_cpu(std::__1::vector<mlx::core::array, std::__1::allocator<mlx::core::array>> const&, std::__1::vector<mlx::core::array, std::__1::allocator<mlx::core::array>>&) + 76
    frame #13: 0x000000011c9eac6c libmlx.dylib`mlx::core::eval_impl(std::__1::vector<mlx::core::array, std::__1::allocator<mlx::core::array>>, bool)::$_4::operator()() + 272
    frame #14: 0x000000011c9eab50 libmlx.dylib`decltype(std::declval<mlx::core::eval_impl(std::__1::vector<mlx::core::array, std::__1::allocator<mlx::core::array>>, bool)::$_4&>()()) std::__1::__invoke[abi:ue170006]<mlx::core::eval_impl(std::__1::vector<mlx::core::array, std::__1::allocator<mlx::core::array>>, bool)::$_4&>(mlx::core::eval_impl(std::__1::vector<mlx::core::array, std::__1::allocator<mlx::core::array>>, bool)::$_4&) + 24
    frame #15: 0x000000011c9eab08 libmlx.dylib`void std::__1::__invoke_void_return_wrapper<void, true>::__call[abi:ue170006]<mlx::core::eval_impl(std::__1::vector<mlx::core::array, std::__1::allocator<mlx::core::array>>, bool)::$_4&>(mlx::core::eval_impl(std::__1::vector<mlx::core::array, std::__1::allocator<mlx::core::array>>, bool)::$_4&) + 24
    frame #16: 0x000000011c9eaae4 libmlx.dylib`std::__1::__function::__alloc_func<mlx::core::eval_impl(std::__1::vector<mlx::core::array, std::__1::allocator<mlx::core::array>>, bool)::$_4, std::__1::allocator<mlx::core::eval_impl(std::__1::vector<mlx::core::array, std::__1::allocator<mlx::core::array>>, bool)::$_4>, void ()>::operator()[abi:ue170006]() + 28
    frame #17: 0x000000011c9e9914 libmlx.dylib`std::__1::__function::__func<mlx::core::eval_impl(std::__1::vector<mlx::core::array, std::__1::allocator<mlx::core::array>>, bool)::$_4, std::__1::allocator<mlx::core::eval_impl(std::__1::vector<mlx::core::array, std::__1::allocator<mlx::core::array>>, bool)::$_4>, void ()>::operator()() + 28
    frame #18: 0x000000011c9c5f80 libmlx.dylib`std::__1::__function::__value_func<void ()>::operator()[abi:ue170006]() const + 68
    frame #19: 0x000000011c9c5738 libmlx.dylib`std::__1::function<void ()>::operator()() const + 24
    frame #20: 0x000000011c9c4a10 libmlx.dylib`mlx::core::scheduler::StreamThread::thread_fn() + 288
    frame #21: 0x000000011c9c69d8 libmlx.dylib`decltype(*std::declval<mlx::core::scheduler::StreamThread*>().*std::declval<void (mlx::core::scheduler::StreamThread::*)()>()()) std::__1::__invoke[abi:ue170006]<void (mlx::core::scheduler::StreamThread::*)(), mlx::core::scheduler::StreamThread*, void>(void (mlx::core::scheduler::StreamThread::*&&)(), mlx::core::scheduler::StreamThread*&&) + 116
    frame #22: 0x000000011c9c6918 libmlx.dylib`void std::__1::__thread_execute[abi:ue170006]<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct>>, void (mlx::core::scheduler::StreamThread::*)(), mlx::core::scheduler::StreamThread*, 2ul>(std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct>>, void (mlx::core::scheduler::StreamThread::*)(), mlx::core::scheduler::StreamThread*>&, std::__1::__tuple_indices<2ul>) + 48
    frame #23: 0x000000011c9c6214 libmlx.dylib`void* std::__1::__thread_proxy[abi:ue170006]<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct>>, void (mlx::core::scheduler::StreamThread::*)(), mlx::core::scheduler::StreamThread*>>(void*) + 84
    frame #24: 0x00000001977a2f94 libsystem_pthread.dylib`_pthread_start + 136

@samrat
Copy link
Contributor

samrat commented Dec 1, 2024

I get a crash when using Nx.BinaryBackend as well which is possibly related:

** (ArgumentError) index 4294967295 is out of bounds for axis 0 in shape {1024, 768}
    (nx 0.9.1) lib/nx/binary_backend.ex:1981: anonymous fn/3 in Nx.BinaryBackend.index_to_binary_offset/2
    (elixir 1.15.7) lib/enum.ex:2510: Enum."-reduce/3-lists^foldl/2-0-"/3
    (nx 0.9.1) lib/nx/binary_backend.ex:1977: Nx.BinaryBackend.index_to_binary_offset/2
    (nx 0.9.1) lib/nx/binary_backend.ex:1966: Nx.BinaryBackend."-gather/3-lbc$^0/2-0-"/9
    (nx 0.9.1) lib/nx/binary_backend.ex:1961: Nx.BinaryBackend.gather/3
    (nx 0.9.1) lib/nx/defn/evaluator.ex:434: Nx.Defn.Evaluator.eval_apply/4
    (nx 0.9.1) lib/nx/defn/evaluator.ex:242: Nx.Defn.Evaluator.eval/3
    iex:11: (file)

@polvalente
Copy link
Collaborator

We can use EMLX.clip on the indices over the corresponding axes to avoid the segfault.

Or if we make use of a process dictionary key to decide whether we're inside an EMLX compilation workflow, we can do the runtime check on the concrete value and raise an error instead.

@polvalente
Copy link
Collaborator

What strikes me as odd is that the index being accessed is max uint32

@jonatanklosko
Copy link
Member

jonatanklosko commented Dec 2, 2024

What strikes me as odd is that the index being accessed is max uint32

Good call, there was a case where Bumblebee would generate out of bound indices. At the same time, whatever gather would return for these specific indices was ignored, so it worked fine, as long as gather returned anything. I've just fixed the indices on Bumblebee main.

That said, we definitely don't want EMLX to segfault :D

I updated the notebook to install {:bumblebee, github: "elixir-nx/bumblebee", override: true} and now there is a different error though.

@jonatanklosko
Copy link
Member

Oh got it, the other error is just a configuration issue in the notebook

-Nx.default_backend(EMLX.Backend)
+Nx.global_default_backend(EMLX.Backend)

@nickkaltner
Copy link
Author

nickkaltner commented Dec 2, 2024

Thank you, this is amazing! Following your instructions (hopefully got them correct) I still experience crashes when i hit run on the neural network task:

https://gist.github.com/nickkaltner/e1f69a7a73530bb443584c057eae4583

@jonatanklosko
Copy link
Member

It seems to work on some runs and crash on others (and sometimes crashes midway through generation), so there's possibly another segfault in EMLX, though less deterministic.

@nickkaltner
Copy link
Author

Yes same

This seems to reproduce it

Mix.install([
  {:nx, "~> 0.9.2"},
  {:bumblebee, github: "elixir-nx/bumblebee", override: true},
  {:emlx, github: "elixir-nx/emlx"},
  {:kino_bumblebee, "~> 0.5.1"}
])

Nx.global_default_backend(EMLX.Backend)
# Nx.global_default_backend({EMLX.Backend, device: :gpu})

Nx.Defn.default_options(compiler: EMLX)

{:ok, model_info} = Bumblebee.load_model({:hf, "openai-community/gpt2"})
{:ok, tokenizer} = Bumblebee.load_tokenizer({:hf, "openai-community/gpt2"})

{:ok, generation_config} =
  Bumblebee.load_generation_config({:hf, "openai-community/gpt2"})

generation_config = Bumblebee.configure(generation_config, max_new_tokens: 20)

serving =
  Bumblebee.Text.generation(model_info, tokenizer, generation_config,
    compile: [batch_size: 1, sequence_length: 100],
    stream: true
  )

Nx.Serving.run(serving, "What is the capital of queensland?") |> Enum.map(fn x -> IO.puts(x) end)

and then just run elixir script.exs

@nickkaltner
Copy link
Author

nickkaltner commented Dec 3, 2024

After a bunch of digging it looks like going into console shows the dump.

Crashed Thread:        43

Exception Type:        EXC_BAD_ACCESS (SIGSEGV)
Exception Codes:       KERN_INVALID_ADDRESS at 0x0000000000000000
Exception Codes:       0x0000000000000001, 0x0000000000000000

Termination Reason:    Namespace SIGNAL, Code 11 Segmentation fault: 11
Terminating Process:   exc handler [29414]

Thread 43 Crashed:
0   libmlx.dylib                  	       0x33976c5d0 std::__1::__function::__func<mlx::core::eval_impl(std::__1::vector<mlx::core::array, std::__1::allocator<mlx::core::array>>, bool)::$_4, std::__1::allocator<mlx::core::eval_impl(std::__1::vector<mlx::core::array, std::__1::allocator<mlx::core::array>>, bool)::$_4>, void ()>::operator()() + 172
1   libmlx.dylib                  	       0x339760268 mlx::core::scheduler::StreamThread::thread_fn() + 488
2   libmlx.dylib                  	       0x339760424 void* std::__1::__thread_proxy[abi:ue170006]<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct>>, void (mlx::core::scheduler::StreamThread::*)(), mlx::core::scheduler::StreamThread*>>(void*) + 72
3   libsystem_pthread.dylib       	       0x1975f72e4 _pthread_start + 136
4   libsystem_pthread.dylib       	       0x1975f20fc thread_start + 8

not sure if this helps anyone, there are lots of moving parts that i'm trying to understand :)

another thing i worked out is i can jump into lldb if i put these lines at the top of the above file;

IO.puts("sudo lldb --attach-pid #{System.pid()}")
IO.gets("hit enter to continue")

then i run the command it outputs in another tab, type continue and then hit enter in this tab to continue. I haven't worked out how see which variables are passed into the current function yet - my arm assembly is off

@polvalente
Copy link
Collaborator

https://dockyard.com/blog/2024/05/22/debugging-elixir-nifs-with-lldb

This might help. It's my usual workflow for using LLDB with Elixir stuff

@polvalente
Copy link
Collaborator

Also, we might need to introduce a debug-enabled version of libmlx for proper debugging here (cc @cocoa-xu). Although it might be sufficient to enable -g on the NIF side, which will let us at least know what's the problematic NIF call on our side.

@cocoa-xu
Copy link
Member

cocoa-xu commented Dec 3, 2024

we might need to introduce a debug-enabled version of libmlx for proper debugging here

Ah we have that already, it can be enabled by setting LIBMLX_ENABLE_DEBUG to true

@cocoa-xu
Copy link
Member

cocoa-xu commented Dec 3, 2024

As for the -g version I can send a PR real quick

@irisTa56
Copy link

Maybe related to ml-explore/mlx#1448?

I've successfully run elixir script.exs tens of times with either of the following settings:

  • Change the :stream option to false
  • Call :erlang.system_flag(:schedulers_online, 1) or set ELIXIR_ERL_OPTIONS="+S"

Both seem to prevent NIFs from running in separate threads (I'm not sure if this is always the case).
The former prevents a separate streaming process from being spawned, so only the main process calls NIFs.
The latter ensures that the main process and streaming process run on the same scheduler thread.

Note that, in my experiment, I removed Nx.Defn.default_options(compiler: EMLX) (i.e., I'm using Nx.Defn.Evaluator) because the compiled mode has been implemented since the issue was first reported.

@lawik
Copy link

lawik commented Mar 5, 2025

I ran into this trying the streaming. I think I also hit it in a weird way when I was developing and the dev server would rebuild various parts of the application. Suddenly it stopped rebuilding and when I checked the terminal the app was dead from a segmentation fault. I won't be able to reproduce it but I would imagine it is related or similar.

@pejrich
Copy link

pejrich commented Mar 13, 2025

I also ran into this, though my stacktrace is a bit different from the one above:

Thread 23 Crashed:: erts_dcpus_10
0   beam.smp                      	       0x1043cbd60 erts_list_length + 12
1   beam.smp                      	       0x10447de6c enif_get_list_length + 24
2   libemlx.so                    	       0x10b6641bc nx::nif::get_list(enif_environment_t*, unsigned long, std::__1::vector<mlx::core::array, std::__1::allocator<mlx::core::array>>&) + 40
3   libemlx.so                    	       0x10b68b510 std::__1::__function::__func<compile(enif_environment_t*, int, unsigned long const*)::$_0, std::__1::allocator<compile(enif_environment_t*, int, unsigned long const*)::$_0>, std::__1::vector<mlx::core::array, std::__1::allocator<mlx::core::array>> (std::__1::vector<mlx::core::array, std::__1::allocator<mlx::core::array>> const&)>::operator()(std::__1::vector<mlx::core::array, std::__1::allocator<mlx::core::array>> const&) + 660
4   libmlx.dylib                  	       0x110cc9db8 mlx::core::detail::compile_trace(std::__1::function<std::__1::vector<mlx::core::array, std::__1::allocator<mlx::core::array>> (std::__1::vector<mlx::core::array, std::__1::allocator<mlx::core::array>> const&)> const&, std::__1::vector<mlx::core::array, std::__1::allocator<mlx::core::array>> const&, bool) + 836
5   libmlx.dylib                  	       0x110cd3ae8 std::__1::__function::__func<mlx::core::detail::compile(std::__1::function<std::__1::vector<mlx::core::array, std::__1::allocator<mlx::core::array>> (std::__1::vector<mlx::core::array, std::__1::allocator<mlx::core::array>> const&)>, unsigned long, bool, std::__1::vector<unsigned long long, std::__1::allocator<unsigned long long>>)::$_14, std::__1::allocator<mlx::core::detail::compile(std::__1::function<std::__1::vector<mlx::core::array, std::__1::allocator<mlx::core::array>> (std::__1::vector<mlx::core::array, std::__1::allocator<mlx::core::array>> const&)>, unsigned long, bool, std::__1::vector<unsigned long long, std::__1::allocator<unsigned long long>>)::$_14>, std::__1::vector<mlx::core::array, std::__1::allocator<mlx::core::array>> (std::__1::vector<mlx::core::array, std::__1::allocator<mlx::core::array>> const&)>::operator()(std::__1::vector<mlx::core::array, std::__1::allocator<mlx::core::array>> const&) + 196
6   libemlx.so                    	       0x10b66dfe0 call_compiled(enif_environment_t*, int, unsigned long const*) + 116
7   beam.smp                      	       0x10447ad80 erts_call_dirty_nif + 372
8   beam.smp                      	       0x104372460 erts_dirty_process_main + 592
9   beam.smp                      	       0x1042fdaa4 sched_dirty_cpu_thread_func + 188
10  beam.smp                      	       0x10454954c thr_wrapper + 192
11  libsystem_pthread.dylib       	       0x188ee42e4 _pthread_start + 136
12  libsystem_pthread.dylib       	       0x188edf0fc thread_start + 8

I found that removing Nx.Defn.default_options(compiler: EMLX) fixed it in my case, though I'm not sure what performance impact not using the EMLX compiler has.

@pshoukry
Copy link

pshoukry commented Mar 13, 2025

Crashes for me in live book too. I am trying with bumblee default examples just modified to use EMLX.

Mix.install([
  {:nx, "~> 0.9.2"},
  {:bumblebee, github: "elixir-nx/bumblebee", override: true},
  {:emlx, github: "elixir-nx/emlx"},
  {:kino_bumblebee, "~> 0.5.1"}
])

#Nx.global_default_backend(EMLX.Backend)
Nx.global_default_backend({EMLX.Backend, device: :gpu})
Nx.default_backend({EMLX.Backend, device: :gpu})
repo_id = "CompVis/stable-diffusion-v1-4"
opts = [params_variant: "fp16", type: :bf16, backend: EMLX.Backend]

{:ok, tokenizer} = Bumblebee.load_tokenizer({:hf, "openai/clip-vit-large-patch14"})
{:ok, clip} = Bumblebee.load_model({:hf, repo_id, subdir: "text_encoder"}, opts)
{:ok, unet} = Bumblebee.load_model({:hf, repo_id, subdir: "unet"}, opts)
{:ok, vae} = Bumblebee.load_model({:hf, repo_id, subdir: "vae"}, [architecture: :decoder] ++ opts)
{:ok, scheduler} = Bumblebee.load_scheduler({:hf, repo_id, subdir: "scheduler"})
{:ok, featurizer} = Bumblebee.load_featurizer({:hf, repo_id, subdir: "feature_extractor"})
{:ok, safety_checker} = Bumblebee.load_model({:hf, repo_id, subdir: "safety_checker"}, opts)

:ok
serving =
  Bumblebee.Diffusion.StableDiffusion.text_to_image(clip, unet, vae, tokenizer, scheduler,
    num_steps: 2,
    num_images_per_prompt: 1,
    safety_checker: safety_checker,
    safety_checker_featurizer: featurizer,
    compile: [batch_size: 1, sequence_length: 60],
    # Option 1
    # defn_options: [compiler: EMLX]
    # Option 2 (reduces GPU usage, but runs noticeably slower)
    # Also remove `backend: EXLA.Backend` from the loading options above
    # defn_options: [compiler: EXLA, lazy_transfers: :always]
  )

Kino.start_child({Nx.Serving, name: StableDiffusion, serving: serving})
prompt_input =
  Kino.Input.text("Prompt", default: "numbat, forest, high quality, detailed, digital art")

negative_prompt_input = Kino.Input.text("Negative Prompt", default: "darkness, rainy, foggy")

Kino.Layout.grid([prompt_input, negative_prompt_input])
prompt = Kino.Input.read(prompt_input)
negative_prompt = Kino.Input.read(negative_prompt_input)

output =
  Nx.Serving.batched_run(StableDiffusion, %{prompt: prompt, negative_prompt: negative_prompt})

for result <- output.results do
  Kino.Image.new(result.image)
end
|> Kino.Layout.grid(columns: 2)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants