Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error when running sampling example #9

Open
maraoz opened this issue Apr 30, 2020 · 28 comments
Open

Error when running sampling example #9

maraoz opened this issue Apr 30, 2020 · 28 comments
Assignees

Comments

@maraoz
Copy link

maraoz commented Apr 30, 2020

I followed the Install instructions and then ran the sampling command and got:

$ python jukebox/sample.py --model=5b_lyrics --name=sample_5b --levels=3 --sample_length_in_seconds=20 --total_sample_length_in_seconds=180 --sr=44100 --n_samples=6 --hop_fraction=0.5,0.5,0.125
Caught error during NCCL init (attempt 0 of 5): Distributed package doesn't have NCCL built in
Caught error during NCCL init (attempt 1 of 5): Distributed package doesn't have NCCL built in
Caught error during NCCL init (attempt 2 of 5): Distributed package doesn't have NCCL built in
Caught error during NCCL init (attempt 3 of 5): Distributed package doesn't have NCCL built in
Caught error during NCCL init (attempt 4 of 5): Distributed package doesn't have NCCL built in
Traceback (most recent call last):
  File "jukebox/sample.py", line 237, in <module>
    fire.Fire(run)
  File "/Users/manu/opt/anaconda3/envs/jukebox/lib/python3.7/site-packages/fire/core.py", line 127, in Fire
    component_trace = _Fire(component, args, context, name)
  File "/Users/manu/opt/anaconda3/envs/jukebox/lib/python3.7/site-packages/fire/core.py", line 366, in _Fire
    component, remaining_args)
  File "/Users/manu/opt/anaconda3/envs/jukebox/lib/python3.7/site-packages/fire/core.py", line 542, in _CallCallable
    result = fn(*varargs, **kwargs)
  File "jukebox/sample.py", line 229, in run
    rank, local_rank, device = setup_dist_from_mpi(port=port)
  File "/Users/manu/git/jukebox/jukebox/utils/dist_utils.py", line 86, in setup_dist_from_mpi
    raise RuntimeError("Failed to initialize NCCL")
RuntimeError: Failed to initialize NCCL

I tried googling around about this NCCL error (I have no idea what NCCL is), but couldn't find any solutions. Any idea on how to fix this? Thanks!

@carchrae
Copy link

given the pain of getting the correct version of nvidia drivers on a system all lined up, i wonder if a docker image of this repo would help. (or alternatively, run this project inside the tf docker image)

https://www.tensorflow.org/install/docker

setting docker up for gpu access requires some extra steps (see step 2 in link above), but was pretty straight forward.

@diffractometer
Copy link

I'm having the same issue. I agree a docker would be great. Is NCCL a dep does anyone know?

@diffractometer
Copy link

@carchrae I think running a docker image on an EC2 instance, if you don't have CUDA on a Mac, is the way to go right?

@carchrae
Copy link

carchrae commented May 4, 2020

@diffractometer - sounds right.

if you do have a cuda supported nvidia card (but not the cuda libs installed) you could probably still run the docker + gpu extensions locally.

otherwise, if you deploy one of these amis it sounds like you get a gpu-enabled docker pre-installed. https://docs.aws.amazon.com/AmazonECS/latest/developerguide/ecs-gpu.html

@diffractometer
Copy link

@carchrae awesome, I'll check that out. My other friend said I should just concentrate on getting it running locally in a jupyter notebook

@carchrae
Copy link

carchrae commented May 5, 2020

i suspect you'd hit the same error in jupyter as it seems the code requires cuda/nccl

https://github.com/openai/jukebox/blob/master/jukebox/utils/dist_utils.py#L42

i guess this is really a bug in documentation (a common one) that the code requires an nvidia gpu. looking at the other issues getting reported, i think you also need a gpu with a lot of ram. i am yet to try it on my card (it has only 6gb ram)

@diffractometer
Copy link

Ah, dang. That makes sense, especially given the results. I'll keep poking...

@carchrae
Copy link

carchrae commented May 5, 2020

hmm - or maybe not? and there is a cpu only flag. (i bet it'll be damn slow tho!)

device = torch.device("cuda", local_rank) if use_cuda else torch.device("cpu")

@diffractometer
Copy link

ah, yeah I saw that line and tried to change it, but every time I ran it the error still bubbled up Distributed package doesn't have NCCL built in

@carchrae
Copy link

carchrae commented May 5, 2020

did you get any more useful error from this output?

            print(f"Caught error during NCCL init (attempt {attempt_idx} of {n_attempts}): {e}")

also, love the comment on the next line

            sleep(1 + (0.01 * mpi_rank))  # Sleep to avoid thundering herd

@carchrae
Copy link

carchrae commented May 5, 2020

so, i went through the install steps, and the sample seems to work for me (it is still running/downloading stuff)

my system: ubuntu 18.04, cuda lib installed 10.2.89, gtx 1060 w/ 6gb.

tom@saturn:~/projects/learning/jukebox$ python jukebox/sample.py --model=5b_lyrics --name=sample_5b --levels=3 --sample_length_in_seconds=20 --total_sample_length_in_seconds=180 --sr=44100 --n_samples=6 --hop_fraction=0.5,0.5,0.125
Using cuda True
{'name': 'sample_5b', 'levels': 3, 'sample_length_in_seconds': 20, 'total_sample_length_in_seconds': 180, 'sr': 44100, 'n_samples': 6, 'hop_fraction': (0.5, 0.5, 0.125)}
Setting sample length to 881920 (i.e. 19.998185941043083 seconds) to be multiple of 128
Downloading from gce
Restored from /home/tom/.cache/jukebox-assets/models/5b/vqvae.pth.tar
0: Loading vqvae in eval mode
Conditioning on 1 above level(s)
Checkpointing convs
Checkpointing convs
Loading artist IDs from /home/tom/projects/learning/jukebox/jukebox/data/ids/v2_artist_ids.txt
Loading artist IDs from /home/tom/projects/learning/jukebox/jukebox/data/ids/v2_genre_ids.txt
Level:0, Cond downsample:4, Raw to tokens:8, Sample length:65536
Downloading from gce

@prafullasd
Copy link
Collaborator

prafullasd commented May 5, 2020

The project does require a GPU to run, it could work on CPU but hasn't been tested and will almost surely be very slow.

@maraoz The NCCL error you see is in initialising torch.distributed, which technically isn't needed for sampling but is unfortunately still present in the code. Maybe initialise it with a different backend eg: setup_dist_from_mpi(backend="gloo"), or remove distributed/mpi all together as done here #36 (comment)

@prafullasd prafullasd self-assigned this May 5, 2020
@diffractometer
Copy link

@prafullasd @carchrae thank you for your input, looks like I need to spend a couple of days working on my tooling before I can attempt a build, so I'm going familiarize myself with notebooks. If there's anyway I can help with a docker build in the meantime, testing at least haha ;) lmk

@Jimmiexjames
Copy link

Wowzers... this seems a little more complicated than I thought. I have the hardware and ram and even the choice of iOS vs Mac.... but I don’t code and I usually don’t pirate so I’m “overemcumbered” by this foggy paranoia about this entire thing.

@stevebanik
Copy link

@prafullasd how do you initialize with the gloo backend? Is that option passed to sample.py?

@diffractometer
Copy link

@Jimmiexjames I'm having good luck using the colab notebook, at least just getting it running. I ended up using the paid plan to stop timeouts.

@btrude
Copy link

btrude commented May 10, 2020

I made a jukebox docker image after proving that my local 2080 ti wasn't going to cut it for training. I have only had the opportunity to test it on vast.ai with a 1070 and then 2x V100s but both sampling and training seem to be working. You can spin up a vast instance for less than a dollar an hour and start messing around with it using btrude/jukebox-docker:latest as your image. IMO this is the easiest way to get going with this project from a hobbyist perspective. There are some minor tweaks to be made to the image but overall it works straight out of the box on vast so if anyone tries it please let me know if it is working for you (especially outside of vast).

@diffractometer
Copy link

@btrude thanks for the update. I'm not sure where to actually find that docker by your description, can you share please? Cheers.

@btrude
Copy link

btrude commented May 12, 2020

@btrude thanks for the update. I'm not sure where to actually find that docker by your description, can you share please? Cheers.

https://hub.docker.com/r/btrude/jukebox-docker I also added btrude/jukebox-docker:apex for faster training

@perlman-izzy
Copy link

Hi, noob programmer here -- can I run this on a vast.ai server? How?

@btrude
Copy link

btrude commented May 23, 2020

Hi, noob programmer here -- can I run this on a vast.ai server? How?

Buy credits -> Create -> Edit Image & Config -> Scroll to last option and click the right-hand Select, then type the name of either of my images into the prompt -> Allocate ~30gb of disk space (at least 15+ of that is for the models which get downloaded each time) and click the bottom most Select button

After that you're on your own 🥼

@perlman-izzy
Copy link

perlman-izzy commented May 25, 2020 via email

@perlman-izzy
Copy link

perlman-izzy commented May 25, 2020 via email

@btrude
Copy link

btrude commented May 25, 2020

I should say, I installed the libraries on my Mac. I haven't installed anything on the Vast.ai server yet since I don't know how or even if I'm supposed to. Just clarifying. Thank you! Best wishes

When using vast.ai or just docker on its own you are using virtual machines that have little or no connection to your local machine. So in this case you don't need to install drivers or any software other than ssh in order to connect to a vast instance and run the code (meaning that you can safely remove nvidia related software from your mac and you already will have ssh as it is built into macos). Also, the entire point of docker images is that you should not have to install anything and can just begin using them immediately after they are loaded (unless you have some specific need like I outline below). Picking up from my instructions above you should do the following:

Follow this guide https://help.github.com/en/github/authenticating-to-github/generating-a-new-ssh-key-and-adding-it-to-the-ssh-agent through to this page https://help.github.com/en/github/authenticating-to-github/adding-a-new-ssh-key-to-your-github-account but instead of putting the ssh key into your github account put it into the ssh key box on this page: https://vast.ai/console/account/

Create a vast instance (you need 16GB of vram for anything more than 1b_lyrics with n_samples ~9 so go for a p100 or v100 if thats what you care about otherwise a 2080 ti/1080 ti will be cheapest and remember at least 30gb of disk space or you will get errors and nothing will work) and go to https://vast.ai/console/instances/ and wait for it to spin up. Sometimes you have to click the start button quite a few times before it will actually begin (maybe this isn't necessary and it will just do it on its own?). My images should be cached so if you don't see the blue button transition to "Connect" within a few minutes then its most likely broken and you should destroy and start over until you are given the "Connect" option after it says the image has successfully loaded (I bring this up because sometimes the instances fail to load and its not obvious through the ui, this will probably save someone time talking to customer support/wasting credits).

Click "Connect" and a modal will pop up with an ssh command, copy that command into your terminal and type "yes" when prompted and then cd /opt/jukebox/ as vast does not take you to the docker image workdir for whatever reason. You should now be connected to the vast instances inside an instance of tmux. tmux allows your processes to stay running even after you have disconnected from the instance which is potentially important depending on how long you intend to use it for. See https://tmuxcheatsheet.com/ for important tmux commands, or just do ctrl+b, d to detach from the session, then type exit to exit ssh when you are done. Generally when I detach from an instance I just use nvidia-smi in a separate terminal window (you can connect to the instance in multiple windows using the original ssh command as many times as you need) to determine if the process has finished or not (when the gpu utilization has gone to zero), but if you need to reattach, follow the instructions in the cheat sheet.

In order to pass your own dataset, prompt, or original code, or to recover any samples you made you will have to use scp (which should also be built-in to macos). Take the ssh command provided to you by vast, e.g: ssh -p 16090 [email protected] -L 8080:localhost:8080 and pass the relevant info to scp like:

scp -P 16090 [email protected]:/opt/jukebox/path/to/file.wav ~/path/on/my/local/mac

So if you wanted to transfer a file from the default example in this repo's readme to your desktop it would look like this:

scp -P 16090 [email protected]:/opt/jukebox/sample_5b/level_0/item_0.wav ~/Desktop depending on which specific file, or just:

scp -r -P 16090 [email protected]:/opt/jukebox/sample_5b/ ~/Desktop if you want to transfer an entire directory. You can also go in the opposite direction if you need to send things to the instance like:
scp -r -P 16090 ~/Desktop/my_audio_dataset/ [email protected]:/opt/jukebox/

Anyone just messing with sampling should note that the metadata in sample.py is hard-coded so you may want to install nano apt-get install nano and then nano jukebox/sample.py, then arrow (nano is a command line text editor) down to line 188 and change the defaults to whatever you want (see here: https://github.com/openai/jukebox/tree/master/jukebox/data/ids for the default options; v3=1b, v2=5b). Ctrl + x, y to save and exit nano.

@LeapGamer
Copy link

btrude, first of all, you are amazing. I got everything working through vast.ai and your docker image for the 1B model from your instructions above. When I get a p100 instance going and try to use the 5B model, it says Killed while trying to load the priors. I read that the notebook handles this differently than sample.py, and you can solve the problem by either increasing swap or editing the code in sample.py to match the notebook regarding how it loads the priors. Did you run into this problem with the 5B model on vast.ai? I tried changing the size of my swap but it seems the recommended way of doing that is through the docker config. Vast.ai only has a 1G swap by default, and it doesn't seem like you can change it once connected.

@btrude
Copy link

btrude commented Jun 24, 2020

btrude, first of all, you are amazing. I got everything working through vast.ai and your docker image for the 1B model from your instructions above. When I get a p100 instance going and try to use the 5B model, it says Killed while trying to load the priors. I read that the notebook handles this differently than sample.py, and you can solve the problem by either increasing swap or editing the code in sample.py to match the notebook regarding how it loads the priors. Did you run into this problem with the 5B model on vast.ai? I tried changing the size of my swap but it seems the recommended way of doing that is through the docker config. Vast.ai only has a 1G swap by default, and it doesn't seem like you can change it once connected.

After it says "Killed" what does echo $? say? If it's 137 then yeah, you're out of memory and need to pick an instance with more memory. I don't think I've ever had OOM problems though, the only time I ever saw "Killed" was when I didn't allocate enough disk space.

@LeapGamer
Copy link

Yes, it was a problem with too little memory. I was able to get it all working by finding an instance with enough memory. Cheers!

@cicinwad
Copy link

Wowzers... this seems a little more complicated than I thought. I have the hardware and ram and even the choice of iOS vs Mac.... but I don’t code and I usually don’t pirate so I’m “overemcumbered” by this foggy paranoia about this entire thing.

I use an iOS, I know.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants