-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error when running sampling example #9
Comments
given the pain of getting the correct version of nvidia drivers on a system all lined up, i wonder if a docker image of this repo would help. (or alternatively, run this project inside the tf docker image) https://www.tensorflow.org/install/docker setting docker up for gpu access requires some extra steps (see step 2 in link above), but was pretty straight forward. |
I'm having the same issue. I agree a docker would be great. Is NCCL a dep does anyone know? |
@carchrae I think running a docker image on an EC2 instance, if you don't have CUDA on a Mac, is the way to go right? |
@diffractometer - sounds right. if you do have a cuda supported nvidia card (but not the cuda libs installed) you could probably still run the docker + gpu extensions locally. otherwise, if you deploy one of these amis it sounds like you get a gpu-enabled docker pre-installed. https://docs.aws.amazon.com/AmazonECS/latest/developerguide/ecs-gpu.html |
@carchrae awesome, I'll check that out. My other friend said I should just concentrate on getting it running locally in a jupyter notebook |
i suspect you'd hit the same error in jupyter as it seems the code requires cuda/nccl https://github.com/openai/jukebox/blob/master/jukebox/utils/dist_utils.py#L42 i guess this is really a bug in documentation (a common one) that the code requires an nvidia gpu. looking at the other issues getting reported, i think you also need a gpu with a lot of ram. i am yet to try it on my card (it has only 6gb ram) |
Ah, dang. That makes sense, especially given the results. I'll keep poking... |
hmm - or maybe not? and there is a cpu only flag. (i bet it'll be damn slow tho!) jukebox/jukebox/utils/dist_utils.py Line 77 in b4dc344
|
ah, yeah I saw that line and tried to change it, but every time I ran it the error still bubbled up |
did you get any more useful error from this output?
also, love the comment on the next line
|
so, i went through the install steps, and the sample seems to work for me (it is still running/downloading stuff) my system: ubuntu 18.04, cuda lib installed 10.2.89, gtx 1060 w/ 6gb.
|
The project does require a GPU to run, it could work on CPU but hasn't been tested and will almost surely be very slow. @maraoz The NCCL error you see is in initialising torch.distributed, which technically isn't needed for sampling but is unfortunately still present in the code. Maybe initialise it with a different backend eg: setup_dist_from_mpi(backend="gloo"), or remove distributed/mpi all together as done here #36 (comment) |
@prafullasd @carchrae thank you for your input, looks like I need to spend a couple of days working on my tooling before I can attempt a build, so I'm going familiarize myself with notebooks. If there's anyway I can help with a docker build in the meantime, testing at least haha ;) lmk |
Wowzers... this seems a little more complicated than I thought. I have the hardware and ram and even the choice of iOS vs Mac.... but I don’t code and I usually don’t pirate so I’m “overemcumbered” by this foggy paranoia about this entire thing. |
@prafullasd how do you initialize with the gloo backend? Is that option passed to sample.py? |
@Jimmiexjames I'm having good luck using the colab notebook, at least just getting it running. I ended up using the paid plan to stop timeouts. |
I made a jukebox docker image after proving that my local 2080 ti wasn't going to cut it for training. I have only had the opportunity to test it on vast.ai with a 1070 and then 2x V100s but both sampling and training seem to be working. You can spin up a vast instance for less than a dollar an hour and start messing around with it using |
@btrude thanks for the update. I'm not sure where to actually find that docker by your description, can you share please? Cheers. |
https://hub.docker.com/r/btrude/jukebox-docker I also added |
Hi, noob programmer here -- can I run this on a vast.ai server? How? |
Buy credits -> Create -> Edit Image & Config -> Scroll to last option and click the right-hand Select, then type the name of either of my images into the prompt -> Allocate ~30gb of disk space (at least 15+ of that is for the models which get downloaded each time) and click the bottom most Select button After that you're on your own 🥼 |
Thank you so much for your help!
I was able to get that far on my own, but now I'm pretty stuck. Do I enter
the code on Terminal on my Mac, with some code which points the computing
to Vast.ai (I installed all the libraries except CUDA, since I don't have a
GPU, and I get an NCCL erro). Or, do I enter some code on the Vast.Ai
server when prompted? Or is it a combination of the two?
I am not necessarily expecting a stranger to just help me out of the blue
so I understand if you don't reply. Or even if you want to tell me what to
google, that would be helpful.
I'm pretty lost here! Sorry and thank you.
Best
BW
…On Sat, May 23, 2020 at 8:24 AM btrude ***@***.***> wrote:
Hi, noob programmer here -- can I run this on a vast.ai server? How?
Buy credits -> Create -> Edit Image & Config -> Scroll to last option and
click the right-hand Select, then type the name of either of my images into
the prompt -> Allocate ~30gb of disk space (at least 15+ of that is for the
models which get downloaded each time) and click the bottom most Select
button
After that you're on your own 🥼
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#9 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/APWDYJWIC4CWPXLWLB6LZD3RS7TB7ANCNFSM4MWJGKJQ>
.
|
I should say, I installed the libraries on my Mac. I haven't installed
anything on the Vast.ai server yet since I don't know how or even if I'm
supposed to.
Just clarifying. Thank you! Best wishes
On Sun, May 24, 2020 at 11:19 PM The Crazy 88s <[email protected]>
wrote:
… Thank you so much for your help!
I was able to get that far on my own, but now I'm pretty stuck. Do I
enter the code on Terminal on my Mac, with some code which points the
computing to Vast.ai (I installed all the libraries except CUDA, since I
don't have a GPU, and I get an NCCL erro). Or, do I enter some code on the
Vast.Ai server when prompted? Or is it a combination of the two?
I am not necessarily expecting a stranger to just help me out of the blue
so I understand if you don't reply. Or even if you want to tell me what to
google, that would be helpful.
I'm pretty lost here! Sorry and thank you.
Best
BW
On Sat, May 23, 2020 at 8:24 AM btrude ***@***.***> wrote:
> Hi, noob programmer here -- can I run this on a vast.ai server? How?
>
> Buy credits -> Create -> Edit Image & Config -> Scroll to last option and
> click the right-hand Select, then type the name of either of my images into
> the prompt -> Allocate ~30gb of disk space (at least 15+ of that is for the
> models which get downloaded each time) and click the bottom most Select
> button
>
> After that you're on your own 🥼
>
> —
> You are receiving this because you commented.
> Reply to this email directly, view it on GitHub
> <#9 (comment)>, or
> unsubscribe
> <https://github.com/notifications/unsubscribe-auth/APWDYJWIC4CWPXLWLB6LZD3RS7TB7ANCNFSM4MWJGKJQ>
> .
>
|
When using vast.ai or just docker on its own you are using virtual machines that have little or no connection to your local machine. So in this case you don't need to install drivers or any software other than Follow this guide https://help.github.com/en/github/authenticating-to-github/generating-a-new-ssh-key-and-adding-it-to-the-ssh-agent through to this page https://help.github.com/en/github/authenticating-to-github/adding-a-new-ssh-key-to-your-github-account but instead of putting the ssh key into your github account put it into the ssh key box on this page: https://vast.ai/console/account/ Create a vast instance (you need 16GB of vram for anything more than 1b_lyrics with n_samples ~9 so go for a p100 or v100 if thats what you care about otherwise a 2080 ti/1080 ti will be cheapest and remember at least 30gb of disk space or you will get errors and nothing will work) and go to https://vast.ai/console/instances/ and wait for it to spin up. Sometimes you have to click the start button quite a few times before it will actually begin (maybe this isn't necessary and it will just do it on its own?). My images should be cached so if you don't see the blue button transition to "Connect" within a few minutes then its most likely broken and you should destroy and start over until you are given the "Connect" option after it says the image has successfully loaded (I bring this up because sometimes the instances fail to load and its not obvious through the ui, this will probably save someone time talking to customer support/wasting credits). Click "Connect" and a modal will pop up with an ssh command, copy that command into your terminal and type "yes" when prompted and then In order to pass your own dataset, prompt, or original code, or to recover any samples you made you will have to use
So if you wanted to transfer a file from the default example in this repo's readme to your desktop it would look like this:
Anyone just messing with sampling should note that the metadata in sample.py is hard-coded so you may want to install nano |
btrude, first of all, you are amazing. I got everything working through vast.ai and your docker image for the 1B model from your instructions above. When I get a p100 instance going and try to use the 5B model, it says Killed while trying to load the priors. I read that the notebook handles this differently than sample.py, and you can solve the problem by either increasing swap or editing the code in sample.py to match the notebook regarding how it loads the priors. Did you run into this problem with the 5B model on vast.ai? I tried changing the size of my swap but it seems the recommended way of doing that is through the docker config. Vast.ai only has a 1G swap by default, and it doesn't seem like you can change it once connected. |
After it says "Killed" what does |
Yes, it was a problem with too little memory. I was able to get it all working by finding an instance with enough memory. Cheers! |
I use an iOS, I know. |
I followed the Install instructions and then ran the sampling command and got:
I tried googling around about this NCCL error (I have no idea what NCCL is), but couldn't find any solutions. Any idea on how to fix this? Thanks!
The text was updated successfully, but these errors were encountered: