Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generating Belief Maps using train2/train.py #242

Open
BazUCD opened this issue Apr 6, 2022 · 4 comments
Open

Generating Belief Maps using train2/train.py #242

BazUCD opened this issue Apr 6, 2022 · 4 comments

Comments

@BazUCD
Copy link

BazUCD commented Apr 6, 2022

Hi I am attempting to run a the training script and generate the belief maps from train2/train.py in order to debug but I am getting this error:

start: 18:18:30.781464
load data: ['/home/user/Downloads/Spanner2']
load data:
training data: 2000 batches
load models
ready to train!
Traceback (most recent call last):
File "train.py", line 606, in
_runnetwork(epoch,trainingdata)
File "train.py", line 422, in _runnetwork
for batch_idx, targets in enumerate(train_loader):
File "/home/user/.local/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 521, in next
data = self._next_data()
File "/home/user/.local/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 1203, in _next_data
return self._process_data(data)
File "/home/user/.local/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 1229, in _process_data
data.reraise()
File "/home/user/.local/lib/python3.6/site-packages/torch/_utils.py", line 434, in reraise
raise exception
IndexError: Caught IndexError in DataLoader worker process 0.
Original Traceback (most recent call last):
File "/home/user/.local/lib/python3.6/site-packages/torch/utils/data/_utils/worker.py", line 287, in _worker_loop
data = fetcher.fetch(index)
File "/home/user/.local/lib/python3.6/site-packages/torch/utils/data/_utils/fetch.py", line 49, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/home/user/.local/lib/python3.6/site-packages/torch/utils/data/_utils/fetch.py", line 49, in
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/home/user/catkin_pcl_new/src/dope/scripts/train2/utils_dope.py", line 321, in getitem
save=False,
File "/home/user/catkin_pcl_new/src/dope/scripts/train2/utils_dope.py", line 593, in CreateBeliefMap
p = [point[numb_point][1],point[numb_point][0]]
IndexError: list index out of range

ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 17249) of binary: /usr/bin/python3
Traceback (most recent call last):
File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/user/.local/lib/python3.6/site-packages/torch/distributed/launch.py", line 193, in
main()
File "/home/user/.local/lib/python3.6/site-packages/torch/distributed/launch.py", line 189, in main
launch(args)
File "/home/user/.local/lib/python3.6/site-packages/torch/distributed/launch.py", line 174, in launch
run(args)
File "/home/user/.local/lib/python3.6/site-packages/torch/distributed/run.py", line 713, in run
)(*cmd_args)
File "/home/user/.local/lib/python3.6/site-packages/torch/distributed/launcher/api.py", line 131, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/user/.local/lib/python3.6/site-packages/torch/distributed/launcher/api.py", line 261, in launch_agent
failures=result.failures,
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

train.py FAILED

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2022-04-06_18:18:39
host : user-User
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 17249)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

I am unsure what is causing this error as I have the correct versions of Pytorch install based on requirements.txt. Is there any common mistakes I could be making?

@TontonTremblay
Copy link
Collaborator

Could you share an example of json file you are using in your dataset. It looks like

p = [point[numb_point][1],point[numb_point][0]]

point looks empty of the dimensions are wrong. @mintar refactored the data format a little bit, I did not check if it was compatible with train2/train.py? But I will try to check soon.

@BazUCD Did you try to use the original training script?

@BazUCD
Copy link
Author

BazUCD commented Apr 7, 2022

Hi @TontonTremblay thanks for the quick reply. Heres an example of my .json files with the associated png as well:
000000 span1
image

I've used the original training script and generated some weights but was unable to detect anything so after your recommendation from #238 I have been trying to generate the belief maps using train2

@TontonTremblay
Copy link
Collaborator

This looks correct, but your object has a symmetry in it. https://github.com/NVlabs/Deep_Object_Pose/tree/master/scripts/nvisii_data_gen#handling-objects-with-symmetries you should look into this from Martin.

@andrewyguo
Copy link
Contributor

I encountered a similar issue. The training script expects the "projected_cuboid" field to contain 9 points. The last point being the point under"projected_cuboid_centroid".

In your case, you can add something like projected_cuboid_keypoints.append(obj['projected_cuboid_centroid']) right below line 228 in utils_dope.py. I did this and it worked for me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants