Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

trouble in "--policy feudal" #6

Open
huoliangyu opened this issue Apr 19, 2018 · 1 comment
Open

trouble in "--policy feudal" #6

huoliangyu opened this issue Apr 19, 2018 · 1 comment

Comments

@huoliangyu
Copy link

huoliangyu commented Apr 19, 2018

Hi, I would like to use your project,but I got some trouble in setting "--policy feudal". I can directly run python train.py and it works normally with default "--policy lstm". But when I switch to add this parameter as python train.py --policy feudal, I got following output:

[2018-04-19 22:01:28,989] Events directory: /tmp/pong/train_0
[2018-04-19 22:01:29,342] Starting session. If this hangs, we're mostly likely w
aiting to connect to the parameter server. One common cause is that the paramete
r server DNS name isn't resolving yet, or is misspecified.
2018-04-19 22:01:29.431565: I tensorflow/core/distributed_runtime/master_session
.cc:998] Start master session 0f5becf7698cbfb7 with config: intra_op_parallelism
_threads: 1 device_filters: "/job:ps" device_filters: "/job:worker/task:0/cpu:0"
inter_op_parallelism_threads: 2
Traceback (most recent call last):
File "/home/xuntian2/anaconda2/envs/fedal_tf16/lib/python3.6/site-packages/ten
sorflow/python/client/session.py", line 1327, in _do_call
return fn(*args)
File "/home/xuntian2/anaconda2/envs/fedal_tf16/lib/python3.6/site-packages/ten
sorflow/python/client/session.py", line 1306, in _run_fn
status, run_metadata)
[2018-04-19 22:01:28,989] Events directory: /tmp/pong/train_0
[2018-04-19 22:01:29,342] Starting session. If this hangs, we're mostly likely w
aiting to connect to the parameter server. One common cause is that the paramete
r server DNS name isn't resolving yet, or is misspecified.
2018-04-19 22:01:29.431565: I tensorflow/core/distributed_runtime/master_session
.cc:998] Start master session 0f5becf7698cbfb7 with config: intra_op_parallelism
_threads: 1 device_filters: "/job:ps" device_filters: "/job:worker/task:0/cpu:0"
inter_op_parallelism_threads: 2
Traceback (most recent call last):
File "/home/xuntian2/anaconda2/envs/fedal_tf16/lib/python3.6/site-packages/ten
sorflow/python/client/session.py", line 1327, in _do_call
return fn(*args)
File "/home/xuntian2/anaconda2/envs/fedal_tf16/lib/python3.6/site-packages/ten
sorflow/python/client/session.py", line 1306, in _run_fn
status, run_metadata)
[2018-04-19 22:01:28,989] Events directory: /tmp/pong/train_0
[2018-04-19 22:01:29,342] Starting session. If this hangs, we're mostly likely w
aiting to connect to the parameter server. One common cause is that the paramete
r server DNS name isn't resolving yet, or is misspecified.
2018-04-19 22:01:29.431565: I tensorflow/core/distributed_runtime/master_session
.cc:998] Start master session 0f5becf7698cbfb7 with config: intra_op_parallelism
_threads: 1 device_filters: "/job:ps" device_filters: "/job:worker/task:0/cpu:0"
inter_op_parallelism_threads: 2
Traceback (most recent call last):
File "/home/xuntian2/anaconda2/envs/fedal_tf16/lib/python3.6/site-packages/ten
sorflow/python/client/session.py", line 1327, in _do_call
return fn(*args)
File "/home/xuntian2/anaconda2/envs/fedal_tf16/lib/python3.6/site-packages/ten
sorflow/python/client/session.py", line 1306, in _run_fn
status, run_metadata)
File "/home/xuntian2/anaconda2/envs/fedal_tf16/lib/python3.6/contextlib.py", l
ine 88, in exit
next(self.gen)
File "/home/xuntian2/anaconda2/envs/fedal_tf16/lib/python3.6/site-packages/ten
sorflow/python/framework/errors_impl.py", line 466, in raise_exception_on_not_ok
_status
pywrap_tensorflow.TF_GetCode(status))
tensorflow.python.framework.errors_impl.NotFoundError: Key global/FeUdal/worker/
rnn/basic_lstm_cell/bias/Adam_1 not found in checkpoint
[[Node: save/RestoreV2_55 = RestoreV2[dtypes=[DT_FLOAT], _device="/job:
ps/replica:0/task:0/cpu:0"](_recv_save/Const_0_S1, save/RestoreV2_55/tensor_name
s, save/RestoreV2_55/shape_and_slices)]]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "worker.py", line 174, in
tf.app.run()
File "/home/xuntian2/anaconda2/envs/fedal_tf16/lib/python3.6/site-packages/ten
sorflow/python/platform/app.py", line 48, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "worker.py", line 166, in main
run(args, server)
File "worker.py", line 94, in run
with sv.managed_session(server.target, config=config) as sess, sess.as_defau
lt():
File "/home/xuntian2/anaconda2/envs/fedal_tf16/lib/python3.6/contextlib.py", l
ine 81, in enter
return next(self.gen)
File "/home/xuntian2/anaconda2/envs/fedal_tf16/lib/python3.6/site-packages/ten
sorflow/python/training/supervisor.py", line 964, in managed_session
self.stop(close_summary_writer=close_summary_writer)
File "/home/xuntian2/anaconda2/envs/fedal_tf16/lib/python3.6/site-packages/ten
sorflow/python/training/supervisor.py", line 792, in stop
stop_grace_period_secs=self._stop_grace_secs)
File "/home/xuntian2/anaconda2/envs/fedal_tf16/lib/python3.6/site-packages/ten
sorflow/python/training/coordinator.py", line 389, in join
six.reraise(*self._exc_info_to_raise)
File "/home/xuntian2/anaconda2/envs/fedal_tf16/lib/python3.6/site-packages/six
.py", line 686, in reraise
raise value
File "/home/xuntian2/anaconda2/envs/fedal_tf16/lib/python3.6/site-packages/ten
sorflow/python/training/supervisor.py", line 953, in managed_session
start_standard_services=start_standard_services)
File "/home/xuntian2/anaconda2/envs/fedal_tf16/lib/python3.6/site-packages/ten
sorflow/python/training/supervisor.py", line 708, in prepare_or_wait_for_session
init_feed_dict=self._init_feed_dict, init_fn=self._init_fn)
File "/home/xuntian2/anaconda2/envs/fedal_tf16/lib/python3.6/site-packages/ten
sorflow/python/training/session_manager.py", line 273, in prepare_session
config=config)
File "/home/xuntian2/anaconda2/envs/fedal_tf16/lib/python3.6/site-packages/ten
sorflow/python/training/session_manager.py", line 205, in _restore_checkpoint
saver.restore(sess, ckpt.model_checkpoint_path)
File "/home/xuntian2/anaconda2/envs/fedal_tf16/lib/python3.6/site-packages/ten
sorflow/python/training/saver.py", line 1560, in restore
{self.saver_def.filename_tensor_name: save_path})
File "/home/xuntian2/anaconda2/envs/fedal_tf16/lib/python3.6/site-packages/ten
sorflow/python/client/session.py", line 895, in run
run_metadata_ptr)
File "/home/xuntian2/anaconda2/envs/fedal_tf16/lib/python3.6/site-packages/ten
sorflow/python/client/session.py", line 1124, in _run
feed_dict_tensor, options, run_metadata)
File "/home/xuntian2/anaconda2/envs/fedal_tf16/lib/python3.6/site-packages/ten
sorflow/python/client/session.py", line 1321, in _do_run
options, run_metadata)
File "/home/xuntian2/anaconda2/envs/fedal_tf16/lib/python3.6/site-packages/ten
sorflow/python/client/session.py", line 1340, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.NotFoundError: Key global/FeUdal/worker/
rnn/basic_lstm_cell/bias/Adam_1 not found in checkpoint
[[Node: save/RestoreV2_55 = RestoreV2[dtypes=[DT_FLOAT], _device="/job:
ps/replica:0/task:0/cpu:0"](_recv_save/Const_0_S1, save/RestoreV2_55/tensor_name
s, save/RestoreV2_55/shape_and_slices)]]
Caused by op 'save/RestoreV2_55', defined at:
File "worker.py", line 174, in
tf.app.run()
File "/home/xuntian2/anaconda2/envs/fedal_tf16/lib/python3.6/site-packages/ten
sorflow/python/platform/app.py", line 48, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "worker.py", line 166, in main
run(args, server)
File "worker.py", line 50, in run
saver = FastSaver(variables_to_save)
File "/home/xuntian2/anaconda2/envs/fedal_tf16/lib/python3.6/site-packages/ten
sorflow/python/training/saver.py", line 1140, in init
self.build()
File "/home/xuntian2/anaconda2/envs/fedal_tf16/lib/python3.6/site-packages/ten
sorflow/python/training/saver.py", line 1172, in build
filename=self._filename)
File "/home/xuntian2/anaconda2/envs/fedal_tf16/lib/python3.6/site-packages/ten
sorflow/python/training/saver.py", line 688, in build
restore_sequentially, reshape)
File "/home/xuntian2/anaconda2/envs/fedal_tf16/lib/python3.6/site-packages/ten
sorflow/python/training/saver.py", line 407, in _AddRestoreOps
tensors = self.restore_op(filename_tensor, saveable, preferred_shard)
File "/home/xuntian2/anaconda2/envs/fedal_tf16/lib/python3.6/site-packages/ten
sorflow/python/training/saver.py", line 247, in restore_op
[spec.tensor.dtype])[0])
File "/home/xuntian2/anaconda2/envs/fedal_tf16/lib/python3.6/site-packages/ten
sorflow/python/ops/gen_io_ops.py", line 663, in restore_v2
dtypes=dtypes, name=name)
File "/home/xuntian2/anaconda2/envs/fedal_tf16/lib/python3.6/site-packages/ten
sorflow/python/framework/op_def_library.py", line 767, in apply_op
op_def=op_def)
File "/home/xuntian2/anaconda2/envs/fedal_tf16/lib/python3.6/site-packages/ten
sorflow/python/framework/ops.py", line 2630, in create_op
original_op=self._default_original_op, op_def=op_def)
File "/home/xuntian2/anaconda2/envs/fedal_tf16/lib/python3.6/site-packages/ten
sorflow/python/framework/ops.py", line 1204, in init
self._traceback = self._graph._extract_stack() # pylint: disable=protected-
access
NotFoundError (see above for traceback): Key global/FeUdal/worker/rnn/ba[26/480]
_cell/bias/Adam_1 not found in checkpoint
[[Node: save/RestoreV2_55 = RestoreV2[dtypes=[DT_FLOAT], _device="/job:
ps/replica:0/task:0/cpu:0"](_recv_save/Const_0_S1, save/RestoreV2_55/tensor_name
s, save/RestoreV2_55/shape_and_slices)]]
ERROR:tensorflow:==================================
Object was never used (type <class 'tensorflow.python.framework.ops.Tensor'>):
<tf.Tensor 'report_uninitialized_variables/boolean_mask/Gather:0' shape=(?,) dty
pe=string>
If you want to mark it as used call its "mark_used()" method.
It was originally created here:
['File "worker.py", line 174, in \n tf.app.run()', 'File "/home/xunti
an2/anaconda2/envs/fedal_tf16/lib/python3.6/site-packages/tensorflow/python/plat
form/app.py", line 48, in run\n _sys.exit(main(_sys.argv[:1] + flags_passthro
ugh))', 'File "worker.py", line 166, in main\n run(args, server)', 'File "wor
ker.py", line 77, in run\n ready_op=tf.report_uninitialized_variables(variabl
es_to_save),', 'File "/home/xuntian2/anaconda2/envs/fedal_tf16/lib/python3.6/sit
e-packages/tensorflow/python/util/tf_should_use.py", line 175, in wrapped\n r
eturn _add_should_use_warning(fn(*args, **kwargs))', 'File "/home/xuntian2/anaco
nda2/envs/fedal_tf16/lib/python3.6/site-packages/tensorflow/python/util/tf_shoul
d_use.py", line 144, in _add_should_use_warning\n wrapped = TFShouldUseWarni$
gWrapper(x)', 'File "/home/xuntian2/anaconda2/envs/fedal_tf16/lib/python3.6/site
-packages/tensorflow/python/util/tf_should_use.py", line 101, in init\n s
tack = [s.strip() for s in traceback.format_stack()]']
==================================
[2018-04-19 22:01:29,676] ==================================
Object was never used (type <class 'tensorflow.python.framework.ops.Tensor'>):
<tf.Tensor 'report_uninitialized_variables/boolean_mask/Gather:0' shape=(?,) dty
pe=string>
If you want to mark it as used call its "mark_used()" method.
It was originally created here:
['File "worker.py", line 174, in \n tf.app.run()', 'File "/home/xunti
an2/anaconda2/envs/fedal_tf16/lib/python3.6/site-packages/tensorflow/python/plat
form/app.py", line 48, in run\n _sys.exit(main(_sys.argv[:1] + flags_passthro
ugh))', 'File "worker.py", line 166, in main\n run(args, server)', 'File "wor
ker.py", line 77, in run\n ready_op=tf.report_uninitialized_variables(variabl
es_to_save),', 'File "/home/xuntian2/anaconda2/envs/fedal_tf16/lib/python3.6/sit
e-packages/tensorflow/python/util/tf_should_use.py", line 175, in wrapped\n r
eturn _add_should_use_warning(fn(*args, **kwargs))', 'File "/home/xuntian2/anaco
nda2/envs/fedal_tf16/lib/python3.6/site-packages/tensorflow/python/util/tf_shoul
d_use.py", line 144, in _add_should_use_warning\n wrapped = TFShouldUseWarnin
gWrapper(x)', 'File "/home/xuntian2/anaconda2/envs/fedal_tf16/lib/python3.6/site
-packages/tensorflow/python/util/tf_should_use.py", line 101, in init\n s
tack = [s.strip() for s in traceback.format_stack()]']

could you please tell me what is the problem? Thanks a lot.

@lucasliunju
Copy link

I use "python train.py -p feudal", and the code is running.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants