You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[[2023-12-04 01:09:51,860](https://github.com/mees/calvin/issues/60#)][calvin_env.envs.play_table_env][INFO] - Using calvin_env with commit 1431a46bd36bde5903fb6345e68b5ccc30def666.
[[2023-12-04 01:09:51,861](https://github.com/mees/calvin/issues/60#)][calvin_agent.wrappers.calvin_env_wrapper][INFO] - Initialized PlayTableEnv for device cuda:0
[[2023-12-04 01:09:51,876](https://github.com/mees/calvin/issues/60#)][calvin_agent.evaluation.multistep_sequences][INFO] - Start generating evaluation sequences.
[[2023-12-04 01:10:07,176](https://github.com/mees/calvin/issues/60#)][calvin_agent.evaluation.multistep_sequences][INFO] - Done generating evaluation sequences.
[[2023-12-04 01:10:07,180](https://github.com/mees/calvin/issues/60#)][calvin_agent.models.mcil][INFO] - Start validation epoch 0
Exception in thread IntMsgThr:
Traceback (most recent call last):
File "/root/miniconda3/envs/calvin/lib/python3.8/threading.py", line 932, in _bootstrap_inner
self.run()
File "/root/miniconda3/envs/calvin/lib/python3.8/threading.py", line 870, in run
self._target(*self._args, **self._kwargs)
File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/wandb_run.py", line 300, in check_internal_messages
self._loop_check_status(
File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/wandb_run.py", line 224, in _loop_check_status
local_handle = request()
File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/interface/interface.py", line 766, in deliver_internal_messages
return self._deliver_internal_messages(internal_message)
File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/interface/interface_shared.py", line 490, in _deliver_internal_messages
return self._deliver_record(record)
File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/interface/interface_shared.py", line 437, in _deliver_record
handle = mailbox._deliver_record(record, interface=self)
File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/lib/mailbox.py", line 455, in _deliver_record
interface._publish(record)
File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/interface/interface_sock.py", line 51, in _publish
self._sock_client.send_record_publish(record)
File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/lib/sock_client.py", line 221, in send_record_publish
self.send_server_request(server_req)
File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/lib/sock_client.py", line 155, in send_server_request
Exception in thread NetStatThr:
self._send_message(msg)
Traceback (most recent call last):
File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/lib/sock_client.py", line 152, in _send_message
File "/root/miniconda3/envs/calvin/lib/python3.8/threading.py", line 932, in _bootstrap_inner
self._sendall_with_error_handle(header + data)
File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/lib/sock_client.py", line 130, in _sendall_with_error_handle
sent = self._sock.send(data)
self.run()
BrokenPipeError: [Errno 32] Broken pipe
File "/root/miniconda3/envs/calvin/lib/python3.8/threading.py", line 870, in run
self._target(*self._args, **self._kwargs)
File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/wandb_run.py", line 268, in check_network_status
self._loop_check_status(
File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/wandb_run.py", line 224, in _loop_check_status
local_handle = request()
File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/interface/interface.py", line 758, in deliver_network_status
return self._deliver_network_status(status)
File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/interface/interface_shared.py", line 484, in _deliver_network_status
return self._deliver_record(record)
File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/interface/interface_shared.py", line 437, in _deliver_record
handle = mailbox._deliver_record(record, interface=self)
File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/lib/mailbox.py", line 455, in _deliver_record
interface._publish(record)
File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/interface/interface_sock.py", line 51, in _publish
self._sock_client.send_record_publish(record)
File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/lib/sock_client.py", line 221, in send_record_publish
self.send_server_request(server_req)
File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/lib/sock_client.py", line 155, in send_server_request
self._send_message(msg)
File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/lib/sock_client.py", line 152, in _send_message
self._sendall_with_error_handle(header + data)
File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/lib/sock_client.py", line 130, in _sendall_with_error_handle
sent = self._sock.send(data)
BrokenPipeError: [Errno 32] Broken pipe
Exception in thread ChkStopThr:
Traceback (most recent call last):
File "/root/miniconda3/envs/calvin/lib/python3.8/threading.py", line 932, in _bootstrap_inner
self.run()
File "/root/miniconda3/envs/calvin/lib/python3.8/threading.py", line 870, in run
self._target(*self._args, **self._kwargs)
File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/wandb_run.py", line 286, in check_stop_status
self._loop_check_status(
File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/wandb_run.py", line 224, in _loop_check_status
local_handle = request()
File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/interface/interface.py", line 750, in deliver_stop_status
return self._deliver_stop_status(status)
File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/interface/interface_shared.py", line 468, in _deliver_stop_status
return self._deliver_record(record)
File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/interface/interface_shared.py", line 437, in _deliver_record
handle = mailbox._deliver_record(record, interface=self)
File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/lib/mailbox.py", line 455, in _deliver_record
interface._publish(record)
File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/interface/interface_sock.py", line 51, in _publish
self._sock_client.send_record_publish(record)
File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/lib/sock_client.py", line 221, in send_record_publish
self.send_server_request(server_req)
File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/lib/sock_client.py", line 155, in send_server_request
self._send_message(msg)
File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/lib/sock_client.py", line 152, in _send_message
self._sendall_with_error_handle(header + data)
File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/lib/sock_client.py", line 130, in _sendall_with_error_handle
sent = self._sock.send(data)
BrokenPipeError: [Errno 32] Broken pipe
Sanity Checking DataLoader 0: 0%| | 0/2 [00:00<?, ?it/s]Error executing job with overrides: ['datamodule.root_data_dir=/fuyujie/calvin/dataset/calvin_debug_dataset', 'datamodule/datasets=vision_lang_shm']
Traceback (most recent call last):
File "training.py", line 68, in train
trainer.fit(model, datamodule=datamodule, ckpt_path=chk) # type: ignore
File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 603, in fit
call._call_and_handle_interrupt(
File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/pytorch_lightning/trainer/call.py", line 38, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 645, in _fit_impl
self._run(model, ckpt_path=self.ckpt_path)
File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1098, in _run
results = self._run_stage()
File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1177, in _run_stage
self._run_train()
File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1190, in _run_train
self._run_sanity_check()
File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1262, in _run_sanity_check
val_loop.run()
File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/pytorch_lightning/loops/loop.py", line 199, in run
self.advance(*args, **kwargs)
File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py", line 152, in advance
dl_outputs = self.epoch_loop.run(self._data_fetcher, dl_max_batches, kwargs)
File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/pytorch_lightning/loops/loop.py", line 199, in run
self.advance(*args, **kwargs)
File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py", line 137, in advance
output = self._evaluation_step(**kwargs)
File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py", line 234, in _evaluation_step
output = self.trainer._call_strategy_hook(hook_name, *kwargs.values())
File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1480, in _call_strategy_hook
output = fn(*args, **kwargs)
File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/pytorch_lightning/strategies/strategy.py", line 390, in validation_step
return self.model.validation_step(*args, **kwargs)
File "/fuyujie/calvin/calvin_models/calvin_agent/models/mcil.py", line 345, in validation_step
else self.language_goal(dataset_batch["lang"])
File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1215, in _call_impl
hook_result = hook(self, input, result)
File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/wandb_torch.py", line 349, in after_forward_hook
wandb.run.summary["graph_%i" % graph_idx] = self
File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/wandb_summary.py", line 52, in __setitem__
self.update({key: val})
File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/wandb_summary.py", line 74, in update
self._update(record)
File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/wandb_summary.py", line 128, in _update
self._update_callback(record)
File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/wandb_run.py", line 371, in wrapper_fn
return func(self, *args, **kwargs)
File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/wandb_run.py", line 1388, in _summary_update_callback
self._backend.interface.publish_summary(summary_record)
File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/interface/interface.py", line 259, in publish_summary
pb_summary_record = self._make_summary(summary_record)
File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/interface/interface.py", line 237, in _make_summary
json_value = self._summary_encode(item.value, path_from_root)
File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/interface/interface.py", line 210, in _summary_encode
val_to_json(self._run, path_from_root, value, namespace="summary")
File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/data_types/utils.py", line 164, in val_to_json
val.bind_to_run(run, key, namespace)
File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/data_types.py", line 1452, in bind_to_run
super().bind_to_run(*args, **kwargs)
File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/data_types/base_types/media.py", line 134, in bind_to_run
_datatypes_callback(media_path)
File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/_globals.py", line 19, in _datatypes_callback
_glob_datatypes_callback(fname)
File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/wandb_run.py", line 1417, in _datatypes_callback
self._backend.interface.publish_files(files)
File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/interface/interface.py", line 276, in publish_files
self._publish_files(files)
File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/interface/interface_shared.py", line 378, in _publish_files
self._publish(rec)
File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/interface/interface_sock.py", line 51, in _publish
self._sock_client.send_record_publish(record)
File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/lib/sock_client.py", line 221, in send_record_publish
self.send_server_request(server_req)
File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/lib/sock_client.py", line 155, in send_server_request
self._send_message(msg)
File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/lib/sock_client.py", line 152, in _send_message
self._sendall_with_error_handle(header + data)
File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/wandb/sdk/lib/sock_client.py", line 130, in _sendall_with_error_handle
sent = self._sock.send(data)
BrokenPipeError: [Errno 32] Broken pipe
Attempted method:
①Because I'm in China, I use the clash in my server. So first I guessed it's my network problem, so I try the demo in the wandb officical website, like this:
importrandomimportwandbwandb.login()
# Launch 5 simulated experimentstotal_runs=5forruninrange(total_runs):
# 🐝 1️⃣ Start a new run to track this scriptwandb.init(
# Set the project where this run will be loggedproject="basic-intro",
# We pass a run name (otherwise it’ll be randomly assigned, like sunshine-lollypop-10)name=f"experiment_{run}",
# Track hyperparameters and run metadataconfig={
"learning_rate": 0.02,
"architecture": "CNN",
"dataset": "CIFAR-100",
"epochs": 10,
})
# This simple block simulates a training loop logging metricsepochs=10offset=random.random() /5forepochinrange(2, epochs):
acc=1-2**-epoch-random.random() /epoch-offsetloss=2**-epoch+random.random() /epoch+offset# 🐝 2️⃣ Log metrics from your script to W&Bwandb.log({"acc": acc, "loss": loss})
# Mark the run as finishedwandb.finish()
And it works well
②Then I tried to modify the training.py
I commented two places about logger:
and it begain training successfully, but when beginning training the epoch 1(epoch 0 is good), it becomes more and more slower, and when it reaches the 100%, it sticks there permanently(at least 15 min), like this:
[rank: 0] Global seed set to 42
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/2
[W socket.cpp:426] [c10d] The server socket has failed to bind to [::]:50001 (errno: 98 - Address already in use).
[W socket.cpp:426] [c10d] The server socket has failed to bind to 0.0.0.0:50001 (errno: 98 - Address already in use).
[E socket.cpp:462] [c10d] The server socket has failed to listen on any local network address.
Error executing job with overrides: ['datamodule.root_data_dir=/fuyujie/calvin/dataset/calvin_debug_dataset', 'datamodule/datasets=vision_lang_shm', 'trainer.devices=-1']
Traceback (most recent call last):
File "training.py", line 68, in train
trainer.fit(model, datamodule=datamodule, ckpt_path=chk) # type: ignore
File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 603, in fit
call._call_and_handle_interrupt(
File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/pytorch_lightning/trainer/call.py", line 36, in _call_and_handle_interrupt
return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 90, in launch
return function(*args, **kwargs)
File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 645, in _fit_impl
self._run(model, ckpt_path=self.ckpt_path)
File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1034, in _run
self.strategy.setup_environment()
File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/pytorch_lightning/strategies/ddp.py", line 153, in setup_environment
self.setup_distributed()
File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/pytorch_lightning/strategies/ddp.py", line 204, in setup_distributed
_init_dist_connection(self.cluster_environment, self._process_group_backend, timeout=self._timeout)
File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/lightning_lite/utilities/distributed.py", line 237, in _init_dist_connection
torch.distributed.init_process_group(torch_distributed_backend, rank=global_rank, world_size=world_size, **kwargs)
File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 754, in init_process_group
store, rank, world_size = next(rendezvous_iterator)
File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/torch/distributed/rendezvous.py", line 246, in _env_rendezvous_handler
store = _create_c10d_store(master_addr, master_port, rank, world_size, timeout)
File "/root/miniconda3/envs/calvin/lib/python3.8/site-packages/torch/distributed/rendezvous.py", line 177, in _create_c10d_store
return TCPStore(
RuntimeError: The server socket has failed to listen on any local network address. The server socket has failed to bind to [::]:50001 (errno: 98 - Address already in use). The server socket has failed to bind to 0.0.0.0:50001 (errno: 98 - Address already in use).
Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
/root/miniconda3/envs/calvin/lib/python3.8/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 8 leaked shared_memory objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
Thanks so much for your attention and help!
The text was updated successfully, but these errors were encountered:
This doesn't seem to be caused by calvin. Did you try running wandb in dryrun? Setting the environment variable WANDB_MODE="dryrun" should turn off the sync. Alternatively you can also use the tensorboard logger by adding the argument logger=tb_logger when you start a training.
By default, there are rollout callbacks enabled which are run during the validation, this could be a reason for why it seemed like it got stuck. Try disabling all rollout callbacks by setting the arguments ~callbacks/rollout and ~callbacks/rollout_lh. I can also recommend not using the shared memory dataloader when debugging, so also set datamodule/datasets=vision_lang.
This again doesn't seem to be caused by our code. Did you successfully run other PyTorch projects with distributed training using ddp?
Let me introduce some problem I encountered and the methods I used to try to solve it.
Environment:
command: python training.py datamodule.root_data_dir=/fuyujie/calvin/dataset/calvin_debug_dataset datamodule/datasets=vision_lang_shm
1.Wandb error
Error:
Attempted method:
①Because I'm in China, I use the clash in my server. So first I guessed it's my network problem, so I try the demo in the wandb officical website, like this:
And it works well
②Then I tried to modify the training.py
I commented two places about logger:

and it begain training successfully, but when beginning training the epoch 1(epoch 0 is good), it becomes more and more slower, and when it reaches the 100%, it sticks there permanently(at least 15 min), like this:
2. multi GPU error
command: python training.py datamodule.root_data_dir=/fuyujie/calvin/dataset/calvin_debug_dataset datamodule/datasets=vision_lang_shm trainer.devices=-1
error:
Thanks so much for your attention and help!
The text was updated successfully, but these errors were encountered: