Multi-GPU training on same machine is getting stuck #378

nighting0le01 · 2023-07-25T02:59:13Z

nighting0le01
Jul 25, 2023

I'm trying to train on gpu 3,4 on a machine with 8 gpu

Jul 26, 2023

Based on your results with train.py in torchvision, I think the problem is caused by your (docker) environment, and I do not have the right answer for this.

torchdistill no longer supports amp because it supports Hugging Face accelerate instead. See #247

View full answer

yoshitomo-matsubara · 2023-07-25T04:33:57Z

yoshitomo-matsubara
Jul 25, 2023
Maintainer

It seems you're using your own scripts (distillation_expts_image_class.py). I cannot help you without knowing what changes you made in the script and exact yaml configs. Please share them here as files, not screenshots.

Please do not use screenshots to show the logs or files, but use texts for better search experience to help others find this discussion when they face similar issues.

Also, please complete other discussions you opened at first and do not leave them unattended. Respect my time for OSS as well.

24 replies

nighting0le01 Jul 25, 2023
Author

yes I did confirm that both GPU were idle. when I used

python -m torch.distributed.launch --nproc_per_node=1 --use_env distillation_expts_image_class.py --world_size 1 --config  vitg-vitL.yaml

training did start but not when world size is more than 1 or nproc_per_node is more than 1

(tf) asahni795@464b02714931:~/workspace/distllation/code_folder$ python -m torch.distributed.launch --nproc_per_node=2 --use_env distillation_expts_image_class.py --world_s
ize 2 --config  vitg-vitL.yaml 
/home/asahni795/.conda/envs/tf/lib/python3.8/site-packages/torch/distributed/launch.py:181: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use-env is set by default in torchrun.
If your script expects `--local-rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See 
https://pytorch.org/docs/stable/distributed.html#launch-utility for 
further instructions

  warnings.warn(
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
2023/07/25 19:20:00     INFO    torchdistill.common.main_util   | distributed init (rank 1): env://
2023/07/25 19:20:00     INFO    torch.distributed.distributed_c10d      Added key: store_based_barrier_key:1 to store for rank: 1
2023/07/25 19:20:00     INFO    torchdistill.common.main_util   | distributed init (rank 0): env://
2023/07/25 19:20:00     INFO    torch.distributed.distributed_c10d      Added key: store_based_barrier_key:1 to store for rank: 0
2023/07/25 19:20:00     INFO    torch.distributed.distributed_c10d      Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes.
2023/07/25 19:20:00     INFO    torch.distributed.distributed_c10d      Rank 1: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes.

gets stuck here all the time.

yoshitomo-matsubara Jul 25, 2023
Maintainer

Try this at first

Again, I suggest you run PyTorch's sample script in distributed training model with some random large models and compare the behavior https://github.com/pytorch/vision/blob/main/references/classification/train.py to make sure that it is not a torchdistill-specific issue.

nighting0le01 Jul 25, 2023
Author

I tried with this script also, only single nproc seems to be working. Do i need to define any additional enviornment variables like RANK or LocaL HOST

also can we enable --amp in torchdistill. the img/s on imagenet is also pretty good on the original script: https://github.com/pytorch/vision/blob/main/references/classification/train.py
what can i do differently to get img/s higher (I'm getting like 12 img/s :( )and multi gpu run working

please help

yoshitomo-matsubara Jul 26, 2023
Maintainer

Based on your results with train.py in torchvision, I think the problem is caused by your (docker) environment, and I do not have the right answer for this.

torchdistill no longer supports amp because it supports Hugging Face accelerate instead. See #247

Answer selected by nighting0le01

nighting0le01 Jul 26, 2023
Author

Hey @yoshitomo-matsubara this is what I found to be the reason:Lightning-AI/pytorch-lightning#11865 (comment) is that the case for torchdistill imagenet.

Also do I need to define Local Host or master addr env variables??

yoshitomo-matsubara Jul 26, 2023
Maintainer

I don't think it is the case because distributed, device_ids = init_distributed_mode(args.world_size, args.dist_url) where I assume you got stuck has nothing to do with the dataset process, AND ImageFolder class does not pre-load the images, but loads sample images at every __getitem__ call.

And again, your results with train.py in torchvision suggests it's not caused by torchdistill. If you find a solution that works for train.py in torchvision, I think it would work for torchdistill as well. Since it seems like a docker-specific problem, I don't have the right answer as said above.

nighting0le01 Jul 26, 2023
Author

Thanks @yoshitomo-matsubara
I see your point thanks alot but can you please clarify if i need to explicity specify:

os.environ['MASTER_ADDR'] = "localhost"
os.environ["MASTER_PORT"] = "12355"

yoshitomo-matsubara Jul 26, 2023
Maintainer

As repeatedly said, I do not have the right answer as I've never faced situations that require us to explicitly specify the env variables for DDP. It may work or may not work.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-GPU training on same machine is getting stuck #378

{{title}}

Replies: 1 comment 24 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Multi-GPU training on same machine is getting stuck #378

nighting0le01 Jul 25, 2023

Replies: 1 comment · 24 replies

yoshitomo-matsubara Jul 25, 2023 Maintainer

nighting0le01 Jul 25, 2023 Author

yoshitomo-matsubara Jul 25, 2023 Maintainer

nighting0le01 Jul 25, 2023 Author

yoshitomo-matsubara Jul 26, 2023 Maintainer

nighting0le01 Jul 26, 2023 Author

yoshitomo-matsubara Jul 26, 2023 Maintainer

nighting0le01 Jul 26, 2023 Author

yoshitomo-matsubara Jul 26, 2023 Maintainer

nighting0le01
Jul 25, 2023

Replies: 1 comment 24 replies

yoshitomo-matsubara
Jul 25, 2023
Maintainer

nighting0le01 Jul 25, 2023
Author

yoshitomo-matsubara Jul 25, 2023
Maintainer

nighting0le01 Jul 25, 2023
Author

yoshitomo-matsubara Jul 26, 2023
Maintainer

nighting0le01 Jul 26, 2023
Author

yoshitomo-matsubara Jul 26, 2023
Maintainer

nighting0le01 Jul 26, 2023
Author

yoshitomo-matsubara Jul 26, 2023
Maintainer