Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Graph4Rec mpirun方式 多机CPU 分布式 启动失败 #518

Open
xbinglzh opened this issue Dec 16, 2022 · 5 comments
Open

Graph4Rec mpirun方式 多机CPU 分布式 启动失败 #518

xbinglzh opened this issue Dec 16, 2022 · 5 comments
Assignees

Comments

@xbinglzh
Copy link

xbinglzh commented Dec 16, 2022

Graph4Rec mpirun方式 如何起多机CPU

@xbinglzh xbinglzh changed the title Graph4Rec 多机CPU 分布式 启动失败 Dec 16, 2022
@xbinglzh
Copy link
Author

手动方式可以起成功

@xbinglzh
Copy link
Author

1、ips配置如下:
192.168.12.217:8813
192.168.12.218:8814
192.168.12.219:8815
2、起图引擎:
/opt/python38paddle/bin/python3 -m pgl.distributed.launch --ip_config ./toy_data/ip_list.txt --conf ./user_configs/metapath2vec.yaml --shard_num 1000 --server_id 0
/opt/python38paddle/bin/python3 -m pgl.distributed.launch --ip_config ./toy_data/ip_list.txt --conf ./user_configs/metapath2vec.yaml --shard_num 1000 --server_id 1
/opt/python38paddle/bin/python3 -m pgl.distributed.launch --ip_config ./toy_data/ip_list.txt --conf ./user_configs/metapath2vec.yaml --shard_num 1000 --server_id 2
3、分布式提交任务:
CPU_NUM=12 /opt/python38paddle/bin/fleetrun --log_dir ../../fleet_logs
--workers "192.168.12.218:8170,192.168.12.219:8171"
--servers "192.168.12.217:8270"
dist_cpu_train.py --config ../../user_configs/metapath2vec.yaml
--ip ../../toy_data/ip_list.txt
理论上三台都执行该命令, 发现 在server 启动后, worker定义了2个节点,
当启动一个节点 192.168.12.218 后任务就开始跑了,不用等另一个节点192.168.12.219:8171 接入吗?
感觉就像 各算个的 无法感知是否有梯度交换。

@xbinglzh
Copy link
Author

分布式的话 返回的loss 不是应该是 数组结构么
sec/batch: 0.149264 | step: 100 | train_loss: 0.485856

这是不是还是单机?

@xbinglzh
Copy link
Author

图片

@Yelrose
Copy link
Collaborator

Yelrose commented Dec 16, 2022

启动一个节点 192.168.12.218 后任务就开始跑了,不用等另一个节点192.168.12.219:8171 接入吗?

这里应该是有问题的,代码里面有很多barrier_worker的,worker之间是会有通讯的。

@Yelrose Yelrose self-assigned this Dec 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
2 participants