GKE Megatron demo workload #378

MrGeislinger · 2024-06-21T21:11:44Z

Files (scripts, configs, etc.) to be used with example guide on using Megatron-LM w/ GKE (on A3-Mega) as shown in the README in the directory.

samcmho

Please revise. Also, node0 is not shutting down properly after other nodes have finished. node0's log is stuck running nvidia-smi repeatedly.

sample_workloads/megatron-gke/README.md

MrGeislinger · 2024-06-25T17:37:35Z

Please revise. Also, node0 is not shutting down properly after other nodes have finished. node0's log is stuck running nvidia-smi repeatedly.

I suspect that this might be from a wait command intended to wait for TensorBoard to finish. Since the default implementation (as given in the example guide) doesn't launch a TensorBoard server (embeddedTensorboardTarget is set to null – see values.yaml) based on EMBEDDED_TENSORBOARD_TARGET (line 270 of megatron-example.yaml) as seen in megatron-example.yaml Line 439, the wait might instead be referring to the background process (as seen in megatron-example.yaml Line 435).

Might be simple fix to remove the wait or even making the default to launch a TensorBoard server. Though, I'd prefer to not hide this bug for when some user sets embeddedTensorboardTarget to null

MrGeislinger

Commits b0c0342 & 77a523f address the simplification for user in using topology-aware scheduling. Waiting for guide documentation review before merging.

sample_workloads/megatron-gke/README.md

- Fixes issue where if no TensorBoard instance is launched node0 would wait for the looping nvidia-smi command and thus never complete. Removing the `wait` command would let the pod complete but would likely happen before the nsys profile commands (in background) can complete. - The fix now explicitly waits for the TensorBoard instance only if it exists. Then will wait for all nsys profile jobs to complete.

MrGeislinger

Commit d819fee makes sure node0 shuts down properly (waits for TensorBoard if it exists, then waits for just nsys profile jobs)

samcmho

LGTM

GKE Megatron demo workload

b332b44

MrGeislinger requested a review from samcmho June 24, 2024 16:55

samcmho reviewed Jun 25, 2024

View reviewed changes

sample_workloads/megatron-gke/README.md Show resolved Hide resolved

sample_workloads/megatron-gke/README.md Show resolved Hide resolved

sample_workloads/megatron-gke/README.md Outdated Show resolved Hide resolved

MrGeislinger added 2 commits June 25, 2024 13:33

Explicitly use repo files' URLs for Topology-Aware Scheduler

b0c0342

Fix configmap; download files to local dir first

77a523f

MrGeislinger commented Jun 28, 2024

View reviewed changes

sample_workloads/megatron-gke/README.md Show resolved Hide resolved

sample_workloads/megatron-gke/README.md Outdated Show resolved Hide resolved

MrGeislinger commented Jun 28, 2024

View reviewed changes

Uncomment section related to kueue

13e9541

samcmho approved these changes Jul 4, 2024

View reviewed changes

Clearer instruction in values.yaml to change workload image

5c1704c

samcmho approved these changes Jul 9, 2024

View reviewed changes

MrGeislinger added 5 commits July 9, 2024 15:10

Update image for network-rx-daemon

8870cfd

Update image for NCCL plugin (tcpxo)

a19eae1

Update README - minor changes from review

d7009bc

Update README - minor changes from review

467b33e

Update README - minor changes from review

66ecaa6

MrGeislinger merged commit 26ad83a into main Jul 12, 2024
2 checks passed

MrGeislinger deleted the victorsvector/megatron-gke branch July 12, 2024 18:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GKE Megatron demo workload #378

GKE Megatron demo workload #378

MrGeislinger commented Jun 21, 2024 •

edited

Loading

samcmho left a comment

MrGeislinger commented Jun 25, 2024

MrGeislinger left a comment

MrGeislinger left a comment

samcmho left a comment

GKE Megatron demo workload #378

GKE Megatron demo workload #378

Conversation

MrGeislinger commented Jun 21, 2024 • edited Loading

samcmho left a comment

Choose a reason for hiding this comment

MrGeislinger commented Jun 25, 2024

MrGeislinger left a comment

Choose a reason for hiding this comment

MrGeislinger left a comment

Choose a reason for hiding this comment

samcmho left a comment

Choose a reason for hiding this comment

MrGeislinger commented Jun 21, 2024 •

edited

Loading