Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GKE Megatron demo workload #378

Merged
merged 11 commits into from
Jul 12, 2024
Merged

GKE Megatron demo workload #378

merged 11 commits into from
Jul 12, 2024

Conversation

MrGeislinger
Copy link
Collaborator

@MrGeislinger MrGeislinger commented Jun 21, 2024

Files (scripts, configs, etc.) to be used with example guide on using Megatron-LM w/ GKE (on A3-Mega) as shown in the README in the directory.

Copy link
Collaborator

@samcmho samcmho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please revise. Also, node0 is not shutting down properly after other nodes have finished. node0's log is stuck running nvidia-smi repeatedly.

sample_workloads/megatron-gke/README.md Show resolved Hide resolved
sample_workloads/megatron-gke/README.md Show resolved Hide resolved
sample_workloads/megatron-gke/README.md Outdated Show resolved Hide resolved
@MrGeislinger
Copy link
Collaborator Author

Please revise. Also, node0 is not shutting down properly after other nodes have finished. node0's log is stuck running nvidia-smi repeatedly.

I suspect that this might be from a wait command intended to wait for TensorBoard to finish. Since the default implementation (as given in the example guide) doesn't launch a TensorBoard server (embeddedTensorboardTarget is set to null – see values.yaml) based on EMBEDDED_TENSORBOARD_TARGET (line 270 of megatron-example.yaml) as seen in megatron-example.yaml Line 439, the wait might instead be referring to the background process (as seen in megatron-example.yaml Line 435).

Might be simple fix to remove the wait or even making the default to launch a TensorBoard server. Though, I'd prefer to not hide this bug for when some user sets embeddedTensorboardTarget to null

Copy link
Collaborator Author

@MrGeislinger MrGeislinger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commits b0c0342 & 77a523f address the simplification for user in using topology-aware scheduling. Waiting for guide documentation review before merging.

sample_workloads/megatron-gke/README.md Show resolved Hide resolved
sample_workloads/megatron-gke/README.md Outdated Show resolved Hide resolved
- Fixes issue where if no TensorBoard instance is launched node0 would wait
  for the looping nvidia-smi command and thus never complete. Removing the
  `wait` command would let the pod complete but would likely happen before
  the nsys profile commands (in background) can complete.
- The fix now explicitly waits for the TensorBoard instance only if it
  exists. Then will wait for all nsys profile jobs to complete.
Copy link
Collaborator Author

@MrGeislinger MrGeislinger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commit d819fee makes sure node0 shuts down properly (waits for TensorBoard if it exists, then waits for just nsys profile jobs)

Copy link
Collaborator

@samcmho samcmho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@MrGeislinger MrGeislinger merged commit 26ad83a into main Jul 12, 2024
2 checks passed
@MrGeislinger MrGeislinger deleted the victorsvector/megatron-gke branch July 12, 2024 18:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
2 participants