-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GKE Megatron demo workload #378
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please revise. Also, node0 is not shutting down properly after other nodes have finished. node0's log is stuck running nvidia-smi
repeatedly.
I suspect that this might be from a Might be simple fix to remove the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Fixes issue where if no TensorBoard instance is launched node0 would wait for the looping nvidia-smi command and thus never complete. Removing the `wait` command would let the pod complete but would likely happen before the nsys profile commands (in background) can complete. - The fix now explicitly waits for the TensorBoard instance only if it exists. Then will wait for all nsys profile jobs to complete.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Commit d819fee makes sure node0 shuts down properly (waits for TensorBoard if it exists, then waits for just nsys profile
jobs)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Files (scripts, configs, etc.) to be used with example guide on using Megatron-LM w/ GKE (on A3-Mega) as shown in the README in the directory.