Questions tagged [distributed-computing]
Utilizing more than one computer, connected to each other with a communication link to accomplish a common task.
distributed-computing
2,875
questions
0
votes
0
answers
14
views
reliable protocol guarantee complete delivery no in order promise
The sender is sending N packets to receiver.
I want a protocol or method that guarantees delivery, each packet is received at least once. It is ok if some packets are received more than once due to ...
0
votes
0
answers
19
views
Does zookeeper preserver order when moving sessions?
In the zookeeper book it says:
When a client creates a ZooKeeper handle using a specific language
binding, it establishes a session with the service. The client
initially connects to any server in ...
0
votes
0
answers
29
views
How to securely conduct lottery-like draws with guaranteed randomness without auditing?
Is there an existing algorithm or method to conduct lottery-like draws that ensures secure and truly random results without the need for auditing?
There are any lib to do this?
I search on the web ...
0
votes
2
answers
34
views
Unable to run code on Multiple GPUs in PyTorch - Usage shows only 1 GPU is being utilized
I am training a Transformer Encoder-Decoder based model for Text summarization. The code works without any errors but uses only 1 GPU when checked with nvidia-smi. However, I want to run it on all the ...
0
votes
0
answers
19
views
Out-of-memory problem when using dist.all_gather
I'm writing codes for multi-GPU training, and I need to gather embeddings from different gpus to calculate loss and then propagate the gradients back to different GPUs. However, when the programs runs ...
0
votes
0
answers
24
views
I Have Imagination of Futuristic Computing Scenarios. How Can I Get Involved?
I am currently a backend developer specializing in Java-oriented web services, with a bachelor's degree in Computer Science. After working for a few years, I have become deeply interested in diving ...
0
votes
0
answers
25
views
The distributed training model inferred the occurrence of nan values
When I trained my Mamba model on 4 GPUs through DistributedDataParallel, after the first round of training, I executed the validation code. The validation on cuda:3 process always gave Nan values, and ...
0
votes
0
answers
40
views
Distributed Training using PyTorch
I am using PyTorch's multiprocessing framework to distribute my training across multiple GPUs. I'm doing this over the batch size, so each GPU has its independent batch that it calculates the gradient ...
1
vote
2
answers
100
views
How to reliably implement fan out write pattern?
I'm trying to RELIABLY implement that pattern.
For practical purposes, assume we have something similar to a twitter clone (in cassandra and nodejs).
So, user A has 500k followers. When user A posts a ...
0
votes
0
answers
17
views
How to Deploy Replicaset and Custom Images in AWS via Ray Docker Images?
Getting started with Ray on AWS cluster and trying to understand the declarative yaml config as in ray github. I can see it is possible to directly add the Docker images of ray on the AWS ec2 ...
1
vote
1
answer
74
views
Using torchrun with AWS sagemaker estimator on multi-GPU node
I would like to run a training job ml.p4d.24xlarge machine on AWS SageMaker. I ran into a similar issue described here with significant slowdowns in training time. I understand now that I should run ...
2
votes
0
answers
14
views
Why am I getting a "LM_WRITE_LOG_FAILED ERROR 80000" in GridDB when writing to the log file?
I'm using GridDB for managing a distributed database system and recently encountered the following error while trying to perform operations:
80000 LM_WRITE_LOG_FAILED ERROR
Writing to log file failed. ...
2
votes
0
answers
11
views
Why am I getting a "JC_CONTAINER_NOT_OPENED ERROR 145034" in GridDB when performing operations on a container?
I'm working with GridDB to manage a distributed database and recently encountered the following error while performing operations on a container:
145034 JC_CONTAINER_NOT_OPENED ERROR Status check of ...
0
votes
0
answers
28
views
How do microservices communicate with each other when they are secured with Jwt?
I am currently learning microservices architecture. I got to know that you can use JWT, Oauth and bunch of other mechanisms to secure microservices but one thing that confuses me is that how do they ...
2
votes
0
answers
19
views
Why am I getting a "SYNC_CREATE_CONTEXT_FAILED ERROR 20037" during data synchronization in my GridDB cluster?
I'm working on a distributed system where I need to synchronize data across a cluster of nodes. However, I'm encountering an error during the synchronization process. The error message I get is:
20037 ...