Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Network glitch reducing etcd cluster availability seriously #7321

Closed
shuting-yst opened this issue Feb 14, 2017 · 22 comments
Closed

Network glitch reducing etcd cluster availability seriously #7321

shuting-yst opened this issue Feb 14, 2017 · 22 comments
Milestone

Comments

@shuting-yst
Copy link

shuting-yst commented Feb 14, 2017

Hi,

I met a problem at using etcd API when the network is not stable: I just use a etcd client(has 3 endpoints which are relative to the 3 etcd nodes)to get some value from etcd nodes. But if one of etcd node is in broken network, the request will hang at least 16 minutes, and then request succeed.

It's easy to reproduce:

  • use docker container to start a 3 nodes etcd cluster
  • new a clientv3 with all nodes' endpoints, and request for once
  • broken one etcd node'network
  • request again(It's random happened if the node is not the one your client currently connected to)

code like this:

func() {
        clu, err := etcd.NewCluster(3)   // use docker run to start a etcd cluster with 3 nodes
	endpoints := []string{}
	for i := 0; i < 3; i++ {
		eps := clu.Client(i).Endpoints()
		endpoints = append(endpoints, eps...)
	}
	logger.Infof("Total endpoints %v\n", endpoints)
	client, err := clientv3.New(clientv3.Config{
		Endpoints: endpoints,
		AutoSyncInterval:time.Second,
		DialTimeout: 5 * time.Second})
	logger.Infof("Begin to get things")
        // just request for the first time
	_, err= client.Get(context.Background(), "noexit")
          
	for _, mem := range clu.GetMembers() {
		logger.Infof("netdown mem %s", mem.GetName())
		_, err:= clu.DisconnectNet(mem.GetName())   // broken one node' network
                        
		_, err= client.Get(context.Background(), "noexit")
		logger.Infof("recover mem: %s", mem.GetName())
		_, err= clu.ConnectNet(mem.GetName())   // recover one node' network
	}
}

start etcd container like this

// create a network
docker network create --driver bridge %s --subnet %s
// create one etcd node  
"docker run -d --net=%s --ip=%s --name %s " + EtcdImagePath +
			" /usr/local/bin/etcd " +
			"--data-dir=data.etcd --name %s " +
			"--initial-advertise-peer-urls http://%s:2380 " +
			"--listen-peer-urls http://%s:2380 " +
			"--advertise-client-urls http://%s:2379 " +
			"--listen-client-urls http://%s:2379 " +
			"--initial-cluster " + cluster + " " +
			"--initial-cluster-state " + clusterState + " " +
			"--initial-cluster-token " + token
// broken network
"docker network disconnect -f %s %s", c.netName, containerID
// recover network
"docker network connect %s %s", c.netName, containerID

I change the context.background() to context.WithTimeout(5*time.Second), the result like this:

  • after 16min, the get operation is successful
  • not every request is terminated after 5s, the request timeout is increasing.
I0214 10:44:52.192850    2228 etcd_local.go:394] Disconnect the test container to netowrk. Container: etcd-s1fsqse-node-1. Network: etcd-s1fsqse. Output: . Err: <nil>
I0214 10:44:57.194600    2228 etcd_test.go:53] Error is context deadline exceeded
I0214 10:45:02.195953    2228 etcd_test.go:53] Error is context deadline exceeded
I0214 10:45:07.199983    2228 etcd_test.go:53] Error is context deadline exceeded
I0214 10:45:12.201359    2228 etcd_test.go:53] Error is context deadline exceeded
I0214 10:45:17.203196    2228 etcd_test.go:53] Error is context deadline exceeded
I0214 10:45:22.206000    2228 etcd_test.go:53] Error is context deadline exceeded
I0214 10:45:27.207338    2228 etcd_test.go:53] Error is context deadline exceeded
I0214 10:45:32.209470    2228 etcd_test.go:53] Error is context deadline exceeded
I0214 10:45:42.249128    2228 etcd_test.go:53] Error is context deadline exceeded
I0214 10:45:53.651880    2228 etcd_test.go:53] Error is context deadline exceeded
I0214 10:46:15.644939    2228 etcd_test.go:53] Error is context deadline exceeded
I0214 10:46:34.199748    2228 etcd_test.go:53] Error is context deadline exceeded
I0214 10:47:32.885118    2228 etcd_test.go:53] Error is context deadline exceeded
I0214 10:49:01.816162    2228 etcd_test.go:53] Error is context deadline exceeded
I0214 10:49:43.253304    2228 etcd_test.go:53] Error is context deadline exceeded
I0214 10:51:00.247848    2228 etcd_test.go:53] Error is context deadline exceeded
I0214 10:51:52.005616    2228 etcd_test.go:53] Error is context deadline exceeded
I0214 10:53:01.580015    2228 etcd_test.go:53] Error is context deadline exceeded
I0214 10:54:01.834655    2228 etcd_test.go:53] Error is context deadline exceeded
I0214 10:55:17.626976    2228 etcd_test.go:53] Error is context deadline exceeded
I0214 10:56:36.551973    2228 etcd_test.go:53] Error is context deadline exceeded
I0214 10:57:45.628319    2228 etcd_test.go:53] Error is context deadline exceeded
I0214 10:58:27.367759    2228 etcd_test.go:53] Error is context deadline exceeded
I0214 10:59:36.572002    2228 etcd_test.go:53] Error is context deadline exceeded
I0214 11:00:22.016167    2228 etcd_test.go:53] Error is context deadline exceeded
I0214 11:00:22.016486    2228 etcd_test.go:59] connect to other node

Questions:

  1. Why the client request hang without context cancel or context timeout? Where it hang at? Why need to hang for 16min?
  2. The client has 3 endpoints, when one node broken, should client retry the other endpoint to connect instead of hanging?
  3. When request canceled by context, and just retry with client, why the client can't change to other endpoints when I had set "AutoSyncInterval" in client? Why It changed just after 16min?

ETCD version : 3.1.0

In my scenary, I grant a lease , and call keepAliveOnce(with a timeout 5s context) for the lease, but the etcd client can't change to connect other etcd nodes when I retry keepAliveOnce, so the lease expired, It's horrible!How should I deal with it? Help, pls

@shuting-yst
Copy link
Author

It seems that the issue #4606 has relation with this

@heyitsanthony
Copy link
Contributor

  1. Why the client request hang without context cancel or context timeout? Where it hang at? Why need to hang for 16min?

If a member is partitioned from the rest of the network, a linearized Get shouldn't succeed; it'll keep retrying. I suspect the auto sync is causing the balancer to select a connection to a new member which is letting the request eventually go through.

  1. The client has 3 endpoints, when one node broken, should client retry the other endpoint to connect instead of hanging?

The client doesn't do this yet. It will only detect a disconnect between the client and the server; it won't blacklist partitioned members.

  1. When request canceled by context, and just retry with client, why the client can't change to other endpoints when I had set "AutoSyncInterval" in client? Why It changed just after 16min?

See 1 & 2.

I grant a lease , and call keepAliveOnce(with a timeout 5s context) for the lease, but the etcd client can't change to connect other etcd nodes when I retry keepAliveOnce, so the lease expired,

Partitioned endpoint blacklisting won't solve this entirely. The client should follow the same path as losing a connection between the client and server if the lease expires.

@heyitsanthony heyitsanthony added this to the v3.2.0-maybe milestone Feb 14, 2017
@shuting-yst
Copy link
Author

shuting-yst commented Feb 15, 2017

@heyitsanthony

I suspect the auto sync is causing the balancer to select a connection to a new member which is letting the request eventually go through

I make sure that It has nothing to do with auto sync. At first I did not set 'AutoSyncInterval', and it also went through after about 16minute.

Partitioned endpoint blacklisting won't solve this entirely

Perhaps, I wasn't too clear about what I had meant. Let me try and clear this up a bit. First, I call GrantLease and get the lease with a ttl as 15s。Then I had a goroutine to call KeepAliveOnce(with a timeout context) every 5s to prevent the lease expired. If KeepAliveOnce failed, just retry until the lease is expired. So if the client can blacklist partitioned members in time, KeepAliveOnce can succeed and the lease is alive. I think it's the key point to solve the problem.

@heyitsanthony
Copy link
Contributor

heyitsanthony commented Feb 15, 2017

AutoSync

OK, then it's getting disconnected somehow from some lower layer and choosing a new connection.

KeepAliveOnce

No, that is still not robust. The application must handle lease expiration that's out of its control, otherwise it won't gracefully failover when there is client network loss or arbitrary client delay. Blacklisting does not completely solve it.

@shuting-yst
Copy link
Author

@heyitsanthony
You are right, I did handle the lease expiration, but I just want to reduce the likelihood by retry and trouble removal. I think the system availability is proportional to the count of ETCD nodes in a particular level, right? When crashing one of the ETCD node make no sence to the system. But if broken the network of one ETCD node, all the client connect to the partitioned node would be influenced by this. The system availability breaks just because one node!

Could you give me some simple suggest to handle the scene of broken network? How to deal with the client effectively?

@heyitsanthony
Copy link
Contributor

@shuting-yst the easiest workaround now through the clientv3 client is to have a WithRequireLeader() watch open on a future revision for some key (cf. https://github.com/coreos/etcd/blob/master/proxy/grpcproxy/leader.go). If the leader stays lost for a few seconds, get the current set of endpoints with clientv3.Client.Endpoints(), then use clientv3.Client.SetEndpoints() to cycle through the endpoints until it finds an active leader. This isn't ideal but it'll work for the time being.

@shuting-yst
Copy link
Author

Sorry, I don't think the method does work:

  1. the partitioned node is not always the etcd leader, a watch with WithRequiredLeader() has no use for other nodes which is out of use.
  2. when one node is network partitioned, all the operations called by the client which connect to that partitioned node will hang or fail, so I do not think the client can get the useful current endpoints.
@heyitsanthony
Copy link
Contributor

heyitsanthony commented Feb 17, 2017

the partitioned node is not always the etcd leader, a watch with WithRequiredLeader() has no use for other nodes which is out of use.

That's not how it works. WithRequireLeader will notify the client that it can't get through to the current leader. The partitioned member detects that it lost its ability to communicate with the leader and closes the watch stream with a lost leader error. The client will receive a watch response with Err() set to a lost leader error. Please refer to leader.go.

when one node is network partitioned, all the operations called by the client which connect to that partitioned node will hang or fail, so I do not think the client can get the useful current endpoints.

Endpoints() reads the current endpoints from the client's memory. It's not a network operation (which is why there's no ctx argument).

@shuting-yst
Copy link
Author

shuting-yst commented Feb 28, 2017

@heyitsanthony sorry, I did an experiment to prove that does not work. Watch operation hangs just like other operation like Get. Simple code shows below:

    clu, err := etcd.NewCluster(3)   // start an etcd cluster with 3 members all in cotainer
    defer clu.Terminate()
    endpoints := []string{}
    for i := 0; i < 3; i++ {
         eps := clu.Client(i).Endpoints()
         endpoints = append(endpoints, eps...)
    }
    logger.Infof("Total endpoints %v\n", endpoints)
    client, err := clientv3.New(clientv3.Config{
        Endpoints: endpoints,
	AutoSyncInterval:time.Second,
        DialTimeout: 5 * time.Second})
     defer client.Close()
			
     go func() {
          for {
		logger.Infof("Start to watch now")
		wch := client.Watch(clientv3.WithRequireLeader(context.Background()), "nowaytoput/")
	        wresp, ok :=  <- wch    // whichever etcd node net broken, nothing happened here
	        logger.Infof("Attention: ok is '%v'. wresp.Err is '%v', wresp is '%v'", ok, wresp.Err(), wresp)

	        eps := client.Endpoints()
		logger.Infof("The current endpoints is %v", eps)
		client.SetEndpoints(eps...)
	   }
     }()
     logger.Infof("Begin to get things")
     _, err= client.Get(context.Background(), "noexit")

     for _, mem := range clu.GetMembers() {
	   logger.Infof("netdown mem %s", mem.GetName())
	   _, err:= clu.DisconnectNet(mem.GetName())   // broken the etcd network

	   utils.Retry(context.Background(), func() error{              // retry until success
		 ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
			defer cancel()
			_, err= client.Get(ctx, "noexit")
			if err != nil {
				logger.Infof("Error is %v", err)
			}
		return err
	})
	logger.Infof("recover mem: %s", mem.GetName())
	_, err= clu.ConnectNet(mem.GetName())
		

More, I did another experiment. When an etcd node is net broken, and the Get operation failed due to 'context deadline exceeded'. I get the endpoints by calling client.Endpoints(), and use client.SetEndpoints() to set back, the client didn't cycle through the endpoints, and the Get operation never succeed by forever retry. client.Endpoints() always get the value that I set at the new client stage, never update.

At last, I tried every single client(a client just connect to one etcd), and had a sad result, the step :

  1. create an etcd cluster with 3 nodes: ip1:2379, ip2:2379, ip3:2379
  2. create a client list[client1(ip1), client2(ip2), client3(ip3)]
  3. first broken the network of ip1, and client1.Get failed, but retry with client2 succeeded
  4. recover ip1 network, and broken ip2, at this time, all clients failed!!! Why?
I0228 14:20:05.096928    2981 worker_test.go:83] netdown mem etcd-brnaiso-node-0
I0228 14:20:05.429900    2981 etcd_local.go:400] Disconnect the test container to netowrk. Container: etcd-brnaiso-node-0. Network: etcd-brnaiso.
I0228 14:20:05.430587    2981 worker_test.go:91] Get with client 0
I0228 14:20:10.431464    2981 worker_test.go:95] Error with client 0 .Err is context deadline exceeded
I0228 14:20:10.431558    2981 worker_test.go:91] Get with client 1
I0228 14:20:10.437359    2981 worker_test.go:103] get success
I0228 14:20:10.697388    2981 etcd_local.go:391] Connect the test container to netowrk. Container: etcd-brnaiso-node-0. Network: etcd-brnaiso.
I0228 14:20:10.697474    2981 worker_test.go:105] recover mem: etcd-brnaiso-node-0
I0228 14:20:10.697495    2981 worker_test.go:83] netdown mem etcd-brnaiso-node-1
I0228 14:20:10.998149    2981 etcd_local.go:400] Disconnect the test container to netowrk. Container: etcd-brnaiso-node-1. Network: etcd-brnaiso.
I0228 14:20:10.998336    2981 worker_test.go:91] Get with client 0
I0228 14:20:15.999176    2981 worker_test.go:95] Error with client 0 .Err is context deadline exceeded
I0228 14:20:15.999257    2981 worker_test.go:91] Get with client 1
I0228 14:20:20.999654    2981 worker_test.go:95] Error with client 1 .Err is context deadline exceeded
I0228 14:20:20.999758    2981 worker_test.go:91] Get with client 2
I0228 14:20:26.000978    2981 worker_test.go:95] Error with client 2 .Err is context deadline exceeded
I0228 14:20:26.001137    2981 worker_test.go:91] Get with client 0
I0228 14:20:31.001967    2981 worker_test.go:95] Error with client 0 .Err is context deadline exceeded
I0228 14:20:31.002132    2981 worker_test.go:91] Get with client 1
I0228 14:20:36.003377    2981 worker_test.go:95] Error with client 1 .Err is context deadline exceeded
I0228 14:20:36.003457    2981 worker_test.go:91] Get with client 2
I0228 14:20:41.004113    2981 worker_test.go:95] Error with client 2 .Err is context deadline exceeded
I0228 14:20:41.180908    2981 worker_test.go:91] Get with client 0
I0228 14:20:46.181458    2981 worker_test.go:95] Error with client 0 .Err is context deadline exceeded
I0228 14:20:46.408611    2981 worker_test.go:91] Get with client 1
I0228 14:20:51.410810    2981 worker_test.go:95] Error with client 1 .Err is context deadline exceeded
I0228 14:20:54.525778    2981 worker_test.go:91] Get with client 2
I0228 14:20:59.526727    2981 worker_test.go:95] Error with client 2 .Err is context deadline exceeded
I0228 14:21:05.482730    2981 worker_test.go:91] Get with client 0

Any other help?I'm looking forward to hearing from you.

@heyitsanthony
Copy link
Contributor

This does not "cycle through the endpoints until it finds an active leader":

eps := client.Endpoints()
logger.Infof("The current endpoints is %v", eps)
client.SetEndpoints(eps...)

It would need to set endpoints one-by-one, not setting them all at once-- that's treated like a no-op since the endpoint list doesn't change.

@shuting-yst shuting-yst changed the title Network glitch cause etcd client request hang Mar 1, 2017
@shuting-yst
Copy link
Author

shuting-yst commented Mar 1, 2017

@heyitsanthony I just do setting endpoints one-by-one, it does not work as before, code is

for _, mem := range clu.GetMembers() {
    logger.Infof("netdown mem %s", mem.GetName())
    _, err:= clu.DisconnectNet(mem.GetName())

    i := 0
    utils.Retry(context.Background(), func() error{
	 ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
	 logger.Infof("Get with client %d",i)
	 _, err= client.Get(ctx, "noexit")
	 cancel()
	 if err != nil {
              logger.Infof("Error with client %d .Err is %v", i, err)
	      i ++
	      if i > 2 {
		    i = 0
	      }
	      client.SetEndpoints(endpoints[i])
	 }
	return err
    })
    logger.Infof("get success")
    _, err= clu.ConnectNet(mem.GetName())
    logger.Infof("recover mem: %s", mem.GetName())
}

result is

I0301 03:36:22.997255    1886 worker_test.go:83] netdown mem etcd-y4ypggv-node-0
I0301 03:36:23.325944    1886 etcd_local.go:400] Disconnect the test container to netowrk. Container: etcd-y4ypggv-node-0. 
I0301 03:36:23.326025    1886 worker_test.go:90] Get with endpoint 172.29.0.20:2379
I0301 03:36:23.329184    1886 worker_test.go:104] get success
I0301 03:36:23.560356    1886 etcd_local.go:391] Connect the test container to netowrk. Container: etcd-y4ypggv-node-0. 
I0301 03:36:23.560454    1886 worker_test.go:106] recover mem: etcd-y4ypggv-node-0
I0301 03:36:23.560490    1886 worker_test.go:83] netdown mem etcd-y4ypggv-node-1
I0301 03:36:23.877492    1886 etcd_local.go:400] Disconnect the test container to netowrk. Container: etcd-y4ypggv-node-1. 
I0301 03:36:23.877635    1886 worker_test.go:90] Get with endpoint 172.29.0.20:2379
I0301 03:36:28.878558    1886 worker_test.go:95] Error with endpoint 172.29.0.20:2379 .Err is context deadline exceeded
I0301 03:36:28.879858    1886 worker_test.go:90] Get with endpoint 172.29.0.21:2379
I0301 03:36:33.880437    1886 worker_test.go:95] Error with endpoint 172.29.0.21:2379 .Err is context deadline exceeded
I0301 03:36:33.880749    1886 worker_test.go:90] Get with endpoint 172.29.0.22:2379
I0301 03:36:38.881204    1886 worker_test.go:95] Error with endpoint 172.29.0.22:2379 .Err is context deadline exceeded
I0301 03:36:38.881296    1886 worker_test.go:90] Get with endpoint 172.29.0.20:2379
I0301 03:36:43.881902    1886 worker_test.go:95] Error with endpoint 172.29.0.20:2379 .Err is context deadline exceeded
I0301 03:36:43.883029    1886 worker_test.go:90] Get with endpoint 172.29.0.21:2379
I0301 03:36:48.883306    1886 worker_test.go:95] Error with endpoint 172.29.0.21:2379 .Err is context deadline exceeded
I0301 03:36:48.883422    1886 worker_test.go:90] Get with endpoint 172.29.0.22:2379
I0301 03:36:53.883962    1886 worker_test.go:95] Error with endpoint 172.29.0.22:2379 .Err is context deadline exceeded
I0301 03:36:54.441005    1886 worker_test.go:90] Get with endpoint 172.29.0.20:2379
I0301 03:36:59.441482    1886 worker_test.go:95] Error with endpoint 172.29.0.20:2379 .Err is context deadline exceeded

I thought really serious a bug it is.

@heyitsanthony
Copy link
Contributor

@shuting-yst this is still with the 3.1.0 client? I see a bug in the balancer on master that breaks endpoint switching if the connection can be established (#7392). I'll write some test cases tomorrow.

@shuting-yst
Copy link
Author

Yes, it's with the version v3.1.0. Thanks! Waiting for your response.

@xiang90
Copy link
Contributor

xiang90 commented Mar 1, 2017

@shuting-yst Thanks for your patience and for creating tests on this issue. If you feel confident and comfortable, you can actually contribute to etcd project and help us fix the issue next time.

@shuting-yst
Copy link
Author

@xiang90 I will focus on the project relative to etcd for a long time and I'm very glad to do fix that under your help.

@shuting-yst
Copy link
Author

shuting-yst commented Mar 1, 2017

Sorry for some mistake.
I found that the ConnectNet method in my case is not correct:'docker network connect' has an argument '--ip'. if not set ip, docker will give the container a default available ip. When I recover the etcd node's network, its ip changed, and the client urls can not work.

I fix the ConnectNet method, and redo all the experiments, results:

  1. Watch WithRequireLeader does not work as before, it makes no sense with network broken.
  2. 'setting endpoints one-by-one' does work, but operation still hangs (especially watch, no event would get), case like:
client := clientv3.new(...)
go func() {
        // ... when one etcd node network broken, just setting endpoints one-by-one
}()

go func() {
       for ... {
             client.Put(ctx, "ForWatch", "aa%d")  // for watch
      }
}()

wch := client.Watch(ctx, "ForWatch")
clu.DisconnectNet()
client.Get(context.Background(), "exist")  // it hangs
go func () {
     for wresp := range wch {
                // nothing get from wch, never reach into 'for'
     }
}()
...
@heyitsanthony
Copy link
Contributor

heyitsanthony commented Mar 1, 2017

Watch WithRequireLeader does not work as before, it makes no sense with network broken.

Huh? WithRequireLeader will detect if the server is partitioned from the rest of the cluster; that is precisely what it is designed to do. The server knows if it loses its leader since it will eventually stop receiving heartbeats (this is the same mechanism that triggers leader election). If the client can connect to the server it will be notified that the leader is lost when using WithRequireLeader.

For an example of how to properly use WithRequireLeader to detect leader loss please see https://github.com/coreos/etcd/blob/master/proxy/grpcproxy/leader.go

@shilpamayanna
Copy link

shilpamayanna commented Apr 13, 2017

@heyitsanthony Is there any update on this issue. I am hitting a similar problem. When etcd leader node goes down, clients on other nodes fail to access etcd even after electing new leader.

@heyitsanthony
Copy link
Contributor

@shilpamayanna that's not what this issue is about. If the current leader goes down and the clients are configured for multiple endpoints, then they will reconnect to different members. This issue is about clients connecting to partitioned members.

The fix for this issue is slated for the 3.3 release.

@shilpamayanna
Copy link

@heyitsanthony I guess it is similar issue. Leader node is down either because it was partitioned from the network or the node was rebooted. At this point client on other nodes that is configured with multiple endpoints cannot access etcd. It requires to be reconnected from the application. Etcd version 3.0.17.

@heyitsanthony
Copy link
Contributor

@shilpamayanna please file a separate issue explaining how to reproduce the problem. Thanks!

@gyuho
Copy link
Contributor

gyuho commented Oct 6, 2017

Superseded by #8660.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
5 participants