Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add coredns proposal #1100

Merged
merged 5 commits into from
Oct 23, 2017
Merged

Conversation

johnbelamaric
Copy link
Member

Add coredns proposal for https://github.com/kubernetes/features/issues/427
@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Sep 19, 2017
@johnbelamaric johnbelamaric changed the title Add coredns proposal for https://github.com/kubernetes/features/issue… Sep 19, 2017
@cblecker
Copy link
Member

@kubernetes/sig-network-proposals

@k8s-ci-robot k8s-ci-robot added sig/network Categorizes an issue or PR as relevant to SIG Network. kind/design Categorizes issue or PR as related to design. labels Sep 19, 2017
@thockin
Copy link
Member

thockin commented Sep 20, 2017

I'm OK with this. Any net change in footprint - CPU or memory required, performance or scale characteristics?

@smarterclayton
Copy link
Contributor

Seems reasonable to me as well. Also want to see some perf diff info - tail latency on representative workloads.

@smarterclayton
Copy link
Contributor

Also add to the proposal security implications (I see some interesting things in the current codebase) for @kubernetes/sig-auth-api-reviews

@k8s-ci-robot k8s-ci-robot added sig/auth Categorizes an issue or PR as relevant to SIG Auth. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API labels Sep 20, 2017
@smarterclayton
Copy link
Contributor

On a 6.5k service / 10k pod cluster i was seeing about 18-35mb in use, and on a 15k service / 18k pod cluster I was seeing slightly higher. No obvious allocations or slow paths on a range of queries. It seems to fare a bit better than most projects that integrate with kube :)

Would be nice if it upgraded to protobuf

* Adding an arbitrary entry inside the cluster domain (for example TXT entries [#38](https://github.com/kubernetes/dns/issues/38))
* Verified pod DNS entries (ensure pod exists in specified namespace)
* Experimental server-side search path to address latency issues [#33554](https://github.com/kubernetes/kubernetes/issues/33554)
* Limit PTR replies to the cluster CIDR [#125](https://github.com/kubernetes/dns/issues/125)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are PTR records for services implemented? I saw coredns/coredns#1074

Copy link
Member Author

@johnbelamaric johnbelamaric Sep 20, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, they are. You have to configure the reverse zone to make it work. That means knowing the service CIDR and configuring that ahead of time (would love to have kubernetes/kubernetes#25533 implemented).

Since reverse DNS zones are on classful boundaries, if you have a classless CIDR for your service CIDR (say, a /12), then you have to widen that to the containing classful network. That leaves a subset of that network open to the spoofing described in kubernetes/dns#125, and so the issue you reference is to fix that.

We still have that issue (PTR hijacking) with CoreDNS for IPs in the pod CIDRs, but we are in a position to fix it if the operator is willing to enable pods verified.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We generally recommend production user disable pod IP DNS for this reason as well. I would prefer to let pod DNS get deprecated out since it was intentionally a stop gap.

Thanks for the clarity on the reverse CIDR. I think one part of this that would be good would be to include a sample Corefile in this proposal that implements conformance with the Kube DNS spec. To someone new to the core file syntax but deeply familiar with the kube dns spec I had to dig through the code to know what I had to set up. A core file would go a long way to assisting in understanding the implications.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I have added that. Let me know if I need any more examples (e.g., federation).


By default, the user experience would be unchanged. For more advanced uses, existing users would need to modify the
ConfigMap that contains the CoreDNS configuration file.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discuss operational characteristics of coredns against existing kube-dns.

Have the existing e2e tests been run against a cluster configured with coredns?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I will add some to the proposal regarding the operational characteristics.

Existing e2e tests have not been run yet, see coredns/coredns#993 - we plan on this for sure.

@johnbelamaric
Copy link
Member Author

@smarterclayton your comment "would be nice if it upgraded to protobuf" - can you explain further?

We actually can answer gRPC queries, which means protobuf over http/2 and TLS. But of course that requires the client to use that. Right now that is a simple Query method that just wraps an ordinary DNS packet. What we'd really like to do in the future is extend the gRPC API to be bi-directional and allow push-based updates. This is especially useful for headless services where the pods may come and go - clients would be informed immediately of changes, rather than having to re-resolve the SRV queries.

@smarterclayton
Copy link
Contributor

No, I mean that you should use a version of the kubernetes client with the protobuf serialization. That reduces server load significantly. Not a requirement for this, but in the performance / scale section it's worth mentioning. I know some people run kube DNS on many many nodes - openshift runs it on every node and so the impact of performance on the cluster is key

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Sep 20, 2017
Copy link
Member

@luxas luxas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before I review this, is it possible to get graphs on CPU/Mem usage + 50/90/95/98/99%-th percentile graphs for reponse time compared to kube-dns in 100/1000/5000-node clusters?
cc @kubernetes/sig-scalability-feature-requests (we'd probably have to use kubemark)
Also, at what point is more than one replica needed when scaling up the cluster?

It would be really interesting to see that (just reiterating what has been said above)

@johnbelamaric
Copy link
Member Author

@luxas Even with what we get as a CNCF project from Packet, I don't have the resources to do 1000+ node clusters. Are there some other resources available for this sort of test?

@johnbelamaric
Copy link
Member Author

Oh, I see. kubemark simulates clusters. That's new to me. We'll look into it.

@smarterclayton
Copy link
Contributor

Examples are great. I'll have someone poke at coredns against our super dense clusters to get a rough idea of the outcome.

@wojtek-t
Copy link
Member

wojtek-t commented Oct 9, 2017

@wojtek-t thanks. CoreDNS will have a watch on services and endpoints, so yeah, we need to measure the impact on memory and CPU with different levels of services and endpoints, as well as different rates of change in those (along with different QPS levels). Any pointers you have on getting kubemark to do this would be helpful.

Kubemark is generally setting cluster control plane (apiserver, etcd, controller manager, ...) + a number of "fake nodes" (those nodes pretend to be real nodes, but they are not starting real containers, mount volumes, etc. - they just fake those operations). But from apiserver perspective they behave as real nodes (e.g. they confirm that a pod is running, or has been killed etc.)
See https://github.com/kubernetes/community/blob/master/contributors/devel/kubemark-guide.md for more information.

So once you have kubemark setup, you can run any test or workflow against it. In particular we are running some of load tests against it:
https://github.com/kubernetes/kubernetes/blob/master/test/e2e/scalability/load.go
which has an option to create services:
https://github.com/kubernetes/kubernetes/blob/master/test/e2e/scalability/load.go#L184

With 5000-node kubemark cluster, our main test is creating ~16k services, with 150.000 pods (each of them being part of some service). But you can run any test you want against the kubemark control plane.

@castrojo
Copy link
Member

This change is Reviewable

@johnbelamaric
Copy link
Member Author

Thanks @wojtek-t , that's helpful. I'll try to catch up with @rajansandeep to see where he is with this.

@rajansandeep
Copy link
Contributor

Performance-test for CoreDNS in Kubernetes

The performance test was done in GCE with the following components:

  • CoreDNS system with machine type : n1-standard-1 ( 1 CPU, 2.3 GHz Intel Xeon E5 v3 (Haswell))
  • Client system with machine type: n1-standard-1 ( 1 CPU, 2.3 GHz Intel Xeon E5 v3 (Haswell))
  • Kubemark Cluster with 5000 nodes

CoreDNS and client are running out-of-cluster (due to it being a Kubemark cluster)

The following is the summary of the performance of CoreDNS.
CoreDNS cache was disabled.

Services (with 1% change per minute*) Max QPS** Latency (Median) CoreDNS memory (at max QPS) CoreDNS CPU (at max QPS)
1,000 18,000 0.1 ms 38 MB 95 %
5,000 16,000 0.1 ms 73 MB 93 %
10,000 10,000 0.1 ms 115 MB 78 %

*We simulated service change load by creating and destroying 1% of services per minute.
** Max QPS with < 1 % packet loss

@johnbelamaric
Copy link
Member Author

FYI, latency is so low and the same across all, we are double checking to make sure the test is valid.

@luxas
Copy link
Member

luxas commented Oct 19, 2017

For reference, what are these numbers for kube-dns? That might be a valuable comparision to perform, possibly encouraging the broader community to switch over sooner rather than later...

@johnbelamaric
Copy link
Member Author

We're getting real responses with that low latency. Probably the two VMs (coredns and client) are on the same physical host or something. Although I imagine latency between GCE hosts is extremely low. Anyway, it can be taken to mean that:

  1. Number of services does not affect latency; and
  2. Latency inside CoreDNS is very low

As an aside, before we merged coredns/coredns#1149, the latency on the 10k services was really high (like 1 second).

@johnbelamaric
Copy link
Member Author

@luxas we can do that. The numbers above are uncached. So, we would need to configure dnsmasq to not cache anything (forward all queries to kube-dns) for an equivalent test.

For general cache tests, I would expect at least 30k QPS per core from CoreDNS, based on past experience. I also would not be surprised if the dnsmasq cache were faster than that, based on other testing we have done versus C-based servers. I have no way of estimating the cache hit ratio in a "typical" cluster, so it's hard to evaluate this in a general way. Nonetheless, with the numbers above I think the overall solution with CoreDNS will be better even if the cache performance isn't quite as good.

@johnbelamaric
Copy link
Member Author

Ok, I think this latest commit covers all the open questions, except running the equivalent kube-dns numbers (which I don't think necessarily belongs here, but we can add it if others wish).

@rajansandeep
Copy link
Contributor

Performance-test for Kubedns + dnsmasq in Kubernetes

The performance test was done in GCE with the following components:

  • Kubedns + dnsmasq system with machine type : n1-standard-1 ( 1 CPU, 2.3 GHz Intel Xeon E5 v3 (Haswell))
  • Client system with machine type: n1-standard-1 ( 1 CPU, 2.3 GHz Intel Xeon E5 v3 (Haswell))
  • Kubemark Cluster with 5000 nodes

Kubedns + dnsmasq and client are running out-of-cluster (due to it being a Kubemark cluster)

The following is the summary of the performance of Kubedns + dnsmasq.
Cache was disabled.

Services (with 1% change per minute*) Max QPS** Latency (Median) Kubedns + dnsmasq memory (at max QPS) Kubedns + dnsmasq CPU (at max QPS)
1,000 8,000 0.2 ms 45 MB 85 %
5,000 7,000 0.2 ms 97 MB 89 %
10,000 6,000 0.2 ms 191 MB 81 %

*We simulated service change load by creating and destroying 1% of services per minute.
** Max QPS with < 1 % packet loss

@smarterclayton
Copy link
Contributor

smarterclayton commented Oct 23, 2017 via email

@bowei
Copy link
Member

bowei commented Oct 23, 2017

/lgtm for alpha

For beta, we would want to have migration story for existing users of kube-dns.

@bowei bowei added approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm "Looks good to me", indicates that a PR is ready to be merged. labels Oct 23, 2017
@k8s-github-robot
Copy link

Automatic merge from submit-queue.

@warmchang
Copy link
Contributor

👍

@fturib fturib mentioned this pull request Mar 23, 2018
MadhavJivrajani pushed a commit to MadhavJivrajani/community that referenced this pull request Nov 30, 2021
Automatic merge from submit-queue.

Add coredns proposal
danehans pushed a commit to danehans/community that referenced this pull request Jul 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API kind/design Categorizes issue or PR as related to design. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. sig/auth Categorizes an issue or PR as relevant to SIG Auth. sig/autoscaling Categorizes an issue or PR as relevant to SIG Autoscaling. sig/network Categorizes an issue or PR as relevant to SIG Network. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.