*: allow fully concurrent large read #9384

xiang90 · 2018-03-04T09:28:53Z

This PR fully solve the large read blocking issue.

Previously, a large read can block put for seconds. With this patch, there is no blocking at all.

xiang90 · 2018-03-04T09:33:55Z

/cc @heyitsanthony hmm... cc'd wrong person initially. haha.

xiang90 · 2018-03-04T09:38:34Z

mvcc/index.go

+	for next {
+		i := 0
+
+		ti.RLock()


@heyitsanthony This is for reducing locking time. Or read will still block write for 100ms when ranging over the index. Luckily, our index itself is MVCC. So we can release and acquire again as we wish.

Might be worthwhile to wrap 'f' higher up so it has some way to communicate a continuation so items can be requested in 10k chunks, which would avoid having to break up the lock here-- the implicit loss of atomicity on the index is a little worrying.

codecov-io · 2018-03-04T10:27:47Z

Codecov Report

❗ No coverage uploaded for pull request base (master@20cf7f4). Click here to learn what that means.
The diff coverage is 88.46%.

@@            Coverage Diff            @@
##             master    #9384   +/-   ##
=========================================
  Coverage          ?   69.63%           
=========================================
  Files             ?      376           
  Lines             ?    35247           
  Branches          ?        0           
=========================================
  Hits              ?    24545           
  Misses            ?     8942           
  Partials          ?     1760

Impacted Files	Coverage Δ
etcdserver/v3_server.go	`76.73% <ø> (ø)`
mvcc/index.go	`91.42% <100%> (ø)`
mvcc/backend/concurrent_read_tx.go	`77.77% <77.77%> (ø)`
mvcc/backend/backend.go	`73.51% <87.5%> (ø)`
mvcc/kvstore_txn.go	`67.64% <91.3%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 20cf7f4...6f5716d. Read the comment docs.

gyuho · 2018-03-04T11:09:38Z

mvcc/kvstore_txn.go

@@ -46,7 +63,9 @@ func (tr *storeTxnRead) Range(key, end []byte, ro RangeOptions) (r *RangeResult,
 }

 func (tr *storeTxnRead) End() {
-	tr.tx.Unlock()
+	if tr.txlocked {
+		tr.tx.Unlock()


tr.txlocked = false?

Can this be better encapsulated? Could tr.tx.Unlock() and tr.txlocked = false be coordinated by a function that tr provides so we have to remember both correctly everywhere? Same for below tr.tx.Lock() logic.

heyitsanthony

some suggestions

heyitsanthony · 2018-03-04T23:10:50Z

mvcc/backend/backend.go

+	// ConcurrentReadTx returns a non-blocking read tx that is suitable for large reads.
+	// ConcurrentReadTx call itself will not return until the current BatchTx gets committed to
+	// ensure consistency.
+	ConcurrentReadTx() ReadTx


StaleReadTx? CommittedReadTx? Concurrent doesn't really tell me what it is (or makes me think it is the thread-safe one)

will rename.

heyitsanthony · 2018-03-04T23:16:53Z

mvcc/backend/backend.go

@@ -95,6 +100,8 @@ type backend struct {

 	readTx *readTx

+	concurrentReadTxCh chan chan ReadTx


May want to avoid O(n) broadcast here; can do O(1) by returning <-chan struct{} that closes on the next commit, wait on channel close, fetch the latest committed tx, then acquire some semaphore to limit concurrency. Could return a closed channel if the backend is already fully committed, so no need to wait on commit timer.

heyitsanthony · 2018-03-04T23:18:12Z

mvcc/backend/backend.go

+	for i := 0; i < 100; i++ {
+		select {
+		case rch := <-b.concurrentReadTxCh:
+			rtx, err := b.db.Begin(false)


lazily create tx following each commit?

heyitsanthony · 2018-03-04T23:22:49Z

mvcc/index.go

+	for next {
+		i := 0
+
+		ti.RLock()


Might be worthwhile to wrap 'f' higher up so it has some way to communicate a continuation so items can be requested in 10k chunks, which would avoid having to break up the lock here-- the implicit loss of atomicity on the index is a little worrying.

heyitsanthony · 2018-03-04T23:59:08Z

mvcc/kvstore_txn.go

+	txlocked bool
+
+	// for creating concurrent read tx when the read is expensive.
+	b backend.Backend


it looks like there should be something like storeTxnReadLimited that wraps storeTxnRead

heyitsanthony · 2018-03-05T00:07:52Z

mvcc/kvstore_txn.go

@@ -133,6 +157,15 @@ func (tr *storeTxnRead) rangeKeys(key, end []byte, curRev int64, ro RangeOptions
 		limit = len(revpairs)
 	}

+	if limit > expensiveReadLimit && !tr.txlocked && tr.ro { // first expensive read in a read only transcation


e.g., storeTxnReadLimited could impose a limit+1 on kvindex.Revisions(). If the result is >limit, load by chunk from the committed tx-- it would split up the rlock hold times and spend less memory on revpairs

gyuho · 2018-05-24T23:37:53Z

Marking WIP. We will make sure to benchmark this, and merge before v3.4 release. Thanks!

wenjiaswe · 2018-10-15T18:13:55Z

/cc @jingyih

jpbetz · 2019-02-06T00:31:08Z

mvcc/kvstore_txn.go

@@ -134,6 +158,15 @@ func (tr *storeTxnRead) rangeKeys(key, end []byte, curRev int64, ro RangeOptions
 		limit = len(revpairs)
 	}

+	if limit > expensiveReadLimit && !tr.txlocked && tr.ro { // first expensive read in a read only transcation
+		// too many keys to range. upgrade the read transcation to concurrent read tx.
+		tr.tx = tr.b.CommittedReadTx()


When tr.tx is reassigned, does this leak a bolt.Tx? If it is rolledback, I was unable to figure out where.

Traced through the code. The original tx is manged by the backend and is rolled back as usual when it finishes the current batch iteration. This looks like it works correctly.

jpbetz · 2019-02-07T01:23:29Z

mvcc/kvstore_txn.go

@@ -46,7 +63,9 @@ func (tr *storeTxnRead) Range(key, end []byte, ro RangeOptions) (r *RangeResult,
 }

 func (tr *storeTxnRead) End() {
-	tr.tx.Unlock()
+	if tr.txlocked {
+		tr.tx.Unlock()


Can this be better encapsulated? Could tr.tx.Unlock() and tr.txlocked = false be coordinated by a function that tr provides so we have to remember both correctly everywhere? Same for below tr.tx.Lock() logic.

jpbetz · 2019-02-07T01:27:28Z

mvcc/kvstore_txn.go

@@ -134,6 +158,15 @@ func (tr *storeTxnRead) rangeKeys(key, end []byte, curRev int64, ro RangeOptions
 		limit = len(revpairs)
 	}

+	if limit > expensiveReadLimit && !tr.txlocked && tr.ro { // first expensive read in a read only transcation
+		// too many keys to range. upgrade the read transcation to concurrent read tx.


As an alternative to upgrading the tr.tx, would it make sense to have both a tr.tx and a tr.committedTx? This would limit what is put in the committed read transaction to only expensive reads.

nvm, once we perform a CommittedReadTx there is no point having two transactions.

xiang90 · 2019-02-11T22:06:13Z

@jpbetz it would be great if you can take over this patch :P. Thank you!

jpbetz · 2019-02-11T22:13:20Z

@xiang90 Sure. We mostly understand the surround code now, so @jingyih and I will continue this work. Next steps are to add some instrumentation, do some benchmarks and correctness validation, and then create an etcd image that we'll scalability test against kubernetes.

jpbetz · 2019-02-11T22:19:04Z

We also have started a design doc explaining the change that we'll send out for review shortly.

jpbetz · 2019-02-12T01:10:05Z

Ran a adhoc benchmark and the results look good.

Benchmark setup:

write 100k KVs with 256B keys and 1KB values
1 expensive read per second
100 cheap reads per second (<10keys)
20 writes per second (1 key)

Benchamrk runs:

Master (baseline)
This PR (with large concurrent read optimization)
Master, without expensive reads in mix (regression check baseline)
This PR, without any expensive reads in the mix (regression check)

Benchmark results:

PR vs. master are the same without concurrent reads, no signs of a regression
Without the fix, during expensive reads:
- ~90% of reads are 30x slower, 10% are much slower still (600ms, which is about the time it takes to run a single large concurrent read if nothing else is running)
- ~50% of writes are also 30x slower, and the rest are an even distribution between 100ms-600ms
With the fix, during expensive reads:
- all reads are about ~3x slower than when expensive reads are not running
- all writes are ~4x slower than when expensive reads are not running
With the fix, expensive reads are an expected 0-100ms slower, evenly distributed (which we would expect since we delay them by the 100ms batch interval)

This PR (with large concurrent read optimization)

Fast Read Response time histogram:
0.0007 [1] |
0.0080 [2878] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
0.0154 [21] |
0.0228 [16] |
0.0301 [15] |
0.0375 [14] |
0.0448 [10] |
0.0522 [17] |
0.0596 [13] |
0.0669 [12] |
0.0743 [3] |

Write Response time histogram:
0.0004 [1] |
0.0079 [957] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
0.0153 [11] |
0.0228 [5] |
0.0303 [4] |
0.0378 [4] |
0.0453 [6] |
0.0527 [5] |
0.0602 [2] |
0.0677 [4] |
0.0752 [1] |

Expensive Response time histogram:
2.2631 [1] |∎∎∎∎∎
2.3135 [3] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
2.3639 [6] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
2.4143 [5] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
2.4647 [7] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
2.5150 [8] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
2.5654 [1] |∎∎∎∎∎
2.6158 [3] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
2.6662 [2] |∎∎∎∎∎∎∎∎∎∎
2.7166 [2] |∎∎∎∎∎∎∎∎∎∎
2.7670 [2] |∎∎∎∎∎∎∎∎∎∎

Master (baseline)

Fast Read Response time histogram:
0.0007 [1] |
0.0643 [2608] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
0.1279 [1] |
0.1915 [0] |
0.2551 [0] |
0.3187 [0] |
0.3823 [6] |
0.4459 [40] |
0.5095 [145] |∎∎
0.5731 [158] |∎∎
0.6367 [41] |

Write Response time histogram:
0.0003 [1] |
0.0633 [607] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
0.1263 [37] |∎∎
0.1894 [50] |∎∎∎
0.2524 [51] |∎∎∎
0.3154 [51] |∎∎∎
0.3784 [47] |∎∎∎
0.4414 [53] |∎∎∎
0.5044 [57] |∎∎∎
0.5674 [40] |∎∎
0.6304 [6] |

Expensive Response time histogram:
2.0526 [1] |∎∎
2.1056 [0] |
2.1586 [1] |∎∎
2.2116 [1] |∎∎
2.2646 [1] |∎∎
2.3176 [3] |∎∎∎∎∎∎∎∎
2.3706 [7] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
2.4236 [15] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
2.4766 [6] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
2.5296 [2] |∎∎∎∎∎
2.5826 [3] |∎∎∎∎∎∎∎∎

Master without expensive reads in mix (regression check baseline)

Read Response time histogram:
0.0007 [1] |
0.0023 [2157] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
0.0040 [632] |∎∎∎∎∎∎∎∎∎∎∎
0.0056 [95] |∎
0.0072 [42] |
0.0088 [44] |
0.0104 [10] |
0.0121 [12] |
0.0137 [2] |
0.0153 [3] |
0.0169 [2] |

Write Response time histogram:
0.0004 [1] |
0.0019 [946] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
0.0034 [9] |
0.0049 [10] |
0.0064 [13] |
0.0079 [8] |
0.0094 [3] |
0.0108 [8] |
0.0123 [1] |
0.0138 [0] |
0.0153 [1] |

This PR, without any expensive reads in the mix (regression check)

Fast Response time histogram:
0.0008 [1] |
0.0023 [2638] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
0.0038 [222] |∎∎∎
0.0054 [50] |
0.0069 [34] |
0.0085 [30] |
0.0100 [8] |
0.0116 [9] |
0.0131 [6] |
0.0146 [0] |
0.0162 [2] |

Write Response time histogram:
0.0005 [1] |
0.0019 [942] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
0.0033 [12] |
0.0048 [13] |
0.0062 [9] |
0.0076 [8] |
0.0090 [3] |
0.0104 [9] |
0.0118 [1] |
0.0132 [0] |
0.0147 [2] |

xiang90 · 2019-02-12T02:28:46Z

which we would expect since we delay them by the 100ms batch interval

This is also configurable. 100ms might not be a good value for large cluster :P.

xiang90 · 2019-03-06T18:38:04Z

mvcc/index.go

@@ -91,10 +91,11 @@ func (ti *treeIndex) keyIndex(keyi *keyIndex) *keyIndex {
 func (ti *treeIndex) visit(key, end []byte, f func(ki *keyIndex)) {
 	keyi, endi := &keyIndex{key: key}, &keyIndex{key: end}

-	ti.RLock()
-	defer ti.RUnlock()
+	ti.Lock()


@jingyih i think this improvement is still needed? can you make a separate PR for this change?

Sounds good.

xiang90 commented Mar 4, 2018

View reviewed changes

gyuho reviewed Mar 4, 2018

View reviewed changes

heyitsanthony reviewed Mar 5, 2018

View reviewed changes

xiang90 mentioned this pull request Mar 30, 2018

mvcc: Clone for batch index compaction and shorten lock #9511

Merged

gyuho added area/performance stage/investigating labels Apr 12, 2018

xiang90 force-pushed the better_range branch 2 times, most recently from f98b76d to 6f5716d Compare May 24, 2018 19:02

*: allow fully concurrent large read

cdf23e3

xiang90 force-pushed the better_range branch from 6f5716d to cdf23e3 Compare May 24, 2018 21:46

gyuho added this to the etcd-v3.4 milestone May 24, 2018

gyuho added the Release-Note label May 24, 2018

gyuho self-assigned this May 24, 2018

gyuho added WIP and removed stage/investigating labels May 24, 2018

xiang90 mentioned this pull request Oct 11, 2018

etcdserver: read-only range request took too long #10174

Closed

jpbetz suggested changes Feb 7, 2019

View reviewed changes

jpbetz mentioned this pull request Feb 14, 2019

[WIP] *: Fully concurrent large reads #10472

Closed

jingyih mentioned this pull request Mar 6, 2019

mvcc: fully concurrent read #10523

Merged

xiang90 commented Mar 6, 2019

View reviewed changes

xiang90 closed this Jun 7, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

*: allow fully concurrent large read #9384

*: allow fully concurrent large read #9384

xiang90 commented Mar 4, 2018 •

edited

Loading

xiang90 commented Mar 4, 2018

xiang90 Mar 4, 2018

heyitsanthony Mar 4, 2018

codecov-io commented Mar 4, 2018 •

edited

Loading

gyuho Mar 4, 2018

jpbetz Feb 7, 2019

heyitsanthony left a comment

heyitsanthony Mar 4, 2018

xiang90 Mar 5, 2018

heyitsanthony Mar 4, 2018

heyitsanthony Mar 4, 2018

heyitsanthony Mar 4, 2018

heyitsanthony Mar 4, 2018

heyitsanthony Mar 5, 2018

gyuho commented May 24, 2018

wenjiaswe commented Oct 15, 2018

jpbetz Feb 6, 2019

jpbetz Feb 11, 2019

jpbetz Feb 7, 2019

jpbetz Feb 7, 2019

jpbetz Feb 7, 2019

xiang90 commented Feb 11, 2019

jpbetz commented Feb 11, 2019

jpbetz commented Feb 11, 2019

jpbetz commented Feb 12, 2019

xiang90 commented Feb 12, 2019

xiang90 Mar 6, 2019

jingyih Mar 6, 2019 •

edited

Loading

		@@ -95,6 +100,8 @@ type backend struct {

		readTx *readTx

		concurrentReadTxCh chan chan ReadTx

*: allow fully concurrent large read #9384

*: allow fully concurrent large read #9384

Conversation

xiang90 commented Mar 4, 2018 • edited Loading

xiang90 commented Mar 4, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov-io commented Mar 4, 2018 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

heyitsanthony left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gyuho commented May 24, 2018

wenjiaswe commented Oct 15, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xiang90 commented Feb 11, 2019

jpbetz commented Feb 11, 2019

jpbetz commented Feb 11, 2019

jpbetz commented Feb 12, 2019

This PR (with large concurrent read optimization)

Master (baseline)

Master without expensive reads in mix (regression check baseline)

This PR, without any expensive reads in the mix (regression check)

xiang90 commented Feb 12, 2019

Choose a reason for hiding this comment

jingyih Mar 6, 2019 • edited Loading

Choose a reason for hiding this comment

xiang90 commented Mar 4, 2018 •

edited

Loading

codecov-io commented Mar 4, 2018 •

edited

Loading

jingyih Mar 6, 2019 •

edited

Loading