Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

*: allow fully concurrent large read #9384

Closed
wants to merge 1 commit into from

Conversation

xiang90
Copy link
Contributor

@xiang90 xiang90 commented Mar 4, 2018

This PR fully solve the large read blocking issue.

Previously, a large read can block put for seconds. With this patch, there is no blocking at all.

/cc @heyitsanthony

@xiang90
Copy link
Contributor Author

xiang90 commented Mar 4, 2018

/cc @heyitsanthony hmm... cc'd wrong person initially. haha.

mvcc/index.go Outdated
for next {
i := 0

ti.RLock()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@heyitsanthony This is for reducing locking time. Or read will still block write for 100ms when ranging over the index. Luckily, our index itself is MVCC. So we can release and acquire again as we wish.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be worthwhile to wrap 'f' higher up so it has some way to communicate a continuation so items can be requested in 10k chunks, which would avoid having to break up the lock here-- the implicit loss of atomicity on the index is a little worrying.

@codecov-io
Copy link

codecov-io commented Mar 4, 2018

Codecov Report

❗ No coverage uploaded for pull request base (master@20cf7f4). Click here to learn what that means.
The diff coverage is 88.46%.

Impacted file tree graph

@@            Coverage Diff            @@
##             master    #9384   +/-   ##
=========================================
  Coverage          ?   69.63%           
=========================================
  Files             ?      376           
  Lines             ?    35247           
  Branches          ?        0           
=========================================
  Hits              ?    24545           
  Misses            ?     8942           
  Partials          ?     1760
Impacted Files Coverage Δ
etcdserver/v3_server.go 76.73% <ø> (ø)
mvcc/index.go 91.42% <100%> (ø)
mvcc/backend/concurrent_read_tx.go 77.77% <77.77%> (ø)
mvcc/backend/backend.go 73.51% <87.5%> (ø)
mvcc/kvstore_txn.go 67.64% <91.3%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 20cf7f4...6f5716d. Read the comment docs.

@@ -46,7 +63,9 @@ func (tr *storeTxnRead) Range(key, end []byte, ro RangeOptions) (r *RangeResult,
}

func (tr *storeTxnRead) End() {
tr.tx.Unlock()
if tr.txlocked {
tr.tx.Unlock()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tr.txlocked = false?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be better encapsulated? Could tr.tx.Unlock() and tr.txlocked = false be coordinated by a function that tr provides so we have to remember both correctly everywhere? Same for below tr.tx.Lock() logic.

Copy link
Contributor

@heyitsanthony heyitsanthony left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some suggestions

// ConcurrentReadTx returns a non-blocking read tx that is suitable for large reads.
// ConcurrentReadTx call itself will not return until the current BatchTx gets committed to
// ensure consistency.
ConcurrentReadTx() ReadTx
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

StaleReadTx? CommittedReadTx? Concurrent doesn't really tell me what it is (or makes me think it is the thread-safe one)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will rename.

@@ -95,6 +100,8 @@ type backend struct {

readTx *readTx

concurrentReadTxCh chan chan ReadTx
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May want to avoid O(n) broadcast here; can do O(1) by returning <-chan struct{} that closes on the next commit, wait on channel close, fetch the latest committed tx, then acquire some semaphore to limit concurrency. Could return a closed channel if the backend is already fully committed, so no need to wait on commit timer.

for i := 0; i < 100; i++ {
select {
case rch := <-b.concurrentReadTxCh:
rtx, err := b.db.Begin(false)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lazily create tx following each commit?

mvcc/index.go Outdated
for next {
i := 0

ti.RLock()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be worthwhile to wrap 'f' higher up so it has some way to communicate a continuation so items can be requested in 10k chunks, which would avoid having to break up the lock here-- the implicit loss of atomicity on the index is a little worrying.

txlocked bool

// for creating concurrent read tx when the read is expensive.
b backend.Backend
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it looks like there should be something like storeTxnReadLimited that wraps storeTxnRead

@@ -133,6 +157,15 @@ func (tr *storeTxnRead) rangeKeys(key, end []byte, curRev int64, ro RangeOptions
limit = len(revpairs)
}

if limit > expensiveReadLimit && !tr.txlocked && tr.ro { // first expensive read in a read only transcation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

e.g., storeTxnReadLimited could impose a limit+1 on kvindex.Revisions(). If the result is >limit, load by chunk from the committed tx-- it would split up the rlock hold times and spend less memory on revpairs

@gyuho gyuho added this to the etcd-v3.4 milestone May 24, 2018
@gyuho gyuho self-assigned this May 24, 2018
@gyuho
Copy link
Contributor

gyuho commented May 24, 2018

Marking WIP. We will make sure to benchmark this, and merge before v3.4 release. Thanks!

@wenjiaswe
Copy link
Contributor

/cc @jingyih

@@ -134,6 +158,15 @@ func (tr *storeTxnRead) rangeKeys(key, end []byte, curRev int64, ro RangeOptions
limit = len(revpairs)
}

if limit > expensiveReadLimit && !tr.txlocked && tr.ro { // first expensive read in a read only transcation
// too many keys to range. upgrade the read transcation to concurrent read tx.
tr.tx = tr.b.CommittedReadTx()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When tr.tx is reassigned, does this leak a bolt.Tx? If it is rolledback, I was unable to figure out where.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Traced through the code. The original tx is manged by the backend and is rolled back as usual when it finishes the current batch iteration. This looks like it works correctly.

@@ -46,7 +63,9 @@ func (tr *storeTxnRead) Range(key, end []byte, ro RangeOptions) (r *RangeResult,
}

func (tr *storeTxnRead) End() {
tr.tx.Unlock()
if tr.txlocked {
tr.tx.Unlock()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be better encapsulated? Could tr.tx.Unlock() and tr.txlocked = false be coordinated by a function that tr provides so we have to remember both correctly everywhere? Same for below tr.tx.Lock() logic.

@@ -134,6 +158,15 @@ func (tr *storeTxnRead) rangeKeys(key, end []byte, curRev int64, ro RangeOptions
limit = len(revpairs)
}

if limit > expensiveReadLimit && !tr.txlocked && tr.ro { // first expensive read in a read only transcation
// too many keys to range. upgrade the read transcation to concurrent read tx.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As an alternative to upgrading the tr.tx, would it make sense to have both a tr.tx and a tr.committedTx? This would limit what is put in the committed read transaction to only expensive reads.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nvm, once we perform a CommittedReadTx there is no point having two transactions.

@xiang90
Copy link
Contributor Author

xiang90 commented Feb 11, 2019

@jpbetz it would be great if you can take over this patch :P. Thank you!

@jpbetz
Copy link
Contributor

jpbetz commented Feb 11, 2019

@xiang90 Sure. We mostly understand the surround code now, so @jingyih and I will continue this work. Next steps are to add some instrumentation, do some benchmarks and correctness validation, and then create an etcd image that we'll scalability test against kubernetes.

@jpbetz
Copy link
Contributor

jpbetz commented Feb 11, 2019

We also have started a design doc explaining the change that we'll send out for review shortly.

@jpbetz
Copy link
Contributor

jpbetz commented Feb 12, 2019

Ran a adhoc benchmark and the results look good.

Benchmark setup:

  • write 100k KVs with 256B keys and 1KB values
  • 1 expensive read per second
  • 100 cheap reads per second (<10keys)
  • 20 writes per second (1 key)

Benchamrk runs:

  • Master (baseline)
  • This PR (with large concurrent read optimization)
  • Master, without expensive reads in mix (regression check baseline)
  • This PR, without any expensive reads in the mix (regression check)

Benchmark results:

  • PR vs. master are the same without concurrent reads, no signs of a regression
  • Without the fix, during expensive reads:
    • ~90% of reads are 30x slower, 10% are much slower still (600ms, which is about the time it takes to run a single large concurrent read if nothing else is running)
    • ~50% of writes are also 30x slower, and the rest are an even distribution between 100ms-600ms
  • With the fix, during expensive reads:
    • all reads are about ~3x slower than when expensive reads are not running
    • all writes are ~4x slower than when expensive reads are not running
  • With the fix, expensive reads are an expected 0-100ms slower, evenly distributed (which we would expect since we delay them by the 100ms batch interval)

This PR (with large concurrent read optimization)

Fast Read Response time histogram:
0.0007 [1] |
0.0080 [2878] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
0.0154 [21] |
0.0228 [16] |
0.0301 [15] |
0.0375 [14] |
0.0448 [10] |
0.0522 [17] |
0.0596 [13] |
0.0669 [12] |
0.0743 [3] |

Write Response time histogram:
0.0004 [1] |
0.0079 [957] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
0.0153 [11] |
0.0228 [5] |
0.0303 [4] |
0.0378 [4] |
0.0453 [6] |
0.0527 [5] |
0.0602 [2] |
0.0677 [4] |
0.0752 [1] |

Expensive Response time histogram:
2.2631 [1] |∎∎∎∎∎
2.3135 [3] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
2.3639 [6] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
2.4143 [5] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
2.4647 [7] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
2.5150 [8] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
2.5654 [1] |∎∎∎∎∎
2.6158 [3] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
2.6662 [2] |∎∎∎∎∎∎∎∎∎∎
2.7166 [2] |∎∎∎∎∎∎∎∎∎∎
2.7670 [2] |∎∎∎∎∎∎∎∎∎∎

Master (baseline)

Fast Read Response time histogram:
0.0007 [1] |
0.0643 [2608] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
0.1279 [1] |
0.1915 [0] |
0.2551 [0] |
0.3187 [0] |
0.3823 [6] |
0.4459 [40] |
0.5095 [145] |∎∎
0.5731 [158] |∎∎
0.6367 [41] |

Write Response time histogram:
0.0003 [1] |
0.0633 [607] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
0.1263 [37] |∎∎
0.1894 [50] |∎∎∎
0.2524 [51] |∎∎∎
0.3154 [51] |∎∎∎
0.3784 [47] |∎∎∎
0.4414 [53] |∎∎∎
0.5044 [57] |∎∎∎
0.5674 [40] |∎∎
0.6304 [6] |

Expensive Response time histogram:
2.0526 [1] |∎∎
2.1056 [0] |
2.1586 [1] |∎∎
2.2116 [1] |∎∎
2.2646 [1] |∎∎
2.3176 [3] |∎∎∎∎∎∎∎∎
2.3706 [7] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
2.4236 [15] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
2.4766 [6] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
2.5296 [2] |∎∎∎∎∎
2.5826 [3] |∎∎∎∎∎∎∎∎

Master without expensive reads in mix (regression check baseline)

Read Response time histogram:
0.0007 [1] |
0.0023 [2157] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
0.0040 [632] |∎∎∎∎∎∎∎∎∎∎∎
0.0056 [95] |∎
0.0072 [42] |
0.0088 [44] |
0.0104 [10] |
0.0121 [12] |
0.0137 [2] |
0.0153 [3] |
0.0169 [2] |

Write Response time histogram:
0.0004 [1] |
0.0019 [946] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
0.0034 [9] |
0.0049 [10] |
0.0064 [13] |
0.0079 [8] |
0.0094 [3] |
0.0108 [8] |
0.0123 [1] |
0.0138 [0] |
0.0153 [1] |

This PR, without any expensive reads in the mix (regression check)

Fast Response time histogram:
0.0008 [1] |
0.0023 [2638] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
0.0038 [222] |∎∎∎
0.0054 [50] |
0.0069 [34] |
0.0085 [30] |
0.0100 [8] |
0.0116 [9] |
0.0131 [6] |
0.0146 [0] |
0.0162 [2] |

Write Response time histogram:
0.0005 [1] |
0.0019 [942] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
0.0033 [12] |
0.0048 [13] |
0.0062 [9] |
0.0076 [8] |
0.0090 [3] |
0.0104 [9] |
0.0118 [1] |
0.0132 [0] |
0.0147 [2] |

@xiang90
Copy link
Contributor Author

xiang90 commented Feb 12, 2019

  • which we would expect since we delay them by the 100ms batch interval

This is also configurable. 100ms might not be a good value for large cluster :P.

@@ -91,10 +91,11 @@ func (ti *treeIndex) keyIndex(keyi *keyIndex) *keyIndex {
func (ti *treeIndex) visit(key, end []byte, f func(ki *keyIndex)) {
keyi, endi := &keyIndex{key: key}, &keyIndex{key: end}

ti.RLock()
defer ti.RUnlock()
ti.Lock()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jingyih i think this improvement is still needed? can you make a separate PR for this change?

Copy link
Contributor

@jingyih jingyih Mar 6, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good.

@xiang90 xiang90 closed this Jun 7, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment