Feat: Threaded MutationsBatcher #722

Mariatta · 2023-01-12T00:38:04Z

Batch mutations in a thread to allow concurrent batching
Flush the batch every second
Flow control
Batcher can now be used as a contextmanager

Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:

Make sure to open an issue as a bug/issue before writing your code! That way we can discuss the change, evaluate designs, and agree on the general idea
Ensure the tests and linter pass
Code coverage does not decrease (if any source code was changed)
Appropriate docs were updated (if necessary)

Fixes #<issue_number_goes_here> 🦕

mutianf · 2023-01-20T03:41:57Z

google/cloud/bigtable/batcher.py

+
+        mutation_count = len(item._get_mutations())
+
+        if mutation_count > MAX_MUTATIONS:


what's the Max_mutation constraint here? Is it for checking the concurrent requests that's sending to bigtable? I don't think throwing an error is the correct behavior here. Instead we should block on adding more elements to the batcher.

This is the max number of mutations for a single row. (100,000) In this case I think it should raise an error since we shouldn't split the mutations for one row?

Blocking additional elements to the batcher is already done by the use of Python Queue itself. If we're trying to queue more than 100 elements, it will be blocked and won't be added until items have been popped from the queue.

google/cloud/bigtable/batcher.py

igorbernstein2

It seems like we don't really have any flow control here. Do you plan on adding it in a follow up PR?

igorbernstein2 · 2023-02-02T16:37:32Z

google/cloud/bigtable/batcher.py

-FLUSH_COUNT = 1000
+FLUSH_COUNT = 100
 MAX_MUTATIONS = 100000
-MAX_ROW_BYTES = 5242880  # 5MB
+MAX_ROW_BYTES = 20 * 1024 * 1024  # 20MB
+MAX_MUTATIONS_SIZE = 100 * 1024 * 1024  # 100MB


Please add comments what these constants control

igorbernstein2 · 2023-02-02T16:38:49Z

google/cloud/bigtable/batcher.py

+        self.total_size = 0
+        self.max_row_bytes = max_row_bytes
+
+    def get(self, block=True, timeout=None):


when would block = False?

google/cloud/bigtable/batcher.py

igorbernstein2 · 2023-02-02T16:50:59Z

google/cloud/bigtable/batcher.py

+        exc = future.exception()
+        if exc:
+            raise exc


will this throw the original exception or will it wrap it?
If its the original exception, then the stacktrace will be rooted in the executor. If its wrapped then you are leaking implementation details.

In java we kept the original exception but added a synthetic stacktrace of the caller to help callers diagnose where they called the failed RPC.

HBase does something even hackier and modifies the stacktrace and inserts the callers stack at the top.

I dont know which approach idiomatic to python, but we should be intentional here

This will re-raise the original exception. I will see how to do that in Python.
Can you share link on how it was done in Java?

In this case it the exception will have the stack trace all the way to the original spot where it was raised (line 303 below). The exception itself will have a list of individual error codes so user can still go through it to find out what failed.

daniel-sanche · 2023-02-04T01:13:04Z

google/cloud/bigtable/batcher.py

+                responses.append(result)
+
+            if has_error:
+                raise MutationsBatchError(status_codes=response)


I noticed Status codes can be converted into exceptions using google.api_core.exceptions.from_grpc_status. Maybe it would be better to raise those directly, to give the full context, rather than using the raw status codes?

google/cloud/bigtable/batcher.py

mutianf · 2023-02-21T16:23:29Z

google/cloud/bigtable/batcher.py


-FLUSH_COUNT = 1000
+
+# Max number of items in the queue. Queue will be flushed if this number is reached


I think these variable names are a bit confusing. Maybe we can refactor them to be:

batch_element_count # After this many elements are accumulated, they will be wrapped up in a batch and sent.
batch_byte_size # After this many bytes are accumulated, they will be wrapped up in a batch and sent.
max_outstanding_elements # After these many elements are sent, block until the previous batch is processed
max_outstanding_bytes # After these many bytes are sent, block until the previous batch is processed.

what do you think?

mutianf · 2023-02-21T17:11:47Z

google/cloud/bigtable/batcher.py

+        if (
+            self.total_mutation_count >= MAX_MUTATIONS
+            or self.total_size >= self.max_row_bytes
+            or self._queue.full()


Why do we need to check if the queue is full? In the batcher code, we flush when the Batcher is full:

if self._rows.full(): self.flush_async()

I'm not sure if the behavior is still correct?

I think full() should only check for 2 things:

Number of elements reached the batch element threshold

Number of bytes reached the batch bytes threshold

The queue is only be used to block user from adding more elements when it's full. We don't trigger another flush when the queue is full.

google/cloud/bigtable/batcher.py

daniel-sanche · 2023-03-10T21:18:01Z

google/cloud/bigtable/batcher.py

+        mutations_count = 0
+        mutations_size = 0
+        rows_count = 0
+        batch_info = BatchInfo()


I'm a bit confused by BatchInfo. Doesn't it duplicate the row_count, mutations_count and mutations_size variables?

It's similar info, but for different "buckets".

There are two "queues":

self._rows: for storing the rows we want to mutate
batch_info: for storing info about the rows that are being mutated, and we're waiting for the result/response from backend. This gets passed to the the batch_completed_callback. That's where we can release the flow control.

daniel-sanche · 2023-03-10T21:25:13Z

google/cloud/bigtable/batcher.py

+        self.flow_control.release(processed_rows)
+        del self.futures_mapping[future]
+
+    def flush_rows(self, rows_to_flush=None):


Having both flush() and flush_rows() seems a little confusing. Especially since this doesn't seem to be "flushing" from the cache in any way. I could see people calling this with no arguments thinking it is the main flush function

Maybe this should be called mutate_rows? Or just made into an internal helper function?

Agree this is confusing. There is already a public mutate_rows from before, and it has different behavior than this one. I'm changing it to private since user isn't expected to call it manually.

google/cloud/bigtable/batcher.py

mutianf · 2023-03-15T20:48:15Z

google/cloud/bigtable/batcher.py

+
+MAX_ROW_BYTES = 20 * 1024 * 1024  # 20MB # after this many bytes, send out the batch
+
+MAX_MUTATIONS_SIZE = 100 * 1024 * 1024  # 100MB # max inflight byte size.


Maybe rename this variable to indicate it's for flow control.

Suggested change

MAX_MUTATIONS_SIZE = 100 * 1024 * 1024 # 100MB # max inflight byte size.

MAX_OUTSTANDING_BYTES = 100 * 1024 * 1024 # 100MB # max inflight byte size.

mutianf · 2023-03-15T20:50:46Z

google/cloud/bigtable/batcher.py

+
+
+class FlowControl(object):
+    def __init__(self, max_mutations=MAX_MUTATIONS_SIZE, max_row_bytes=MAX_ROW_BYTES):


The defaults should be:

Suggested change

def __init__(self, max_mutations=MAX_MUTATIONS_SIZE, max_row_bytes=MAX_ROW_BYTES):

def __init__(self, max_mutations=MAX_OUTSTANDING_ELEMNTS, max_row_bytes=MAX_MUTATION_SIZE):

And these numbers are:
MAX_OUTSTANDING_ELEMENTS = 100000
MAX_MUTATION_SIZE = 20 MB
correct?

mutianf · 2023-03-15T20:55:17Z

google/cloud/bigtable/batcher.py

+
+        self.inflight_mutations += batch_info.mutations_count
+        self.inflight_size += batch_info.mutations_size
+        self.inflight_rows_count += batch_info.rows_count


I don't think we care about this, it's also not used in is_blocked

I was using this for debugging. I will remove it in the end before merging.

google/cloud/bigtable/batcher.py

Mariatta · 2023-03-29T15:21:46Z

I've made adjustments based on previews reviews and feedback. Please take another look.

mutianf

LGTM with some nits!

google/cloud/bigtable/batcher.py

Co-authored-by: Mattie Fu <mattiefu@google.com>

- Remove unneeded error - Make some functions internal

Co-authored-by: Mattie Fu <mattiefu@google.com>

- Remove debugging variable - Update variable names

google/cloud/bigtable/batcher.py

igorbernstein2 · 2023-04-17T12:23:04Z

google/cloud/bigtable/batcher.py

+    mutations_size: int = 0
+
+
+class FlowControl(object):


Should this be prefixed with an underscore since this is an internal class?

Sure, I will add that in the next commit.

igorbernstein2 · 2023-04-17T12:29:29Z

google/cloud/bigtable/batcher.py

+        self.inflight_mutations += batch_info.mutations_count
+        self.inflight_size += batch_info.mutations_size
+        self.set_flow_control_status()
+        self.wait()


won't this cause a deadlock with a large row? if max_inflight_bytes is 2 and the row size is 4, this will just get stuck?

I adjusted the logic, moving the wait to the flush_async function. If the batch is causing the event to be blocked, then it will be sent through, but the subsequent thread will be blocked and waited.

google/cloud/bigtable/batcher.py

Allow the batch to go through.

igorbernstein2

LGTM, but please have Daniel take a look at the api and confirm it works for him in the async client

daniel-sanche · 2023-05-10T20:35:14Z

Some of the API will need to change for the async client, due to different model classes and asyncio patterns, but the general shape of the solution should be mostly consistent. LGTM

This reverts commit 7521a61.

Reverts #722 This PR caused beam bigtableio.py failures https://togithub.com/apache/beam/issues/26673 and is blocking beam release. We're unclear why it caused the failure. So will revert this change, cut another release so we can unblock beam and investigate separately.

product-auto-label bot added size: m Pull request size is medium. api: bigtable Issues related to the googleapis/python-bigtable API. size: l Pull request size is large. and removed size: m Pull request size is medium. labels Jan 12, 2023

Mariatta marked this pull request as ready for review January 17, 2023 19:28

Mariatta requested review from a team as code owners January 17, 2023 19:28

mutianf reviewed Jan 20, 2023

View reviewed changes

Mariatta force-pushed the batcher-threaded branch from b99695f to 3e627ce Compare February 1, 2023 23:00

igorbernstein2 requested changes Feb 2, 2023

View reviewed changes

daniel-sanche reviewed Feb 4, 2023

View reviewed changes

Mariatta requested review from daniel-sanche, mutianf and igorbernstein2 February 13, 2023 18:19

daniel-sanche reviewed Feb 13, 2023

View reviewed changes

google/cloud/bigtable/batcher.py Show resolved Hide resolved

Mariatta requested a review from daniel-sanche February 16, 2023 18:36

mutianf reviewed Feb 21, 2023

View reviewed changes

Mariatta requested a review from mutianf March 10, 2023 00:58

Mariatta force-pushed the batcher-threaded branch from 9eb4dff to 69d864d Compare March 10, 2023 01:40

daniel-sanche reviewed Mar 10, 2023

View reviewed changes

mutianf reviewed Mar 15, 2023

View reviewed changes

Mariatta requested review from daniel-sanche and mutianf March 29, 2023 15:21

Mariatta added the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Mar 29, 2023

yoshi-kokoro removed the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Mar 29, 2023

mutianf approved these changes Apr 4, 2023

View reviewed changes

google/cloud/bigtable/batcher.py Outdated Show resolved Hide resolved

google/cloud/bigtable/batcher.py Outdated Show resolved Hide resolved

google/cloud/bigtable/batcher.py Outdated Show resolved Hide resolved

Mariatta mentioned this pull request Apr 5, 2023

Asynchronous batching #702

Closed

4 tasks

Mariatta force-pushed the batcher-threaded branch from 1a470c1 to edb7d5c Compare April 5, 2023 22:44

Mariatta requested a review from a team as a code owner April 5, 2023 22:44

Mariatta and others added 8 commits April 5, 2023 16:07

Add flow control to control inflight requests

b912792

Improve test coverage

cc334ec

Submit the batch when queue's empty

177414f

Update google/cloud/bigtable/batcher.py

d907180

Co-authored-by: Mattie Fu <mattiefu@google.com>

Changes based on PR review

d0a6ba3

- Remove unneeded error - Make some functions internal

Update google/cloud/bigtable/batcher.py

06bc2c1

Co-authored-by: Mattie Fu <mattiefu@google.com>

Update google/cloud/bigtable/batcher.py

c0584b5

Co-authored-by: Mattie Fu <mattiefu@google.com>

Changes based on PR review

9899e45

- Remove debugging variable - Update variable names

Mariatta force-pushed the batcher-threaded branch from edb7d5c to 9899e45 Compare April 5, 2023 23:08

Docstring updates

51b063b

Mariatta added the snippet-bot:force-run Force snippet-bot runs its logic label Apr 6, 2023

snippet-bot bot removed the snippet-bot:force-run Force snippet-bot runs its logic label Apr 6, 2023

Mariatta added the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Apr 6, 2023

yoshi-kokoro removed the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Apr 6, 2023

daniel-sanche approved these changes Apr 6, 2023

View reviewed changes

Remove unused variable, is_open

1e42c1b

igorbernstein2 requested changes Apr 17, 2023

View reviewed changes

Adjust the logic for when the batch is larger than available resources.

c1c50ea

Allow the batch to go through.

Mariatta requested a review from igorbernstein2 May 9, 2023 16:08

Mariatta added 2 commits May 9, 2023 13:32

Make _flush_async private

414fe87

Update test case for flush async

c2943ad

igorbernstein2 approved these changes May 10, 2023

View reviewed changes

Mariatta merged commit 7521a61 into main May 10, 2023
16 checks passed

Mariatta deleted the batcher-threaded branch May 10, 2023 22:42

release-please bot mentioned this pull request May 10, 2023

chore(main): release 2.18.0 #757

Merged

mutianf added a commit that referenced this pull request May 11, 2023

Revert "Feat: Threaded MutationsBatcher (#722)"

5d86a00

This reverts commit 7521a61.

mutianf mentioned this pull request May 11, 2023

fix: Revert "Feat: Threaded MutationsBatcher" #773

Merged

Abacn mentioned this pull request May 23, 2023

[Failing Test]: beam_PreCommit_Python_Cron failing TestWriteBigTable.test_write_metrics apache/beam#26673

Closed

15 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat: Threaded MutationsBatcher #722

Feat: Threaded MutationsBatcher #722

Mariatta commented Jan 12, 2023 •

edited

Loading

mutianf Jan 20, 2023

Mariatta Feb 1, 2023

Mariatta Feb 1, 2023

igorbernstein2 left a comment

igorbernstein2 Feb 2, 2023

igorbernstein2 Feb 2, 2023

igorbernstein2 Feb 2, 2023

Mariatta Feb 2, 2023

Mariatta Feb 10, 2023

daniel-sanche Feb 4, 2023

mutianf Feb 21, 2023

mutianf Feb 21, 2023

daniel-sanche Mar 10, 2023

Mariatta Mar 29, 2023 •

edited

Loading

daniel-sanche Mar 10, 2023

Mariatta Mar 29, 2023

mutianf Mar 15, 2023

mutianf Mar 15, 2023

Mariatta Mar 29, 2023

mutianf Mar 15, 2023

Mariatta Mar 29, 2023

Mariatta commented Mar 29, 2023

mutianf left a comment

igorbernstein2 Apr 17, 2023

Mariatta May 4, 2023

igorbernstein2 Apr 17, 2023

Mariatta May 5, 2023

igorbernstein2 left a comment

daniel-sanche commented May 10, 2023


		mutation_count = len(item._get_mutations())

		if mutation_count > MAX_MUTATIONS:


		FLUSH_COUNT = 1000

		# Max number of items in the queue. Queue will be flushed if this number is reached


		MAX_ROW_BYTES = 20 * 1024 * 1024 # 20MB # after this many bytes, send out the batch

		MAX_MUTATIONS_SIZE = 100 * 1024 * 1024 # 100MB # max inflight byte size.

	MAX_MUTATIONS_SIZE = 100 * 1024 * 1024 # 100MB # max inflight byte size.
	MAX_OUTSTANDING_BYTES = 100 * 1024 * 1024 # 100MB # max inflight byte size.



		class FlowControl(object):
		def __init__(self, max_mutations=MAX_MUTATIONS_SIZE, max_row_bytes=MAX_ROW_BYTES):

	def __init__(self, max_mutations=MAX_MUTATIONS_SIZE, max_row_bytes=MAX_ROW_BYTES):
	def __init__(self, max_mutations=MAX_OUTSTANDING_ELEMNTS, max_row_bytes=MAX_MUTATION_SIZE):

Feat: Threaded MutationsBatcher #722

Feat: Threaded MutationsBatcher #722

Conversation

Mariatta commented Jan 12, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

igorbernstein2 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Mariatta Mar 29, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Mariatta commented Mar 29, 2023

mutianf left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

igorbernstein2 left a comment

Choose a reason for hiding this comment

daniel-sanche commented May 10, 2023

Mariatta commented Jan 12, 2023 •

edited

Loading

Mariatta Mar 29, 2023 •

edited

Loading