Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

src: avoid allocation in the Size method for BASE64URL and BASE64 #53550

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

lemire
Copy link
Member

@lemire lemire commented Jun 22, 2024

For large base64 strings, this PR multiplies the performance of Buffer.from(..., "base64"); by making the Size function simpler and non-allocating.

This will reduce the gap with Bun from being over 3x slower to being about only 40% slower for large inputs. For modest inputs, Node.js is still about 2x slower than Bun. Note that both Bun and Node.js use the same underlying library (simdutf) for base64 decoding so any difference is entirely due to the runtime (not to the base64 decoding per se).

Benchmark (from bun):

import { bench, run } from "mitata";
function makeBenchmark(size, isToString) {
  const base64Input = Buffer.alloc(size, "latin1").toString("base64");
  const base64From = Buffer.from (base64Input, "base64");
  bench(`Buffer. from(${size} bytes, 'base64')`, () => {
      Buffer.from(base64Input, "base64");
  });
}
[128, 1024, 32 * 1024, 1024 * 1024 * 8]. forEach (s => makeBenchmark(s, false)) ;
await run();

Node.js 22:

cpu: Apple M2
runtime: node v22.3.0 (arm64-darwin)

benchmark                                  time (avg)             (min … max)       p75       p99      p999
----------------------------------------------------------------------------- -----------------------------
Buffer. from(128 bytes, 'base64')         110 ns/iter       (101 ns … 457 ns)    110 ns    140 ns    208 ns
Buffer. from(1024 bytes, 'base64')        306 ns/iter       (239 ns … 453 ns)    323 ns    426 ns    453 ns
Buffer. from(32768 bytes, 'base64')     9'087 ns/iter     (7'333 ns … 317 µs)  8'709 ns 27'750 ns 51'041 ns
Buffer. from(8388608 bytes, 'base64')   3'273 µs/iter   (3'144 µs … 3'564 µs)  3'387 µs  3'546 µs  3'564 µs

Node.js with this PR:

cpu: Apple M2
runtime: node v23.0.0-pre (arm64-darwin)

benchmark                                  time (avg)             (min … max)       p75       p99      p999
----------------------------------------------------------------------------- -----------------------------
Buffer. from(128 bytes, 'base64')         111 ns/iter       (101 ns … 470 ns)    112 ns    145 ns    224 ns
Buffer. from(1024 bytes, 'base64')        308 ns/iter       (256 ns … 475 ns)    312 ns    452 ns    475 ns
Buffer. from(32768 bytes, 'base64')     7'384 ns/iter  (7'091 ns … 12'399 ns)  7'268 ns 11'472 ns 12'399 ns
Buffer. from(8388608 bytes, 'base64')   1'431 µs/iter   (1'358 µs … 2'034 µs)  1'453 µs  1'627 µs  2'034 µs

Bun canary (upcoming release)

cpu: Apple M2
runtime: bun 1.1.16 (arm64-darwin)

benchmark                                  time (avg)             (min … max)       p75       p99      p999
----------------------------------------------------------------------------- -----------------------------
Buffer. from(128 bytes, 'base64')         115 ns/iter   (89.42 ns … 1'054 ns)  92.67 ns    880 ns  1'034 ns
Buffer. from(1024 bytes, 'base64')        226 ns/iter       (174 ns … 923 ns)    184 ns    720 ns    913 ns
Buffer. from(32768 bytes, 'base64')     4'045 ns/iter   (3'904 ns … 4'504 ns)  4'083 ns  4'404 ns  4'504 ns
Buffer. from(8388608 bytes, 'base64')     998 µs/iter     (861 µs … 1'590 µs)  1'165 µs  1'460 µs  1'590 µs

To get the Bun results, you need the canary which you may get with bun upgrade --canary.

@nodejs-github-bot nodejs-github-bot added buffer Issues and PRs related to the buffer subsystem. c++ Issues and PRs that require attention from people who are familiar with C++. needs-ci PRs that need a full CI run. labels Jun 22, 2024
@lemire lemire added the request-ci Add this label to start a Jenkins CI on a PR. label Jun 22, 2024
@github-actions github-actions bot removed the request-ci Add this label to start a Jenkins CI on a PR. label Jun 22, 2024
@nodejs-github-bot

This comment was marked as outdated.

@anonrig anonrig requested review from mcollina and jasnell June 22, 2024 17:35
@lemire
Copy link
Member Author

lemire commented Jun 22, 2024

For people who care about history, Size has been expensive since the very beginning (we trace it back to the initial commit by @isaacs in 2013):

     case BASE64: {
       String::AsciiValue value(str);
       data_size = base64_decoded_size(*value, value.length());
       break;
     }
@lemire lemire requested a review from anonrig June 22, 2024 18:52
Copy link
Member

@mcollina mcollina left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@mcollina mcollina added the request-ci Add this label to start a Jenkins CI on a PR. label Jun 22, 2024
@github-actions github-actions bot removed the request-ci Add this label to start a Jenkins CI on a PR. label Jun 22, 2024
@nodejs-github-bot

This comment was marked as outdated.

@mildsunrise
Copy link
Member

mildsunrise commented Jun 22, 2024

for the record, Size() wasn't just doing an unnecessary allocation, it was also returning the wrong size, causing us to allocate a much larger buffer than necessary. simdutf::base64_length_from_binary calculates the length after encoding; what we need here is the reverse. @lemire has replaced the code with this calculation:

str->Length() % 4 <= 1 ? str->Length() / 4 * 3
    : str->Length() / 4 * 3 + (str->Length() % 4) - 1)

which is only an upper bound, as it doesn't look at the actual data, and so it assumes the worst case (all characters are data characters, i.e. no padding or whitespace). the math looks correct to me.

bear in mind that while Size() is allowed to return an upper bound, it is expected to return an exact prediction most of the time. if this is not the case, we do another allocation + memcpy to a new backing store with the actual size. for base64url the prediction is usually correct as it rarely has padding, but for base64 it will miss 2/3 of the time (or always, if the input contains whitespace).

@lemire
Copy link
Member Author

lemire commented Jun 22, 2024

@mildsunrise

As far as I can tell, Node never returned the exact size. Doing so requires scanning the entire input, checking for characters to discard.

@mildsunrise
Copy link
Member

wasn't base64_decoded_size supposed to do exactly that?

@mildsunrise
Copy link
Member

even if it's not worth it to provide an exact prediction (like you said, it would require scanning the full input) we can still make the prediction exact for "pure" base64 (i.e. containing padding but no whitespace) which is pretty common. String::Write() takes a start argument, so we should be able to efficiently grab the last 2 characters and discard any padding that is present before doing the calculation

@lemire
Copy link
Member Author

lemire commented Jun 23, 2024

wasn't base64_decoded_size supposed to do exactly that?

It depends what you mean by 'supposed'. Maybe that was the intention, but that's not what it did.

It was very simple:

template <typename TypeName>
size_t base64_decoded_size(const TypeName* src, size_t size) {
  // 1-byte input cannot be decoded
  if (size < 2)
    return 0;

  if (src[size - 1] == '=') {
    size--;
    if (src[size - 1] == '=')
      size--;
  }
  return base64_decoded_size_fast(size);
}

prediction is usually correct as it rarely has padding, but for base64 it will miss 2/3 of the time (or always, if the input contains whitespace).

It is an effect which we can measure. Consider this benchmark...

import { bench, run } from "mitata";
function makeBenchmark(size, isToString) {
  const base64Input = Buffer.alloc(size, "latin1").toString("base64");
  const base64Input1 = Buffer.alloc(size-1, "latin1").toString("base64");
  const base64Input2 = Buffer.alloc(size-2, "latin1").toString("base64");
  console.log(base64Input.length + " " + base64Input2.length)
  bench(`Buffer. from(${size} bytes, 'base64')`, () => {
      Buffer.from(base64Input, "base64");
  });
  bench(`Buffer. from(${size} bytes, 'base64')`, () => {
      Buffer.from(base64Input1, "base64");
  });
  bench(`Buffer. from(${size} bytes, 'base64')`, () => {
      Buffer.from(base64Input2, "base64");
  });
}
[1024 * 1024 * 8]. forEach (s => makeBenchmark(s, false)) ;
await run();

The way it is designed, one of the three runs will not overallocate (using the current PR).

Screenshot 2024-06-22 at 7 54 38 PM

So it is about 15%.

so we should be able to efficiently grab the last 2 characters and discard any padding that is present before doing the calculation

Let us do that. I pushed a commit to handle the simple case. It can get arbitrarily fancier.

@mildsunrise
Copy link
Member

mildsunrise commented Jun 23, 2024

It depends what you mean by 'supposed'. Maybe that was the intention, but that's not what it did. [...]

ah, i see, they had the same idea...
edit: for the record, the JS version does the same thing

So it is about 15%.

80-90% in my i7 linux laptop:

cpu: Intel(R) Core(TM) i7-8665U CPU @ 1.90GHz
runtime: node v23.0.0-pre (x64-linux)

benchmark                                 time (avg)             (min … max)       p75       p99      p999
---------------------------------------------------------------------------- -----------------------------
Buffer.from(8388608 bytes, 'base64')   3'401 µs/iter   (2'824 µs … 4'462 µs)  3'527 µs  4'382 µs  4'462 µs
Buffer.from(8388607 bytes, 'base64')   3'559 µs/iter   (3'117 µs … 6'086 µs)  3'521 µs  5'503 µs  6'086 µs
Buffer.from(8388606 bytes, 'base64')   1'800 µs/iter   (1'535 µs … 4'361 µs)  1'763 µs  3'679 µs  4'361 µs

Let us do that. I pushed a commit to handle the simple case. It can get arbitrarily fancier.

thanks, that was exactly what I had in mind. if we can efficiently handle a 'normal' input (as in, produced by toString('base64') or the equivalent in most languages) I'm happy :)

I don't think it needs to be any fancier, but i don't see a need to specialize it to extern 8-bit strings. as said Write() takes start so we can check the last 2 chars without copying the whole string, i.e.

if (str->Length() >= 2) {
  char16_t trailing [2];
  int written = str->Write(isolate, trailing, str->Length() - 2, 2);
  // ...
}

are you concerned it may still be too heavy of an operation to be worth it?

@lemire
Copy link
Member Author

lemire commented Jun 23, 2024

@mildsunrise I kept it simple deliberately. The use case we care about in the benchmark I provided is the external one-byte case.

It is highly likely that other cases could be optimized. It would be good to have corresponding benchmarks. I submit to you that it can be done in distinct PRs.

@lemire lemire added the author ready PRs that have at least one approval, no pending requests for changes, and a CI started. label Jun 23, 2024
@anonrig anonrig added the request-ci Add this label to start a Jenkins CI on a PR. label Jun 23, 2024
@github-actions github-actions bot removed the request-ci Add this label to start a Jenkins CI on a PR. label Jun 23, 2024
@mildsunrise
Copy link
Member

looks good to me, I'll open another PR after this one lands and we can continue the discussion there :)

case BASE64: {
String::Value value(isolate, str);
return Just(simdutf::base64_length_from_binary(value.length()));
size_t data_size = str->Length() % 4 <= 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the duplication here, any chance of moving this into a separate helper function?

@lemire
Copy link
Member Author

lemire commented Jun 23, 2024

@mildsunrise

looks good to me, I'll open another PR after this one lands and we can continue the discussion there :)

❤️

@jasnell

Given the duplication here, any chance of moving this into a separate helper function?

This is, of course, a very valid remark.

It is clear that @mildsunrise is planning to do another pass over this code as soon as it lands.

My suggestion is to leave it like it is now and just wait for @mildsunrise to improve it.

@mildsunrise
Copy link
Member

I'm fine with both options

@mildsunrise
Copy link
Member

also, we can discuss that in the next PR but I wanted to point it out now just in case: we are currently only creating external strings if their length is above ~1MB. so, for the test cases 512k and below, the string is actually internal and thus we do 2 extra copies + allocs.

@kurtextrem
Copy link

Not fully on-topic, but also not entirely off-topic: For those not active on Twitter but watching this issue, Leszek Swirski who works on V8 made a thread, mentioning this issue and what can be improved from V8 side too: https://x.com/leszekswirski/status/1805170432753934644.

@lemire
Copy link
Member Author

lemire commented Jun 24, 2024

I bet that @mildsunrise can help boost the speed further without any update to v8. :-)

But... yeah... help from v8 is probably needed to get the best possible performance.

@mildsunrise
Copy link
Member

mildsunrise commented Jun 24, 2024

yes, I don't think we need any update to v8 to get on par with Bun in this specific benchmark, because the strings are guaranteed to be flat. indeed, I was able to get these results by lowering EXTERN_APEX a bit (in addition to the changes already in this PR):

Screenshot from 2024-06-23 21-17-09

but having V8 expose us an iterator over non-flat strings (like the one they already use internally) would allow us to get the best performance in all cases (except when there is whitespace present).

edit: not that I'm actually planning on touching that constant... it was chosen 11 years ago through a microbenchmark very similar to this one, and I don't feel knowledgeable enough to recommend tweaking it just based off another microbenchmark. I can only confirm that GC activity shoots up at 512kB otherwise, but this might be understood as a microbenchmark deficiency

@LeszekSwirski
Copy link

Not fully on-topic, but also not entirely off-topic: For those not active on Twitter but watching this issue, Leszek Swirski who works on V8 made a thread, mentioning this issue and what can be improved from V8 side too: https://x.com/leszekswirski/status/1805170432753934644.

Indeed. I've landed a non-copying v8::String::ValueView API in V8 now, so you should be able to use this once you roll V8 past 12.8.218. At that point this simplifies to something like:

String::ValueView view(isolate, str);
data_size = view.length() % 4 <= 1
                ? view.length() / 4 * 3
                : view.length() / 4 * 3 + (view.length() % 4) - 1;
// Check if the string ends with one or two padding characters and adjust the
// size accordingly. Note that the input can contain non-base64 characters, so,
// at best, we can provide an upper bound. A correct size would requires
// scanning the entire input. We try to keep the function as fast as possible,
// so we only check the last two characters when the string is one-byte.
if (view.is_one_byte()) {
  if (view.length() > 1) {
    if (view.data8()[view.length() - 1] == '=') {
      data_size--;
      if (view.data8()[view.length() - 2] == '=') {
        data_size--;
      }
    }
  }
}

(I don't include a padding character check for a two-byte string since then it'll anyway likely be a wrong length estimate, due almost guaranteed non-base64 characters in the string)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
author ready PRs that have at least one approval, no pending requests for changes, and a CI started. buffer Issues and PRs related to the buffer subsystem. c++ Issues and PRs that require attention from people who are familiar with C++. needs-ci PRs that need a full CI run.
8 participants