src: avoid allocation in the Size method for BASE64URL and BASE64 #53550

lemire · 2024-06-22T17:26:39Z

For large base64 strings, this PR multiplies the performance of Buffer.from(..., "base64"); by making the Size function simpler and non-allocating.

This will reduce the gap with Bun from being over 3x slower to being about only 40% slower for large inputs. For modest inputs, Node.js is still about 2x slower than Bun. Note that both Bun and Node.js use the same underlying library (simdutf) for base64 decoding so any difference is entirely due to the runtime (not to the base64 decoding per se).

Benchmark (from bun):

import { bench, run } from "mitata";
function makeBenchmark(size, isToString) {
  const base64Input = Buffer.alloc(size, "latin1").toString("base64");
  const base64From = Buffer.from (base64Input, "base64");
  bench(`Buffer. from(${size} bytes, 'base64')`, () => {
      Buffer.from(base64Input, "base64");
  });
}
[128, 1024, 32 * 1024, 1024 * 1024 * 8]. forEach (s => makeBenchmark(s, false)) ;
await run();

Node.js 22:

cpu: Apple M2
runtime: node v22.3.0 (arm64-darwin)

benchmark                                  time (avg)             (min … max)       p75       p99      p999
----------------------------------------------------------------------------- -----------------------------
Buffer. from(128 bytes, 'base64')         110 ns/iter       (101 ns … 457 ns)    110 ns    140 ns    208 ns
Buffer. from(1024 bytes, 'base64')        306 ns/iter       (239 ns … 453 ns)    323 ns    426 ns    453 ns
Buffer. from(32768 bytes, 'base64')     9'087 ns/iter     (7'333 ns … 317 µs)  8'709 ns 27'750 ns 51'041 ns
Buffer. from(8388608 bytes, 'base64')   3'273 µs/iter   (3'144 µs … 3'564 µs)  3'387 µs  3'546 µs  3'564 µs

Node.js with this PR:

cpu: Apple M2
runtime: node v23.0.0-pre (arm64-darwin)

benchmark                                  time (avg)             (min … max)       p75       p99      p999
----------------------------------------------------------------------------- -----------------------------
Buffer. from(128 bytes, 'base64')         111 ns/iter       (101 ns … 470 ns)    112 ns    145 ns    224 ns
Buffer. from(1024 bytes, 'base64')        308 ns/iter       (256 ns … 475 ns)    312 ns    452 ns    475 ns
Buffer. from(32768 bytes, 'base64')     7'384 ns/iter  (7'091 ns … 12'399 ns)  7'268 ns 11'472 ns 12'399 ns
Buffer. from(8388608 bytes, 'base64')   1'431 µs/iter   (1'358 µs … 2'034 µs)  1'453 µs  1'627 µs  2'034 µs

Bun canary (upcoming release)

cpu: Apple M2
runtime: bun 1.1.16 (arm64-darwin)

benchmark                                  time (avg)             (min … max)       p75       p99      p999
----------------------------------------------------------------------------- -----------------------------
Buffer. from(128 bytes, 'base64')         115 ns/iter   (89.42 ns … 1'054 ns)  92.67 ns    880 ns  1'034 ns
Buffer. from(1024 bytes, 'base64')        226 ns/iter       (174 ns … 923 ns)    184 ns    720 ns    913 ns
Buffer. from(32768 bytes, 'base64')     4'045 ns/iter   (3'904 ns … 4'504 ns)  4'083 ns  4'404 ns  4'504 ns
Buffer. from(8388608 bytes, 'base64')     998 µs/iter     (861 µs … 1'590 µs)  1'165 µs  1'460 µs  1'590 µs

To get the Bun results, you need the canary which you may get with bun upgrade --canary.

lemire · 2024-06-22T17:47:01Z

For people who care about history, Size has been expensive since the very beginning (we trace it back to the initial commit by @isaacs in 2013):

     case BASE64: {
       String::AsciiValue value(str);
       data_size = base64_decoded_size(*value, value.length());
       break;
     }

mcollina

lgtm

mildsunrise · 2024-06-22T22:02:18Z

for the record, Size() wasn't just doing an unnecessary allocation, it was also returning the wrong size, causing us to allocate a much larger buffer than necessary. simdutf::base64_length_from_binary calculates the length after encoding; what we need here is the reverse. @lemire has replaced the code with this calculation:

str->Length() % 4 <= 1 ? str->Length() / 4 * 3
    : str->Length() / 4 * 3 + (str->Length() % 4) - 1)

which is only an upper bound, as it doesn't look at the actual data, and so it assumes the worst case (all characters are data characters, i.e. no padding or whitespace). the math looks correct to me.

bear in mind that while Size() is allowed to return an upper bound, it is expected to return an exact prediction most of the time. if this is not the case, we do another allocation + memcpy to a new backing store with the actual size. for base64url the prediction is usually correct as it rarely has padding, but for base64 it will miss 2/3 of the time (or always, if the input contains whitespace).

lemire · 2024-06-22T22:08:47Z

@mildsunrise

As far as I can tell, Node never returned the exact size. Doing so requires scanning the entire input, checking for characters to discard.

mildsunrise · 2024-06-22T22:09:44Z

wasn't base64_decoded_size supposed to do exactly that?

src/string_bytes.cc

mildsunrise · 2024-06-22T23:38:22Z

even if it's not worth it to provide an exact prediction (like you said, it would require scanning the full input) we can still make the prediction exact for "pure" base64 (i.e. containing padding but no whitespace) which is pretty common. String::Write() takes a start argument, so we should be able to efficiently grab the last 2 characters and discard any padding that is present before doing the calculation

lemire · 2024-06-23T00:10:40Z

wasn't base64_decoded_size supposed to do exactly that?

It depends what you mean by 'supposed'. Maybe that was the intention, but that's not what it did.

It was very simple:

template <typename TypeName>
size_t base64_decoded_size(const TypeName* src, size_t size) {
  // 1-byte input cannot be decoded
  if (size < 2)
    return 0;

  if (src[size - 1] == '=') {
    size--;
    if (src[size - 1] == '=')
      size--;
  }
  return base64_decoded_size_fast(size);
}

prediction is usually correct as it rarely has padding, but for base64 it will miss 2/3 of the time (or always, if the input contains whitespace).

It is an effect which we can measure. Consider this benchmark...

import { bench, run } from "mitata";
function makeBenchmark(size, isToString) {
  const base64Input = Buffer.alloc(size, "latin1").toString("base64");
  const base64Input1 = Buffer.alloc(size-1, "latin1").toString("base64");
  const base64Input2 = Buffer.alloc(size-2, "latin1").toString("base64");
  console.log(base64Input.length + " " + base64Input2.length)
  bench(`Buffer. from(${size} bytes, 'base64')`, () => {
      Buffer.from(base64Input, "base64");
  });
  bench(`Buffer. from(${size} bytes, 'base64')`, () => {
      Buffer.from(base64Input1, "base64");
  });
  bench(`Buffer. from(${size} bytes, 'base64')`, () => {
      Buffer.from(base64Input2, "base64");
  });
}
[1024 * 1024 * 8]. forEach (s => makeBenchmark(s, false)) ;
await run();

The way it is designed, one of the three runs will not overallocate (using the current PR).

So it is about 15%.

so we should be able to efficiently grab the last 2 characters and discard any padding that is present before doing the calculation

Let us do that. I pushed a commit to handle the simple case. It can get arbitrarily fancier.

mildsunrise · 2024-06-23T09:58:42Z

It depends what you mean by 'supposed'. Maybe that was the intention, but that's not what it did. [...]

ah, i see, they had the same idea...
edit: for the record, the JS version does the same thing

So it is about 15%.

80-90% in my i7 linux laptop:

cpu: Intel(R) Core(TM) i7-8665U CPU @ 1.90GHz
runtime: node v23.0.0-pre (x64-linux)

benchmark                                 time (avg)             (min … max)       p75       p99      p999
---------------------------------------------------------------------------- -----------------------------
Buffer.from(8388608 bytes, 'base64')   3'401 µs/iter   (2'824 µs … 4'462 µs)  3'527 µs  4'382 µs  4'462 µs
Buffer.from(8388607 bytes, 'base64')   3'559 µs/iter   (3'117 µs … 6'086 µs)  3'521 µs  5'503 µs  6'086 µs
Buffer.from(8388606 bytes, 'base64')   1'800 µs/iter   (1'535 µs … 4'361 µs)  1'763 µs  3'679 µs  4'361 µs

Let us do that. I pushed a commit to handle the simple case. It can get arbitrarily fancier.

thanks, that was exactly what I had in mind. if we can efficiently handle a 'normal' input (as in, produced by toString('base64') or the equivalent in most languages) I'm happy :)

I don't think it needs to be any fancier, but i don't see a need to specialize it to extern 8-bit strings. as said Write() takes start so we can check the last 2 chars without copying the whole string, i.e.

if (str->Length() >= 2) {
  char16_t trailing [2];
  int written = str->Write(isolate, trailing, str->Length() - 2, 2);
  // ...
}

are you concerned it may still be too heavy of an operation to be worth it?

lemire · 2024-06-23T16:17:57Z

@mildsunrise I kept it simple deliberately. The use case we care about in the benchmark I provided is the external one-byte case.

It is highly likely that other cases could be optimized. It would be good to have corresponding benchmarks. I submit to you that it can be done in distinct PRs.

nodejs-github-bot · 2024-06-23T16:27:37Z

CI: https://ci.nodejs.org/job/node-test-pull-request/59941/

mildsunrise · 2024-06-23T17:55:40Z

looks good to me, I'll open another PR after this one lands and we can continue the discussion there :)

jasnell · 2024-06-23T18:01:03Z

src/string_bytes.cc

    case BASE64: {
-      String::Value value(isolate, str);
-      return Just(simdutf::base64_length_from_binary(value.length()));
+      size_t data_size = str->Length() % 4 <= 1


Given the duplication here, any chance of moving this into a separate helper function?

lemire · 2024-06-23T18:29:31Z

@mildsunrise

looks good to me, I'll open another PR after this one lands and we can continue the discussion there :)

❤️

@jasnell

Given the duplication here, any chance of moving this into a separate helper function?

This is, of course, a very valid remark.

It is clear that @mildsunrise is planning to do another pass over this code as soon as it lands.

My suggestion is to leave it like it is now and just wait for @mildsunrise to improve it.

mildsunrise · 2024-06-23T18:42:00Z

I'm fine with both options

mildsunrise · 2024-06-23T19:07:24Z

also, we can discuss that in the next PR but I wanted to point it out now just in case: we are currently only creating external strings if their length is above ~1MB. so, for the test cases 512k and below, the string is actually internal and thus we do 2 extra copies + allocs.

kurtextrem · 2024-06-24T14:16:01Z

Not fully on-topic, but also not entirely off-topic: For those not active on Twitter but watching this issue, Leszek Swirski who works on V8 made a thread, mentioning this issue and what can be improved from V8 side too: https://x.com/leszekswirski/status/1805170432753934644.

lemire · 2024-06-24T14:18:21Z

I bet that @mildsunrise can help boost the speed further without any update to v8. :-)

But... yeah... help from v8 is probably needed to get the best possible performance.

mildsunrise · 2024-06-24T16:31:57Z

yes, I don't think we need any update to v8 to get on par with Bun in this specific benchmark, because the strings are guaranteed to be flat. indeed, I was able to get these results by lowering EXTERN_APEX a bit (in addition to the changes already in this PR):

but having V8 expose us an iterator over non-flat strings (like the one they already use internally) would allow us to get the best performance in all cases (except when there is whitespace present).

edit: not that I'm actually planning on touching that constant... it was chosen 11 years ago through a microbenchmark very similar to this one, and I don't feel knowledgeable enough to recommend tweaking it just based off another microbenchmark. I can only confirm that GC activity shoots up at 512kB otherwise, but this might be understood as a microbenchmark deficiency

LeszekSwirski · 2024-07-05T09:47:39Z

Not fully on-topic, but also not entirely off-topic: For those not active on Twitter but watching this issue, Leszek Swirski who works on V8 made a thread, mentioning this issue and what can be improved from V8 side too: https://x.com/leszekswirski/status/1805170432753934644.

Indeed. I've landed a non-copying v8::String::ValueView API in V8 now, so you should be able to use this once you roll V8 past 12.8.218. At that point this simplifies to something like:

String::ValueView view(isolate, str);
data_size = view.length() % 4 <= 1
                ? view.length() / 4 * 3
                : view.length() / 4 * 3 + (view.length() % 4) - 1;
// Check if the string ends with one or two padding characters and adjust the
// size accordingly. Note that the input can contain non-base64 characters, so,
// at best, we can provide an upper bound. A correct size would requires
// scanning the entire input. We try to keep the function as fast as possible,
// so we only check the last two characters when the string is one-byte.
if (view.is_one_byte()) {
  if (view.length() > 1) {
    if (view.data8()[view.length() - 1] == '=') {
      data_size--;
      if (view.data8()[view.length() - 2] == '=') {
        data_size--;
      }
    }
  }
}

(I don't include a padding character check for a two-byte string since then it'll anyway likely be a wrong length estimate, due almost guaranteed non-base64 characters in the string)

src: avoid allocation in the Size method for BASE64URL and BASE64

dcb9718

nodejs-github-bot added buffer Issues and PRs related to the buffer subsystem. c++ Issues and PRs that require attention from people who are familiar with C++. needs-ci PRs that need a full CI run. labels Jun 22, 2024

lemire added the request-ci Add this label to start a Jenkins CI on a PR. label Jun 22, 2024

lemire requested review from anonrig, addaleax and joyeecheung June 22, 2024 17:28

github-actions bot removed the request-ci Add this label to start a Jenkins CI on a PR. label Jun 22, 2024

This comment was marked as outdated.

Sign in to view

anonrig approved these changes Jun 22, 2024

View reviewed changes

anonrig requested review from mcollina and jasnell June 22, 2024 17:35

lemire added 2 commits June 22, 2024 14:34

simplify the function

a69442c

lint

cf1d6b9

anonrig approved these changes Jun 22, 2024

View reviewed changes

more simplification

fb76dbd

lemire requested a review from anonrig June 22, 2024 18:52

fix typo

2588eb9

anonrig approved these changes Jun 22, 2024

View reviewed changes

lemire added 2 commits June 22, 2024 15:16

explicit cast

b6a694d

lint

9ae78a4

mcollina approved these changes Jun 22, 2024

View reviewed changes

mcollina added the request-ci Add this label to start a Jenkins CI on a PR. label Jun 22, 2024

github-actions bot removed the request-ci Add this label to start a Jenkins CI on a PR. label Jun 22, 2024

This comment was marked as outdated.

Sign in to view

mildsunrise mentioned this pull request Jun 22, 2024

buffer: Reallocate() instead of creating a new backing store #53552

Closed

mildsunrise reviewed Jun 22, 2024

View reviewed changes

src/string_bytes.cc Show resolved Hide resolved

adjust the output size more precisely, for the simple cases

87a47a7

lint

e698118

lemire added the author ready PRs that have at least one approval, no pending requests for changes, and a CI started. label Jun 23, 2024

anonrig approved these changes Jun 23, 2024

View reviewed changes

anonrig added the request-ci Add this label to start a Jenkins CI on a PR. label Jun 23, 2024

github-actions bot removed the request-ci Add this label to start a Jenkins CI on a PR. label Jun 23, 2024

mildsunrise approved these changes Jun 23, 2024

View reviewed changes

jasnell reviewed Jun 23, 2024

View reviewed changes

jasnell approved these changes Jun 23, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

src: avoid allocation in the Size method for BASE64URL and BASE64 #53550

src: avoid allocation in the Size method for BASE64URL and BASE64 #53550

lemire commented Jun 22, 2024 •

edited

Loading

This comment was marked as outdated.

lemire commented Jun 22, 2024

mcollina left a comment

This comment was marked as outdated.

mildsunrise commented Jun 22, 2024 •

edited

Loading

lemire commented Jun 22, 2024

mildsunrise commented Jun 22, 2024

mildsunrise commented Jun 22, 2024

lemire commented Jun 23, 2024

mildsunrise commented Jun 23, 2024 •

edited

Loading

lemire commented Jun 23, 2024

nodejs-github-bot commented Jun 23, 2024

mildsunrise commented Jun 23, 2024

jasnell Jun 23, 2024

lemire commented Jun 23, 2024

mildsunrise commented Jun 23, 2024

mildsunrise commented Jun 23, 2024

kurtextrem commented Jun 24, 2024

lemire commented Jun 24, 2024

mildsunrise commented Jun 24, 2024 •

edited

Loading

LeszekSwirski commented Jul 5, 2024

src: avoid allocation in the Size method for BASE64URL and BASE64 #53550

Are you sure you want to change the base?

src: avoid allocation in the Size method for BASE64URL and BASE64 #53550

Conversation

lemire commented Jun 22, 2024 • edited Loading

This comment was marked as outdated.

lemire commented Jun 22, 2024

mcollina left a comment

Choose a reason for hiding this comment

This comment was marked as outdated.

mildsunrise commented Jun 22, 2024 • edited Loading

lemire commented Jun 22, 2024

mildsunrise commented Jun 22, 2024

mildsunrise commented Jun 22, 2024

lemire commented Jun 23, 2024

mildsunrise commented Jun 23, 2024 • edited Loading

lemire commented Jun 23, 2024

nodejs-github-bot commented Jun 23, 2024

mildsunrise commented Jun 23, 2024

jasnell Jun 23, 2024

Choose a reason for hiding this comment

lemire commented Jun 23, 2024

mildsunrise commented Jun 23, 2024

mildsunrise commented Jun 23, 2024

kurtextrem commented Jun 24, 2024

lemire commented Jun 24, 2024

mildsunrise commented Jun 24, 2024 • edited Loading

LeszekSwirski commented Jul 5, 2024

lemire commented Jun 22, 2024 •

edited

Loading

mildsunrise commented Jun 22, 2024 •

edited

Loading

mildsunrise commented Jun 23, 2024 •

edited

Loading

mildsunrise commented Jun 24, 2024 •

edited

Loading