-1

In the context of a babel plugin, I reading a Buffer from a file to check some content.

I'm looking specifically for the following string ɵɵfoobar escaped as \u0275\u0275foobar.

When I printing my buffer with myBuffer.toString() I can see

...

[...\u0275\u0275foobar()]
...

But when I check the content, I never get a positive. I've tried following :

  • myBuffer.includes('ɵɵfoobar')
  • myBuffer.includes('\u0275\u0275foobar')
  • myBuffer.toString().includes('ɵɵfoobar')

Also note that myBuffer.includes('foobar()') returns true.

Any idea what I'm doing wrong ?

1
  • You sure it’s not actual backslashes in the source? myBuffer.includes('\\u0275\\u0275foobar') Anyway, please reduce this to a minimal reproducible example.subarray, .toString('hex'), and Buffer.from('…', 'hex') can help with that.
    – Ry-
    Commented Jul 8 at 1:08

1 Answer 1

1

The encoding for the character you are looking for (ɵ) is 0x0275 in UTF-16 encoding, while it is 0xC9 or 0xB5 in UTF-8 encoding. Since you're looking for it as 0x0275, we know this file you've read into the buffer is encoded in UTF-16.

Node.js's Buffer.toString() accepts an 'encoding' as its first parameter. The default value of this is 'utf8'. This means that you must provide 'utf16le' as the first argument when calling the toString method (eg buffer.toString('utf16le')) if you want to be able to match against the UTF-16 encoding as \u0275. Note that Node.js only supports the little-endian variant of UTF-16.

myBuffer.includes('foobar()') returns true because all of the characters you are searching for in the string 'foobar()' are represented the same in ASCII as they are in UTF-8 (eg: 'f' is encoded as 0x66), and ASCII is a proper subset of UTF-8 (remember, we just saw UTF-8 is the default encoding of the Node.js Buffer.toString method).

If you're curious, this post on ASCII vs Unicode has some great jumping off points for some differences and encoding concepts.

3
  • Also using toString('utf16le') just messes up the output. Commented Jul 8 at 1:01
  • @MatthieuRiegler can you elaborate on what "messes up the output" means for your situation? What example string was in the buffer as input, and what outputs are you getting with UTF-8 and UTF-16, respectively?
    – Luke
    Commented Jul 8 at 1:07
  • with utf16, to string returns chineses characteres Commented Jul 8 at 8:00

Not the answer you're looking for? Browse other questions tagged or ask your own question.