Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lighthouse misattributing Main-Thread Blocking Time in Run Microtasks #14843

Open
2 tasks done
nicjansma opened this issue Mar 1, 2023 · 5 comments
Open
2 tasks done

Comments

@nicjansma
Copy link

FAQ

URL

https://nicj.net

What happened?

(submitted URL is a placeholder URL, not the site I've reproduced this issue on)

As a third party RUM provider, my team provides a JavaScript RUM library called boomerang.js that is included in our customer's websites. boomerang.js is loaded on every page load and beacons performance analytics information.

Our customers, who are paying for RUM or CDN services, are usually very motivated and proactive in monitoring their site's performance, as you would expect. They will utilize tools like Lighthouse and want to ensure that boomerang.js is not causing any performance side-effects on their sites. We sometimes get Lighthouse reports that highlight boomerang.js in some of the audits. Being a RUM provider, our development team is very interested in ensuring boomerang.js is as performant and non-invasive as possible.

We recently were sent a Lighthouse report by a customer that highlighted boomerang.js as one of the top hits under the Reduce the impact of third-party code audit. I was surprised to see boomerang.js highlighted there, as we haven't seen occurrences of blocking time in our own self-audits of our boomerang.js runtime. Their report showed 989ms of Main-Thread Blocking Time (though my screenshots below will show a smaller amount (286ms) with my local repro).

Here's an example screenshot from my development machine:

image

(the rest of the third-parties in this audit have been chopped off for brevity and the privacy of our customer)

I started to dig into what was causing the Main-Thread Blocking Time being 286ms in this trace. In other examples, when running Lighthouse locally, I would potentially see it even higher, 500-1000ms or more.

I had a hard time correlating those 286ms to anything within the Trace itself. Once loaded in the Performance tab, I could see our script executing infrequently, but most tasks were 0.1ms or smaller, and added up (even at the microsecond granularity) the total FunctionCall time was only around ~23ms. Even taking the mobile .cpuMultiplier 4x into account that would be just 92ms.
We have a few 100ms-interval timers and a rAF callback, most of which complete in under 0.1ms. The below list is filtered to just our URL:

image

I additionally took several JavaScript Profiler traces, and generally found the same results: 25-50ms maximum spent via sampled profiles under our JavaScript library.

I eventually dug into the Lighthouse source to understand how the Main-Thread Blocking Time was calculated, and came across third-party-summary.js:

for (const task of mainThreadTasks) {
const attributableURL = getAttributableURLForTask(task, jsURLs);
const urlSummary = byURL.get(attributableURL) || {...defaultSummary};
const taskDuration = task.selfTime * cpuMultiplier;
// The amount of time spent on main thread is the sum of all durations.
urlSummary.mainThreadTime += taskDuration;
// The amount of time spent *blocking* on main thread is the sum of all time longer than 50ms.
// Note that this is not totally equivalent to the TBT definition since it fails to account for FCP,
// but a majority of third-party work occurs after FCP and should yield largely similar numbers.
urlSummary.blockingTime += Math.max(taskDuration - 50, 0);
byURL.set(attributableURL, urlSummary);
}

So for each task, find its source URL and sum .blockingTime if it's over 50ms.

This function gets the source URL from getAttributeableURLForTask:

function getAttributableURLForTask(task, jsURLs) {
const jsURL = task.attributableURLs.find(url => jsURLs.has(url));
const fallbackURL = task.attributableURLs[0];
let attributableURL = jsURL || fallbackURL;
// If we can't find what URL was responsible for this execution, attribute it to the root page
// or Chrome depending on the type of work.
if (!attributableURL || attributableURL === 'about:blank') {
if (BROWSER_TASK_NAMES_SET.has(task.event.name)) attributableURL = 'Browser';
else if (BROWSER_GC_TASK_NAMES_SET.has(task.event.name)) attributableURL = 'Browser GC';
else attributableURL = 'Unattributable';
}
return attributableURL;
}

The above function looks for any matching 3P script URLs, and falls back to the "first" URL of the task if it can't determine a cause.

When I looked back at the trace, I think I found the actual cause of the "blocking" task -- a Run Microtasks which has ~80 items. This looks possibly like the DCL event, and all related tasks executed:

image

The only time our boomerang.js is in that Run Microtasks of 80 sub-tasks is the very last FunctionCall, of 0.1ms:

image

In the trace JSON, we see that RunMicrotasks w/ 80 subtasks:

{
    "args": {
        "microtask_count": 80
    },
    "cat": "v8.execute",
    "dur": 104382,
    "name": "RunMicrotasks",
    "ph": "X",
    "pid": 41000,
    "tdur": 102588,
    "tid": 13620,
    "ts": 1112473559258,
    "tts": 1451432
},

And the previous FunctionCall happened to be our script appearing once:

{
    "args": {
        "data": {
            "columnNumber": 10854,
            "frame": "9D1A3DB0E9E508834136FE2E28B89B67",
            "functionName": "l",
            "lineNumber": 10,
            "scriptId": "41",
            "url": "https://s2.go-mpulse.net/boomerang/KX3GJ..."
        }
    },
    "cat": "devtools.timeline",
    "dur": 347,
    "name": "FunctionCall",
    "ph": "X",
    "pid": 41000,
    "tdur": 339,
    "tid": 13620,
    "ts": 1112473558883,
    "tts": 1451065
},

Debugging this in Node also confirmed that it's just a big bunch of tasks, and our boomerang.js just happens to be the first in the attributableURLs list:

{
  event: {
    args: {},
    cat: 'v8',
    dur: 104379,
    name: 'V8.RunMicrotasks',
    ph: 'X',
    pid: 41000,
    tdur: 102586,
    tid: 13620,
    ts: 1112473559259,
    tts: 1451433
  },
  endEvent: undefined,
  startTime: 1326.394,
  endTime: 1430.773,
  duration: 104.379,
  unbounded: false,
...
  attributableURLs: [
    'https://s2.go-mpulse.net/boomerang/KX3GJ...',
    'https://customer.com/xyz.html',
    'https://customer.com/assets/index.js',
    'https://customer.com/assets/PDP.js',
    'https://customer.com/assets/ImageGallery.js',
    'https://customer.com/assets/vendor.js',
    'https://customer.com/gtm.js?id=GTM-XXX,
    'https://customer.com/gtag/js?id=G-XXX
  ],
  group: {
    id: 'scriptEvaluation',
    label: 'Script Evaluation',
    traceEventNames: [
      'EventDispatch',
      'EvaluateScript',
      'v8.evaluateModule',
      'FunctionCall',
      'TimerFire',
      'FireIdleCallback',
      'FireAnimationFrame',
      'RunMicrotasks',
      'V8.Execute'
    ]
  },
  selfTime: 83.955
}

If we take that 83.955 selfTime and multiply by 4 (cpuMultiplier) minus 50ms we get the Main-Thread Blocking Time reported in the original audit of 286ms.

So unfortunately because our boomerang.js script has one task executing at DCL that took ~0.1ms, we're unluckily being blamed for a 83.955ms Microtask chunk that gets shown as 286ms of Main-Thread Blocking Time time.

d'oh!

I've run Lighthouse a bunch of times on the same URL, and some times the Main-Thread Blocking Time is 0ms, and other times it's that much higher value. It's up to luck of the draw. I suspect the times it's 0ms, it's not the [0]th entry in attributableURLs or 3P URLs list.

What did you expect?

Ultimately I think the question is how we should deal with a microtask-execution chunk when there are many microtasks from various sources involved. Task attribution is hard, and I'm not sure how to "assign blame" to a Run Microtasks that has a lot of sub-calls to various other scripts/handlers/callbacks/timeouts.

In the above examples where boomerang.js was unlucky, it's ~0.1ms task(s) happen to get blamed for the whole chunk of microtask time.

Since this audit's purpose is to highlight third-party scripts that are behaving badly, I'd like to suggest we make a change to the calculation to ensure it's more accurate and repeatable.

I'm not a Chrome or Lighthouse expert so I may be missing some context on how this all works, but with my understanding here are some possible ideas for how we could deal with microtask chunks:

  1. Do nothing! Sometimes 3P scripts will be involved in a chonky microtasks run and may be blamed for the whole thing
  2. Don't include Microtask execution chunks in any consideration for .blockingTime calculation for 3P URLs (this may hide real problems)
  3. Dive a level deeper into the Microtask execution time's children, and treat each child FunctionCalls independently for .blockingTime consideration.
  4. Dive a level deeper into the Microtask execution time's children, and accumulate time in child FunctionCalls grouped by URL. Consider the sum() for each URL for .blockingTime calculation.
  5. Dive a level deeper into the Microtask execution time's children, and only accumulate time in child FunctionCalls etc that are related to that specific URL. Combine all related ones for that URL in that Run Microtasks, and if it's > 50(?)% of the total Run Microtasks execution time, declare that URL to be blamed for the whole Run Microtasks.

Personally, I think (3) sounds the most logical -- essentially, group each execution of Microtasks by the 3P URLs and treat their weight of runtime independently, to see if they cross the .blockingTime threshold (50ms).

I'm happy to take a stab at the code changes for this, if there's agreement on the approach.

What have you tried?

No response

How were you running Lighthouse?

CLI, Chrome DevTools

Lighthouse Version

10.0.1

Chrome Version

113.0.0.0

Node Version

16.19.1

OS

Windows

Relevant log output

No response

@yoavweiss
Copy link

Thank you so much for the great bug report!!

I think there are 3 questions here:

  1. Why is the browser running all these unrelated functions as a single long task? @noamr mentioned today that this could be what the spec says in case of deferred scripts, but maybe we can change that somehow?
  2. Are these tasks really microtasks? I don't think that e.g. the DCL event handler should be a microtask. Maybe they are being mislabeled?
  3. Why is Lighthouse identifying any particular script as the guilty party here, where it's clearly a group effort? :) @paulirish @patrickhulce may have insights there
@patrickhulce
Copy link
Collaborator

patrickhulce commented Mar 7, 2023

Heyo 👋 excellent bug report 👍 thank you!

I can help provide some explanation as to how we arrived here. This deck is helpful for explaining how Lighthouse attribution is supposed to work.

To answer @yoavweiss 's formulation of the question "Why is Lighthouse identifying any particular script as the guilty party here, where it's clearly a group effort? :)" the short answer is our view is that long tasks are almost always a group effort and we attempt to find a reasonable proximate cause.

We've taken the approach that proximate cause from Lighthouse's perspective means who is responsible for scheduling this work rather than performing it.

To many of your questions @nicjansma, we don't really care what actually executed inside the RunMicrotasks at all and only use first executor when we lack instrumentation from Chrome about who scheduled the task (as a best guess for, it's rather likely that whoever scheduled the work will call their own code first before delegating to other scripts/libraries). I would've expected to be able to tie RunMicrotasks back to its scheduling script without this fallback case being triggered though 🤔 , so perhaps that's where a bug has been introduced at some point in the last few years since this was written?

Why is the browser running all these unrelated functions as a single long task? @noamr mentioned today that this could be what the spec says in case of deferred scripts, but maybe we can change that somehow?
Are these tasks really microtasks? I don't think that e.g. the DCL event handler should be a microtask. Maybe they are being mislabeled?

+1 to these questions. Combining with the above, it'd be really helpful to understand a human expert's manual assessment of the proximate cause description found in the LH deck to be able to pinpoint where exactly Chrome tracing/Lighthouse fell off the wagon.

@paulirish
Copy link
Member

paulirish commented Mar 22, 2023

@nicjansma I just want to say Thank Tou for a phenomenal bug report. I'm sorry we haven't responded yet.. and this isn't a proper response either. We've got a few things going on.

But.. the short of it is: your bug is legit and it'll take us a little bit to draft a response. Attribution is a hard problem with indeterminate solutions. But we want attribution/blame to make sense to developers, point the way to actionability, be fair, and treat things consistently across tools. I'm eager figure the right solution.

Edit: previously: #7005 #7059 slides #8526

@noamr
Copy link

noamr commented Mar 23, 2023

  1. Why is the browser running all these unrelated functions as a single long task? @noamr mentioned today that this could be what the spec says in case of deferred scripts, but maybe we can change that somehow?

There could be several reasons why unrelated functions would run in the same task:

  • they are all handlers (then/catch) of the same promise, e.g. a fetch()
  • They are all event listeners of events that are fired synchronously
  • It's the DCL task, which runs all the available deferred & module scripts plus several event handlers

In this case, it seems like the long task is caused by several event handlers that happen at DCL time, some to DCL itself and some (like the first one) to readyStateChange. If you had deferred scripts they would probably make that task even longer.

One of the actual causes of long tasks here btw is DOZENS of style thrashes (coming from calls to getComputedStyle). Tracing this page with LoAF did show this.

@nicjansma
Copy link
Author

Hi team!

Just a note that we're seeing additional cases of customers reporting this issue to us.

In this Lighthouse trace, boomerang.js was being blamed for 3,256ms of Main-Thread Blocking Time.

image

Upon looking at the trace, we found a 809.8ms (*4x) Run Microtasks where boomerang.js executed once for 9.6ms:

image

Total boomerang.js CPU usage was around 93ms for the whole trace.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment