Rethink accessibility scoring #3444

robdodson · 2017-09-29T23:27:44Z

Currently the accessibility tests are all equally weighted. This means even if a test is not applicable it still counts as a pass and artificially inflates the score. As a result, pages in the WCAG's "bad" section still get a score of >89%.

One suggestion is to score inapplicable tests at a weight of 0, however based on this issue I'm not sure if aXe actually returns inapplicable tests.

Another thing we should definitely do in the near term is re-weight the tests. aXe itself has criticality ratings, and we can look at httparchive stats to figure out which tests fail most often and maybe boost their weight even more.

Ultimately though, an automated tool like aXe can only ever test a subset of accessibility issues, so giving someone a score of 100% can be misleading. It's entirely possible to make something that gets a 100% accessibility score but is still not very useable. For this reason we might consider ditching the accessibility score altogether and replacing it with something else. Maybe just an indicator that there are (minor|major|critical) errors? Open to suggestion here :)

@marcysutton @WilcoFiers

robdodson · 2017-09-29T23:30:11Z

Forgot to mention, I think a tool like Tenon.io also weighs the score based on page complexity. So if there are 1000 DOM nodes and 1 failing test, it's considered less severe than if there are 100 DOM nodes and a failing test.

jnurthen · 2017-09-29T23:48:05Z

I think having a % based score is a bad idea - 100% implies that the job is done. Perhaps consider reversing it so that the goal is to get to 0 errors. In my mind having 0 errors doesn't have the same implications as reaching 100%

rpkoller · 2017-09-29T23:54:13Z

@jnurthen maybe in a vein like HTML_Sniffer displays results. There you get errors, warnings and notices for the page. You might spice things up with some sort of color coding. Red background color for too many errors on a page with few DOM nodes or a certain amount of errors for an average amount of DOM nodes and so forth like @robdodson suggested . Yellow background for better results and green background for desired results.

paulirish · 2017-09-29T23:55:43Z

an automated tool like aXe can only ever test a subset of accessibility issues

True, but this is basically the case for anything. For performance or best practices or PWA or anything, we'll always be looking at a subset of all worthwhile issues. So I think that just places the onus on the tool developers to be as comprehensive as possible.

re-weight the tests.

SGTM. I'm totally on board with looking at weighting things differently, even if it includes aspects like DOM node count.

marcysutton · 2017-09-29T23:56:40Z

I know we have work to do in aXe-core for inapplicable, and the "WCAG bad page" (which, funny enough, @WilcoFiers said he worked on). I do really appreciate accessibility being given such prominence! Just want to set devs up for success by giving them realistic expectations.

brendankenny · 2017-09-30T00:13:48Z

we can look at httparchive stats to figure out which tests fail most often

failure rates for the most recent HTTP Archive run (Sept 1-15). This is from runs of the a11y audits over 427,306 URLs (there were 2,705 URLs the audits weren't able to return results for due to a variety of errors):

audit	failure rate
color-contrast	73.21%	▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅
link-name	66.23%	▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅
image-alt	50.59%	▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅
label	47.54%	▅▅▅▅▅▅▅▅▅▅▅▅▅▅
html-has-lang	36.58%	▅▅▅▅▅▅▅▅▅▅
frame-title	34.07%	▅▅▅▅▅▅▅▅▅▅
duplicate-id	32.38%	▅▅▅▅▅▅▅▅▅
meta-viewport	31.82%	▅▅▅▅▅▅▅▅▅
button-name	18.4%	▅▅▅▅▅
list	10.71%	▅▅▅
bypass	7.064%	▅▅
listitem	4.821%	▅
tabindex	4.476%	▅
aria-valid-attr-value	4.373%	▅
aria-required-children	3.821%	▅
input-image-alt	2.007%
document-title	1.868%
aria-allowed-attr	1.694%
definition-list	1.511%
object-alt	1.393%
meta-refresh	0.7905%
aria-roles	0.6431%
html-lang-valid	0.4907%
aria-required-parent	0.4152%
dlitem	0.3471%
aria-required-attr	0.2822%
accesskeys	0.1678%
aria-valid-attr	0.1378%
valid-lang	0.11%
video-description	0.0094%
td-headers-attr	0.0091%
video-caption	0.0056%
layout-table	0.0016%
audio-caption	0.0002%
th-has-data-cells	0.%

(as stated above, the complementary percentile isn't necessarily the pass rate...the audit may have not been applicable to the page being tested)

WilcoFiers · 2017-09-30T11:38:55Z

Wow that's excellent data! Interesting that one of the rules has 0%.

As for the topic at hand. This related to a well known problem in accessibility: Metrics. There just isn't a good way to grade the accessibility of a page. Either you passed, or you didn't.

Having X number of issues, or X number of rules failed, or X number of criteria failed generally doesn't say much about how accessible that page is. A page can have 1 very bad accessibility issue and be a disaster to work with, or 100 trivial problems that users with disabilities can easily work around. The W3C had a whole symposium on the problem of a11y metrics with no solution to speak of: https://www.w3.org/TR/accessibility-metrics-report/

I personally don't much like the percentage approach. The problem I have with that is two-fold. First that 100% quite heavily implies there are no problems, which is an impression we should very much try to avoid, since that's not what having no issues in aXe means. We've solved this in our products by adding an indication that further testing is always necessary. The second problem I have with it is that as we add more rules, the numbers change. Going from 100% accessible to 80% accessible due to an update isn't a nice message to receive, and its hard to explain to someone who doesn't understand the inner workings.

My favourite approach for metrics has been to just use the absolute number of passes and failures per rule as a score, so, two numbers. There is no "highest number", in the sense that you don't imply that hitting 100% or 10, or A+ or whatever means that you're done. It gives some perspective, because you can still gage passes to failures. And it is relatively easy to understand that new rules will grow the number of tests, which can either mean more passes, more fails, or no change at all because it was inapplicable.

Hope that helps!

alastc · 2017-10-02T08:58:07Z

In practice we encourage people to see each issue found as a barrier, and how big a barrier depends on the context of the user-journey.

For example, a keyboard-inaccessible 'add to basket' button is a huge barrier, missing alt-text for the logo in the footer less so.

Would it be possible to flip the metric around to "Barriers found", or something similar?

So at the top of an audit it would be the number of issues found (perhaps including a weighting factor, or splitting into higher / lower issues). At the top of the accessibility section it could say something like:

"These checks highlight technical barriers in your app, but please remember to check from a user point of view as well [link to suitable training, like @robdodson & Alice's Udacity course]."

paulirish · 2017-10-04T17:29:42Z

@robdodson let's talk some more about this today. We think we can do some quick fixes here by reweighting the audits within the category. And then we can do some research to sort out how to more dynamically adjust the weightings/score based on the results coming back from aXe.

robdodson · 2017-10-09T19:44:53Z

Just wanted to provide an update for the folks subscribed to this thread. I think we have a multi-part plan we'd like to enact.

The first step will be to re-weight the scores based on how bad the offending error is. Currently aXe lists all errors as either major or critical, and there's no way to filter out non-applicable tests, so this re-weighting will be pretty subjective. My current thinking is the stuff that is really egregious, and really common, will be weighted very heavily. I've already started putting together the new weights using the stats brendan linked above. Along with this work, we'll also add language to the report that clearly explains the tests can only test a small subset of a11y issues and folks still need to do manual checks. Similar warnings and manual checks already exist in the PWA tests report, so there is prior art for this:

The second step is to work with the aXe team to filter out non-applicable tests (dequelabs/axe-core#473). This would be very helpful because then we could probably switch back to just scoring based on what aXe defines as major vs critical. If you end up with only 1 applicable test, and its critical, and you fail, you'd get a very bad a11y score.

The third step is to work with the Lighthouse team on a bigger rethink of how we present these results to the user. There has been talk of doing this for other parts of the Lighthouse report, so we can just make accessibility part of that larger redesign. There are folks on this thread who have said that we shouldn't do scoring at all, however I've also heard from folks on Slack and in-person, that they really like the scoring and it has been helpful inside their larger organizations. I think we'll have to iterate on a few different UI options to see what feels right.

I'll ping this thread again when the PR for the re-weighted scores is up so folks can try it out if they're interested.

robdodson · 2017-10-10T02:30:53Z

Score re-weighting is in PR now if anyone is interested: #3515

vinamratasingal-zz · 2017-11-02T23:35:01Z

Closing this out since the PR was merged :) If any outstanding issues feel free to re-open.

robdodson · 2017-11-02T23:41:39Z

Short update, I'm going to see if I can work with aXe folks at CSUN to look into fixing the non-applicable tests results array.

vinamratasingal-zz · 2017-11-02T23:42:27Z

Awesome, thanks Rob! :D

thibaudcolas · 2021-02-05T21:27:37Z

I know this is three years old but thought I should comment nonetheless to say that the "100% is misleading" point still feels true to this day, despite the addition of explanatory text and manual checks in #3834. It’s a common misconception that automated checks can be enough, and Lighthouse presenting "no issues" as "100%" only reinforces this. To the point that Lighthouse is very frequently used as the go-to bad example to demonstrate this problem. I realise this is the same as other Lighthouse scores, but it feels worse for this one due to the more fundamental misconception about automation in accessibility testing.

Is there anything else that could be done to bring Lighthouse closer to other accessibility checkers, or otherwise expand upon what it already does to alleviate the confusion?

As a starting point, here is a comparison of how major automated accessibility checkers describe a "perfect score", compared to Lighthouse:

WAVE

0 errors
Congratulations! No errors were detected! Manual testing is still necessary to ensure compliance and optimal accessibility.

Accessibility Insights

Congratulations!
No failed automated checks were found. Continue investigating your website's accessibility compliance through manual testing using Tab stops and Assessment in Accessibility Insights for Web.

Axe

Congratulations!
axe found (0) issues automatically on this page.

Lighthouse

100
Accessibility
These checks highlight opportunities to improve the accessibility of your web app. Only a subset of accessibility issues can be automatically detected so manual testing is also encouraged.

The note could do with a stronger choice of words to start with. Perhaps cite how many issues Axe finds on average to bring the point home. Of course this still is just band-aid on the scoring problem – really what would be better is to de-emphasize this "100" and instead steer people towards caring about "0 issues".

Perhaps follow the approach of other checkers – upon reaching the perfect score, display something else instead that congratulates the tester and clearly guides them to manual tests as the obvious next step.

vinamratasingal-zz added feature needs-priority labels Sep 29, 2017

paulirish added P2 and removed needs-priority labels Oct 4, 2017

vinamratasingal-zz closed this as completed Nov 2, 2017

robdodson mentioned this issue May 3, 2019

core(a11y): update scoring weights based on severity #8823

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rethink accessibility scoring #3444

Rethink accessibility scoring #3444

robdodson commented Sep 29, 2017 •

edited

Loading

robdodson commented Sep 29, 2017 •

edited

Loading

jnurthen commented Sep 29, 2017

rpkoller commented Sep 29, 2017

paulirish commented Sep 29, 2017

marcysutton commented Sep 29, 2017

brendankenny commented Sep 30, 2017 •

edited

Loading

WilcoFiers commented Sep 30, 2017

alastc commented Oct 2, 2017

paulirish commented Oct 4, 2017

robdodson commented Oct 9, 2017

robdodson commented Oct 10, 2017

vinamratasingal-zz commented Nov 2, 2017

robdodson commented Nov 2, 2017

vinamratasingal-zz commented Nov 2, 2017

thibaudcolas commented Feb 5, 2021 •

edited

Loading

Rethink accessibility scoring #3444

Rethink accessibility scoring #3444

Comments

robdodson commented Sep 29, 2017 • edited Loading

robdodson commented Sep 29, 2017 • edited Loading

jnurthen commented Sep 29, 2017

rpkoller commented Sep 29, 2017

paulirish commented Sep 29, 2017

marcysutton commented Sep 29, 2017

brendankenny commented Sep 30, 2017 • edited Loading

WilcoFiers commented Sep 30, 2017

alastc commented Oct 2, 2017

paulirish commented Oct 4, 2017

robdodson commented Oct 9, 2017

robdodson commented Oct 10, 2017

vinamratasingal-zz commented Nov 2, 2017

robdodson commented Nov 2, 2017

vinamratasingal-zz commented Nov 2, 2017

thibaudcolas commented Feb 5, 2021 • edited Loading

WAVE

Accessibility Insights

Axe

Lighthouse

robdodson commented Sep 29, 2017 •

edited

Loading

robdodson commented Sep 29, 2017 •

edited

Loading

brendankenny commented Sep 30, 2017 •

edited

Loading

thibaudcolas commented Feb 5, 2021 •

edited

Loading