Skip to content

Latest commit

 

History

History
234 lines (181 loc) · 13.2 KB

explainer.md

File metadata and controls

234 lines (181 loc) · 13.2 KB

Async Clipboard API: Read unsanitized HTML.

Author:

Introduction

HTML content is essential for supporting copy/paste operation of high fidelity content from native apps to web sites and vice versa, especially in sites supporting document editing. DataTransfer object's getData and async clipboard read() methods have interop differences in how the HTML content is sanitized during a paste operation. The getData method returns unsanitized HTML content, but the read() method uses the browser's sanitizer to strip out content (ex. global <style>s, <script>s, <meta> tags) from the HTML markup which results in format loss, and bloating of payload due to inlining of styles.

Example of content getting stripped out and styles getting inlined

Clipboard content:

Version:0.9
StartHTML:0000000105
EndHTML:0000000363
StartFragment:0000000141
EndFragment:0000000327
<html>
<body>
<!--StartFragment--><head><script>alert('hello');</script><style> p {font-color: red; background-color: blue;mso-font-charset:0;mso-ignore:padding;mso-rotate:0;}</style></head> <body><p>html text</p></body><!--EndFragment-->
</body>
</html>

After read() was called with the default sanitizer, the HTML markup returned was:

<p style="background-color: blue; color: rgb(0, 0, 0); font-size: medium; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; white-space: normal; text-decoration-thickness: initial; text-decoration-style: initial; text-decoration-color: initial;">html text</p>

In the above example, script, style tags were removed and custom mso styles were stripped out and styles associated with <p> element were inlined.

These problems mean that web developers may not get the same HTML paste quality and performance with the async clipboard read API as they do with the DataTransfer object's getData method. This proposal aims to solve these problems so that the read can work just as well as getData when pasting HTML content.

Goals

  • Preserve copy/paste fidelity when reading/writing the HTML format on the clipboard.
  • Have parity with the existing DataTransfer object's getDatamethod.
  • Allow browsers that write unsanitized HTML content to the clipboard to roundtrip HTML content better.
  • Build on the existing Async Clipboard API, by leveraging existing:
    • Structure, like asynchronous design and ClipboardItem.
    • Protections, like permissions model, and secure-context/active-frame requirements of the API.

Non-goals

  • Modify design of original Async Clipboard API, where not relevant to unsanitized html format.
  • Add support for unsanitized read/write of other supported formats that is not text/html.
  • Drag-and-Drop APIs.

Additional Background

HTML format is being supported by three APIs:

DataTransfer object's getData

The DataTransfer object can be accessed via the paste event handler and getData can be used to get the clipboard data in a specific format. Authors can call preventDefault to prevent the browser's default paste action and create their own app-specific paste implementation. The getData API does not perform sanitization and always returns unsanitized HTML to the caller. E.g.

document.addEventListener('paste', function(e) {
    e.clipboardData.getData('text/html');
    e.preventDefault();
});

Copy/paste execCommand

execCommand is used to invoke the copy/paste command which uses the browser's default logic to read/write the clipboard content.

pasteExecCommandBtn.addEventListener("click", function(e) {
  var pasteTarget = document.createElement("textarea");
  pasteTarget.contentEditable = true;
  document.body.appendChild(pasteTarget);
  pasteTarget.focus();
  const result = document.execCommand("paste");
});

Async HTML read APIs

This API is called via the navigator.clipboard object and is used to read HTML to the clipboard asynchronously without listening for a clipboard event or calling execCommand. This provides more flexibility and better performance to web authors than the other APIs. E.g.

paste.onclick = async () => {
    try {
        const clipboardItems = await navigator.clipboard.read();
        const clipboardItem = clipboardItems[0];
        const customTextBlob = await clipboardItem.getType('text/html');
        logDiv.innerText = await customTextBlob.text();
        console.log('Text pasted.');
        } catch(e) {
        console.log('Failed to read clipboard');
        }
};

All of the above-mentioned APIs should allow web authors to read HTML content with equally high fidelity.

Paste HTML content using getData

Chrome

Version:0.9
StartHTML:0000000105
EndHTML:0000000509
StartFragment:0000000141
EndFragment:0000000473
<html>
<body>
<!--StartFragment--><html><head><meta http-equiv=Content-Type content="text/html; charset=utf-8"><meta name=ProgId content=Excel.Sheet><meta name=Generator content="Microsoft Excel 15"><style>table {height=10px}</style></head><body link="#0563C1" vlink="#954F72"><p style='color: red; font-style: oblique;'>This text was copied using </p></body></html><!--EndFragment-->
</body>
</html>

Firefox

Version:0.9
StartHTML:00000097
EndHTML:00000499
StartFragment:00000131
EndFragment:00000463
<html><body>
<!--StartFragment--><html><head><meta http-equiv=Content-Type content="text/html; charset=utf-8"><meta name=ProgId content=Excel.Sheet><meta name=Generator content="Microsoft Excel 15"><style>table {height=10px}</style></head><body link="#0563C1" vlink="#954F72"><p style='color: red; font-style: oblique;'>This text was copied using </p></body></html><!--EndFragment-->
</body>
</html>

Safari

When content is copied in the text/html MIME type via setData method, Safari inserts both sanitized & unsanitized versions of html content. It inserts the unsanitized html content into a custom webkit format type(com.apple.Webkit.custom-pasteboard-data), but the built-in public.html format contains the sanitized fragment. When getData is called from a site that is within the same origin as copy, the HTML content in the custom webkit format type is returned (makes round tripping possible). For cross-origin sites, the sanitized HTML fragment is returned from the public.html format.

In Chromium & FF:

When getData is called, the HTML string is read without sanitization i.e. global styles, script tags, meta tags are not removed from the markup. On Windows, it also contains the header information which is hardcoded(ui::clipboard_util::HtmlToCFHtml) during copy and then written to the clipboard.

Proposal

With this new proposal, we will be introducing a new unsanitized parameter in the read() method so the HTML content can be read without any loss of information i.e. read({ unsanitized: ['text/html'] }) would return the content without any sanitization.

IDL changes

dictionary ClipboardUnsanitizedFormats {
    sequence<DOMString> unsanitized;
};

[
    SecureContext,
    Exposed=Window
] interface Clipboard : EventTarget {
    [CallWith=ScriptState]
    Promise<sequence<ClipboardItem>> read(ClipboardUnsanitizedFormats formats);
};

Read()

Follow the algorithm specified in read() except for the below steps:

  1. If text/html representation is present in the ClipboardItem and text/html is present in the unsanitized list, then follow the below steps:
    1. If size of unsanitized list is greater than 1, then throw Reading multiple unsanitized formats is not supported. exception.
    2. If text/html is not at the first position in the unsanitized list, then throw The unsanitized type {formatName} is not supported. exception.
    3. else, return the blobData as-is without any sanitization.
  2. Else, follow the existing sanitization behavior as mentioned in step-3.

JS example

const html_text = new Blob(['<html><head><meta http-equiv=Content-Type content=\"text/html; charset=utf-8\"><meta name=ProgId content=Excel.Sheet><meta name=Generator content=\"Microsoft Excel 15\">'
'<style>body {font-family: HK Grotesk; background-color: var(--color-bg);}</style></head><body><div>hello</div></body></html>'], {type: 'text/html'});

const clipboard_item = new ClipboardItem({
'text/html': html_text     /* Sanitized format. */
});
              
await navigator.clipboard.write([clipboard_item]);

// Read the unsanitized HTML format using the `unsanitized` option.
const clipboardItems = await navigator.clipboard.read({ unsanitized: ['text/html'] });
const blobOutput = await clipboardItems[0].getType('text/html');

Privacy and Security

This feature introduces an unsanitized option that has unsanitized text/html content. This will be exposed to both native apps and websites.

Websites or native apps are already reading unsanitized content via DataTransfer APIs using getData() method. In this proposal, web authors are required to explicitly specify unsanitized option in the async clipboard read() method to access the raw text/html content from the clipboard. This feature uses async clipboard API that has a user gesture requirement on top of existing async clipboard API security measures to mitigate security and privacy concerns.

For more details see the security-privacy doc.

Some examples of native apps that do sanitization themselves during paste.

User Gesture Requirement

On top of Async Clipboard API requirements for focus, secure context, and permission, use of this API will require a transient user activation, so that the site will not be able to silently read or write clipboard information.

This requirement is now enforced for the Async Clipboard API overall. It may be notable that Safari already requires a user gesture for all Async Clipboard API interactions.

Permissions

Due to concerns regarding permission fatigue and comprehensibility, and due to the limited utility of a permission, no new permission would be implemented for unsanitized clipboard. Given that Clipboard API read and write are already permitted, unsanitized clipboard read and write will be permitted as-is.

Alternatives considered

Web custom formats can be used to exchange unsanitized HTML if both source and target apps have support for it, but there are many native apps that don't have support for web custom formats, so contents copied from these apps in the HTML format would have to go through the browser's sanitizer in read() that would result in loss of fidelity.

Stakeholder Feedback / Opposition

  • Implementers:
  • Stakeholders:
    • Excel Online : Positive
    • Adobe : Positive
    • Google Sheets : Positive

Excel's issues with sanitization

Custom Office styles are stripped out if the default sanitizer is used to read HTML data from the clipboard. These styles are inserted by Excel app that are used to preserve excel specific semantics. Additional problems are discussed in this doc.

Google Sheet's issues with sanitization

crbug.com/1493388: The empty table cells are dropped because of a bug in the sanitizer.

References & acknowledgements

Many thanks for valuable feedback and advice from:

Reference Documents: