Infer valid text to offer it as text/plain #97

0x7D2B · 2020-10-07T13:34:41Z

To deal with xdg-open issues and some MIME types such as application/csv not getting correctly picked up as text, it might be good to recognize inputs which are valid text and offer them as such.

This commit adds an infer_is_text_plain_utf8 function, which is currently implemented by running an iconv conversion from UTF-8 into UTF-8, which will succeed only if the input is valid UTF-8 text. This function will run if mime_type_is_text is false. This probably isn't the most efficient implementation, since iconv will read through the entire file, however it should fail early for binary files more likely to be large in size.

0x7D2B · 2020-10-07T13:37:51Z

src/util/files.c

+        int devnull = open("/dev/null", O_WRONLY);
+        dup2(devnull, STDOUT_FILENO);
+        dup2(devnull, STDERR_FILENO);
+        execlp("iconv", "iconv", "-f", "utf-8", "-t", "utf-8", file_path, NULL);


Can potentially switch this to use iconv(3) instead or some other UTF-8 validity check.

bugaevc · 2020-10-07T13:56:09Z

Hi!

So far, we use some heuristics and additionally hardcode some types that should be treated as text:

wl-clipboard/src/util/string.c

Lines 34 to 51 in 2c3cee1

    
               /* Common script and markup types */ 
        
               int common 
        
                   = strstr(mime_type, "json") != NULL 
        
                   || str_has_suffix(mime_type, "script") 
        
                   || str_has_suffix(mime_type, "xml") 
        
                   || str_has_suffix(mime_type, "yaml"); 
        
               /* Special-case PGP and SSH keys. 
        
                * A public SSH key is typically stored 
        
                * in a file that has a name similar to 
        
                * id_rsa.pub, which xdg-mime misidentifies 
        
                * as being a Publisher file. Note that it 
        
                * handles private keys, which do not have 
        
                * a .pub extension, correctly. 
        
                */ 
        
               int special 
        
                   = strstr(mime_type, "application/vnd.ms-publisher") != NULL 
        
                   || str_has_suffix(mime_type, "pgp-keys");

Would it be not enough to just add CSV to the list?

We could indeed try to detect if a file is actually textual; though I wonder if using iconv is the best/fastest way...

0x7D2B · 2020-10-07T15:15:08Z

I was initially planning to just add the CSV check, but I'm worried about the cat and mouse game of keeping up with unusual file types. Perhaps both the heuristics and the UTF-8 check as a last resort could be good? In that case it could be useful to also introduce negative heuristics to catch things that are guaranteed to not be text, such as audio/* and video/* and etc.
For the actual check, I think there are different ways it could be done. I went with iconv to avoid bringing in the actual UTF-8 checking logic into the code, but there probably are improvements that could be made, such as only checking a limited part of the file.
One thing worth looking into might be using the file command for this check, since it will output the charset when running with the -i flag:

text.txt:  text/plain; charset=us-ascii
text.csv:  application/csv; charset=us-ascii
image.png: image/png; charset=binary
binary:    application/x-pie-executable; charset=binary

If this data is good and consistent, then it's probably possible to get rid of the heuristics altogether and use that, offering anything with a non-binary charset as text.

bugaevc · 2020-10-07T15:38:56Z

I was initially planning to just add the CSV check, but I'm worried about the cat and mouse game of keeping up with unusual file types.

Yeah I don't disagree with that; I'm just hesitant to start iconving all the files you try to copy 🙂

One thing worth looking into might be using the file command for this check, since it will output the charset when running with the -i flag: <...> If this data is good and consistent, then it's probably possible to get rid of the heuristics altogether and use that, offering anything with a non-binary charset as text.

Great find, this looks exactly like what we need, and it seems to handle those cases that are currently hardcoded just fine. I'm inclined to just switch from using xdg-mime to that. But before doing that, I'd need to:

See how portable its output is (how does file -i behave on other systems?)
Check if there are any regressions — where xdg-mime identifies the file type one way and file -i disagrees, and xdg-mime is right

Perhaps you could help with finding out?

0x7D2B · 2020-10-07T16:07:37Z

At least going by the man pages and the official website, it looks like file should be the same almost everywhere in the *nix world with the notable exception of OpenBSD. I'm not sure about how it compares to xdg-mime though, would need to take a closer look. It caught everything I threw at it so far though.

Add a UTF-8 validity check

dd3ec1b

0x7D2B marked this pull request as ready for review October 7, 2020 13:34

0x7D2B commented Oct 7, 2020

View reviewed changes

groner mentioned this pull request Oct 29, 2022

Runtime control of mime_type_is_text() overrides #150

Open

bugaevc mentioned this pull request Jan 5, 2023

Shorten temporary files lifetime, various fixes/refactoring #155

Draft

bugaevc mentioned this pull request Apr 9, 2023

Can not copy some text YaLTeR/wl-clipboard-rs#35

Open

bugaevc mentioned this pull request Nov 7, 2023

dependency in x11-utils? #202

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Infer valid text to offer it as text/plain #97

Infer valid text to offer it as text/plain #97

0x7D2B commented Oct 7, 2020

0x7D2B Oct 7, 2020

bugaevc commented Oct 7, 2020

0x7D2B commented Oct 7, 2020

bugaevc commented Oct 7, 2020

0x7D2B commented Oct 7, 2020

Infer valid text to offer it as text/plain #97

Are you sure you want to change the base?

Infer valid text to offer it as text/plain #97

Conversation

0x7D2B commented Oct 7, 2020

0x7D2B Oct 7, 2020

Choose a reason for hiding this comment

bugaevc commented Oct 7, 2020

0x7D2B commented Oct 7, 2020

bugaevc commented Oct 7, 2020

0x7D2B commented Oct 7, 2020