Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Infer valid text to offer it as text/plain #97

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

0x7D2B
Copy link

@0x7D2B 0x7D2B commented Oct 7, 2020

To deal with xdg-open issues and some MIME types such as application/csv not getting correctly picked up as text, it might be good to recognize inputs which are valid text and offer them as such.

This commit adds an infer_is_text_plain_utf8 function, which is currently implemented by running an iconv conversion from UTF-8 into UTF-8, which will succeed only if the input is valid UTF-8 text. This function will run if mime_type_is_text is false. This probably isn't the most efficient implementation, since iconv will read through the entire file, however it should fail early for binary files more likely to be large in size.

@0x7D2B 0x7D2B marked this pull request as ready for review October 7, 2020 13:34
int devnull = open("/dev/null", O_WRONLY);
dup2(devnull, STDOUT_FILENO);
dup2(devnull, STDERR_FILENO);
execlp("iconv", "iconv", "-f", "utf-8", "-t", "utf-8", file_path, NULL);
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can potentially switch this to use iconv(3) instead or some other UTF-8 validity check.

@bugaevc
Copy link
Owner

bugaevc commented Oct 7, 2020

Hi!

So far, we use some heuristics and additionally hardcode some types that should be treated as text:

/* Common script and markup types */
int common
= strstr(mime_type, "json") != NULL
|| str_has_suffix(mime_type, "script")
|| str_has_suffix(mime_type, "xml")
|| str_has_suffix(mime_type, "yaml");
/* Special-case PGP and SSH keys.
* A public SSH key is typically stored
* in a file that has a name similar to
* id_rsa.pub, which xdg-mime misidentifies
* as being a Publisher file. Note that it
* handles private keys, which do not have
* a .pub extension, correctly.
*/
int special
= strstr(mime_type, "application/vnd.ms-publisher") != NULL
|| str_has_suffix(mime_type, "pgp-keys");

Would it be not enough to just add CSV to the list?

We could indeed try to detect if a file is actually textual; though I wonder if using iconv is the best/fastest way...

@0x7D2B
Copy link
Author

0x7D2B commented Oct 7, 2020

I was initially planning to just add the CSV check, but I'm worried about the cat and mouse game of keeping up with unusual file types. Perhaps both the heuristics and the UTF-8 check as a last resort could be good? In that case it could be useful to also introduce negative heuristics to catch things that are guaranteed to not be text, such as audio/* and video/* and etc.
For the actual check, I think there are different ways it could be done. I went with iconv to avoid bringing in the actual UTF-8 checking logic into the code, but there probably are improvements that could be made, such as only checking a limited part of the file.
One thing worth looking into might be using the file command for this check, since it will output the charset when running with the -i flag:

text.txt:  text/plain; charset=us-ascii
text.csv:  application/csv; charset=us-ascii
image.png: image/png; charset=binary
binary:    application/x-pie-executable; charset=binary

If this data is good and consistent, then it's probably possible to get rid of the heuristics altogether and use that, offering anything with a non-binary charset as text.

@bugaevc
Copy link
Owner

bugaevc commented Oct 7, 2020

I was initially planning to just add the CSV check, but I'm worried about the cat and mouse game of keeping up with unusual file types.

Yeah I don't disagree with that; I'm just hesitant to start iconving all the files you try to copy 🙂

One thing worth looking into might be using the file command for this check, since it will output the charset when running with the -i flag: <...> If this data is good and consistent, then it's probably possible to get rid of the heuristics altogether and use that, offering anything with a non-binary charset as text.

Great find, this looks exactly like what we need, and it seems to handle those cases that are currently hardcoded just fine. I'm inclined to just switch from using xdg-mime to that. But before doing that, I'd need to:

  • See how portable its output is (how does file -i behave on other systems?)
  • Check if there are any regressions — where xdg-mime identifies the file type one way and file -i disagrees, and xdg-mime is right

Perhaps you could help with finding out?

@0x7D2B
Copy link
Author

0x7D2B commented Oct 7, 2020

At least going by the man pages and the official website, it looks like file should be the same almost everywhere in the *nix world with the notable exception of OpenBSD. I'm not sure about how it compares to xdg-mime though, would need to take a closer look. It caught everything I threw at it so far though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
2 participants