-
Notifications
You must be signed in to change notification settings - Fork 57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Infer valid text to offer it as text/plain #97
base: master
Are you sure you want to change the base?
Conversation
int devnull = open("/dev/null", O_WRONLY); | ||
dup2(devnull, STDOUT_FILENO); | ||
dup2(devnull, STDERR_FILENO); | ||
execlp("iconv", "iconv", "-f", "utf-8", "-t", "utf-8", file_path, NULL); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can potentially switch this to use iconv(3) instead or some other UTF-8 validity check.
Hi! So far, we use some heuristics and additionally hardcode some types that should be treated as text: wl-clipboard/src/util/string.c Lines 34 to 51 in 2c3cee1
Would it be not enough to just add CSV to the list? We could indeed try to detect if a file is actually textual; though I wonder if using iconv is the best/fastest way... |
I was initially planning to just add the CSV check, but I'm worried about the cat and mouse game of keeping up with unusual file types. Perhaps both the heuristics and the UTF-8 check as a last resort could be good? In that case it could be useful to also introduce negative heuristics to catch things that are guaranteed to not be text, such as audio/* and video/* and etc.
If this data is good and consistent, then it's probably possible to get rid of the heuristics altogether and use that, offering anything with a non-binary charset as text. |
Yeah I don't disagree with that; I'm just hesitant to start
Great find, this looks exactly like what we need, and it seems to handle those cases that are currently hardcoded just fine. I'm inclined to just switch from using
Perhaps you could help with finding out? |
At least going by the man pages and the official website, it looks like |
To deal with xdg-open issues and some MIME types such as application/csv not getting correctly picked up as text, it might be good to recognize inputs which are valid text and offer them as such.
This commit adds an
infer_is_text_plain_utf8
function, which is currently implemented by running an iconv conversion from UTF-8 into UTF-8, which will succeed only if the input is valid UTF-8 text. This function will run ifmime_type_is_text
is false. This probably isn't the most efficient implementation, since iconv will read through the entire file, however it should fail early for binary files more likely to be large in size.