Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mask control characters in filenames #118

Closed
wants to merge 7 commits into from
Closed

Conversation

Larhzu
Copy link
Member

@Larhzu Larhzu commented May 1, 2024

The command line tools print filenames and some other user-specified strings to standard output or standard error. Malicious strings, for example, from filenames could contain control characters that affect the state of the terminal.

These commits add a function to replace the single-byte control characters with question marks. This is simple but hopefully good enough in practice.

@Larhzu Larhzu self-assigned this May 1, 2024
@Larhzu
Copy link
Member Author

Larhzu commented May 2, 2024

Single-byte control character masking isn't enough. At least Konsole and Xfce Terminal (but not uxterm) interpret C1 control codes and CSI sequences in en_US.UTF-8 locale.

$ printf 'foo\u009b3Dbar\n'
bar

$ printf 'a\u0090 Can you see me? \u009cb\n'
ab

A proper masking method must decode multibyte characters. It must tolerate invalid multibyte sequences and restart decoding from the next byte.

extern const char *
tuklib_mask_cntrl(const char *str)
{
static char *mem = NULL;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Threading is utilized, so this src/common function should probably have static thread_local char *mem to avoid future issues.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thread_local needs C11 but the codebase is C99 for now. The function is documented to be thread unsafe. It's called from the main thread only.

The function is really primitive as it doesn't even provide more than one memory slot. That is, one cannot print a message that needs two masked strings. It's easy to expand but the first version doesn't need to do more than is required right now.

/// \file tuklib_cntrl_chars.h
/// \brief Find and replace single-byte control characters
///
/// This is a small and very simple implementation that uses isnctrl(3),

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isnctrl => iscntrl

@Larhzu Larhzu force-pushed the mask_cntrl_chars branch from d5e6830 to 6639a08 Compare May 7, 2024 11:51
@Larhzu
Copy link
Member Author

Larhzu commented May 8, 2024

The new version should handle all relevant multibyte character sets, not just UTF-8.

Instead of looking for control characters, it now looks for non-printable characters which is a much stricter check. A possible downside is that an old C library might not recognize newer printable Unicode characters even though the user might be using them already. I suppose it's not a real problem. :-) Gnulib's quotearg looks for printable characters as well.

tuklib_mask_nonprint isn't thread safe although it would be straightforward to add a variant that takes a char **mem argument. That wouldn't require C11's optional feature thread_local from <threads.h> or a compiler-specific thread-local extension (although those are quite widely available). Then it would be possible to have more than one masked string available at the same time in case one needs to print two filenames at once. But the simplest version is enough for now.

I hope I didn't miss any string that should be masked.

@Larhzu Larhzu force-pushed the mask_cntrl_chars branch from 6639a08 to 4468a0e Compare May 13, 2024 20:18
Larhzu added 7 commits May 22, 2024 16:26
In multibyte locales, some control characters are multibyte too, for
example, terminals interpret C1 control characters (U+0080 to U+009F)
that are two bytes as UTF-8. Thus, multibyte character sets have to
be handled.

Instead of checking for control characters with iswcntrl(), this
uses iswprint() to detect printable characters. This is much stricter.

Gnulib's quotearg would do a lot more but I hope such a thing isn't
needed here.

Thanks to Ryan Colyer for the discussion about the problems of
the earlier single-byte-only method.

Thanks to Christian Weisgerber for reporting a bug in an earlier
version of this code.

Thanks to Jeroen Roovers for a typo fix.
Call tuklib_mask_nonprint() on filenames and also on a few other
strings from the command line too. The lack of this feature has been
listed in TODO since 2009: 5f6dddc
This prepares for tuklib_mask_nonprint() from tuklib_mbstr_nonprint.c.
It has locale-specific behavior (LC_CTYPE).
@Larhzu Larhzu force-pushed the mask_cntrl_chars branch from 4468a0e to 3ad4a25 Compare May 22, 2024 17:25
@Larhzu Larhzu closed this in 40e5733 Dec 18, 2024
@Larhzu
Copy link
Member Author

Larhzu commented Dec 18, 2024

I merged this with a few changes:

  • One filename usage wasn't masked in list.c.

  • Don't reject non-printable characters in xz --suffix=.SOME_BAD_CHARS. It would have been problematic especially if LC_CTYPE is C and one tries to use fancy chars in a suffix. Only mask the chars when printing the suffix or derived filename in messages.

  • I added a note to a commit message about xz --robot --list.

  • Handle U+FFFD specially on Windows.

@Larhzu Larhzu deleted the mask_cntrl_chars branch December 22, 2024 18:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants