What exactly was the point of [ “x$var” = “xval” ]?

In shell scripting you sometimes come across comparisons where each value is prefixed with "x". Here are some examples from GitHub:

if [ "x${JAVA}" = "x" ]; then
if [ "x${server_ip}" = "xlocalhost" ]; then
if test x$1 = 'x--help' ; then

I’ll call this the x-hack.

For any POSIX compliant shell, the value of the x-hack is exactly zero: this comparison works without the x 100% of the time. But why was it a thing?

Online sources like this stackoverflow Q&A are a little handwavy, saying it’s an alternative to quoting (most definitely NOT the case!), pointing towards issues with "some versions" of certain shells, or generally cautioning against the mystic behaviors of especially ancient Unix system without concrete examples.

To determine whether or not ShellCheck should warn about this, and if so, what its long form rationale should be, I decided to dig into the history of Unix with the help of The Unix Heritage Society‘s archives. I was unfortunately unable to peer into the closely guarded world of the likes of HP-UX and AIX, so dinosaur herders beware.

These are the cases I found that can fail.

Left-hand side matches a unary operator

The AT&T Unix v6 shell from 1973, at least as found in PWB/UNIX from 1977, would fail to run test commands whose left-hand side matched a unary operator. This must have been immediately obvious to anyone who tried to check for command line parameters:

% arg="-f"
% test "$arg" = "-f"
syntax error: -f
% test "x$arg" = "x-f"
(true)

This was fixed in the AT&T Unix v7 Bourne shell builtin in 1979. However, test and [ were also available as separate executables, and appear to have retained a variant of the buggy behavior:

$ arg="-f"
$ [ "$arg" = "-f" ]
(false)
$ [ "x$arg" = "x-f" ]
(true)

This happened because the utility used a simple recursive descent parser without backtracking, which gave unary operators precedence over binary operators and ignored trailing arguments.

The "modern" Bourne shell behavior was copied by the Public Domain KornShell in 1988, and made part of POSIX.2 in 1992. GNU Bash 1.14 did the same thing for its builtin [, and the GNU shellutils package that provided the external test/[ binaries followed POSIX, so the early GNU/Linux distros like SLS were not affected, nor was FreeBSD 1.0.

The x-hack is effective because no unary operators can start with x.

Either side matches string length operator -l

A similar issue that survived longer was with the string length operator -l. Unlike the normal unary predicates, this one was only parsed as part as part of an operand to binary predicates:

var="helloworld"
[ -l "$var" -gt 8 ] && echo "String is longer than 8 chars"

It did not make it into POSIX because, as the rationale puts it, "it was undocumented in most implementations, has been removed from some implementations (including System V), and the functionality is provided by the shell", referring to [ ${#var} -gt 8 ].

It was not a problem in UNIX v7 where = took precedence, but Bash 1.14 from 1996 would parse it greedily up front:

$ var="-l"
$ [ "$var" = "-l" ]
test: -l: binary operator expected
$ [ "x$var" = "x-l" ]
(true)

It was also a problem on the right-hand side, but only in nested expressions. The -l check made sure there was a second argument, so you would need an additional expression or parentheses to trigger it:

$ [ "$1" = "-l" -o 1 -eq 1 ]
[: too many arguments
$ [ "x$1" = "x-l" -o 1 -eq 1 ]
(true)

This operator was removed in Bash 2.0 later that year, eliminating the problem.

Left-hand side is !

Another issue in early shells was when the left-hand side was the negation operator !:

$ var="!"
$ [ "$var" = "!" ]
test: argument expected            (UNIX v7, 1979)
test: =: unary operator expected   (bash 1.14, 1996)
(false)                            (pd-ksh88, 1988)
$ [ "x$var" = "x!" ]
(true)

Again, the x-hack is effective by preventing the ! from being recognized as a negation operator.

ksh treated this the same as [ ! "=" ], and ignored the rest of the arguments. This quiety returned false, as = is not a null string. Ksh continues to ignore trailing arguments to this day:

$ [ -e / random words/ops here ]
(true)                              (ksh93, 2021)
bash: [: too many arguments         (bash5, 2021)

Bash 2.0 and ksh93 both fixed this problem by letting = take precedence in the 3-argument case, in accordance with POSIX.

Left-hand side is "("

This is by far my favorite.

The UNIX v7 builtin failed when the left-hand side was a left-parenthesis:

$ left="(" right="("
$ [ "$left" = "$right" ]
test: argument expected
$ [ "x$left" = "x$right" ]
(true)

This happens because the ( takes precedence over the =, and becomes an invalid parenthesis group.

Why is this my favorite? Behold Dash 0.5.4 up until 2009:

$ left="(" right="("
$ [ "$left" = "$right" ]
[: 1: closing paren expected
$ [ "x$left" = "x$right" ]
(true)

That was an active bug when the StackOverflow Q&A was posted.

But wait, there’s more!

Here’s Zsh in late 2015, right before version 5.3:

% left="(" right=")"
% [ "$left" = "$right" ]
(true)
% [ "x$left" = "x$right" ]
(false)

Amazingly, the x-hack could be used to work around certain bugs all the way up until 2015, seven years after StackOverflow wrote it off as an archaic relic of the past!

The bugs are of course increasingly hard to come across. The Zsh one only triggers when comparing left-paren against right-paren, as otherwise the parser will backtrack and figure it out.

Another late holdout was Solaris, whose /bin/sh was the legacy Bourne shell as late as Solaris 10 in 2009. However, this was undoubtedly for compatibility, and not because they believed this was a viable shell. A "standards compliant" shell had been an option for a long time before Solaris 11 dragged it kicking and screaming into 21th century — or at least into the 90s — by switching to ksh93 by default in 2011.

In all cases, the x-hack is effective because it prevents the operands from being recognized as parentheses.

Conclusion

The x-hack was indeed useful and effective against several real and practical problems in multiple shells.

However, the value was mostly gone by the mid-to-late 1990s, and the few remaining issues were cleaned up before 2010 — shockingly late, but still over a decade ago.

The last one managed to stay until 2015, but only in the very specific case of comparing opening parenthesis to a closed parenthesis in one specific non-system shell.

I think it’s time to retire this idiom, and ShellCheck now offers a style suggestion by default.

Epilogue

The Dash issue of [ "(" = ")" ] was originally reported in a form that affected both Bash 3.2.48 and Dash 0.5.4 in 2008. You can still see this on macOS bash today:

$ str="-e"
$ [ \( ! "$str" \) ]
[: 1: closing paren expected     # dash
bash: [: `)' expected, found ]   # bash

POSIX fixes all these ambiguities for up to 4 parameters, ensuring that shells conditions work the same way, everywhere, all the time.

Here’s how Dash maintainer Herbert Xu put it in the fix:

/*
 * POSIX prescriptions: he who wrote this deserves the Nobel
 * peace prize.
 */

Why is /usr/bin/test 4kiB smaller than /usr/bin/[ ?

Reddit user mathisweirdaf posted this interesting observation:

 $ ls -lh /usr/bin/{test,[}
-rwxr-xr-x 1 root root 59K  Sep  5  2019 '/usr/bin/['
-rwxr-xr-x 1 root root 55K  Sep  5  2019  /usr/bin/test

[ and test are supposed to be aliases for each other, and yet there is a 4kiB difference between their GNU coreutils binaries. Why?

First, for anyone surprised: yes, there is a /usr/bin/[. I have a previous post on this subject, but here’s a quick recap:

When you write if [ -e /etc/passwd ]; then .. that bracket is not shell syntax but just a regular command with a funny name. It’s serviced by /usr/bin/[, or (more likely) a shell builtin. This explains a lot of its surprising behavior, e.g. why it’s notoriously space sensitive: [1=2] is no more valid than ls-l/tmp.

Anyways, why is there a size difference? We can compare objdump output to see where the data goes. Here’s an excerpt from objdump -h /usr/bin/[:

                 size                                          offset
15 .text         00006e82  0000000000002640  0000000000002640  00002640  2**4
16 .fini         0000000d  00000000000094c4  00000000000094c4  000094c4  2**2
17 .rodata       00001e4c  000000000000a000  000000000000a000  0000a000  2**5

and here’s objdump -h /usr/bin/test:

15 .text         000068a2  0000000000002640  0000000000002640  00002640  2**4
16 .fini         0000000d  0000000000008ee4  0000000000008ee4  00008ee4  2**2
17 .rodata       00001aec  0000000000009000  0000000000009000  00009000  2**5

We can see that the .text segment (compiled executable code) — is 1504 bytes larger, while .rodata (constant values and strings) is 864 bytes larger.

Most crucially, the increased size of the .text segment causes it to go from the 8000s to the 9000s, crossing a 0x1000 (4096) page size boundary, and therefore shifting all other segments up by 4096 bytes. This is the size difference we’re seeing.

The only nominal difference between [ and test is that [ requires a ] as a final argument. Checking for that would be a very minuscule amount of code, so what are those ~1500 bytes used for?

Since it’s hard to inspect stripped binaries, I built my own copy of coreutils and compared the list of functions in each:

$ diff -u <(nm -S --defined-only src/[ | cut -d ' ' -f 2-) <(nm -S --defined-only src/test | cut -d ' ' -f 2-)
--- /dev/fd/63      2021-02-02 20:21:35.337942508 -0800
+++ /dev/fd/62      2021-02-02 20:21:35.341942491 -0800
@@ -37,7 +37,6 @@
 D __dso_handle
 d _DYNAMIC
 D _edata
-0000000000000099 T emit_bug_reporting_address
 B _end
 0000000000000004 D exit_failure
 0000000000000008 b file_name
@@ -63,7 +62,7 @@
 0000000000000022 T locale_charset
 0000000000000014 T __lstat
 0000000000000014 t lstat
-0000000000000188 T main
+00000000000000d1 T main
 000000000000000b T make_timespec
 0000000000000004 d nslots
 0000000000000022 t one_argument
@@ -142,16 +141,10 @@
 0000000000000032 T umaxtostr
 0000000000000013 t unary_advance
 00000000000004e5 t unary_operator
-00000000000003d2 T usage
+0000000000000428 T usage
 0000000000000d2d T vasnprintf
 0000000000000013 T verror
 00000000000000ae T verror_at_line
-0000000000000008 D Version
-00000000000000ab T version_etc
-0000000000000018 T version_etc_ar
-000000000000042b T version_etc_arn
-000000000000002f R version_etc_copyright
-000000000000007a T version_etc_va
 000000000000001c r wide_null_string.2840
 0000000000000078 T x2nrealloc
 000000000000000e T x2realloc

The major contributors are the version_etc* functions. What do they do?

Well, let’s have a look:

/* The three functions below display the --version information the
   standard way. [...]

These are 260 lines of rather elaborate, internationalized, conditional ways of formatting data that makes up --version output. Together they take about bc <<< "ibase=16; 7A+2F+42B+18+AB+8+99" = 1592 bytes.

What does this mean? Simple. This is what we’re paying an extra 4kb for:

$ /usr/bin/[ --version
[ (GNU coreutils) 8.30
Copyright (C) 2018 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <https://gnu.org/licenses/gpl.html>.
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

Written by Kevin Braunsdorf and Matthew Bradburn.

[ --version is missing the final ], so the invocation is invalid and the result is therefore implementation defined. GNU is free to let it output version info.

Meanwhile, /usr/bin/test --version is a valid invocation, and POSIX mandates that it simply returns success when the first parameter (--version) is a non-empty string.

This difference is even mentioned in the documentation:

NOTE: [ honors the --help and --version options, but test does not.
test treats each of those as it treats any other nonempty STRING.

Mystery solved!

(Exercise: what would have been the implications of having test support --help and --version in spite of POSIX?)

Zsh and Fish’s simple but clever trick for highlighting missing linefeeds

tl;dr: We look at how Zsh and Fish is able to indicate a missing terminating linefeed in program output when the Unix programming model precludes examining the output itself.

Most shells, including bash, ksh, dash, and ash, will show a prompt wherever the previous command left the cursor when it exited.

The fact that the prompt (almost) always shows up on the familiar left-most column of the next line is because Unix programs universally cooperate to park the cursor there when they exit.

This is done by always making sure to output a terminating linefeed \n (aka newline):

vidar@vidarholen-vm2 ~ $ whoami
vidar
vidar@vidarholen-vm2 ~ $ whoami | hexdump -c
0000000   v   i   d   a   r  \n  

If a program fails to follow this convention, the prompt will end up in the wrong place:

vidar@vidarholen-vm2 ~ $ echo -n "hello world"
hello worldvidar@vidarholen-vm2 ~ $

However, I recently noticed that zsh and fish will instead show a character indicating a missing linefeed, and still start the prompt where you’d expect to find it:

vidarholen-vm2% echo -n "hello zsh"
hello zsh% 
vidarholen-vm2%

vidar@vidarholen-vm2 ~> echo -n "hello fish"
hello fish⏎
vidar@vidarholen-vm2 ~> 

If you’re disappointed that this is what there’s an entire blog post about, you probably haven’t tried to write a shell. This is one of those problems where the more you know, the harder it seems (obligatory XKCD).

If you have a trivial solution in mind, maybe along the lines of if (!output.ends_with("\n")) printf("%\n");, consider the following restrictions*:

  • Contrary to popular belief, the shell does not sit between programs and the terminal. The shell has no ability to intercept or examine the terminal output of programs.
  • The terminal programming model is based on teletypes (aka TTYs), electromechanical typewriters from the early 1900s. They printed letter by letter onto paper, so there is no memory or screen buffer that can be programmatically read back.

Given this, here are some flawed ways to make it happen:

  • The shell could use pipes to intercept all output, and relay it onto the terminal. While it works in trivial cases like whoami, some programs check whether stdout is a terminal and change their behavior, others go over your head and talk to the TTY directly (e.g. ssh‘s password prompt), and some use TTY specific ioctls that fail if the output is not a TTY, such as querying window size or disabling local echo for password input.

  • The shell can ptrace the process to see what it writes where. This has a huge overhead and breaks sudo, ping, and other commands that rely on suid.

  • The shell can create a pseudo-tty (pty), run commands in that, and relay information back and forth much like ssh or script does. This is an annoying and heavy-handed approach, which in its ultimate form would require re-implementing an entire terminal emulator.

  • The shell can use ECMA-48 cursor position reporting features: printf '\e[6n' on a supported terminal will cause the terminal to simulate user input on the form ^[[y;xR where y and x is the row and column. The shell could then read this to figure out where the cursor is. These kinds of round trips are feasible, but somewhat slow and annoying to implement for such a simple feature.

Zsh and Fish instead have a much simpler and far more clever way of doing it:

  1. They always output the missing linefeed indicator, whether or not it’s needed.
  2. They then pad out the line with $COLUMN-1 spaces
  3. This is followed by a carriage return to move to the first column
  4. Finally, they show the prompt.

This solution is very simple because it only requires printing a fixed string before every prompt, but it’s highly effective on all terminals§.

Why?

Let’s pretend our terminal is 10 columns wide and 3 rows tall, and a canonical program just wrote a short string with a trailing linefeed:

[vidar     ]
[|         ]
[          ]

The cursor, indicated by |, is at the start of the line. This is what would happen in step 1 and 2:

[vidar      ]
[%         |]
[           ]

The indicator is shown, and since we have written exactly $COLUMN characters, the cursor is after the last column. Step 3, a carriage return, now moves it back to the start:

[vidar      ]
[|%         ]
[           ]

The prompt now draws over the indicator, and is shown on the same line:

[vidar      ]
[~ $ |      ]
[           ]

The final result is exactly the same as if we had simply written out the prompt wherever the cursor was.

Now, let’s look at what happens when a program does not output a terminating linefeed:

[vidar|     ]
[           ]
[           ]

The indicator is shown, but this time the spaces in step 2 causes the line to wrap all the way around to the next line:

[vidar%     ]
[     |     ]
[           ]

The carriage return moves the cursor back to the start of the next line:

[vidar%     ]
[|          ]
[           ]

The prompt is now shown on that line, and therefore doesn’t overwrite the indicator:

[vidar%     ]
[~ $ |      ]
[           ]

And there you have it. A seemingly simple problem turned out harder than expected, but a clever use of line wrapping made it easy again.

Now that we know the secret sauce, we can of course do the same thing in Bash:

PROMPT_COMMAND='printf "%%%$((COLUMNS-1))s\\r"'

* These same restrictions are reflected in several other aspects of Unix:

  • While useful and often requested, there is no robust way to get the output of the previously executed command.
  • It’s surprisingly tricky to take screenshots/dumps of terminals, and it only works on specific terminals.
  • The phenomenon of background process output cosmetically trashing foreground processes is well known, and yet there’s no solution

§ Fish developer and Hacker News reader ComputerGuru explains that there are many caveats related to various terminals’ line wrapping that make this trickier than shown here.

The curious pitfalls in shell redirections to $((i++))

ShellCheck v0.7.1 has just been released. It primarily has cleanups and bugfixes for existing checks, but also some new ones. The new check I personally find the most fascinating is this one, for an issue I haven’t really seen discussed anywhere before:

In demo line 6:
  cat template/header.txt "$f" > archive/$((i++)).txt
                                             ^
  SC2257: Arithmetic modifications in command redirections
          may be discarded. Do them separately.

Here’s the script in full:

#!/bin/bash
i=1
for f in *.txt
do
  echo "Archiving $f as $i.txt"
  cat template/header.txt "$f" > archive/$((i++)).txt
done

Seasoned shell scripter may already have jumped ahead, tried it in their shell, and found that the change is not discarded, at least not in their Bash 5.0.16(1):

bash-5.0$ i=0; echo foo > $((i++)).txt; echo "$i" 
1

Based on this, you may be expecting a quick look through the Bash commit history, and maybe a plea that we should be kind to our destitute brethren on macOS with Bash 3.

But no. Here’s the demo script on the same system:

bash-5.0$ ./demo
Archiving chocolate_cake_recipe.txt as 1.txt
Archiving emo_poems.txt as 1.txt
Archiving project_ideas.txt as 1.txt

The same is true for source ./demo, which runs the script in the exact same shell instance that we just tested on. Furthermore, it only happens in redirections, and not in arguments.

So what’s going on?

As it turns out, Bash, Ksh and BusyBox ash all expand the redirection filename as part of setting up file descriptors. If you are familiar with the Unix process model, the pseudocode would be something like this:

if command is external:
  fork child process:
    filename := expandString(command.stdout) # Increments i
    fd[1] := open(filename)
    execve(command.executable, command.args)
else:
  filename := expandString(command.stdout)   # Increments i
  tmpFd := open(filename)
  run_internal_command(command, stdout=tmpFD)
  close(tmpFD)

In other words, the scope of the variable modification depends on whether the shell forked off a new process in anticipation of executing the command.

For shell builtin commands that don’t or can’t fork, like echo, this means that the change takes effect in the current shell. This is the test we did.

For external commands, like cat, the change is only visible between the time the file descriptor is set up until the command is invoked to take over the process. This is what the demo script does.

Of course, subshells are well known to experienced scripters, and also described on this blog in the article Why Bash is like that: Subshells, but to me, this is a new and especially tricky source of them.

For example, the script works fine in busybox sh, where cat is a builtin:

$ busybox sh demo
Archiving chocolate_cake_recipe.txt as 1.txt
Archiving emo_poems.txt as 2.txt
Archiving project_ideas.txt as 3.txt

Similarly, the scope may depend on whether you overrode any commands with a wrapper function:

awk() { gawk "$@"; }
# Increments
awk 'BEGIN {print "hi"; exit;}' > $((i++)).txt
# Does not increment
gawk 'BEGIN {print "hi"; exit;}' > $((i++)).txt  

Or if you want to override an alias, the result depends on whether you used command or a leading backslash:

# Increments
command git show . > $((i++)).txt
# Does not increment
\git show . > $((i++)).txt

To avoid this confusion, consider following ShellCheck’s advice and just increment the variable separately if it’s part of the filename in a redirection:

anything > "$((i++)).txt"
: $((i++))

Thanks to Strolls on #bash@Freenode for pointing out this behavior.

PS: While researching this article, I found that dash always increments (though with $((i=i+1)) since it doesn’t support ++). ShellCheck v0.7.1 still warns, but git master does not.

So what exactly is -ffunction-sections and how does it reduce binary size?

If you’d like a more up-to-date version of ShellCheck than what Raspbian provides, you can build your own on a Raspberry Pi Zero in a little over 21 hours.

Alternatively, as of last week, you can also download RPi compatible, statically linked armv6hf binaries of every new commit and stable release.

It’s statically linked — i.e. the executable has all its library dependencies built in — so you can expect it to be pretty big. However, I didn’t expect it to be 67MB:

build@d1044ff3bf67:/mnt/shellcheck# ls -l shellcheck
-rwxr-xr-x 1 build build 66658032 Jul 14 16:04 shellcheck

This is for a tool intended to run on devices with 512MiB RAM. strip helps shed a lot of that weight, and the post-stripped number is the one we’ll use from now on, but 36MB is still more than I expected, especially given that the x86_64 build is 23MB.

build@d1044ff3bf67:/mnt/shellcheck# strip --strip-all shellcheck
build@d1044ff3bf67:/mnt/shellcheck# ls -l shellcheck
-rwxr-xr-x 1 build build 35951068 Jul 14 16:22 shellcheck

So now what? Optimize for size? Here’s ghc -optlo-Os to enable LLVM opt size optimizations, including a complete three hour Qemu emulated rebuild of all dependencies:

build@31ef6588fdf1:/mnt/shellcheck# ls -l shellcheck
-rwxr-xr-x 1 build build 32051676 Jul 14 22:38 shellcheck

Welp, that’s not nearly enough.

The real problem is that we’re linking in both C and Haskell dependencies, from the JSON formatters and Regex libraries to bignum implemenations and the Haskell runtime itself. These have tons of functionality that ShellCheck doesn’t use, but which is still included as part of the package.

Fortunately, GCC and GHC allow eliminating this kind of dead code through function sections. Let’s look at how that works, and why dead code can’t just be eliminated as a matter of course:

An ELF binary contains a lot of different things, each stored in a section. It can have any number of these sections, each of which has a pile of attributes including a name:

  • .text stores executable code
  • .data stores global variable values
  • .symtab stores the symbol table
  • Ever wondered where compilers embed debug info? Sections.
  • Exception unwinding data, compiler version or build IDs? Sections.

This is how strip is able to safely and efficiently drop so much data: if a section has been deemed unnecessary, it’s simple and straight forward to drop it without affecting the rest of the executable.

Let’s have a look at some real data. Here’s a simple foo.c:

int foo() { return 42; }
int bar() { return foo(); }

We can compile it with gcc -c foo.c -o foo.o and examine the sections:

$ readelf -a foo.o
ELF Header:
  Magic:   7f 45 4c 46 01 01 01 00 00 00 00 00 00 00 00 00
  Class:        ELF32
  Data:         2's complement, little endian
  Version:      1 (current)
  OS/ABI:       UNIX - System V
  ABI Version:  0
  Type:         REL (Relocatable file)
  Machine:      ARM
[..]

Section Headers:
  [Nr] Name       Type      Addr   Off    Size   ES Flg Lk Inf Al
  [ 0]            NULL      000000 000000 000000 00      0   0  0
  [ 1] .text      PROGBITS  000000 000034 000034 00  AX  0   0  4
  [ 2] .rel.text  REL       000000 000190 000008 08   I  8   1  4
  [ 3] .data      PROGBITS  000000 000068 000000 00  WA  0   0  1
  [ 4] .bss       NOBITS    000000 000068 000000 00  WA  0   0  1
  [..]

Symbol table '.symtab' contains 11 entries:
   Num:    Value  Size Type    Bind   Vis      Ndx Name
   [..]
     9: 00000000    28 FUNC    GLOBAL DEFAULT    1 foo
    10: 0000001c    24 FUNC    GLOBAL DEFAULT    1 bar

There’s tons more info not included here, and it’s an interesting read in its own right. Anyways, both our functions live in the .text segment. We can see this from the symbol table’s Ndx column which says section 1, corresponding to .text. We can also see it in the disassembly:

$ objdump -d foo.o
foo.o:     file format elf32-littlearm

Disassembly of section .text:
00000000 <foo>:
   0:   e52db004   push    {fp}
   4:   e28db000   add     fp, sp, #0
   8:   e3a0302a   mov     r3, #42 ; 0x2a
   c:   e1a00003   mov     r0, r3
  10:   e28bd000   add     sp, fp, #0
  14:   e49db004   pop     {fp}
  18:   e12fff1e   bx      lr

0000001c <bar>:
  1c:   e92d4800   push    {fp, lr}
  20:   e28db004   add     fp, sp, #4
  24:   ebfffffe   bl      0 <foo>
  28:   e1a03000   mov     r3, r0
  2c:   e1a00003   mov     r0, r3
  30:   e8bd8800   pop     {fp, pc}

Now lets say that the only library function we use is foo, and we want bar removed from the final binary. This is tricky, because you can’t just modify a .text segment by slicing things out of it. There are offsets, addresses and cross-dependencies compiled into the code, and any shifts would mean trying to patch that all up. If only it was as easy as when strip removed whole sections…

This is where gcc -ffunction-sections and ghc -split-sections come in. Let’s recompile our file with gcc -ffunction-sections foo.c -c -o foo.o:

$ readelf -a foo.o
[..]
Section Headers:
  [Nr] Name          Type      Addr  Off  Size ES Flg Lk Inf Al
  [ 0]               NULL      00000 0000 0000 00      0   0  0
  [ 1] .text         PROGBITS  00000 0034 0000 00  AX  0   0  1
  [ 2] .data         PROGBITS  00000 0034 0000 00  WA  0   0  1
  [ 3] .bss          NOBITS    00000 0034 0000 00  WA  0   0  1
  [ 4] .text.foo     PROGBITS  00000 0034 001c 00  AX  0   0  4
  [ 5] .text.bar     PROGBITS  00000 0050 001c 00  AX  0   0  4
  [ 6] .rel.text.bar REL       00000 01c0 0008 08   I 10   5  4
  [..]

Symbol table '.symtab' contains 14 entries:
   Num:    Value  Size Type    Bind   Vis      Ndx Name
[..]
12: 00000000    28 FUNC    GLOBAL DEFAULT    4 foo
13: 00000000    28 FUNC    GLOBAL DEFAULT    5 bar

Look at that! Each function now has its very own section.

This means that a linker can go through and find all the sections that contain symbols we need, and drop the rest. We can enable it with the aptly named ld flag --gc-sections. You can pass that flag to ld via gcc using gcc -Wl,--gc-sections. And you can pass that whole thing to gcc via ghc using ghc -optc-Wl,--gc-sections

I enabled all of this in my builder’s .cabal/config:

program-default-options
  gcc-options: -Os -Wl,--gc-sections -ffunction-sections -fdata-sections
  ghc-options: -optc-Os -optlo-Os -split-sections

With this in place, the ShellCheck binary became a mere 14.5MB:

-rw-r--r-- 1 build build 14503356 Jul 15 10:01 shellcheck

That’s less than half the size we started out with. I’ve since applied the same flags to the x86_64 build, which brought it down from 23MB to 7MB. Snappier downloads and installs for all!


For anyone interested in compiling Haskell for armv6hf on x86_64, I spent weeks trying to get cross-compilation going, but in the end (and with many hacks) I was only able to cross-compile armv7. In the end I gave up and took the same approach as with the Windows build blog post: a Docker image runs the Raspbian armv6 userland in Qemu user emulation mode.

I didn’t even have to set up Qemu. There’s tooling from Resin.io for building ARM Docker containers for IoT purposes. ShellCheck (ab)uses this to run emulated GHC and cabal. Everything Just Works, if slowly.

The Dockerfile is available on GitHub as koalaman/armv6hf-builder.

Bash’s white collar eval: [[ $var -eq 42 ]] runs arbitrary code too

Did you know this bash snippet is open to arbitrary code execution from user input?

#!/bin/bash
read -rp "Enter guess: " num
if [[ $num -eq 42 ]]
then
  echo "Correct"
else
  echo "Wrong"
fi

Here’s an example:

$ ./myscript
Enter guess: 42
Correct

$ ./myscript
Enter guess: a[$(date >&2)]+42
Sun Feb  4 19:06:19 PST 2018
Correct

This is not a new discovery or recently introduced vulnerability, but it’s one of the lesser discussed issues in bash scripting. ksh behaves the same way, if you account for the lack of read -p.

The shell evaluates values in an arithmetic context in several syntax constructs where the shell expects an integer. This includes: $((here)), ((here)), ${var:here:here}, ${var[here]}, var[here]=.. and on either side of any [[ numerical comparator like -eq, -gt, -le and friends.

In this context, you can use or pass literal strings like "1+1" and they will evaluated as math expressions: var="1+1"; echo $((var)) outputs "2".

However, as demonstrated, it’s not just a bc style arithmetic expression evaluator, it’s more closely integrated with the shell: a="b"; b="c+1"; c=41; echo $((a)) will show 42.

And any value used as a scalar array subscript in an arithmetic context will be evaluated and interpreted as an integer — including command substitutions.

This issue is generally trivialized or ignored. It’s not pointed out in code reviews or example code, even by those who are well aware of this issue. It would be pretty tedious to use ${var//[!0-9-]/} for every comparison, and it’s often not relevant anyways.

This isn’t necessarily unfair, because arbitrary code execution is only a security issue if it also provides escalation of privileges, like when you’re doing math using values from untrusted files, or with user input from CGI scripts.

The initial example rather involves being on the other side of this airtight hatchway, where you can do the same thing much more easily by just running date instead of ./myscript in the first place.

Contrast this to eval.

It also frequently allows arbitrary code execution, but any use of it — safe or unsafe, on either side of said hatchway — is likely to elicit a number of comments ranging from cautionary to condescending.

"eval is evil", people say, even though no one ever said that "-eq is evil" or "$((foo+1)) considered harmful".

Why such a double standard?

I would argue that eval is considered bad not (just) because it’s unsafe, but because it’s lowbrow. Safety is just the easiest way to argue that people should use better constructs.

eval is considered lowbrow because it’s primary use case is letting newbies do higher level operations without really knowing how the shell works. You don’t have to know the difference between literal and syntactic quotes, or how to work with the order of expansion, as long as you can figure out roughly what you want to run and generate such commands.

Meanwhile, Bash arithmetic contexts are a convenient and useful feature frequently employed and preferred by advanced scripters.

Here’s an example of a script where you can run ./myscript 1 or ./myscript 2 to get "Foo" or "Bar" respectively. You’ve probably seen (and at some point tried) the variable-with-numeric-suffix newbie pattern:

#!/bin/bash
var1="Foo"
var2="Bar"
eval echo "\$var$1"

This example would get a pile of comments about how this is unsafe, how ./myscript '1; rm -rf /' will give you a bad time, so think of the children and use arrays instead:

#!/bin/bash
var[1]="Foo"
var[2]="Bar"
echo "${var[$1]}"

As discussed, this isn’t fundamentally more secure: ./myscript '1+a[$(rm -rf /)]' will still give you a bad time. However, since it now uses the approprate shell features in a proper way, most people will give it a pass.

Of course, this is NOT to say that anyone should feel free to use eval more liberally. There are better, easier, more readable and more correct ways to do basically anything you would ever want to do with it. If that’s not reason enough, the security argument is still valid.

It’s just worth being aware that there’s a double standard.

I have no particular goal or conclusion in mind, so what do you think? Should we complain about eval less and arithmetic contexts more? Should we ignore the issue as a lie-to-children until eval makes it clear a user is ready for The Talk about code injection? Should we all just switch to the fish shell?

Post your thoughts here or on whichever site you came from.