June 09, 2011

Programming Anti-patterns in Shell Scripts

How to create a new file from inside a Shell script?
touch newfile1
# OR
echo > newfile2
#OR
: > newfile3

Which of these is better? "touch" is a Unix command, "echo" and ":" (No-Op) are built-in Shell functions. If you "time" the above operations, you will see that the first is very much slower than the second is slightly slower than the third. I'll spare you the reasons.

In my opinion this is a difference that is not highlighted enough. I have seen thousands of lines of "shell script" which are indistinguishable from "one-liners" typed at the command-line. The reason anyone writes something inside a script and not at the command-line is the assumption that it will be used more than once, and that alone is reason enough for "efficiency" to be a criterion. Nothing slows down a script like a bunch of one-liners in it. And scripts are slow to start with.

Contrary to popular belief a Shell script is not a file with a bunch of lines each with some combination of Unix commands (called "commands text"). Shells are like all other languagues, with many useful features. I have a few unoriginal guidelines for writing scripts (Shell, Perl, Batch, ...):
1. Use environment and internal variables. (Like $HOSTNAME, $MACHTYPE, $PWD in Bash.)
2. Use built-ins and libraries. (Shells have a type command to verify the existence of an eqivalent built-in.)
3. Self-parse text. (Shells have built-in string manipulation and substring extraction functions with regexp support.)
4. Combine operations. (Many commands, internal and external, can take multiple arguments.)
5. Check fork rampage and /dev/null redirections. (They are used far more than necessary.)
6. Refer the manual. (Again.)

Moving on to the more useful part. Below is a table containing many anti-patterns that can be found -- I came across each of them numerous times -- in Shell scripts along with suggested alternatives. It is a work in progress. I hope you find it useful and I hope you will help add more rows ot it. Check out alternatives.sh for detailed examples of the anti-patterns and alternatives. It should work with various other Shells as well with minor changes. Combinations of these anti-patterns are common, and alternatives aren't necessarily the most efficient, nor are they efficient all the time.

30th June, 2011: TimeRatio is the ratio of time taken by Alternative to that by Anti-pattern, as taken from two different trials, using Korn Shell 93 (dtksh) on Solaris 10 (Sun T5120). I hope these time ratios will highlight better why especially some of the anti-patterns are to be avoided. My suggestion is to not take these numbers on face value.

Anti-patternAlternativeTimeRatio1TimeRatio2
awk '{print}'read -A1.311.32
awk | grepawk1.091.09
awk | sedawk1.531.53
awk | sortawk2.182.18
basename${string##*/}2.442.44
cat | awkawk1.11.1
cat | grepgrep1.381.36
cat | headhead21.5921.46
cat | readread1.451.47
cat | sedsed1.171.17
cat | tailtail1.61.5
cat |wcwc15.2115.45
/usr/bin/cdcd407.87413.4
dirname${string%/*}2.182.3
/usr/bin/echoecho99.28130.35
echo | awk($string)212.72159.63
echo | cut($string)100.83100.25
echo | cut | cutsubstrings187.91185.18
echo | grepif [[ $string == regexp ]]381.61378.55
echo | sed${string/find/replace}341.36329.14
echo | wc -m${#string}161.39165.7
echo | wc -w($string); ${#array[*]}85.6985.9
grep | awkgrep1.071.07
grep | grepsed11.01
grep | sedsed1.021.03
grep | grep -vsed1.011
grep | wcgrep -c0.990.99
headread, break1.21.21
head | awkread -A, break66.7268.4
/usr/bin/killkill3.755.79
ls | awk(ls)1.041.23
/usr/bin/pwd$PWD305.72271.46
sed | sedsed1.091.05
tail | awk(tail)1.481.44
/usr/bin/testtest210.81203.97
/usr/bin/trueTRUE496.71452.64
/usr/bin/ulimitulimit180.74167.1
/usr/bin/umaskumask180.16166.78
uname$ENV3.243.21
wc | awk(wc)13.5311.67

4 comments:

  1. Hi, I got here from your post on the Cygwin mailing list. Looks like very useful work. I suggest you put things on GitHub or a Wiki so people can contribute additions easily.

    Certainly bash_completion is one of the big inadequacies in Cygwin, and there seems to be a lot of room for improvement. The tricky thing is that it would probably be best to commit the changes upstream of the Cygwin package, but outside Cygwin the fork performance is better so people may be less interested in improving things.

    ReplyDelete
  2. Hi Peter, thank you very much for dropping by and sharing your opinion.

    Not to whine, but one problem seems to be that I'm still so old-schooled that I haven't figured out about putting things on GitHub or Wiki. I especially like the idea of putting it on a Wiki, so that others could add more anti-patterns. I'll see what I can do, but if you could point me to a how-to resource I'll be grateful.

    I agree that bash_completion can use a lot of help. I'm trying to contact the package owner directly.

    I agree about the fork performance being better outside Cygwin, but I still think it's a bad reason to ignore, even on the Unices. A fork involves far too much work, including the under-estimated difficult-to-measure context switch (and back). Just to prove the point, I'm now also sharing the actual benefit I noticed in terms of total time taken on Solaris 10, Sun T5120. In this updated table, I would especially like to draw attention to the "echo | xyz" kind of anti-patterns which are among the worst and also the most common.

    ReplyDelete
  3. Hi, sorry to have dropped this. I think if you are not familiar with GitHub and don't have a favorite free wiki host, just use the comments here to collect any more suggestions people might have.

    Did you hear anything from the bash_completion package owner. I think the key will be to find out what upstream code he/she is using to build the package; presumably it is a standard package like the Debian project's version. Then see if you can get interest in refactoring that upstream version. Hopefully you can, but if not, maybe the Cygwin package maintainer can be convinced to at least patch some of the more egregious cases.

    ReplyDelete
  4. Thank you, Peter. I actually checked out GitHub for another project and it was very smooth and easy to use. I'm hosting that other project on GitHub now.

    I first contacted the todo.sh group about anti-patterns, and offered them my version: https://github.com/bsravanin/todo.txt-cli. (On my setup, their test harness took ~8m30s to complete all tests with the original version, and ~6m to complete all tests with my forked version.) The tool hasn't had a new release in a long time, so they're catching up with the other queued requests first. They said that they would consider pushing my changes upstream for a future release, but were apprehensive about sacrificing readability for performance at least in some of the changes I made. I've left it at that for now.

    I didn't contact the bash_completion package maintainer until now, after your reminder (thank you again). I will let you know if there's any progress on that front.

    ReplyDelete