ch01_bin.texi

@node    Theme 1, Theme 2, Acknowledgements, Top
@chapter Theme 1: ``/bin''

@tie{}@image{img/E}very project has to start somewhere, Problem 5, Top, Top
begin at the beginning.

Guile can be used as a scripting language.  Programs can be written as
plain text files, and then run from the command line by using the
Guile interpreter.  As such, most scripts run on Unix-like shells will
begin with a sha-bang @code{#!} invocation.  And most scripts must
start off doing the same chores: parsing the command line, acting on
the options, and finding the files whose names appeared in the
command-line arguments.

To introduce these mundane concepts, our first theme is @emph{/bin}, e.g.
re-implementing some common Unix tools.  This will get us warmed up.

These examples should demonstrate
@itemize
@item
How to set up the sha-bang invocation for Guile scripts run from Unix
shells.
@item
How to handle command line arguments
@item
How to map file names given as command line arguments to their files
@item
How to search for files and directories
@item
How to open files, both as binary data and as encoded text data
@end itemize

And so, without further ado, here are the examples.

@menu
* Problem 1::                   echo and cat
* Problem 2::                   ls
* Problem 3::                   LZW Compression
* Problem 4::                   tar file archives
@end menu

@node    Problem 1, Problem 2, Theme 1, Theme 1
@section Problem 1: Echo and Cat

In this problem, two venerable Unix commands are re-implemented in
Scheme: @command{echo} and @command{cat}.  @command{echo} prints out
the command-line arguments, and @command{cat} prints a file to the
terminal.

@subsection The Requirements for `echo' and `cat'

In this problem, like in many of the problems, we'll lay out the
requirements for a program, and then see how our volunteer implemented
the requirements.  For the purpose of this exercise, the requirements
for @command{echo} and @command{cat} with be drawn from the Posix
standard@footnote{@ref{IEEE 2004}}, with a couple of minor
modifications.

@heading Echo

The @command{echo} script writes its arguments to the standard output,
followed by a <newline>.  If there are no arguments, it just prints a
<newline>.

@command{echo} has no command-line options.  Even @option{--help} and
@option{--version} are not treated as command-line options.

If any of the arguments contain the backslash character (@code{\}),
the argument is modified.  Backslash introduces an escape.  These
escapes are parsed from logical left to right.

@table @code
@item \a
Write an <alert> in place of @code{\a}.
@item \b
Write a <backspace> in place of @code{\b}.
@item \c
Suppress the <newline> that would otherwise be written after the
command-line arguments.  The @code{\c} is not written, any remaining
characters in this argument are not written, and any remaining
arguments are not written.
@item \f
Write a <form-feed> in place of @code{\f}.
@item \n
Write a <newline> in place of @code{\n}.
@item \r
Write a <carriage-return> in place of @code{\r}.
@item \t
Write a <tab> in place of @code{\t}
@item \v
Write a <vertical-tab> in place of @code{\v}.
@item \\
Write a single backslash character in place of the pair of backslash characters.
@item \0@i{num}
Write an 8-bit character corresponding to @i{num}, an octal number
between octal 0 and octal 377 (decimal 255) inclusive.
@end table

A backslash at the end of a command line argument will not be escaped.
The backslash will be written.  However, the exit value will be 1 in
this case.

A backslash followed by any other character not listed in the table,
will will not be escaped.  The backslash will be written, and the
character that follows it will be written.  However, in this
case, the exit value will be 1.

For the octal escape @code{\0}, it is important to note that this
value is not an ISO-8859-x position or a Unicode code point, but,
rather a raw 8-bit byte to be sent unencoded to the standard output.
It is up to the operator, not @command{echo}, to ensure that a
character sequence that is valid for the environments locale is being
sent.

If a @code{\0} escape is present, but is not followed by an
number, the raw byte zero is written.

If a @code{\0} escape is present and is followed by an octal
number of greater than 3 digits, only the first 3 digits will be
interpreted as being part of the escape.

If a @code{\0} escape is present and its octal value is greater than
377, print nothing. In this case, the exit value will be 1.

An octal escape may not have unnecessary initial zeros.  For example
@itemize
@item
@code{\01} should output raw byte 1
@item 
@code{\001} should output raw byte zero followed by the string ``01''
@item
@code{\0001} should output raw byte zero followed by the string ``001''
@end itemize

The digits 8 and 9 are not part of an octal escape.  For example, the
string @code{\018} shall be output as the raw byte 1 followed by the
character for the numeral 8.

Remember that command-line arguments and file names may contain any
character allowed by the current locale.

In all other cases, the exit value will be zero.

@heading Cat

@command{cat} [OPTION]... [FILE]...

@command{cat} concatenates files or standard input and prints it to
the standard output.

This version of @command{cat} supports three command-line options,
each with a short and a long form.
@table @option
@item -u --unbuffered
Do no buffering.  Write bytes from the input to the standard output
without delay as each character is read.
@item -h --help
Print out command help.
@item -v --version
Print out the program name and version number.
@end table

After the command-line options, a list of file names is expected.  The
contents of the files are printed to standard output.  No character
encoding or decoding of the contents of the files should be performed:
they should be transmitted unmodified.

If the special file name @file{-} (hyphen) is given, at that point the
contents of the standard input will be transmitted to the standard
output.

If one of the files does not exist, or if it cannot be opened, the
program will print a descriptive error message to the standard error
and will return the exit code 1.

Otherwise, the exit code is zero.

@heading Rules and Suggestions for the Volunteer

For this exercise, the volunteer was requested to use only Guile's
library functions.  No external libraries were allowed.

The volunteer was also requested to attempt to use one of Guile's 
two sets of functions to help parse command line
options: the @code{ice-9 getopt-long} module and the @code{srfi
srfi-37} module.  As you shall see, the volunteer did use manage to
use @code{ice-9 getopt-long} for @command{cat}.

@subsection An Implementation of `echo' and `cat'

Chris K Jester-Young was the volunteer for this section, and here are
his solutions, with some annotations by the editor.

@page
@heading `cat'

First we have @command{cat}.  One interesting thing to note in this
example is the use of @code{catch} to catch system errors that may
arise if files do not exist or cannot be opened.

@verbatiminclude code/cat.scm
@page

@heading `echo'

Next up is @command{echo}. 

@verbatiminclude code/echo.scm
@page

@node    Problem 2, Problem 3, Problem 1, Theme 1
@section Problem 2: `ls'

In this section, we investigate the most famous Unix command of all
time: @command{ls}.  @command{ls} lists files or directories, and
displays their properties.

However, @command{ls} has accumulated dozens of options over the past
decades.  A feature-complete @command{ls} would be too long to make a
usable example.  So, this script is constrained to the most important
command-line options.

The command @command{ls} lists information about files, directories,
and the contents of directories.  Basically, for this challenge, the
script should operate like a limited functionality version of
Posix
@command{ls}@footnote{@url{http://pubs.opengroup.org/onlinepubs/009695399/utilities/ls.html,
The Posix spec for @command{ls}}}.

@subsection The Requirements for a Limited @command{ls}

This script only recognizes a limited set of command-line options:
@itemize
@item
@option{-a} - display all matching files, including those whose name
begins with a period
@item
@option{-l} - use the long output format
@item
@option{-R} - recursively descend into subdirectories
@end itemize
Any other command-line arguments that begin with a hyphen should cause
an ``invalid option'' error, and the program will be terminated with a
non-zero exit code.

The command-line option @option{-R} will recursively print the
contents of any subdirectory encountered.

The command-line option @option{-l} has two effects.  One, information
about the files will be printed in the long format.  Two, when given a
symbolic link to a directory, the command will print information about
the symbolic link itself and not the file or directory to which it
points.

@heading Operands

If a command-line argument does not begin with a hyphen, it is treated
as an operand.

When called without operands, the contents of the current directory
are printed.

Operands must be either the names of files, directories, or symbolic
links.  When an operand that is not one of the above is encountered,
the script should print a descriptive error and exit with a non-zero
return code.

If an operand is a file, @command{ls} will print the name of the file.
If an operand is a symbolic link to a file, the command will print the
name of the link.  If an operand is a directory, @command{ls} will
print out the contents of that directory.  If an operand is a symbolic
link to a directory, @command{ls} will print the contents of that
directory, unless the @option{-l} is given.

When printing the contents of a directory, files and directories
that begin with <period> are usually not printed.  If the command-line
option @option{-a} is given, files and directories that begin with
<period> are printed.

@heading Output

There are two output formats: the default format and the long format.

Within each directory, the files are sorted in case-insensitive
alphabetical order according to the current locale.

In the default format, the filenames are output one per line.  You can
print them out in a columnar format if you like, though.

In the long format, the file information will be printed as follows
@multitable @columnfractions 0.2 0.1 0.68
@headitem
Field @tab Length @tab Description
@item
Type @tab 1 @tab
`d' for directory@*
`-' for regular file@*
`b' for block special file@*
`l' for symbolic link@*
`c' for character special file@*
`p' for fifo
@item
User Read @tab 1 @tab
`r' if readable by the owner@*
`-' otherwise
@item
User Write @tab 1 @tab
`w' if twritable by the owner@*
`-' otherwise
@item
User Execute @tab 1 @tab
`S' if the file is not executable and the set-user-ID
mode is set@*
`s' if the file is executable and the set-user-ID mode is
set@*
`x' if the file is executable or the directory is searchable by
the owner@*
`-' otherwise
@item
Group Read @tab 1 @tab
`r' if readable by the group@*
`-' otherwise
@item
Group Write @tab 1 @tab
`w' if writable by the group@*
`-' otherwise
@item
Group Execute @tab 1 @tab
`S' if the file is not executable and
the set-group-ID mode is set@*
`s' if the file is executable and the
set-group-ID mode is set@*
`x' if the file is exectuable or the
directory is searchable by members of this group@*
`-' otherwise
@item
Other Read @tab 1 @tab
`r' if readable by others@*
`-' otherwise
@item 
Other Write @tab 1 @tab
`w' if writable by others@*
`-' otherwise
@item
Other Execute @tab 1 + space @tab
`T' if the file is a directory and the
search permission is not granted to others and the restricted
deletion flag is set@*
`t' if the file is a directory and the search
permission is granted to others and the restricted deletion flag is
set@*
`x' if the file is executable or the directory is searchable by
others@*
`-' otherwise
@item
Link Count @tab @tab
For a directory, number of immediate
subdirectories it has plus one for itself plus one for its parent.
The link count for a file is one.
@item
Owner Name @tab @tab
@item
Group Name @tab @tab
@item
File Size @tab @tab in bytes
@item
Date & Time @tab @tab
``month day hour:sec'' format if the file has
been modified in the last six months, or ``month day year'' format
otherwise
@item
Pathname @tab @tab
For non-links, the path@*
For links, ``<link name> -> <path to linked file or directory>''
@end multitable

The exit code should be zero except in those error cases described
above.

For more information about @command{ls}, you can consult The Open
Group Base Specifications Issue 6, or the documentation of any BSD or
GNU version of @command{ls}. 

@heading Rules and Suggestions for the Volunteer

For this challenge, only Guile's library functions have been used.

@subsection An Implementation of `ls'

Jez Ng contributed a script to these specifications.  It is an
interesting solution.

One thing to note is how he has decided to truly minimize the scope of
the procedures by declaring procedures within procedures.

Unsurprisingly, the majority of the script involves getting the format
right for long output.

@page
@verbatiminclude code/ls.scm
@page

@node    Problem 3, Problem 4, Problem 2, Theme 1
@section Problem 3: LZW Compression

Lempel-Ziv-Welch compression is the basis of both the UNIX Compress
program and of GIF encoding.  Today's challenge has two parts.
@itemize
@item
Write `compress' and `uncompress' procedures for LZW compression.
@item
Use them to make `compress' and `uncompress' scripts.
@end itemize

@subsection The Requirements for Compression Procedures and Scripts

First up are the compression procedures.  Good old LZW compression: a
nice problem in every CompSci's undergraduate classes.

@heading @code{lzw-compress} and @code{lzw-uncompress}

@deffn {Guile Procedure} lzw-compress input-bv #:key table-size dictionary
This procedure should take a bytevector presumed to contain 8-bit
unsigned integers, and it should return a bytevector containing 16-bit
unsigned integers in little-endian format.  

@var{input-bv} is the input bytevector.

@var{table-size} is an optional parameter that indicates the maximum
number of entries in the dictionary.  This parameter is limited to the
range 258 - 65536.  The default value of @var{table-size} is 65536.

@var{dictionary} is an optional parameter that modifies the output.
When true, the procedure should return both the output 16-bit
bytevector as well as the dictionary or hash table created by the
compression routine.  Since the formation of the dictionary is up to
the implementer, the output format of the dictionary is unspecified.
@end deffn

Probably the best writup on LZW compression is the one by Mark Nelson
over at @uref{http://marknelson.us/2011/11/08/lzw-revisited/}. Refer
to that article for details on LZW compression.

It is possible to fill up the dictionary.  In that case, one continues
to use the dictionary as it is, without adding new entries.

As is common, the first 256 entries in the dictionary -- entries #0 to
#255 -- are initialized to 0 to 255.  Entry #256 is not to be
used.  Entries #257 to #(table-size - 1) will contain the multi-byte
entries in the dictionary.

@deffn {Guile Procedure} lzw-uncompress input-bv #:key table-size dictionary
Similarly, this procedure takes @var{input-bv} the bytevector created
by @code{compress} and an optional table size and returns the
8-bit unsigned bytevector of uncompressed data.  @var{dictionary},
when true, causes the procedure to also return its dictionary or hash
table.
@end deffn

Daniel Hartwig contributed an implementation of these compression
routines.

There are a couple of interesting techniques of which to take note.
First, if you C programmers have ever wondered how to create a static
variable in a function, @code{make-serial-number-generator} show the
Scheme analog of that technique.

@page
@verbatiminclude code/lzw.scm
@page


@heading The `compress' and `uncompress' scripts

Once the procedures are working, it is a simple task to write scripts
that use them.  So we'll write scripts that are simplified versions
Unix commands `compress' and `uncompress'.  These scripts will
manipulate files with the following format.

Each file will begin with a 3 byte header.
@itemize
@item
Byte 1: @code{#x1F}
@item
Byte 2: @code{#x9D}
@item
Byte 3: Dictionary size, given as an 8-bit unsigned number between 9
and 16 inclusive.  The number indicates a dictionary size from between
2^9 and 2^16.
@end itemize

The rest of the file is the LZW-compressed 16-bit binary data stored
in little-endian format.

(Note that this may not be compatible with your operating system's
version of @command{compress}.  The @command{compress} file format is
not 100% consistent across platforms.)

@example
compress [-v] [-b bits] [name ...]
@end example

For each filename, @command{compress}, will create a LZW-compressed
version of an input file.  The compressed file will have the same
filename as the input file with the ".Z" extension appended to it.  If
the compression is successful and the output file is successfully
written, the input file will be deleted.

If no filenames are given, @command{compress} will take the contents
of stdin and send the compressed data to stdout.

The optional @option{-b} @code{bits} parameter will indicate the
maximum size of the dictionary.  If @code{bits} is given, it must be
between 9 and 16, indicating maximum dictionary sizes of
@code{2^bits}.

If the optional @option{-v} parameter is given, the script should
print to stdout the compression ratio for each file processed.  If no
file was specified and this program is thus compressing stdin to
stdout, this flag is ignored.

Compress should fail with appropriate error messages if any of the
following problems occur

@itemize
@item
The command-line has unknown options or is otherwise incorrect
@item
The command line argument after a @option{-b} is out of range, non-numeric,
or missing.
@item
The file associated with an input filename does not exist or is
unreadable
@item
An input filename has a ".Z" suffix
@item
Writing the output file would overwrite a file that already exists
@item
Writing to disk fails for any reason
@item
Erasing the input file on completion fails for any reason
@end itemize

If an error occurs, the script should return the error code 1.
Otherwise it returns the error code 0.

@example
uncompress [-v] [name ...]
@end example

@command{uncompress} will create an uncompressed version of a file
generated by @command{compress}.  The uncompressed file with have the
same filename as the input file with the ".Z" extension removed.  If
the uncompression is successful and the output file is successfully
written, the input file will be deleted.

Also, like @command{compress}, if no filenames are given,
@command{uncompress} takes the contents of stdin and uncompresses them
to stdout.

If the optional @option{-v} parameter is given, the script should
print to stdout the compression ratio for each file processed.  If no
file was specified and thus this program is compressing stdin to
stdout, this flag is ignored.

Uncompress should fail with appropriate error messages if any of the
following problems occur

@itemize
@item
The command-line has unknown options or is otherwise incorrect
@item
The file header is incorrect
@item
The bits parameter in the file header is out of range
@item
The file associated with the input filename does not exist or is
unreadable
@item
The input compressed data is incorrect or corrupt, which can be
detected by receiving an index that is not yet in the dictionary, or
if an index value exceeds the number of entries in the dictionary as
specified in the header, or if the last entry in the file not a
complete 16-bit integer
@item
The input file does not end in ".Z"
@item
The output file would overwrite a file that already exists
@item
Writing to disk fails for any reason.
@item
Erasing the input file on completion fails for any reason
@end itemize

If an error occurs, the script should return the error code 1.
Otherwise it returns the error code 0.

@heading @code{compress} and @code{uncompress}

Daniel Hartwig contributed @code{compress} and @code{uncompress}
scripts.  As you can imagine, the majority of the scripts do
unglamorous tasks such as checking options, filenames and the like.

@page
Here's @code{compress}
@verbatiminclude code/compress
@page
Here's @code{uncompress}
@verbatiminclude code/uncompress

@node    Problem 4,  , Problem 3, Theme 1
@section Problem 4: tar file archives

This challenge is to create a script that takes a list of filenames
and that generates an @emph{ustar}-format archive file.  This archive
file format is compatible with common POSIX tools.

The @emph{ustar} interchange format is one of the simpler formats used
for archive files that contain multiple files along with their
metadata.

We are going to create a script that creates @emph{ustar}-format
files. But, to keep things simple, we are only going to use a small
subset of the functionality that @emph{ustar} files can provide.  The
result should be readable by common @command{tar} and @command{pax}
tools.

@subsection @command{ustar} Script

The @command{ustar} script will have a simple calling structure.

@command{ustar} @code{archive file1 .. filen}

It will create a new archive containing the files indicated on the
command line.

The script will have to handle many error conditions, including but
not limited to
@itemize
@item
filename contains characters not in the ustar-string's character set
@item
file part of filename is longer than 100 characters
@item
path part of filename is longer than 155 characters
@item 
file is a symbolic link, fifo, directory or any othet type of
non-normal file
@item 
file's uname and gname contain characters not in ustar-string's
character set
@item 
file's uname or gname are longer than 31 characters
@item 
file length is greater than 8,589,934,591 bytes, (octal 77777777777)
@item 
file's UID or GID is greater than 2,097,151 (octal 7777777)
@item 
system errors about inability to open, write, or close files.
@end itemize

@subsection The @emph{rustar} File Format

First, I will describe our restricted @emph{ustar} file format, which,
I'm going to dub @emph{rustar} for @emph{restricted ustar}, just so
that we're clear that I'm talking about something more specific than
the @emph{ustar} format.

@heading File Structure

A @emph{rustar} file contains a set of @emph{logical records}.  Each
logical record represents the contents of a file plus its metadata.
The logical records appear sequentially in the file, one after
another, and there is no global header in the file.  At the end of the
file is a footer.

@heading Logical Records

Each logical record consists of two parts, a @emph{header} segment,
and the contents of the file a.k.a the @emph{data} segment.  Of these,
only the @emph{header} requires a detailed explanation.

@heading Header

The header segment is a 512 byte block that contains metadata for a
file.  The block is broken up into 17 fields of fixed length.  Each
field contains data in one of three types.

@heading Header Types

Here we describe the three types that can appear in a header.  Each
type has the annotation @code{[N]}.  The @emph{N} indicates that this
field is a fixed-size that takes up @emph{N} bytes.

@enumerate
@item
@code{rustar-string[N]} is a fixed-width string that contains only the
codepoints listed below.  It is stored in the ASCII encoding, and, if
necessary, is right padded with NULL bytes to ensure it occupies the
whole of its @emph{N} bytes.  NULL bytes can only appear at the end of
the string.  The string need not end with NULL bytes if it fills the
whole of its fixed witdh.

The list of allowed codepoints is
@itemize
@item
U+20 to U+22
@item
U+25 to U+3F
@item
U+41 to U+5A
@item
U+5F
@item
U+61 to U+7A
@item
and U+00, but, U+00 can only be followed by more U+00.
@end itemize

@item
@code{rustar-0string[N]} --- note the `0' --- is a fixed-width string
with the same format and restrictions as a @code{rustar-string[N]} but
with an addition restriction. It must end with at least one NULL byte.

@item
@code{rustar-number[N]} is an unsigned integer stored as a fixed-width
string.  The string contains the the text representation of the
integer in octal format.  The last byte (and only the last byte) of
the string must be NULL.  The string is left-padded with the `0'
character to ensure the number occupies the whole of its fixed width
buffer.

For example, a @code{rustar-number[8]} field for the integer 10 will
be the string ``0000012'' followed by one byte of NULL. 12 octal
equals 10 decimal.
@end enumerate

@heading Header Fields

The 17 fields in the 512 byte header block of a logical record are

@multitable @columnfractions .25 .25 .50
@headitem
Field @tab Format @tab
Description
@item
Name @tab string[100] @tab
The filename by itself, with no directory information. The path
separator character (U+2F), is not allowed.
@item
Mode @tab number[8] @tab
A bitfield of the permissions.  See below.
@item
UID @tab number[8] @tab
The User ID of the file
@item
GID @tab number[8] @tab
The Group ID of the file
@item
Size @tab number[12] @tab
The length of the file in bytes
@item
mtime @tab number[12] @tab
The 32-bit integer modification time of the file.
@item
Checksum @tab number[8] @tab
256 + the sum of all the bytes in this header except the checksum
field.
@item
Typeflag @tab string[1] @tab
Always ``0''.
@item
Link name @tab string[100] @tab
Always 100 bytes of NULL.
@item
Magic @tab 0string[6] @tab
The string ``ustar'' plus a NULL.
@item
Version @tab string[2] @tab
The string ``00''.
@item 
uname @tab 0string[32] @tab
The uname of the file.
@item
gname @tab 0string[32] @tab
The gname of the file
@item
Dev-Major @tab number[8] @tab
Always zero.
@item
Dev-Minor @tab number[8] @tab
Always zero.
@item
Prefix @tab string[155] @tab
Path information for this file.  If this file has no additional path
information, this is all NULL.  Directory separation is represented by
`/' forward slash.  The slash at the end is assumed, and should not be
included explicitly.@footnote{For example: prefix ``foo'' + name
``bar'' forms ``foo/bar''.  Prefix ``foo/'' + name ``bar'' forms
``foo//bar''.  Don't do that.}
@item
Padding @tab 0string[12] @tab
12 bytes of NULL.
@end multitable

The mode bitfield is a standard permissions bitfield:
@itemize
@item
0x001 execute permission for 'other'
@item
0x002 write permission for 'other'
@item
0x004 read permission for 'other'
@item
0x008 exeute permission for 'group'
@item
0x010 write permission for 'group'
@item
0x020 read permission for 'group'
@item
0x040 execute permission for 'owner'
@item
0x080 write permission for 'owner'
@item
0x100 read permission for 'owner'
@item
0x200 (unused)
@item
0x400 if is setgid
@item
0x800 if is setuid
@end itemize

@heading Data
After the 512-byte header block, the binary contents of the file are
stored.  The data segment is NULL-padded so that it ends on a 512-byte
block boundary.

@heading Footer
The footer is 1024 bytes of NULL that appears at the end of the file.

@page
@heading The Archive Script

Jez NG contributed a script that meets the above requirements quite
nicely. One thing to note here is the use of the procedures @code{cut}
and @code{cute}.  These let you, in effect, pass a subset of the
required parameters to a procedure.  In a later call, you can add the
remaining parameters to the procedure and then truly call it.

@verbatiminclude code/tar.scm

Later, Mark Weaver contributed a more featureful script that handles
almost all of the capabilites of the @code{ustar} archive format.  It
does directories and links as well as files.  Also, he uses a very
common hack to allow longer path names.  He puts whatever part of the
path that will fit within the 100 character field for the filename.
You can find his script in the appendix, @xref {ustar Archives}.