gztool.1

.\"                                      Hey, EMACS: -*- nroff -*-
.\" (C) Copyright 2021 Roberto S. Galende <roberto.s.galende@gmail.com>,
.\"
.\" First parameter, NAME, should be all caps
.\" Second parameter, SECTION, should be 1-8, maybe w/ subsection
.\" other parameters are allowed: see man(7), man(1)
.TH gztool 1 "Nov 21 2024" "gztool v1.8.0"
.\" Please adjust this date whenever revising the manpage.
.\"
.\" Some roff macros, for reference:
.\" .nh        disable hyphenation
.\" .hy        enable hyphenation
.\" .ad l      left justify
.\" .ad b      justify to both left and right margins
.\" .nf        disable filling
.\" .fi        enable filling
.\" .br        insert line break
.\" .sp <n>    insert n+1 empty lines
.\" for manpage-specific macros, see man(7)
.SH NAME
gztool \- extract random-positioned data from gzip files, even like `tail -f`
.SH SYNOPSIS
.B gztool
.RI \ [\ [-[abLnsv]\ #]\ [-[1..9]AcCdDeEfFhilpPqrRStTwWxXzZ|u[cCdD]]\ [-I\ <INDEX>]\ ]\ "files"\ ...
.br

Note that actions `-bcStT` proceed to an index file creation (if
none exists) INTERLEAVED with data flow. As data flow and
index creation occur at the same time there's no waste of time.
Also you can interrupt actions at any moment and the remaining
index file will be reused (and completed if necessary) on the
next gztool run over the same data.
.SH DESCRIPTION
\fBgztool\fP is a GZIP files indexer, compressor and data retriever.
It can create small indexes for gzipped files and use them
for quick and random-positioned data extraction.

\fBgztool\fP can extract random-positioned data from gzip files with no penalty,
including gzip tailing like with `tail -f`.

\fBgztool\fP creates an index file (.gzi) for every gzip it treats,
and this action is interleaved with compression/uncompression
so there's no waste of time. Any action can be interrupted at
any moment, and the remaining index will be reused on next runs.

Extraction is possible from any byte (or line) position
in the uncompressed data using `-b` (or `-L` for lines).

If the uncompressed file is a text file, then using
the `-x` modifier the index will take care of lines, so later \fBgztool\fP can be
requested to extract data from a particular line with `-L`.

See the full listing of capabilities on \fBINTERNALS\fp.
.BR
.SH OPTIONS
.TP
.BR \fBfiles\fP
One or more files. If no file is indicated, standard input is used.
.TP
.BR \-[1..9]
compression factor to use with `-[c|u[cC]]`, from best speed (`-1`) to best compression (`-9`). Default is `-6`.
.TP
.BR \-a\ #
Await # seconds between reads when `-[ST]|Ec`. Default is 4 s.
.TP
.BR \-A
modifier for `-[rR]` to indicate the range of bytes/lines in absolute values, instead of the default incremental values.
.TP
.BR \-b\ #
extract data from indicated uncompressed byte position of
gzip file (creating or reusing an index file) to STDOUT.
Accepts '0', '0x', and suffixes 'kmgtpe' (^10) or 'KMGTPE' (^2).
.TP
.BR \-C
always create a 'Complete' index file, ignoring possible errors.
.TP
.BR \-c
compress a file like with gzip, creating an index at the same time.
.TP
.BR \-d
decompress a file like with gzip.
.TP
.BR \-D
do not delete original file when using `-[cd]`.
.TP
.BR \-e
if multiple files are indicated, continue on error (if any).
.TP
.BR \-E
end processing on first GZIP end of file marker at EOF.
Nonetheless with `-c`, `-E` waits for more data even at EOF.
.TP
.BR \-f
force file overwriting if destination file already exists.
.TP
.BR \-F
force index creation/completion first, and then action:
if `-F` is not used, index is created interleaved with actions.
.TP
.BR \-h
print brief help; `-hh` prints this help.
.TP
.BR \-i
create index for indicated gzip file (For 'file.gz' the default 
index file name will be 'file.gzi'). This is the default action.
.TP
.BR \-I\ string
index file name will be the indicated string.
.TP
.BR \-l
check and list info contained in indicated index file.
`-ll` and `-lll` increase the level of index checking detail.
.TP
.BR \-L\ #
extract data from indicated uncompressed line position of
gzip file (creating or reusing an index file) to STDOUT.
Accepts '0', '0x', and suffixes 'kmgtpe' (^10) or 'KMGTPE' (^2).
.TP
.BR \-n\ #
indicates that the first byte on compressed input is #, not 1,
and so truncated compressed inputs can be used if an index exists.
.TP
.BR \-p
indicates that the gzip input stream may be composed of various
incorrectly terminated GZIP streams, and so then a careful
\fIPatching\fP of the input may be needed to extract correct data.
.TP
.BR \-P
like `-p`, but when used with `-[ST]` implies that checking
for errors in stream is made as quick as possible as the gzip file
grows. Warning: this may lead to some errors not being patched.
.TP
.BR \-q
extract data as indicated with `-[bL]` of a still-growing gzip file
and continue Supervising & extracting to STDOUT even after eof.
.TP
.BR \-r\ #
(range): Number of bytes to extract when using `-[bL]`.
Accepts '0', '0x', and suffixes 'kmgtpe' (^10) 'KMGTPE' (^2).
.TP
.BR \-R\ #
(Range): Number of lines to extract when using `-[bL]`.
Accepts '0', '0x', and suffixes 'kmgtpe' (^10) 'KMGTPE' (^2).
.TP
.BR \-s\ #
span in uncompressed MiB between index points when
creating the index. By default is `10`.
.TP
.BR \-S
Supervise indicated file.
Create a growing index,
for a still-growing gzip file. (`-i` is implicit).
.TP
.BR \-t
tail (extract last bytes) to STDOUT on indicated gzip file.
.TP
.BR \-T
tail (extract last bytes) to STDOUT on indicated still-growing
gzip file, and continue Supervising & extracting to STDOUT.
.TP
.BR \-u\ [cCdD]
utility to compress (`-u c`) or decompress (`-u d`)
zlib-format files to STDOUT. Use `-u C` and `-u D`
to manage raw compressed files. No index involved.
.TP
.BR \-v\ #
output verbosity
from `0` (none) to `5` (nuts). Default is `1` (normal).
.TP
.BR \-w
wait for creation if file doesn't exist, when using `-[cdST]`.
.TP
.BR \-W
do not Write index to disk. But if one is already available
read and use it. Useful if the index is still under a `-S` run.
.TP
.BR \-x
create index with line number information (win/*nix compatible).
.br
Please, note that gztool's index counts last line even if the last char isn't a newline char - whilst `wc` command will not count it in this case!.
.br
This is implicit unless `-X` or `-z` are indicated.
.TP
.BR \-X
like `-x`, but newline character is '\\r' (old mac).
.TP
.BR \-z
create index without line number information.
.TP
.BR \-Z
adjust index points to a byte boundary: no previous byte needed.
.br
.SH QUICK EXAMPLE
Extract data from 1 GiB byte (byte 2^30) on,
from `myfile.gz` to the file `myfile.txt`. Also gztool will
create (or reuse, or complete) an index file named `myfile.gzi`:

.BR \ \ \ \ $\ gztool\ -b\ 1G\ myfile.gz\ >\ myfile.txt
.br

.SH MORE EXAMPLES
.br
* Make an index for `test.gz`. The index will be named `test.gzi`:

.BR \ \ \ \ $\ gztool\ -i\ test.gz
.br


* Make an index for `test.gz` with name `test.index`, using `-I`:

.BR \ \ \ \ $\ gztool\ -I\ test.index\ test.gz
.br

* Also `-I` can be used to indicate the complete path to an index in another directory. This way the directory where the gzip file resides could be read-only and the index be created in another read-write path:

.BR \ \ \ \ $\ gztool\ -I\ /tmp/test.gzi\ test.gz
.br

* Retrieve data from uncompressed byte position 1000000 inside test.gz. Index file will be created \fBat the same time\fP (named `test.gzi`):

.BR \ \ \ \ $\ gztool\ -b\ 1m\ test.gz
.br


* \fBSupervise an still-growing gzip file and generate the index for it on-the-fly\fP. The index file name will be `openldap.log.gzi` in this case. `gztool` will execute until interrupted (it can also stop at first end-of-gzip data with `-E`):

.BR \ \ \ \ $\ gztool\ -S\ openldap.log.gz
.br


* The previous command can be sent to background and with no verbosity, so we can forget about it:

.BR \ \ \ \ $\ gztool\ -v0\ -S\ openldap.log.gz\ &
.br


Creating and index for all "*gz" files in a directory:

.BR \ \ \ \ $\ gztool\ -i\ *gz
.br


* Extract data from `project.gz` byte at 1 GiB to STDOUT, and use `grep` on this output. Index file name will be `project.gzi`:

.BR \ \ \ \ $\ gztool\ -b\ 1G\ project.gz\ |\ grep\ -i\ "balance\ =\ "
.br


* Please, note that STDOUT is used for data extraction with `-bcdtT` modifiers, so an explicit command line redirection is needed if output is to be stored in a file:

.BR \ \ \ \ $\ gztool\ -b\ 99m\ project.gz\ >\ uncompressed.data
.br


* Extract data from a gzipped file which index is still growing with a `gztool -S` process that is monitoring the (still-growing) gzip file: in this case the use of `-W` will not try to update the index on disk so the other process is not disturb! (Note that `gztool` always tries to update the index used if it thinks it's necessary):

.BR \ \ \ \ $\ gztool\ -Wb\ 100k\ still-growing-gzip-file.gz\ >\ mytext
.br


* Extract data from line 10 million, to STDOUT:

.BR \ \ \ \ $\ gztool\ -L\ 10m\ compressed_text_file.gz
.br


* Nonetheless note that if in the precedent example an index was previously created for the gzip file without the `-x` parameter (or not using `-L`), \fBas it doesn't contain line numbering info\fP, `gztool` will complain and stop. This can be circumvented by telling `gztool` to use another new index file name (`-I`), or even not using anyone at all with `-W` (do not write index) and an index file name that doesn't exists (in this case `None` - it won't be created because of `-W`), and so ((just) this time) the gzip will be processed from the beginning:

.BR \ \ \ \ $\ gztool\ -L\ 10m\ -WI\ None\ compressed_text_file.gz
.br

* Extract data from line 10 million, to STDOUT, on a still-growing file (`-q`), so if that line number still has not been reached, `gztool` will patiently wait until the gzip file grows enough to store that line, and show just it (`-R 1`). `-q` to wait for content can be used with both `-L` (line numbers) and `-b` (byte positions):

.BR \ \ \ \ $\ gztool\ -q\ -L\ 10m\ -R\ 1\ compressed_text_file.gz
.br

* Extract all data from a \fBrsyslog's veryRobustZip\fP (//www.rsyslog.com/doc/v8-stable/configuration/modules/omfile.html#veryrobustzip) that contains dirty data. This *corrupted-gzip-files* can arise when using \fBrsyslog's veryRobustZip omfile option\fP and the process that is logging is abruptly terminated and then restarted - this produces an incorrectly-terminated-gzip stream that is followed by another gzip stream \fBin the same file\fP. `gzip` (nor `zlib`) cannot read this files beyond the point of error. But `gztool` can correctly extract all data (and only good data) using `-p` (*patch*) parameter:

.BR \ \ \ \ $\ gztool\ -p\ -b0\ compressed_text_file.gz
.br

This creates, as usual, the index file `compressed_text_file.gzi`. In order to not create it, `-W` (\fIdo not Write index\fP) can be used:

.BR \ \ \ \ $\ gztool\ -pWb0\ compressed_text_file.gz
.br

Note that `-p` can require up to twice the time for decompression, because it performs two decompression processes: the usual one, and another one that is performed \fBin advance\fP of the usual and which is the one that detects errors, marks them, and finds new entry points to end/begin the decompression circumventing the problems.
.br
Note also that these \fIcorrupted-gzip-files\fP should be always decompressed with `-p` parameter, even if a `gztool` index file exists for them, because the index file stores entry points, but does not store where do errors occur in the `gzip` file.
That said, if the `-[bL]` point of extraction is beyond the point(s) of error in the `gzip` file and an index file exists, then the decompression can proceed fine without `-p`, as the index points stored in the index file are always clean.
.br


* When tailing an still-growing gzip file (`-T`) that could contain errors at some point, one may still want to obtain output from the gzip stream as soon as possible - this is what the patching option `-P` is for (like `-p` but capitalized): with `-p` `gztool` decompress the stream about 48 kiB ahead of the output that is actually shown/written in order to catch possible gzip-stream errors ahead of output, and so maintain always a clean output without error-introduced artifacts. This has the side effect that output must always wait for that 48 kiB of data to be available in advance, which if the file grows slowly can take a very long time. With `-P` the buffer-ahead restriction is relaxed to just as few bytes as available before reaching end-of-file and waiting for new data, so responsiveness is as quick as without `-p`. The side effect of `-P` is that depending on the gzip file some errors may lead to incorrect output being shown/written - though in this case a "\fBPATCHING WARNING\fP" would be shown (to stderr).

.BR \ \ \ \ $\ gztool\ -PT\ application_log.gz
.br

The same applies to `-S` though in this case there's no output, as only the index is being constructed:

.BR \ \ \ \ $\ gztool\ -PS\ application_log.gz
.br


* To tail to stdout, \fIlike a\fP `tail -f`, an still-growing gzip file (an index file will be created with name `still-growing-gzip-file.gzi` in this case):

.BR \ \ \ \ $\ gztool\ -T\ still-growing-gzip-file.gz
.br


* More on files still being "Supervised" (`-S`) by another `gztool` instance: they can also be tailed \fIà la\fP `tail -f` without updating the index on disk using `-W`:

.BR \ \ \ \ $\ gztool\ -WT\ still-growing-gzip-file.gz
.br


* Compress (`-c`) an still growing (`-E`) file: in this case both `still-growing-file.gz` and `still-growing-file.gzi` files will be created \fIon-the-fly\fP as the source file grows. Note that in order to terminate compression, Ctrl+C must be used to kill gztool: this results in an incomplete-gzip-file as per GZIP standard, but this is not important as it will contain all the source data, and both `gzip` and `gztool` (or any other tool) can correctly and completely decompress it:

.BR \ \ \ \ $\ gztool\ -Ec\ still-growing-file
.br


* If you have an \fIincomplete\fP index file (it just does not have the length of the source data, as it didn't correctly finish) and want to make it complete and so that the length of the uncompressed data be stored, just unconditionally \fIcomplete\fP it with `-C` with a new `-i` run over your gzip file: note that as the existent index data is used (in this case the file `my-incomplete-gzip-data.gzi`), only last compressed bytes are decompressed to complete this action:

.BR \ \ \ \ $\ gztool\ -Ci\ my-incomplete-gzip-data.gz
.br


* Decompress a file like with gzip (`-d`), but do not delete (`-D`) the original one: Decompressed file will be `myfile`. Note that gzipped file \fBmust\fP have a ".gz" extension or `gztool` will complain:

.BR \ \ \ \ $\ gztool\ -Dd\ myfile.gz
.br


* Decompress a file that does not have ".gz" file extension, like with gzip (`-d`):

.BR \ \ \ \ $\ cat\ mycompressedfile\ |\ gztool\ -d\ >\ my_uncompressed_file
.br


* Show internals of all index files in this directory. `-e` is used not to stop the process on the first error, if a `*.gzi` file is not a valid gzip index file. The `-ll` list option repetition will show data about each index point. `-lll` also decompress each point's window to ensure index integrity:

.BR \ \ \ \ $\ gztool\ -ell\ *.gzi
.br


If `gztool` finds the gzip file companion of the index file, some statistics are shown, like the index/gzip size ratio, or the ratio of compression of the gzip file. 
Also, if the gzip is complete, the size of the uncompressed data is shown. This number is interesting if the gzip file is bigger than 4 GiB, in which case `gunzip -l` cannot correctly calculate it as it is limited to a 32 bit counter (see //tools.ietf.org/html/rfc1952#page-5), or if the gzip file is in `bgzip` format, in which case `gunzip -l` would only show data about the first block (< 64 kiB).
.br
Note that `gztool -l` tries to guess the companion gzip file of the index looking for a file with the same name, but without the `i` of the `.gzi` file name extension, or without the `.gzi`. But the gzip file name can also be directly indicated with this format:

.BR \ \ \ \ $\ gztool\ -l\ -I\ index_filename\ gzip_filename
.br

In this latter case only a pair of index+gzip filenames can be indicated with each use.
.br


* Use a truncated gzip file (100000 first bytes are removed: (not zeroed, removed); if they're zeroed cautions are the same, but `-n` is not needed), to extract from byte 20 MiB, \fBusing a previously generated index\fP: as far as the `-b` parameter refers to a byte \fBafter\fP an index point (See `-ll`) and `-n` be less than that needed first index point, this is always possible. In this case \fI-I gzip_filename.gzi\fP is implicit:


.BR \ \ \ \ $\ gztool\ -n\ 100001\ -b\ 20M\ gzip_filename.gz
.br

Take into account that, as shown, the first byte of the truncated `gzip_filename.gz` file is numbered \fB100001\fP, that is, the bytes retain the order number in which they appear in the original file (that's the reason why it is not the \fB1\fP byte).
.br
Please, note that index point positions at index file \fBmay require also the previous byte\fP to be available in the truncated gzip file, as gzip stream is not byte-rounded but a stream of pure bits. Thus \fIif you're thinking on truncating a gzip file, please do it always at least by one byte before the indicated index point in the gzip\fP - as said, it may not be needed, but in 7 of 8 cases it is needed. \fBAnother option is to use `-Z` when creating the index, as indicated below.\fP
.br

* Create an index for a gzip file in which every index entry point is adjusted to byte boundary, so no previous byte (bits) is needed. Note that in general the byte at which the index entry point begins does not represent a clear cut point as the gzip window needs up to 7 bits from the previous byte. This is so because \fBgzip\fP is a bit-level stream compressor. With `-Z` the cut point is always clean and no bits from the previous byte are required. This will result in index points spaced by more than `-s` bytes between then, and so, may be, less points in the index. But this is completely safe and sound.

        $ gztool -Z my_gzip_file.gz

`-Z` exists since gztool \fBv1.6.0\fP.

* Since v1.5.0, using `-[fW]` (`-f`: force index overwriting; `-W`: do not write index) with `-[ST]` (`-S`: create index on still-growing gzip file; `-T`: tail and continue decompressing to stdout) indicates `gztool` to continue operations even after the source file is overwritten. If using `-f`, the index file will be overwritten. For example:


.BR \ \ \ \ $\ gztool\ -WT\ log_filename.gz
.br
.BR \ \ \ \ ...
.br
.BR \ \ \ \ File\ overwriting\ detected\ and\ restarting\ decompression...
.br
.BR \ \ \ \ Processing\ 'log_filename.gz' ...
.br

.SH INTERNALS
By default gzip-compressed files cannot be accessed in random mode: any byte required at position N requires the complete gzip file to be decompressed from the beginning to the N byte.   
Nonetheless Mark Adler, the author of zlib (//github.com/madler/zlib), provided years ago a cryptic file named `zran.c` (//github.com/madler/zlib/blob/master/examples/zran.c) that creates an "index" of "windows" filled with 32 kiB of uncompressed data at different positions along the un/compressed file, which can be used to initialize the zlib library and make it behave as if compressed data begin there.   

\fBgztool\fP builds upon zran.c to provide a useful command line tool. 
Also, some optimizations has been made:

.br
* \fBgztool\fP can correctly read \fIincomplete gzip-concatenated-files\fP (using `-p`), that is, a gzip composed of a concatenation of `gzip` files, some of which are not correctly terminated. This can happen, for example, when using \fIrsyslog's veryRobustZip omfile option\fP (//www.rsyslog.com/doc/v8-stable/configuration/modules/omfile.html#veryrobustzip) and the process that is logging is abruptly terminated and then restarted.
.br

* \fBgztool\fP can store line numbering information in the index (use only if source data is text!), and retrieve data from a specific line number using `-L`. (Using `-[xXz]` when creating the index selects Unix new line format (default), old Mac new line format, or no line information respectively.)
.br

* \fBgztool\fP can \fBSupervise an still-growing gzip file\fP (for example, a log created by rsyslog directly in gzip format) and generate the index on-the-fly, thus reducing in the practice to zero the time of index creation. See `-S`.
.br

* extraction of data and index creation are interleaved, so there's no waste of time for the index creation.
.br

* \fBindex files are reusable\fP, so they can be stopped at any time and reused and/or completed later.
.br

* an \fIex novo\fP index file format has been created to store the index
.br

* span between index points is raised by default from 1 MiB to 10 MiB, and can be adjusted with `-s` (\fIspan\fP).
.br

* windows \fBare compressed\fP in file
.br

* windows are not loaded in memory unless they're needed, so the application memory footprint is fairly low (< 1 MiB)
.br

* \fBgztool\fP can compress files (`-c`) and at the same time generate an index that is about 10-100 times smaller than if the index is generated after the file has already been compressed with gzip.
.br

* \fBCompatible with `bgzip` files\fP (//www.htslib.org/doc/bgzip.html)
.br

* \fBCompatible with complete `gzip` concatenated files\fP
.br

* \fBCompatible with rsyslog's veryRobustZip omfile option\fP (variable-short-uncompressed complete-gzip-block sizes)
.br

* data can be provided from/to stdin/stdout
.br

* \fBgztool\fP can be used to remotely retrieve just a small part of a bigger gzip compressed file and successfully decompress it locally. See //unix.stackexchange.com/questions/429197/#541903 . Just note that the \fBgztool\fP \fIindex file\fP must be also available.
.br

.SH PROJECT HOME PAGE
//github.com/circulosmeos/gztool
.SH SEE ALSO
.BR gzip (1),
.BR gunzip (1),
.BR zlib (3)
.SH AUTHOR
This program was written by Roberto S. Galende <roberto.s.galende@gmail.com>
on work by Mark Adler's zlib (examples/zran.c) and is copyrighted under zlib licence terms.
.br