README

# @(#)README	2.4.0-JCCH
========================== JCC-H specific info =========================

JCC-H is a version of TPC-H rife with Join Crossing Correlations (JCC). 

You get into situations where 5 keys join with 20% of the fact tables, 
which is skew that MPP systems tend not to like. This skew, only appears 
when certain selections are made on the dimension table (often 
indirectly, the selection may also be on another dimension table a few 
joins away). This means current optimizers will typically not be able 
to tell that this will happen.

This was published in TPCTC 2017, the benchmarking workshop at VLDB.

Paper here: https://ir.cwi.nl/pub/27429/27429.pdf

This software is a 100% compatible TPC-H benchmark. It generates very 
intricate correlations and skew in the datagen when you add the flag. 

usage: dbgen -k [any other options, see below]
       generates the correlated+skewed TPC-H dataset

For the query generation (qgen), it supports two modes of operation: 
get "normal" parameters where JCC-H behaves without skew, very similar 
to normal TPC-H. And a "skewed" parameter set: identical 22 TPC-H queries, 
just different constants, but now the join and scan skew is everywhere.

usage: qgen -k [any other options, see below]
       generates the 'skewed' TPC-H parameter bindings
       omitting generates the 'normal' TPC-H parameter bindings

(normal != default, with default being what TPC-H qgen generates,
 though it is supposed to be similar. Please use normal qgen to get the 
 default parameters).

========================= TCC-H normal readme =========================

Table of Contents
===================
 0. What is this document?
 1. What is DBGEN?
 2. What will DBGEN create?
 3. How is DBGEN built?
 4. Command Line Options for DBGEN
 5. Building Large Data Sets with DBGEN
 6. DBGEN limitations and compliant usage
 7. Sample DBGEN executions
 8. What is QGEN?
 9. What will QGEN create?
10. How is QGEN built?
11. Command Line Options for QGEN
12. Query Template Syntax
13. Sample QGEN executions and Query Templates
14. Environment variable
15. Version Numbering in DBGEN and QGEN
16. Validated Platforms

0. What is this document?

This is the general README file for DBGEN and QGEN, the data-
base population and executable query text generation programs 
used in the TPC-H benchmark. It covers the proper use 
of DBGEN and QGEN. For information on porting the utility to your 
particular platform see Porting.Notes.

1. What is DBGEN?

DBGEN is a database population program for use with the TPC-H benchmark.  
It is written in ANSI 'C' for portability, and has 
been successfully ported to over a dozen different systems. While the 
TPC-H specification allow an implementor to use any utility 
to populate the benchmark database, the resultant population must exactly 
match the output of DBGEN. The source code has been provided to make the 
process of building a compliant database population as simple as possible.

2. What will DBGEN create?

Without any command line options, DBGEN will generate 8 separate ascii
files. Each file will contain pipe-delimited load data for one of the
tables defined in the TPC-H database schema. The default tables 
will contain the load data required for a scale factor 1 database. By 
default the file will be created in the current directory and be 
named <table>.tbl. As an example, customer.tbl will contain the 
load data for the customer table.

When invoked with the '-U' flag, DBGEN will create the data sets to be 
used in the update functions and the SQL syntax required to delete the 
data sets. The update files will be created in the same directory as 
the load data files and will be named "u_<table>.set". The delete 
syntax will be written to "delete.set". For instance, the data set to 
be used in the third query set to update the lineitem table will be 
named "u_lineitem.tbl.3", and the SQL to remove those rows will be 
found in "delete.3". The size of the update files can be controlled 
with the '-r' flag.

3. How is DBGEN built?

Create an appropriate makefile, using makefile.suite as a basis, 
and type make.  Refer to Porting.Notes for more details and for 
suggested compile time options.

4. Command Line Options for DBGEN

DBGEN's output is controlled by a combination of command line options
and environment variables. Command line options are assumed to be single
letter flags preceded by a minus sign. They may be followed by an
optional argument.

option  argument    default     action
------  --------    -------     ------
-k                              generate correlated and skewed data

-h                              Display a usage summary

-f      none                    Force. Existing data files will be
                                overwritten.

-F      none        yes         Flat file output.

-D      none                    Direct database load. ld_XXXX() routines
                                must be defined in load_stub.c

-s      <scale>     1           Scale of the database population. Scale
                                1.0 represents ~1 GB of data

-T      <table>                 Generate the data for a particular table
                                ONLY. Arguments: p -- part/partuspp, 
                                c -- customer, s -- supplier, 
                                o -- orders/lineitem, n -- nation, r -- region,
                                l -- code (same as n and r),
                                O -- orders, L -- lineitem, P -- part, 
                                S -- partsupp

-O      d                       Generate SQL for delete function 
                                instead of key ranges

-O      f                       Allow over-ride of default output file 
                                names

-O      h                       Generate headers in flat ascii files.
                                hd_XXX routines must be defined in 
                                load_stub.c

-O      m                       Flat files generate fixed length records

-O      r                       Generate key ranges for the UF2 update 
                                function

-O      v                       Verify data set without generating it.

-r      <percentage>     10     Scale each udpate file to the given 
                                percentage (expressed in basis points)
                                of the data set

-v      none                    Verbose. Progress messages are 
                                displayed as data is generated.

-n      <name>                  Use database <name> for in-line load

-C      <children>              Use <children> separate processes to 
                                generate data

-S      <n>                     Generate the <n>th part of a multi-part load
                                or update set

-U      <updates>               Create a specified number of data sets
                                in flat files for the update/delete 
                                functions

-i      <n>                     Split the inserted rows in an refresh pair 
								between <n> files

-d      <n>                     Split the deleted rows in an refresh pair
								between <n> files

5. DBGEN limitations and compliant usage

DBGEN is meant to be a robust population generator for use with the 
TPC-H benchmark. It is hoped that DBGEN will make it easier 
to experiment with and become proficient in the execution of TPC decision 
support benchmarks.  As a result, it includes a number of command line 
options which are not, strictly speaking, necessary to generate a compliant 
data set for a TPC-D run. In addition, some command line options will accept 
arguments which result in the generation of NON-COMPLIANT data sets. Options 
which should be used with care include:

-s -- scale factor. TPC-H runs are only compliant when run against SF's 
      of 1, 10, 100, 300, 1000, 3000, 10000, 30000, 100000
-r -- refresh percentage. TPC-H runs are only compliant when run with 
      -r 10, the default.

6. Sample DBGEN executions

DBGEN has been built to allow as much flexibility as possible, but is
fundementally intended to generate two things: a database population 
against which the queries in TPC-H can be run, and the updates 
that are used during the update functions in TPC-H. Here are 
some sample uses of DBGEN.

  1. To generate the database population for the qualification database
	dbgen -s 1
  2. To generate the lineitem table only, for a scale factor 10 database,
     and over-write any existing flat files:
	dbgen -s 10 -f -T L
  4. To geterate a 100GB data set in 1GB pieces, generate only the part and 
     partsupplier tables, and include some progress reports along the way:
	dbgen -s 100 -S 1 -C 100 -T p -v (to generate the first 1GB file)
	dbgen -s 100 -S 2 -C 100 -T p -v (to generate the second 1GB file)
        (and so on, incrementing the argument to -S each time)
  5. To generate the update files needed for a 4 stream run of the throughput
     test at 100 GB, using an existing set of seed files from an 8 process 
     load:
	dbgen -s 100 -U 4 -C 8
     

7. What is QGEN?

QGEN is a query generation program for use with the TPC-H benchmark.
It is written in ANSI 'C' for portability, and has been successfully
ported to over a dozen different systems. While the benchmark specifications
allow an implementor to use any utility to create the benchmark query
sets, QGEN has been provided to make the process of building
a benchmark implementation as simple as possible.

8. What will QGEN create?

QGEN is a filter, triggered by :'s. It does line-at-a-time reads of its
input (more on that later), scanning for :foo, where foo determines the
substitution that occurs. Including:

:<int>          replace with the appropriate value for parameter <int>
:b              replace with START_TRAN (from tpcd.h)
:c              replace with SET_DBASE (from tpcd.h)
:n<int>         replace with SET_ROWCOUNT(<int>) (from tpcd.h)
:o              replace with SET_OUTPUT (from tpcd.h)
:q              replace with query number
:s              replace with stream number
:x              replace with GEN_QUERY_PLAN (from tpcd.h)

Qgen takes an assortment of command line options, controlling which of these
options should be active during the translation from template to EQT, and a
list of query "names". It then translates the template found in
$DSS_QUERY/<name>.sql and puts the result of stdout.

Here is a sample query template:

{  Sccsid:     @(#)1.sql        9.1.1.1     1/25/95  10:51:56  }
:n 0
:o
select
 l_returnflag,
 l_linestatus,
 sum(l_quantity) as sum_qty,
 sum(l_extendedprice) as sum_base_price,
 sum(l_extendedprice * (1 - l_discount)) as sum_disc_price,
 sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as sum_charge,
 avg(l_quantity) as avg_qty,
 avg(l_extendedprice) as avg_price,
 avg(l_discount) as avg_disc,
 count(*) as count_order
from lineitem
where l_shipdate <= date '1998-12-01' - interval :1 day
group by l_returnflag, l_linestatus
order by l_returnflag, l_linestatus;

And here is what is generated:
$ qgen -d 1

{return 0 rows}

select
 l_returnflag,
 l_linestatus,
 sum(l_quantity) as sum_qty,
 sum(l_extendedprice) as sum_base_price,
 sum(l_extendedprice * (1 - l_discount)) as sum_disc_price,
 sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as sum_charge,
 avg(l_quantity) as avg_qty,
 avg(l_extendedprice) as avg_price,
 avg(l_discount) as avg_disc,
 count(*) as count_order
from lineitem
where l_shipdate <= date('1998-12-01') - interval (90)  day to day
group by l_returnflag, l_linestatus
order by l_returnflag, l_linestatus;

See "Query Template Syntax" below for more detail on converting your prefered query
phrasing for use with QGEN.

9. How is QGEN built?

QGEN is built by the same makefile that creates DBGEN. If the makefile
is successfully creating DBGEN, no further compilation modifications
should be necessary. You may need to modify some of the options which
allow QGEN to integrate with your preferred query tool. Refer to
Porting.Notes for more detail.

10. Command Line Options for QGEN

Like DBGEN, QGEN is controlled by a combination of command line options
and environment variables (See "Environment Variables", below for more
detail).  Command line options are assumed to be single
letter flags preceded by a minus sign. They may be followed by an
optional argument.

option  argument    default     action
------  --------    -------     ------
-k                              generate 'skewed' parameters that bring about all the 
                                problems with correlations and skew. Omitting this will 
                                generate (non-default) parameters that mostly avoid 
                                these problems and behave rather similar to default TPC-H

-c      none                    Retain comments in translation of template to
                                EQT

-d      none                    Default. Use the parameter substitutions
                                required for query validation

-h                              Display a usage summary

-i      <file>                  Use contents of <file> to init a query stream

-l      <file>                  Save query parameters to <file>

-n      <name>                  Use database <name> for queries

-N                              Always use default rowcount, and ignore :n directives

-o      <path>                  Save query n's output in <path>/n.<stream>
                                Uses -p option, and uses :o tag

-p      <stream>                Use the query permutation defined for
                                stream <stream>. If this option is
                                omited, EQT will be generated for the
                                queries named on the command line.

-r      <n>                     Seed the rnadom number generator with <n>

-s      <n>                     Set scale to <n> for parameter 
                                substitutions.

-t      <file>                  Use contents of <file> to complete a query 
                                stream

-T      none                    Use time table format for date substitution

-v      none                    Verbose. Progress messages are 
                                displayed as data is generated.

-x      none                    Generate a query plan as part of query
                                execution.

11. Query Template Syntax

QGEN is a simple ASCII text filter, meant to translate query generalized
query syntax("query template") into the executable query text(EQT) re-
quired by the benchmarks. It provides a number of shorthands and syntactic 
extensions that allow the automatic generation of query parameters and some 
control over the operation of the benchmark implementation.

QGEN first strips all comments from the query template, recognizing both
{comment} and --comment styles. Next it traverses the query template
one line at a time, locating required substitution points, called
parameter tags. The values substituted for a given tag are summarized
below.  QGEN does not support nested substitutions. That is, if
the text substituted for tag itself contains a valid tag the second tag
will not be expanded.

Tag             Converted To            Based on
===             ============            ========
:c		database <dbname>;(1)   -n from the command line
:x              set explain on;(1)      -x from the command line
:<number>       paremeter <number>
:s              stream number
:o              output to outpath/qnum.stream;(1)
					-o from command line, -s from 
                                        command line
:b              BEGIN WORK;(1)          -a from comand line
:e              COMMIT WORK(1)          -a from command line
:q              query number
:n <number>                             sets rowcount to be returned 
                                        to <number>, unless -N appears on the command line

Notes:
   (1)  This is Informix-specific syntax. Refer to Porting.Notes for
   tailoring the generated text to your database environment.
   
12. Sample QGEN executions and Query Templates

QGEN translates generic query templates into valid SQL. In addition, it 
allows conditional inclusion of the commands necessary to connect to a 
database, produce diagnostic output, etc. Here are some sample of QGEN
usage, and the way that command line parameters and the query templates 
interact to produce valid SQL.

  Template, in $DSS_QUERY/1.sql:
            :c
            :o
            select count(*) from foo;
            :x
            select count(*) from lineitem
              where l_orderdate < ':1';

  1. "qgen 1", would produce:
      select count(*) from foo;
      select count(*) from lineitem 
        where l_orderdate < '1997-01-01'; 
   Assuming that 1 January 1997 was a valid substitution for parameter 1.

  2. "qgen -d -c dss1 1, would produce:
      database dss1;
      select count(*) from foo;
      select count(*) from lineitem 
        where l_orderdate < '1995-07-18'; 
   Assuming that 18 July 1995 was the default substitution for parameter 1,
    and using Informix syntax.

  3. "qgen -d -c dss1 -x -o somepath 1, would produce:
      database dss1;
      output to "somepath/1.0"
      select count(*) from foo;
      set explain on;
      select count(*) from lineitem 
        where l_orderdate < '1995-07-18'; 
   Assuming that 18 July 1995 was the default substitution for parameter 1,
    and using Informix syntax.
 

13. Environment Variables

Enviroment variables are used to control features of DBGEN and QGEN 
which are unlikely to change from one execution to another.

Variable    Default     Action
-------     -------     ------
DSS_PATH    .           Directory in which to build flat files
DSS_CONFIG  .           Directory in which to find configuration files
DSS_DIST    dists.dss   Name of distribution definition file
DSS_QUERY   .           Directory in which to find query templates

14. Version Numbering in DBGEN and QGEN

DBGEN and QGEN use a common version numbering algorithm. Each executable
is stamped with a version number which is displayed in the usage messages
available with the '-h' option. A version number is of the form:

   V.R.P.M
   | | | |
   | | | |
   | | | |
   | | |  -- modification: alphabetic, incremented for any trivial changes 
   | | |                   to the source (e.g, porting ifdef's)
   | |  ---- patch level:  numeric, incremented for any minor bug fix
   | |                     (e.g, qgen parameter range)
   | ------- release:      numeric, incremented for each minor revision of the
   |                       specification
   |-------- version:      numeric, incremented for each major revision of the 
                           specification

An implementation of TPC-H is valid only if it conforms to the 
following version usage rules:

  -- The Version of DBGEN and QGEN must match the integer portion of the 
     current specification revision

15. The current revisions are:
    DBGEN: 2.4.0
    QGEN:  2.4.0

16. Validated Platforms
    The following platforms have been validated to produce the reference 
    data set for TPC-H 2.4.0
    Processor	Operating System (version)  Compiler (version)		Compiler Flags
    ----------------------------------------------------------------------------
    POWER5 		AIX 64-bit (5.3)			  C for AIX Compiler, v7  -q64 (no -g)
    IA-64		HPUX 64-bit ()		icc 		
    Linux 32-bit ()	gcc