Test data generation plays a critical role in evaluating system performance, validating accuracy, bug identification, enhancing reliability, assessing scalability, ensuring regulatory compliance, training machine learning models, and supporting CI/CD processes. It enables the discovery of potential issues and ensures that systems operate as intended across diverse scenarios.
The AWS Glue Test Data Generator provides a configurable framework for Test Data Generation using AWS Glue Pyspark serverless Jobs. The required test data description is fully configurable through a YAML configuration file.
The source code and depolyment instruction are accessible through this link: Github Code Repository
The Test Data Generation Framework currently supports the following types:
-
Unique Key Generator
This generator produces formatted unique values that can be used as partition key. you can specify a prefix to and the number of leading zeros if required.
-
Child Key Generator
This generator produces a child key referencing the primary key. This is useful in generating multi-level hierarchical data. you can specify the number of levels and how many nodes you want to generate per level.
-
String Data Generator
This generator produces String data type with various mechanisms:
-
Random Strings: you can specify the number of characters and the type of generated characters: numeric, alphabetic or alphanumeric values. This can be used for generating random serial numbers, ordinal data, codes, identity numbers, .. etc.
-
Strings from a Dictionary: you can provide a dictionary of words to pick up randomly by the generator. This can be used to generate categorical columns with predefined set of values such as order status, product types, marital status, gender,..etc/
-
Strings from a Pattern: you can provide generic pattern for your string data. This can be used to generate fake emails, formatted phone numbers, comments, address like data, …etc.
-
-
Integer Data Generator
This generator produces random integer data from a specified range.
-
Float/Double Data Generator
This generator produces random float/double data from an expression. This can be used to generate float values such as salary, temperature, profit, statistical data,.. etc
-
Internet Address Data Generator
This generator produces random IP addresses. This can be used to generate IP address ranges for testing applications used for internet traffic monitoring or filtering.
-
Date Data Generator
This generator produces random dates generator from a configurable date range.
-
Close Date Data Generator
This generator produces random from a configurable start date column and a range. This can be used to generate dates of specific intervals such as a support ticket close date, deceased date, expiration date,… etc
The Test Data Generator is based on PySpark library which is invoked through as a PySpark AWS Glue job. All configurations to the generator is configured through a YAML formatted file stored in the S3 artefact bucket. The deployment to AWS account is done by using AWS Cloud Development Kit (CDK)
-
AWS CDK generates the CloudFromation template and deploy it in the hosting AWS Account
-
Cloudfromation creates:
-
The artefacts S3 Bucket and uploads the TDG PySpark library and YAML configuration file into it.
-
The TDG PySpark glue Job
-
The Service IAM role required by TDG PySpark glue Job.
-
-
The TDG PySpark glue Job is invoked to generate the test data.
-
Clone the GitHub repository in your local development environment
-
Set the following environment variables:
AWS_ACCOUNT
to the AWS account id where you intend to deploy the Test Data Generator
AWS_REGION
to the AWS region id where you intend to deploy the Test Data Generator
- Use aws configure to configure the AWS CLI with the access key to the AWS account
- If the account is not CDK bootstrapped, you need to run the following command:
cdk bootstrap
- open a terminal in the workspace path and run the following CDK command to deploy the solution
$<workspace-path>/AWSGluePysparkTDG> cdk deploy
The Test Data Generator is configured through the YAML file TDG_configuration_file.yml
found in the artefacts bucket at the following path:
s3://tdg-artefacts-<account-id>/tgd_glue_job/Config/TDG_configuration_file.yml
Number of desired generated records
Descriptor of the generated record fields/columns. You can configure the following data types:
- Unique Key Generator
ColumnName: Column name
Generator: key_generator
DataDescriptor:
Prefix: (optional) prefix to the key generated values
LeadingZeros: (optional) number of digits formatting the key values. Key values are prefixed by leading zeros to generated a fixed number of digits
- Child Key Generator
ColumnName: Column name
Generator: child_key_generator
DataDescriptor:
Prefix: prefix should match the parent key prefix
LeadingZeros: should match the parent key LeadingZero
ChildCountPerSublevel: a list of number of nodes per hierarchy sub-levels. For example, the following list describes three levels of hierarchy with level 1 has 10 nodes, level 2 has 100 nodes and level 3 has 1000 nodes.
- 10 - 100 - 1000
- String Data Generator
1. Strings from a Dictionary
ColumnName: Column name
Generator: string_generator
DataDescriptor:
Values: a list of string values.
2. Strings from a Pattern
ColumnName: Column name
Generator: string_generator
DataDescriptor:
Pattern: a pattern of expressions separated by #. available expressions:
- Constant strings: can be any constant string such as: Contact Details, @, Title:, ..etc
- Random Numbers: ^N for example to specify 8 digits: ^N8
- Random Alphabetic Strings: ^A for example to specify a random string of length 10 charters: ^A10
- Random Alphanumeric Strings: ^x for example to specify a random alphanumeric string of length 5 charters: ^X5
Example, the following pattern
Contact Details: Email: #^X8#__#^N2#@#^A4#.#^A3# Phone: #^N8"
will result in the following sample values:
Contact Details: Email: [email protected] Phone: 9643728
Contact Details: Email: [email protected] Phone: 84716259
Contact Details: Email: [email protected] Phone: 4651938
3. Random Strings
ColumnName: Column name
Generator: string_generator
DataDescriptor:
Random: 'True'
NumChar: length of generated alphanumeric strings
- Integer Data Generator
ColumnName: Column name
Generator: integer_generator
DataDescriptor:
Range: lower value, upper value
- Float/Double Data Generator
ColumnName: Column name
Generator: float_generator
DataDescriptor:
** Expression**: SQL expression such as: rand(42) * 3000
- Date Data Generator
ColumnName: Column name
Generator: date_generator
DataDescriptor:
StartDate: start date of the date range on the format DD/MM/YYYY
EndDate: end date of the date range on the format DD/MM/YYYY
- Close Date Data Generator
ColumnName: Column name
Generator: close_date_generator
DataDescriptor:
StartDateColumnName: column name of the generated open date
CloseDateRangeInDays: maximum span form the open date in days
- Internet Address Data Generator
ColumnName: Column name
Generator: ip_address_generator
DataDescriptor:
IpRanges: list of ranges for the IP address four numeric parts on the form of lower value, upper value. For example:
- 9,10 - 1,254 - 1,128 - 2,20
the list of targets for the generator. The generator will perform automatic data types conversion for every specified target. Currently, the generator supports the following targets:
- S3 Buckets
target: S3
attributes:
BucketArn: S3 Bucket arn including the prefix
mode: s3 bucket writing mode (overwrite, append)
header: include header in the generated data (True, Flase)
delimiter: CSV file delimeter
- DynamoDB tables
target: Dynamodb
attributes:
dynamodb.output.tableName: dynamodb table name
dynamodb.throughput.write.percent: throughput write percent
From the AWS Glue Console:
- Navigate to Data Integration and ETL>AWS Glue Studio]>Jobs
- Select the TestDataGeneratorJob job and press Run Job
- Once the job completes successfully, check for the generated data in the configured targets.
See CONTRIBUTING for more information.
This library is licensed under the MIT-0 License. See the LICENSE file.
- Mohamed Elbishbeashy - Wrote the initial version.